9.18 Support Vector Machine
The oml.svm
class creates a Support Vector Machine (SVM) model for classification, regression, or anomaly detection.
SVM is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data.
SVM models have a functional form similar to neural networks and radial basis functions, which are both popular machine learning techniques.
SVM can be used to solve the following problems:
-
Classification: SVM classification is based on decision planes that define decision boundaries. A decision plane is one that separates a set of objects having different class memberships. SVM finds the vectors (“support vectors") that define the separators that give the widest separation of classes.
SVM classification supports both binary and multiclass targets.
-
Regression: SVM uses an epsilon-insensitive loss function to solve regression problems.
SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.
-
Anomaly Detection: Anomaly detection identifies unusual cases in data that is seemingly homogeneous. Anomaly detection is an important tool for detecting fraud, network intrusion, and other rare events that may have great significance but are hard to find.
Anomaly detection is implemented as one-class SVM classification. An anomaly detection model predicts whether a data point is typical for a given distribution or not.
The oml.svm
class builds each of these three different types of models. Some arguments apply to classification models only, some to regression models only, and some to anomaly detection models only.
For information on the oml.svm
class attributes and methods, invoke
help(oml.svm)
or see Oracle Machine Learning for Python API
Reference.
Support Vector Machine Model Settings
The following table lists settings for SVM models.
Table 9-16 Support Vector Machine Settings
Setting Name | Setting Value | Description |
---|---|---|
CLAS_COST_TABLE_NAME |
table_name |
The name of a table that stores a cost matrix for the algorithm to use in scoring the model. The cost matrix specifies the costs associated with misclassifications. The cost matrix table is user-created. The following are the column requirements for the table.
|
CLAS_WEIGHTS_BALANCED |
|
Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is |
CLAS_WEIGHTS_TABLE_NAME |
table_name |
The name of a table that stores weighting information for individual target values in GLM logistic regression models. The weights are used by the algorithm to bias the model in favor of higher weighted classes. The class weights table is user-created. The following are the column requirements for the table.
|
|
Positive integer |
Sets the size of the batch for the SGD solver. This setting applies to SVM models with linear kernel. An input of 0 triggers a data driven batch size estimate. The default value is |
|
|
Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor. Value of complexity factor for SVM algorithm (both Classification and Regression). Default value estimated from the data by the algorithm. |
|
|
Convergence tolerance for SVM algorithm. Default is |
|
|
Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data. Value of epsilon factor for SVM regression. Default is |
|
|
Kernel for Support Vector Machine. Linear or Gaussian. The default value isSVMS_LINEAR .
|
|
Positive integer |
Sets an upper limit on the number of SVM iterations. The default is system determined because it depends on the SVM solver. |
|
Range [ |
Sets an upper limit on the number of pivots used in the Incomplete Cholesky decomposition. It can be set only for non-linear kernels. The default value is |
|
|
The desired rate of outliers in the training data. Valid for One-Class SVM models only (Anomaly Detection). The default value is |
|
|
Controls the type of regularization that the SGD SVM solver uses. The setting applies only to linear SVM models. The default value is system determined because it depends on the potential model size. |
|
|
Allows the user to choose the SVM solver. The SGD solver cannot be selected if the kernel is non-linear. The default value is system determined. |
|
|
Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases. Value of standard deviation for SVM algorithm. This is applicable only for the Gaussian kernel. The default value is estimated from the data by the algorithm. |
See Also:
Example 9-18 Using the oml.svm Class
This example demonstrates the use of various methods of the oml.svm
class. In the listing for this example, some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species']))
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]
# Create an SVM model object.
svm_mod = oml.svm('classification',
svms_kernel_function =
'dbms_data_mining.svms_linear')
# Fit the SVM Model according to the training data and parameter
# settings.
svm_mod.fit(train_x, train_y)
# Use the model to make predictions on test data.
svm_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Return the prediction probability.
svm_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Species']],
proba = True)
svm_mod.predict_proba(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Species']],
topN = 1).sort_values(by = ['Sepal_Length', 'Sepal_Width'])
svm_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # Create an SVM model object.
... svm_mod = oml.svm('classification',
... svms_kernel_function =
... 'dbms_data_mining.svms_linear')
>>>
>>> # Fit the SVM model according to the training data and parameter
... # settings.
>>> svm_mod.fit(train_x, train_y)
Algorithm Name: Support Vector Machine
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
1 CLAS_WEIGHTS_BALANCED OFF
2 ODMS_DETAILS ODMS_ENABLE
3 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
4 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
5 PREP_AUTO ON
6 SVMS_CONV_TOLERANCE .0001
7 SVMS_KERNEL_FUNCTION SVMS_LINEAR
Computed Settings:
setting name setting value
0 SVMS_COMPLEXITY_FACTOR 10
1 SVMS_NUM_ITERATIONS 30
2 SVMS_SOLVER SVMS_SOLVER_IPM
Global Statistics:
attribute name attribute value
0 CONVERGED YES
1 ITERATIONS 14
2 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
COEFFICIENTS:
TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE COEF
0 setosa Petal_Length None None -0.5809
1 setosa Petal_Width None None -0.7736
2 setosa Sepal_Length None None -0.1653
3 setosa Sepal_Width None None 0.5689
4 setosa None None None -0.7355
5 versicolor Petal_Length None None 1.1304
6 versicolor Petal_Width None None -0.3323
7 versicolor Sepal_Length None None -0.8877
8 versicolor Sepal_Width None None -1.2582
9 versicolor None None None -0.9091
10 virginica Petal_Length None None 4.6042
11 virginica Petal_Width None None 4.0681
12 virginica Sepal_Length None None -0.7985
13 virginica Sepal_Width None None -0.4328
14 virginica None None None -5.3180
>>> # Use the model to make predictions on test data.
... svm_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
44 6.7 3.3 5.7 virginica virginica
45 6.7 3.0 5.2 virginica virginica
46 6.5 3.0 5.2 virginica virginica
47 5.9 3.0 5.1 virginica virginica
>>> # Return the prediction probability.
... svm_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Species']],
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION PROBABILITY
0 4.9 3.0 setosa setosa 0.761886
1 4.9 3.1 setosa setosa 0.805510
2 4.8 3.4 setosa setosa 0.920317
3 5.8 4.0 setosa setosa 0.998398
... ... ... ... ... ...
44 6.7 3.3 virginica virginica 0.927706
45 6.7 3.0 virginica virginica 0.855353
46 6.5 3.0 virginica virginica 0.799556
47 5.9 3.0 virginica virginica 0.688024
>>> # Make predictions and return the probability for each class
... # on new data.
>>> svm_mod.predict_proba(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Species']],
... topN = 1).sort_values(by = ['Sepal_Length', 'Sepal_Width'])
Sepal_Length Sepal_Width Species TOP_1 TOP_1_VAL
0 4.4 3.0 setosa setosa 0.698067
1 4.4 3.2 setosa setosa 0.815643
2 4.5 2.3 setosa versicolor 0.605105
3 4.8 3.4 setosa setosa 0.920317
... ... ... ... ... ...
44 6.7 3.3 virginica virginica 0.927706
45 6.9 3.1 versicolor versicolor 0.378391
46 6.9 3.1 virginica virginica 0.881118
47 7.0 3.2 versicolor setosa 0.586393
>>> svm_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.895833