Oracle Data Mining Application Developer's Guide 10g Release 1 (10.1) Part Number B1069901 


View PDF 
This section contains information about some special considerations for clustering models, for SVM models, and for NMF models.
ODM supports two algorithms for clustering:
The two algorithms treat data differently. This section discusses important considerations about data for clustering.
Binary attributes should have data mining type as follows:
You can either bin the data manually or let the algorithm do the binning. For kMeans, it is usually best to let the algorithm do the binning. If you bin manually, the first bin number must be 1. We recommend that you have the same number of bins per attribute in order to have the same scale in the distance computation. For example, if age is binned in 20 bins (1...20), and there is a binary attribute (gender), the binary attribute should be binned as 1 and 20 instead of 1 and 2. If this is not done, the algorithm would still work but the results will be unreliable.
You can either bin the data manually or let the algorithm do the binning. For OCluster, it is usually best to let the algorithm do the binning. If you bin manually, the first bin number must be 1. In the case of OCluster, manual binning should not oversmooth or undersmooth the histograms of numeric attributes. The number of bins does not need to be the same across attributes but should be chosen to capture the attribute value distribution accurately.
This section describes the ways in which you can affect model build quality and performance with SVM models.
The user can influence both the SVM model quality (accuracy) and performance (build time) through two basic mechanisms: data preparation and model settings.
Poor choice of settings or data preparation can lead to serious performance degradation. Poor settings choice can also lead to inaccurate models. For example, a model can predict only one class. ODM offer built in mechanisms that estimate appropriate settings for the problem at hand.
SVM estimates three settings: complexity factor, standard deviation for Gaussian kernels, and epsilon for regression models. These three settings are estimated by default.
Default settings are overridden by specifying a value for the setting when creating the algorithm settings object (Java) or by inserting a row in the settings table for that setting (DBMS_DM).
Default data preparation is overridden by specifying the data as prepared (ODM). In DBMS_DM there is no default data preparation. Data preparation must be specifically invoked.
ODM and DBMS_DM accept two types of predictors: numeric and categorical. In ODM, the logical data specification identifies the data type of each predictor. DBMS_DM identifies all database numeric types as numeric predictors and all database string types as categorical predictors. SVM requires all predictors to be numeric and the range of values restricted to a small interval around 0. Hence numeric predictors are normalized and categorical predictors are exploded (described below).
Normalization of numeric predictors is typically required for two reasons: (1) so that the relative influence of the various predictors on the model is not distorted by their relative scaling, and (2) to avoid computational overflow/underflow. To the first point, note that an SVM model (ai) parameter applies to an entire training vector, rather than individual predictors within the vector; hence large individual vector values can dominate. In addition (point 2), note that the kernel computation includes sum of dot products of two potentially large values. Such sums can result in overflow, or, as the exponent of a Gaussian, underflow.
Our offerings for normalization include zscore and minmax normalization. Each normalization technique has advantages and drawbacks. Zscore has the advantage of not distorting the shape of the distribution. However, zscore'd data can still have many large (absolute) values (outside of range {1, 1}. The number and size of large values can differ greatly across attributes, depending upon their relative nonnormality. In addition, the zscore'd range differs from the exploded categorical range {0,1}. Minmax normalization avoids the large value issue and has the same range as the categorical data but may suffer compression of its dense range in the presence of large extreme values. The user could potentially address the minmax normalization drawback with some procedure for handling outliers.
Categorical predictors are exploded into arrays of indicator variables. This is transparent to the user in both interfaces. For example, a predictor, VAR, which takes on one of three possible values: A, B or C becomes three predictors that one could think of as VAR_A, VAR_B and VAR_C. Each of these exploded predictors is a binary attribute with values in the set {0,1}. If VAR_A is 1, then VAR = "A". If VAR_A is 0 then VAR is NOT equal to "A". For multivalued categorical predictors with no implicit order, explosion, as specified, is the most sensible procedure.
For binary predictors or multivalued categoricals with an implicit order there are choices. Consider a predictor taking on values in the set {LOW, MEDIUM, HIGH}. The user could choose to recode the values to {0, 0.5, 1}, or, use some other mapping that is intended to reflect the relative distance between the categories. The recoded predictor would then be passed as numeric field.
Binary predictors take on one of two possible values. E.g. a binary predictor might take on values from the set {NO, YES}. The decision to recode such predictors to {0, 1} depends upon whether there is a preferred positive value or not.
In the Java API, for performance and accuracy, we internally normalize regression targets. In DBMS_DM, the user can choose to normalize the target externally. Consider a regression target with large absolute values. Because the predictors are constrained, the coefficients (ai) must be large, or the number of nonzero coefficients must be large, or both, otherwise it is impossible to predict a large value. The range of the coefficients is restricted by the ComplexityFactor (C) setting,. With a small C, the performance and, possibly, the accuracy of SVM can be poor. With a large C, the training time can increase significantly. In addition, the model can be larger than would be otherwise necessary. To avoid this issue we normalize the targets. Since the goal is to avoid large values, minmax normalization is often a good choice (see numeric predictor handling discussion above).
Several of the algorithm settings can significantly affect SVM accuracy and performance. The issues and considerations in choosing setting values are discussed briefly in the sections below. In the literature, crossvalidation and grid search techniques are often used to select parameters. These methods, while accurate, are very expensive. Rather than saddle users with an expensive procedure, we have opted for computationally inexpensive procedures that fit easily within our implementation framework. The intent is to provide settings that should give the model adequate complexity for the problem at hand. If the user requires greater accuracy and is willing to incur the additional computational expense, our defaults can be used as the starting point for a grid search.
The complexity factor, C, determines the tradeoff between minimizing model error on the training data and minimizing model complexity. Its responsibility is to avoid overfit (an overcomplex model fitting noise in the training data) and underfit (a model that is too simple). Overfit exists if the model fits the training data well but does poorly on heldaside test data. Underfit exists if the model does poorly on both training and test data. If the user wishes to override the default C after seeing the model result, the user can obtain the internally computed default value as a starting point for a grid search. Both the java API and PL/SQL interfaces have methods for getting the Complexity factor. The subsequent discussion provides qualitative description of the impact of C on the model build.
Very large value of C places extreme penalty on errors, so that SVM seeks a perfect separation of target classes, or, in the case of regression, a perfect (epsiloninsensitive  see below) fit. Assuming the data to have at least some noise, this is overfit. Large C leaves the parameters of the model (ai) unconstrained, i.e., the model is complex with high capacity for fitting data. Conversely, small C places low penalty on errors and high constraints on the model parameters, which can lead to underfit.
Standard deviation is an SVM setting applying to Gaussian Kernels only. This parameter, in conjunction with C, affects the tradeoff between error on the training data and generalization. For fixed C, underfit results as the standard deviation gets large, and overfit results as the standard deviation goes to zero.
To facilitate using the default values computed in ODM as a starting point, methods exist to extract the setting values.
Epsilon is an SVM setting applying to regression models only. The parameter separates small errors from large errors. Small errors are considered to have no associated penalty. Only large errors are penalized. In the Java API, if the default target normalization is used the value of epsilon is rescaled accordingly. In DBMS_DM, epspilon needs to be rescaled by the user. By default, epsilon is estimated internally.
The epsilon parameter affects the number of support vectors and, thus, indirectly, the tradeoff between model complexity and generalization (overfit and underfit as possible consequences). An estimate of the standard deviation of the additive noise in the target variable is needed to find an appropriate epsilon value and thus minimize predictive error. The estimate can be based either on domain knowledge or it can be obtained from the residuals of a crude model (e.g., polynomial, KNN) built on the training data.
The most expensive operation in building a gaussian SVM model is the computation of kernels. The general approach taken to build is to converge within a chunk of data at a time, then to test for violators outside of the chunk. Build is complete when there are no more violators within tolerance. The size of the chunk is chosen such that the associated kernels can be maintained in memory in a "Kernel Cache". The larger the chunk size, presumably, the better the chunk represents the population of training data and the fewer number of times new chunks will need to be created. Generally, larger caches imply faster builds. Default size is 50M.
Tolerance is the maximum size of a violation of convergence criteria such that the model is considered to have converged. The default value is 0.001. Larger values imply faster building but less accurate models.
Traditionally, as part of standard numerical analysis, matrix factorization is a common preprocessing procedure prior to solving a liner system of equations. For data mining, matrix factorization offers a way to reduce the dimensionality of a dataset and extract features that reveal interesting structure in the data or provide inputs to further types of analysis. In matrix factorization, the number of the dataset independent columns is reduced by projection onto a lower dimensional space (e.g., smaller matrices).
Rank reduction by factorization can reveal interesting lowdimensional subspaces embedded in large dimensionality datasets space and is a useful operation for pattern discovery and feature extraction. For example, the traditional Principal Component Analysis (PCA) uses a projection of the data on dimensions along which it varies the most and can be used to visualize the most dominant structure in a dataset.
Nonnegative matrix factorization (NMF), by imposing nonnegativity constraints on the factors, has been shown to be a useful decomposition and feature extraction method in fields such as object detection and recognition and to be a valuable alternative PCA^{Foot 1}. By forcing a dataset (matrix) to "fit" into a product of smaller datasets (matrices), NMF compresses the data and tends to eliminate some of the redundancies and expose the most common patterns. By using a partsbased or componentbased decomposition, and in contrast to PCA and other techniques, the compressed version of the data is not random looking and can be used to understand interesting patterns and common trends in the dataset. The NMF decomposition also induces a numerical taxonomy that can be used for grouping the rows or columns of the original dataset. The extracted features can be used as inputs to other analysis tasks such as classification or indexing. This procedure has proven useful in face recognition problems or the discovery of semantic features in texts^{Foot 2}.
Given an N (rows) x M (columns) 2D dataset A and k < N, M, NMF computes an approximation of the original data, A ~ W H, where W is N by k, and H is k by M. Starting from random initial conditions, W and H are iteratively updated until convergence to a local minimum is achieved, monitored by the minimization of the Euclidean cost function. A must have positive entries, and so are W and H by construction. Even though localization is not an explicit property of the algorithm, NMF appears to produce quite localized and sparse features that facilitate the interpretation of results and the transparency of the model. For example, when NMF is applied to a dataset of facial images, the extracted features are facial parts: eyes, noses, etc. When the dataset is a document/keyword matrix, then NMF extracts "semantic" features^{Foot 3}.
When NMF is used as a feature extractor, it can benefit from scaling individual attributes to a common scale via normalization. This would facilitate pattern discovery and make dimensionality reduction more effective. A preferred approach would be to perform outlier treatment (e.g., by using a winsorizing transformation) and then perform minmax normalization. Having individual attributes on a common scale would help ranking and interpretation of feature coefficients in terms of there relative magnitude. Additionally, applying a normalization transformation would allow NMF to operate on negative data as well. Otherwise, the data need to be shifted to the positive range manually on a perattribute basis.
If NMF is used to cluster column instances  e.g., by using the amplitudes from the rows of H then according to the nature of the problem, one may consider normalizing the rows, the columns, or both, prior to using NMF.
^{1} Daniel D. Lee, and H. Sebastian Seung, Learning the parts of objects by nonnegative matrix factorization. Nature 1999, Oct 21, 401(6755), 788793.
^{2} Same as footnote 1, above.
^{3} Same as footnote 1, above.