Oracle® Data Mining Concepts 11g Release 2 (11.2) E1680807 


PDF · Mobi · ePub 
This chapter describes NonNegative Matrix Factorization (NMF), the unsupervised algorithm used by Oracle Data Mining for feature extraction.
Note:
NonNegative Matrix Factorization (NMF) is described in the paper "Learning the Parts of Objects by NonNegative Matrix Factorization" by D. D. Lee and H. S. Seung in Nature (401, pages 788791, 1999).This chapter contains the following topics:
NonNegative Matrix Factorization is a state of the art feature extraction algorithm. NMF is useful when there are many attributes and the attributes are ambiguous or have weak predictability. By combining attributes, NMF can produce meaningful patterns, topics, or themes.
Each feature created by NMF is a linear combination of the original attribute set. Each feature has a set of coefficients, which are a measure of the weight of each attribute on the feature. There is a separate coefficient for each numerical attribute and for each distinct value of each categorical attribute. The coefficients are all nonnegative.
NonNegative Matrix Factorization uses techniques from multivariate analysis and linear algebra. It decomposes the data as a matrix M into the product of two lower ranking matrices W and H. The submatrix W contains the NMF basis; the submatrix H contains the associated coefficients (weights).
The algorithm iteratively modifies of the values of W and H so that their product approaches M. The technique preserves much of the structure of the original data and guarantees that both basis and weights are nonnegative. The algorithm terminates when the approximation error converges or a specified number of iterations is reached.
The NMF algorithm must be initialized with a seed to indicate the starting point for the iterations. Because of the high dimensionality of the processing space and the fact that there is no global minimization algorithm, the appropriate initialization can be critical in obtaining meaningful results. Oracle Data Mining uses a random seed that initializes the values of W and H based on a uniform distribution. This approach works well in most cases.
NMF can be used as a dimensionality reduction preprocessing step in classification, regression, clustering, and other mining tasks. Scoring an NMF model produces data projections in the new feature space. The magnitude of a projection indicates how strongly a record maps to a feature.
NMF is especially wellsuited for text mining. In a text document, the same word can occur in different places with different meanings. For example, "hike" can be applied to the outdoors or to interest rates. By combining attributes, NMF introduces context, which is essential for explanatory power:
See Also:
"Text Feature Extraction"Oracle Data Mining supports five configurable parameters for NMF. All of them have default values which will be appropriate for most applications of the algorithm. The NMF settings are:
Number of features. By default, the number of features is determined by the algorithm.
Convergence tolerance. The default is .05.
Number of iterations. The default is 50.
Random seed. The default is 1.
Nonnegative scoring. You can specify whether negative numbers should be allowed in scoring results. By default they are allowed.
See Also:
Oracle Database PL/SQL Packages and Types Reference for information about model settingsAutomatic Data Preparation normalizes numerical attributes for NMF.
When there are missing values in columns with simple data types (not nested), NMF interprets them as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean.
When there are missing values in nested columns, NMF interprets them as sparse. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors.
If you choose to manage your own data preparation, keep in mind that outliers can significantly impact NMF. Use a clipping transformation before binning or normalizing. NMF typically benefits from normalization. However, outliers with minmax normalization cause poor matrix factorization. To improve the matrix factorization, you need to decrease the error tolerance. This in turn leads to longer build times.
Chapter 19, "Automatic and Embedded Data Preparation"
Oracle Data Mining Application Developer's Guide for information about nested columns and missing data