This chapter describes Non-Negative Matrix Factorization (NMF), the unsupervised algorithm used by Oracle Data Mining for feature extraction.
Note:
Non-Negative Matrix Factorization (NMF) is described in the paper "Learning the Parts of Objects by Non-Negative Matrix Factorization" by D. D. Lee and H. S. Seung in Nature (401, pages 788-791, 1999).This chapter contains the following topics:
Non-Negative Matrix Factorization is a state of the art feature extraction algorithm. NMF is useful when there are many attributes and the attributes are ambiguous or have weak predictability. By combining attributes, NMF can produce meaningful patterns, topics, or themes.
Each feature created by NMF is a linear combination of the original attribute set. Each feature has a set of coefficients, which are a measure of the weight of each attribute on the feature. There is a separate coefficient for each numerical attribute and for each distinct value of each categorical attribute. The coefficients are all non-negative.
Non-Negative Matrix Factorization uses techniques from multivariate analysis and linear algebra. It decomposes the data as a matrix M into the product of two lower ranking matrices W and H. The sub-matrix W contains the NMF basis; the sub-matrix H contains the associated coefficients (weights).
The algorithm iteratively modifies of the values of W and H so that their product approaches M. The technique preserves much of the structure of the original data and guarantees that both basis and weights are non-negative. The algorithm terminates when the approximation error converges or a specified number of iterations is reached.
The NMF algorithm must be initialized with a seed to indicate the starting point for the iterations. Because of the high dimensionality of the processing space and the fact that there is no global minimization algorithm, the appropriate initialization can be critical in obtaining meaningful results. Oracle Data Mining uses a random seed that initializes the values of W and H based on a uniform distribution. This approach works well in most cases.
NMF can be used as a dimensionality reduction pre-processing step in classification, regression, clustering, and other mining tasks. Scoring an NMF model produces data projections in the new feature space. The magnitude of a projection indicates how strongly a record maps to a feature.
NMF is especially well-suited for text mining. In a text document, the same word can occur in different places with different meanings. For example, "hike" can be applied to the outdoors or to interest rates. By combining attributes, NMF introduces context, which is essential for explanatory power:
See Also:
"Text Feature Extraction"Oracle Data Mining supports five configurable parameters for NMF. All of them have default values which will be appropriate for most applications of the algorithm. The NMF settings are:
Number of features. By default, the number of features is determined by the algorithm.
Convergence tolerance. The default is .05.
Number of iterations. The default is 50.
Random seed. The default is -1.
Non-negative scoring. You can specify whether negative numbers should be allowed in scoring results. By default they are allowed.
See Also:
Oracle Database PL/SQL Packages and Types Reference for information about model settingsAutomatic Data Preparation normalizes numerical attributes for NMF.
When there are missing values in columns with simple data types (not nested), NMF interprets them as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean.
When there are missing values in nested columns, NMF interprets them as sparse. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors.
If you choose to manage your own data preparation, keep in mind that outliers can significantly impact NMF. Use a clipping transformation before binning or normalizing. NMF typically benefits from normalization. However, outliers with min-max normalization cause poor matrix factorization. To improve the matrix factorization, you need to decrease the error tolerance. This in turn leads to longer build times.
Chapter 19, "Automatic and Embedded Data Preparation"
Oracle Data Mining Application Developer's Guide for information about nested columns and missing data