This chapter describes data requirements and how the data should be prepared before it is mined using either of the Oracle Data Mining (ODM) interfaces. The data preparation required depends on the type of model that you plan to build and the characteristics of the data. For example, data with attributes that take on a small number of values, that is, that have low cardinality, may not require binning.
In general, users must prepare data before invoking ODM algorithms.
The following topics are addressed:
Data used by ODM consists of tables or views stored in an Oracle database. Both ordinary tables and nested tables can be used as input data. The data used in a data mining operation is often called a data set.
Data has a physical organization and a logical interpretation. Column names refer to physical organization; attribute names, described in the next paragraph, refer to the logical interpretation of the data.
The rows of a data table are often called cases, records, or examples. The columns of the data tables are called attributes or fields; each attribute in a record holds a cell of information. Attribute names are constant from record to record for unnested tables; the values in the attributes can vary from record to record. For example, each record may have an attribute labeled "annual income." The value in the annual income attribute can vary from one record to another.
ODM distinguishes two types of attributes: categorical and numerical. Categorical attributes are those that define their values as belonging to a small number of discrete categories or classes; there is no implicit order associated with the values. If there are only two possible values, for example, yes and no, or male and female, the attribute is said to be binary. If there are more than two possible values, for example, small, medium, large, extra large, the attribute is said to be multiclass.
Numerical attributes are numbers that take on a large number of values that have an order, for example, annual income. For numerical attributes, the differences between values are also ordered. Annual income could theoretically be any value from zero to infinity, though in practice annual income occupies a bounded range and takes on a finite number of values.
You can often transform numerical attributes to categorical attributes. For example, annual income could be divided into three categories: low, medium, high. Conversely, you can explode categorical values to transform them into numerical values.
Classification and Regression algorithms require a target attribute. A supervised model can predict a single target attribute. The target attribute for all classification algorithms can be numerical or categorical. The ODM regression algorithm supports only numerical target attributes.
Certain ODM algorithms support unstructured text attributes. Although unstructured data includes images, audio, video, geospatial mapping data, and documents or text, ODM supports mining text data only. An input table can contain one or more text columns.
ODM data must reside in a single table or view in an Oracle database. The table or view must be a standard relational table, where each case is represented by one row in the table, with each attribute represented by a column in the table. The columns must be of one of the types supported by ODM.
ODM does not support all the data types that Oracle supports. Each attribute (column) in a data set used by ODM must have one of the following data types:
DM_NESTED_NUMERICALS (nested column)
DM_NESTED_CATEGORICALS (nested column)
The supported attribute data types have a default attribute type (categorical or numerical). For details, see Oracle Data Mining Application Developer's Guide.
Nested table columns can be used for capturing in a single table or view data that is distributed over many tables (for example, a star schema). Nested columns allow you to capture one-to-many relationships (for example, one customers can buy many products). Nested tables are required if the data has more than 1000 attributes; nested tables are useful if the data is sparse, or if the data is already persisted in a transactional format and must be passed to the data mining interface through an object view.
Note:The Decision Tree algorithm, described in "Decision Tree Algorithm", does not support nested columns.
For a given case identifier, attribute names must be unique across all the collections and individual columns. The fixed collection types enforce this requirement.
Data tables often contain missing values.
Certain algorithms assume that a
NULL value indicates a missing value; others assume that a
NULL value indicates sparse data, as described in "Sparse Data" .
If an algorithm assumes that NULL values indicate sparse data, then you should treat any values that are true missing values.
Data is said to be sparse if only a small fraction (no more than 20%, often 3% or less) of the attributes are non-zero or non-null for any given case. Sparse data occurs, for example, in market basket problems. In a grocery store, there might be 10,000 products in the store, and the average size of a basket (the collection of distinct items that a customer purchases in a typical transaction) is on average 50 products. In this example, a transaction (case or record) has on average 50 out of 10,000 attributes that are not null. This implies that the fraction of non-zero attributes in the table (or the density) is approximately 50/10,000, or 0.5%. This density is typical for market basket and text mining problems.
Sparse data is represented in a table in such a way that avoids the specification of the most common value to save storage. In such a specification of sparse data, a missing value is implicitly interpreted as the most common value.
Different algorithms make different assumptions about what indicates sparse data. For Support Vector Machine, k-Means, association, and Non-Negative Matrix Factorization,
NULL values indicate sparse data; for all other algorithms,
NULL values indicate missing values. See the description of each algorithm for information about how it interprets
An outlier is a value that is far outside the normal range in a data set, typically a value that is several standard deviations from the mean. The presence of outliers can have a significant impact on certain kinds of ODM models. Naive Bayes, Adaptive Bayes Network, Support Vector Machine, Attribute Importance, either clustering algorithm, and Non-Negative Matrix Factorization are sensitive to outliers.
For example, the presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the ability of the model to detect differences in numerical attributes may be significantly lessened. For example, a numerical attribute such as income may have all the data belonging to a single bin except for one entry (the outlier) that belongs to a different bin.
For outlier treatments, see "Winsorizing and Trimming" .
Data is said to be prepared when certain data transformations required by a data mining algorithm are performed by the user before the algorithm is invoked. For most algorithms, data must be prepared before the algorithm is invoked.
Data preparation can take many forms, such as joining two or more tables so that all required data is in a single table or view, transforming numerical attributes by applying numerical functions to them, recoding attributes, treating missing values, treating outliers, omitting selected columns for a training data set, and so forth.
ODM includes transformations that perform the following data-mining-specific transformations:
Winsorizing and trimming to treat outliers
Discretization to reduce the number of distinct values for an attribute
Normalization to convert attribute values to a common range
Winsorizing involves setting the tail values of an attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the maximum value in the 95th percentile.
Some ODM algorithms may benefit from binning (discretizing) both numeric and categorical data. Naive Bayes, Adaptive Bayes Network, Clustering, Attribute Importance, and Association Rules algorithms may benefit from binning.
Binning means grouping related values together, thus reducing the number of distinct values for an attribute. Having fewer distinct values typically leads to a more compact model and one that builds faster. Binning must be performed carefully. Proper binning can improve model accuracy; improper binning can lead to loss in accuracy.
Top N most frequent items: This technique is used to bin categorical values. The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin_1,..., bin_N corresponds to the values with top frequencies. The bin bin_N+1 corresponds to all remaining values.
Equi-Width Binning: This technique is used to bin numerical values. For numerical attributes, ODM finds the minimum (min) and maximum (max) values for every attribute in the data. Then ODM divides the [min, max] range into N equal bins of size d=(max-min)/N. Thus bin 1 is [min, min+d), bin 2 is [min+d, min+2d), and bin N is [min+(N-1)*d,max]. The number of bins can either be specified by the user or calculated by the transformation.
Quantile Binning: This technique is used to bin numerical values. The definition for each relevant attribute is computed based on the minimum values for each quantile, where quantiles are computed from the data using
NTILE function. Bins bin_1,..., bin_N span the following ranges: bin_1 spans [min_1,min_2]; bin_2,..., bin_i,..., bin_N-1 span (min_i, min_(i+1)] and bin_N spans (min_N, max_N]. Bins with equal left and right boundaries are collapsed.
Normalization converts individual numerical attributes so that each attribute's values lie in the same range. Values are converted to be in the range 0.0 to 1.0 or the range -1.0 to 1.0. Normalization ensures that attributes do not receive artificial weighting caused by differences in the ranges that they span. Some algorithms, such as k-Means, Support Vector Machine, and Non-Negative Matrix Factorization, benefit from normalization.