Attributes are the items of data that are used in data mining. In predictive models, attributes are the predictors that affect a given outcome. In descriptive models, attributes are the items of information being analyzed for natural groupings or associations. For example, a table of employee data that contains attributes such as job title, date of hire, salary, age, gender, and so on.
3.2.1 Data Attributes and Model Attributes
Data attributes are columns in the data set used to build, test, or score a model. Model attributes are the data representations used internally by the model.
Data attributes and model attributes can be the same. For example, a column called
SIZE, with values
L, are attributes used by an algorithm to build a model. Internally, the model attribute
SIZE is most likely be the same as the data attribute from which it was derived.
On the other hand, a nested column
SALES_PROD, containing the sales figures for a group of products, does not correspond to a model attribute. The data attribute can be
SALES_PROD, but each product with its corresponding sales figure (each row in the nested column) is a model attribute.
Transformations also cause a discrepancy between data attributes and model attributes. For example, a transformation can apply a calculation to two data attributes and store the result in a new attribute. The new attribute is a model attribute that has no corresponding data attribute. Other transformations such as binning, normalization, and outlier treatment, cause the model's representation of an attribute to be different from the data attribute in the case table.
3.2.2 Target Attribute
Understand what a target means in data mining and understand the different target data types.
The target of a supervised model is a special kind of attribute. The target column in the training data contains the historical values used to train the model. The target column in the test data contains the historical values to which the predictions are compared. The act of scoring produces a prediction for the target.
Clustering, Feature Extraction, Association, and Anomaly Detection models do not use a target.
Nested columns and columns of unstructured data (such as
BLOB) cannot be used as targets. Target attributes must have a simple data type.
Table 3-1 Target Data Types
|Target Data Types
You can query the *
_MINING_MODEL_ATTRIBUTES view to find the target for a given model.
3.2.3 Numericals, Categoricals, and Unstructured Text
Explains numeric, categorical, and unstructured text attributes.
Numerical attributes can theoretically have an infinite number of values. The values have an implicit order, and the differences between them are also ordered. Oracle Data Mining interprets
DM_NESTED_BINARY_FLOATS as numerical.
Categorical attributes have values that identify a finite number of discrete categories or classes. There is no implicit order associated with the values. Some categoricals are binary: they have only two possible values, such as yes or no, or male or female. Other categoricals are multi-class: they have more than two values, such as small, medium, and large.
Oracle Data Mining interprets
VARCHAR2 as categorical by default, however these columns may also be identified as columns of unstructured data (text). Oracle Data Mining interprets columns of
DM_NESTED_CATEGORICALS as categorical. Columns of
BFILE always contain unstructured data.
The target of a classification model is categorical. (If the target of a classification model is numeric, it is interpreted as categorical.) The target of a regression model is numerical. The target of an attribute importance model is either categorical or numerical.
3.2.4 Model Signature
The model signature is the set of data attributes that are used to build a model. Some or all of the attributes in the signature must be present for scoring. The model accounts for any missing columns on a best-effort basis. If columns with the same names but different data types are present, the model attempts to convert the data type. If extra, unused columns are present, they are disregarded.
The model signature does not necessarily include all the columns in the build data. Algorithm-specific criteria can cause the model to ignore certain columns. Other columns can be eliminated by transformations. Only the data attributes actually used to build the model are included in the signature.
3.2.5 Scoping of Model Attribute Name
column_name component is the name of the data attribute. It is present in all model attribute names. Nested attributes and text attributes also have a
subcolumn_name component as shown in the following example.
Example 3-2 Model Attributes Derived from a Nested Column
SALESPROD(ATTRIBUTE_NAME, VALUE) -------------------------------- ((PROD1, 300), (PROD2, 245), (PROD3, 679))
The name of the data attribute is
SALESPROD. Its associated model attributes are:
SALESPROD.PROD1 SALESPROD.PROD2 SALESPROD.PROD3
3.2.6 Model Details
Model details reveal information about model attributes and their treatment by the algorithm. There is a separate GET_MODEL_DETAILS routine for each algorithm. Oracle recommends that users leverage the model detail views instead.
Transformation and reverse transformation expressions are associated with model attributes. Transformations are applied to the data attributes before the algorithmic processing that creates the model. Reverse transformations are applied to the model attributes after the model has been built, so that the model details are expressed in the form of the original data attributes, or as close to it as possible.
Model Details for an Attribute Importance Model
The function returns
RETURN DM_RANKED_ATTRIBUTES PIPELINED;
DM_RANKED_ATTRIBUTES, a virtual table. The columns are the model details. There is one row for each model attribute in the specified model. The columns are described as follows: