Attributes are the items of data that are used in machine learning. Attributes are also referred as variables, fields, or predictors.
In predictive models, attributes are the predictors that affect a given outcome. In descriptive models, attributes are the items of information being analyzed for natural groupings or associations. For example, a table of employee data that contains attributes such as job title, date of hire, salary, age, gender, and so on.
3.2.1 Data Attributes and Model Attributes
Data attributes are columns in the data set used to build, test, or score a model. Model attributes are the data representations used internally by the model.
Data attributes and model attributes can be the same. For example, a column called
SIZE, with values
L, are attributes used by an algorithm to build a model. Internally, the model attribute
SIZE is most likely be the same as the data attribute from which it was derived.
On the other hand, a nested column
SALES_PROD, containing the sales figures for a group of products, does not correspond to a model attribute. The data attribute can be
SALES_PROD, but each product with its corresponding sales figure (each row in the nested column) is a model attribute.
Transformations also cause a discrepancy between data attributes and model attributes. For example, a transformation can apply a calculation to two data attributes and store the result in a new attribute. The new attribute is a model attribute that has no corresponding data attribute. Other transformations such as binning, normalization, and outlier treatment, cause the model's representation of an attribute to be different from the data attribute in the case table.
3.2.2 Target Attribute
Understand what a target means in machine learning and understand the different target data types.
The target of a supervised model is a special kind of attribute. The target column in the training data contains the historical values used to train the model. The target column in the test data contains the historical values to which the predictions are compared. The act of scoring produces a prediction for the target.
Clustering, feature extraction, association, and anomaly detection models do not use a target.
Nested columns and columns of unstructured data (such as
BLOB) cannot be used as targets.
Table 3-1 Target Data Types
|Machine Learning Function||Target Data Types|
You can query the *
_MINING_MODEL_ATTRIBUTES view to find the target for a given model.
3.2.3 Numericals, Categoricals, and Unstructured Text
Explains numeric, categorical, and unstructured text attributes.
Numerical attributes can theoretically have an infinite number of values. The values have an implicit order, and the differences between them are also ordered. Oracle Machine Learning for SQL interprets
DM_NESTED_BINARY_FLOATS as numerical.
Categorical attributes have values that identify a finite number of discrete categories or classes. There is no implicit order associated with the values. Some categoricals are binary: they have only two possible values, such as yes or no, or male or female. Other categoricals are multi-class: they have more than two values, such as small, medium, and large.
VARCHAR2 as categorical by default, however these columns may also be identified as columns of unstructured data (text). OML4SQL interprets columns of
DM_NESTED_CATEGORICALS as categorical. Columns of
BFILE always contain unstructured data.
The target of a classification model is categorical. (If the target of a classification model is numeric, it is interpreted as categorical.) The target of a regression model is numerical. The target of an attribute importance model is either categorical or numerical.
3.2.4 Model Signature
Learn about model signature and the data types that are considered in the build data.
The model signature is the set of data attributes that are used to build a model. Some or all of the attributes in the signature must be present for scoring. The model accounts for any missing columns on a best-effort basis. If columns with the same names but different data types are present, the model attempts to convert the data type. If extra, unused columns are present, they are disregarded.
The model signature does not necessarily include all the columns in the build data. Algorithm-specific criteria can cause the model to ignore certain columns. Other columns can be eliminated by transformations. Only the data attributes actually used to build the model are included in the signature.
3.2.5 Scoping of Model Attribute Name
Learn about model attribute name.
column_name component is the name of the data attribute. It is present in all model attribute names. Nested attributes and text attributes also have a
subcolumn_name component as shown in the following example.
Example 3-2 Model Attributes Derived from a Nested Column
SALESPROD(ATTRIBUTE_NAME, VALUE) -------------------------------- ((PROD1, 300), (PROD2, 245), (PROD3, 679))
The name of the data attribute is
SALESPROD. Its associated model attributes are:
SALESPROD.PROD1 SALESPROD.PROD2 SALESPROD.PROD3
3.2.6 Model Details
Model details reveal information about model attributes and their treatment by the algorithm. Oracle recommends that users leverage the model detail views for the respective algorithm.
Transformation and reverse transformation expressions are associated with model attributes. Transformations are applied to the data attributes before the algorithmic processing that creates the model. Reverse transformations are applied to the model attributes after the model has been built, so that the model details are expressed in the form of the original data attributes, or as close to it as possible.
There is a separate
GET_MODEL_DETAILS routine for each algorithm. Starting from Oracle Database 12c Release 2, the
GET_MODEL_DETAILS are deprecated. Oracle recommends to use Model Detail Views for the respective algorithms.