|Oracle Data Mining Concepts
10g Release 1 (10.1)
Part Number B10698-01
An Oracle proprietary algorithm that can generate "rules". ABN provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute.
A specific technique or procedure for producing a data mining model. An algorithm uses a specific model representation and may support one or more functional areas. Examples of algorithms used by ODM include Naive Bayes, Adaptive Bayes Networks, and Support Vector Machine for classification, Support Vector Machine for regression, k-means and O-Cluster for clustering, MDL for attribute importance, and Apriori for association models.
The settings that specify algorithm-specific behavior for model building.
A user specification in the ODM Java interface describing the kind of output desired from applying a model to data. This output may include predicted values, associated probabilities, key values, and other supplementary data.
A data mining function that captures co-occurrence of items among transactions. A typical rule is an implication of the form A -> B, which means that the presence of itemset A implies the presence of itemset B with certain support and confidence. The support of the rule is the ratio of the number of transactions where the itemsets A and B are present to the total number of transactions. The confidence of the rule is the ratio of the number of transactions where the itemsets A and B are present to the number of transactions where itemset A is present. ODM uses the Apriori algorithm for association models.
Generated by association models. See association.
In the Java interface, an instance of
Attribute maps to a column with a name and data type. The attribute corresponds to a column in a database table. When assigned to a column, the column must have a compatible data type; if the data type is not compatible, a runtime exception is likely. Attributes are also called variables, features, data fields, or table columns.
A measure of the importance of an attribute in predicting a specified target. The measure of different attributes of a build data table enables users to select the attributes that are found to be most relevant to a mining model. A smaller set of attributes results in a faster model build; the resulting model could be more accurate. ODM uses the minimum description length principle to discover important attributes. Sometimes referred to as feature selection and key fields.
Specifies how a logical attribute is to be used when building a model, for example, active or supplementary, suppressing automatic data preprocessing, and assigning a weight to a particular attribute. See also attributes usage set.
A collection of attribute usage objects that together determine how the logical attributes specified in a logical data object are to be used.
All the data collected about a specific transaction or related set of values.
An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either non-ordered (nominal) like state, gender, etc., or ordered (ordinal) such as high, medium, or low temperatures.
In the Java interface, corresponds to a distinct value of a categorical attribute. Categories may have string or numeric values. String values must not exceed 64 characters in length.
See cluster centroid.
A data mining function for predicting categorical target values for new records using a model built from records with known target values. ODM supports three algorithms for classification, Naive Bayes, Adaptive Bayes Networks, and Support Vector Machines.
The cluster centroid is the vector that encodes, for each attribute, either the mean (if the attribute is numerical) or the mode (if the attribute is categorical) of the cases in the build data assigned to a cluster.
A data mining function for finding naturally occurring groupings in data. More precisely, given a set of data points, each having a set of attributes, and a similarity measure among them, clustering is the process of grouping the data points into different clusters such that data points in the same cluster are more similar to one another and data points in different clusters are less similar to one another. ODM supports two algorithms for clustering, k-means and orthogonal partitioning clustering.
Measures the correctness of predictions made by a model from a test task. The row indexes of a confusion matrix correspond to actual values observed and provided in the test data. These were used for model building. The column indexes correspond to predicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing.
When predicted value equals actual value, the model produces correct predictions. All other entries indicate errors.
A two-dimensional, n by n table that defines the cost associated with a prediction versus the actual value. A cost matrix is typically used in classification models, where n is the number of distinct values in the target, and the columns and rows are labeled with target values. The rows are the actual values; the columns are the predicted values.
A technique for evaluating the accuracy of a classification or regression model. This technique is used when there are insufficient cases for using separate sets of data for model building and testing. The data table is divided into several parts, with each part in turn being used to evaluate a model built using the remaining parts. Cross-validation occurs automatically for Naive Bayes and Adaptive Bayes Networks. Available in the Java interface only.
The process of discovering hidden, previously unknown, and usable information from a large amount of data. This information is represented in a compact form, often referred to as a model.
The component of the Oracle database that implements the data mining engine and persistent metadata repository.
Discretization groups related values together under a single value (or bin). This reduces the number of distinct values in a column. Fewer bins result in models that build faster. Many ODM algorithms (NB, ABN, etc.) may benefit from input data that is discretized prior to model building, testing, computing lift, and applying (scoring).
distance-based (clustering algorithm)
Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used.
In text mining, a matrix that represents the terms that are included in a given document.
A combination of attributes in the data that is of special interest and that captures important characteristics of the data.
See also network feature.
Creates new set of features by decomposing the original data. Feature extraction lets you describe the data with a number of features that is usually far smaller than the number of original dimensions (attributes). See also non-negative matrix factorization.
A measure of how much better prediction results are using a model than could be obtained by chance. For example, suppose that 2% of the customers mailed a catalog without using the model would make a purchase. However, using the model to select catalog recipients, 10% would make a purchase. Then the lift is 10/2 or 5. Lift may also be used as a measure to compare different data mining models. Since lift is computed using a data table with actual outcomes, lift compares how well a model performs with respect to this data on predicted outcomes. Lift indicates how well the model improved the predictions over a random selection given actual results. Lift allows a user to infer how a model will perform on new data.
Specifies the location of data for a mining operation in the ODM Java interface.
In the Java interface, a description of a domain of data used as input to mining operations. Logical attributes may be categorical, ordinal, or numerical.
A set of mining attributes used as input to building a mining model.
Given a sample of data and an effective enumeration of the appropriate alternative theories to explain the data, the best theory is the one that minimizes the sum of
This principle is used to select important attributes in attribute importance.
See apply output.
ODM supports the following mining functions: classification, regression, attribute importance, and clustering.
An object in the ODM Java interface that specifies the type of model to build, the function of the model, and the algorithm to use. ODM supports the following mining functions: classification, regression, association, attribute importance, and clustering.
The result of building a model from mining function settings (Java interface) or mining settings table (PL/SQL interface). The representation of the model is specific to the algorithm specified by the user or selected by the DMS. A model can be used for direct inspection, e.g., to examine the rules produced from an ABN model or association models, or to score data.
In the Java interface, the end product(s) of a mining task. For example, a build task produces a mining model; a test task produces a test result.
A data value that is missing because it was not measured (that is, has a null value), not answered, was unknown, or was lost. Data mining systems vary in the way they treat missing values. There are several typical ways to treat them: ignore then, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values. ODM ignores missing values during mining operations.
A mixture model is a type of density model that includes several component functions (usually Gaussian) that are combined to provide a multimodal density.
An important function of data mining is the production of a model. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules. See also mining model.
A network feature is a tree-like multi-attribute structure. From the standpoint of the network, features are conditionally independent components. Features contain at least one attribute (the root attribute). Network features are used in the Adaptive Bayes Network algorithm.
A feature extraction algorithm that decomposes multivariate data by creating a user-defined number of features, which results in a reduced representation of the original data.
An attribute whose values are numbers. The numeric value can be either an integer or a real number. Numerical attribute values can be manipulated as continuous values. See also categorical attribute.
An Oracle proprietary clustering algorithm that creates a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters.
A data value that does not come from the typical population of data; in other words, extreme values. In a normal distribution, outliers are typically at least 3 standard deviations from the mean.
In the Java interface, identifies data to be used as input to data mining. Through the use of attribute assignment, attributes of the physical data are mapped to logical attributes of a model's logical data. The data referenced by a physical data object can be used in model building, model application (scoring), lift computation, statistical analysis, etc.
In the Java interface, an object that specifies the characteristics of the physical data used in a mining operation. The physical data specification includes information about the format of the data (transactional or nontransactional) and the roles that the data columns play.
In binary classification problems, you may designate one of the two classes (target values) as positive, the other as negative. When ODM computes a model's lift, it calculates the density of positive target values among a set of test instances for which the model predicts positive values with a given degree of confidence.
An attribute used as input to a supervised model or algorithm to build a model.
The set of prior probabilities specifies the distribution of examples of the various classes in data. Also referred to as priors, these could be different from the distribution observed in the data.
See prior probabilities.
A data mining function for predicting continuous target values for new records using a model built from records with known target values. ODM supports the Support Vector Machine algorithm for regression. See approximation.
An expression of the general form if X, then Y. An output of certain models, such as association models or ABN models. The predicate X may be a compound predicate.
Scoring data means applying a data mining model to new data to generate predictions. See apply output.
Data for which only a small fraction of the attributes are non-zero or non-null in any given case. Examples of sparse data include market basket and text mining data.
The process of building data mining models using a known dependent variable, also referred to as the target. Classification and regression techniques are examples of supervised. See unsupervised mining (learning).
A classification and regression prediction algorithm that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. Support vector machine also has the ability to make predictions with sparse data, i.e., in domains that have a large number of predictor columns and relatively few rows, as is the case with bioinformatics.
In supervised learning, the identified attribute that is to be predicted. Sometimes called target value or target attribute.
A container within which to specify arguments to data mining operations to be performed by the data mining system.
Text mining is conventional data mining done using "text features." Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine then just as you would any other data. Both ODM interfaces and Oracle Text support text mining.
A function applied to data resulting in a new form or representation of the data. For example, discretization and normalization are transformations on data.
The process of building data mining models without the guidance (supervision) of a known, correct result. In supervised learning, this correct result is provided in the target attribute. Unsupervised learning has no such target attribute. Clustering and association are examples of unsupervised mining functions. See supervised mining (learning).