This chapter includes the following topics:
Too much data and not enough information — this is a problem facing many businesses and industries. Most businesses have an enormous amount of data, with a great deal of information hiding within it, but "hiding" is usually exactly what it is doing: So much data exists that it overwhelms traditional methods of data analysis.
Data mining provides a way to get at the information buried in the data. Data mining creates models to find hidden patterns in large, complex collections of data, patterns that sometimes elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of patterns, or the difficulty in performing the analysis.
Data mining projects usually require a significant amount of data collection and data processing before and after model building. Data tables are created by combining many different types and sources of information. Real-world data is often dirty, that is, includes wrong or missing values; data must often be cleaned before it can be used. Data is filtered, normalized, sampled, transformed in various ways, and eventually used as input to data mining algorithms. Up to 80% of the effort in a data mining project is often devoted to data preparation. When the data is stored as a table in a database, data preparation can be performed using database facilities.
Data mining models have to be built, tested, validated, managed, and deployed in their appropriate application domain environments. The data mining results may need to be post-processed as part of domain specific computations (for example, calculating estimated risks, expected utilities, and response probabilities) and then stored into permanent databases or data warehouses.
Making the entire data mining process work in a reproducible and reliable way is challenging; it may involve automation and transfers across servers, data repositories, applications, and tools. For example, some data mining tools require that data be exported from the corporate database and converted to the data mining tool's format; data mining results must be imported into the database. Removing or reducing these obstacles can enable data mining to be utilized more frequently to extract more valuable information and, in many cases, to make a significant impact on the bottom-line of an enterprise. Data mining in the database makes the data movement required by tools that do not operate in the database unnecessary and make it much easier to mine up-to-date data. Also, the less data movement, the less time the entire data mining process takes.
Data movement can make data insecure. If data never leaves the database, database security protects the data.
Less data movement
More data security
Oracle Data Mining (ODM) embeds data mining within the Oracle database. ODM algorithms operate natively on relational tables or views, thus eliminating the need to extract and transfer data into standalone tools or specialized analytic servers. ODM's integrated architecture results in a simpler, more reliable, and more efficient data management and analysis environment. Data mining tasks can run asynchronously and independently of any specific user interface as part of standard database processing pipelines and applications. Data analysts can mine the data in the database, build models and methodologies, and then turn those results and methodologies into full-fledged application components ready to be deployed in production environments. The benefits of the integration with the database cannot be emphasized enough when it comes to deploying models and scoring data in a production environment. ODM allows a user to take advantage of all aspects of Oracle's technology stack as part of an application. Also, fewer "moving parts" results in a simpler, more reliable, more powerful advanced business intelligence application.
ODM provides single-user multi-session access to models. ODM programs can run either asynchronously or synchronously in the Java interface. ODM programs using the PL/SQL interface run synchronously; to run PL/SQL asynchronously requires using the Oracle Scheduler. For a brief description of the ODM interfaces, see "Java and PL/SQL Interfaces".
Data mining functions can be divided into two categories: supervised (directed) and unsupervised (undirected).
Supervised functions are used to predict a value; they require the specification of a target (known outcome). Targets are either binary attributes indicating yes/no decisions (buy/don't buy, churn or don't churn, etc.) or multi-class targets indicating a preferred alternative (color of sweater, likely salary range, etc.). Naive Bayes for classification is a supervised mining algorithm.
Unsupervised functions are used to find the intrinsic structure, relations, or affinities in data. Unsupervised mining does not use a target. Clustering algorithms can be used to find naturally occurring groups in data.
Data mining can also be classified as predictive or descriptive. Predictive data mining constructs one or more models; these models are used to predict outcomes for new data sets. Predictive data mining functions are classification and regression. Naive Bayes is one algorithm used for predictive data mining. Descriptive data mining describes a data set in a concise way and presents interesting characteristics of the data. Descriptive data mining functions are clustering, association models, and feature extraction. k-Means clustering is an algorithm used for descriptive data mining.
Different algorithms serve different purposes; each algorithm has advantages and disadvantages. A given algorithm can be used to solve different kinds of problems. For example, k-Means clustering is unsupervised data mining; however, if you use k-Means clustering to assign new records to a cluster, it performs predictive data mining. Similarly, decision tree classification is supervised data mining; however, the decision tree rules can be used for descriptive purposes.
Oracle Data Mining supports the following data mining functions:
Supervised data mining:
Classification: Grouping items into discrete classes and predicting which class an item belongs to
Regression: Approximating and forecasting continuous values
Attribute Importance: Identifying the attributes that are most important in predicting results
Anomaly Detection: Identifying items that do not satisfy the characteristics of "normal" data (outliers)
Unsupervised data mining:
Clustering: Finding natural groupings in the data
Association models: Analyzing "market baskets"
Feature extraction: Creating new attributes (features) as a combination of the original attributes
Oracle Data Mining permits mining of one or more columns of text data.
Oracle Data Mining also supports specialized sequence search and alignment algorithms (BLAST) used to detect similarities between nucleotide and amino acid sequences.
The following features are new in Oracle Data Mining 10g, Release 2 (compared with ODM 10g, Release 1):
New algorithms and improvements to existing algorithms:
Decision Tree algorithm, a fast, scalable means of extracting predictive and descriptive information from a database table with respect to a user-supplied target; Decision Tree provides human-understandable rules
One-Class Support Vector Machine algorithm for anomaly detection
Active learning, an enhancement to the Support Vector Machine algorithm that provides a way to deal with large build data sets
Completely revised Java interface consistent with the Java Data Mining (JDM) standard for data mining (JSR-73)
For information about JDM, see the Java Help for the JSR-73 Specification, available on the Oracle Technology Network at
SQL Data Mining Functions for applying a model to new data:
O-cluster algorithm added to PL/SQL interface
Oracle Data Miner, a graphical user interface for ODM; Oracle Data Miner is distributed on Oracle Technology Network
The PL/SQL package
DBMS_PREDICTIVE_ANALYTICS that automates the later stages of data mining; includes the following procedures:
EXPLAIN to rank attributes in order of influence in explaining a target column
PREDICT to predict the value of a target attribute (categorical or numerical)
Oracle Spreadsheet Add-In for Predictive Analytics enables Microsoft Excel users to mine their Oracle Database or Excel data using the automated methodologies provided by DBMS_PREDICTIVE_ANALYTICS; the Add-In is distributed on Oracle Technology Network