Train and Deploy Models from Massive Data Sets: Fraud Detection Use Case

Architecture

This architecture highlights how you can leverage Oracle Cloud Infrastructure services to explore data sets, build a model, and train such model on terabytes to petabytes of data. You can use the trained model to perform batch inferencing as a job, or deployed as a real-time inferencing REST endpoint. The architecture has three primary phases: Collect, Analyze and Act.

The following diagram illustrates this reference architecture.

Description of cc-fraud-detection-architecture.png follows

Description of the illustration cc-fraud-detection-architecture.png

cc-fraud-detection-architecture-oracle.zip

The architecture has the following components:

Collect
The collect phase has the following components:
- Devices, sensors, and inputs that generate the data. In the use-case of fraud detection, data originates from point-of-sale (POS) systems
- Real-Time Ingest receives data points as they are produced, and may either be queued in a stream using the Streaming Service, that is connected to a Model Deployment, or an application may call the inference server directly through an API. Real-time and historical data are reconciled in a datastore (Cloud storage or database).
- Historical Data acquired through the above means is typically stored in a database, or in object storage.
- Cloud Storage can be used to stage datasets for exploration and model training.
- Ingest Services, such as the Oracle Cloud Infrastructure Data Integration service, or Oracle GoldenGate, can also link and transport external data such as datasets that reside on-premises or in third party datastores.
Analyze
- Explore, Analyze and Design a Model
  In the explore phase, data scientists extract a representative subset of the data set that can fit in memory and learn from it to engineer features that are meaningful to the task at hand. Alternatively, data scientists can run Oracle Cloud Infrastructure Data Flow applications from the Oracle Cloud Infrastructure Data Science to extract a representative data set. For fraud detection, data typically includes information about the customer (for example, account number, address, gender, date of birth) and the transaction (date, time, merchant, merchant location) from which other features can be derived, such as time-of-day, age, distance to merchant, customer city population, and so on.
  
  Once meaningful features are engineered, you can test various models to find the most accurate candidate. This implies using a sample data set to train and evaluate the model at small scale, in memory. If that isn't possible, data scientists can create and run Data Flow applications from Oracle Cloud Infrastructure Data Science to train larger scale models
- Model Training
  You can use the Apache Spark distributed processing engine that powers the Oracle Cloud Infrastructure Data Flow service to train the selected model at scale on a data set that cannot fit into memory (terabytes or even petabytes).
  
  Data Flow takes care of provisioning driver and executor nodes with shapes selected to handle the amount of training data and can auto-scale as needed.
- Storing Model Artifacts
  The trained model is serialized and exported to object storage. You can load the artifact for batch inference, or use it to deploy the model for real-time inference.
- Model Catalog
  Oracle Cloud Infrastructure Model Catalog stores model code and artifacts and can add metadata related to the provenance and taxonomy, provides capability for introspection and defining input and output schemas. The Model Catalog is the source for Model Deployments.
Act
- Batch Inference
  You can use Batch Inference to either evaluate past events based on a schedule or to audit model performance and drift on a regular basis. Batch Inference is performed at scale using the Oracle Cloud Infrastructure Data Flow service. You can create scoring or inference Data Flow applications directly from the Oracle Cloud Infrastructure Data Science Notebook using the code and model artifacts stored on Object Storage.
- Real-Time Inference
  Use a model deployment to perform inference on single or small batches of events that can fit into memory. Like Data Flow applications, you can create and store models in the Model Catalog directly from the Data Science Notebook. Inference can then happen on a stream of real-time data or synchronously through a direct API call from the application.
- Orchestration and Scheduling
  When working with batches, it is often useful to run the jobs on a schedule or on a trigger. You can use the Oracle Cloud Infrastructure Data Integration service to perform this type of orchestration. The service can trigger and control ingestion tasks, transformations and trigger training, scoring or inference jobs.

You can use a similar pattern with other use-cases that require very large data sets, such as:

Predictive Maintenance
Predictive maintenance is about cost avoidance and minimizing operational disruptions that increase expenses, such as scheduling additional shifts, paying overtime, freight expediting, and other costs.
Energy Production Output
Predicting throughput from alternative energy farms, like wind or solar, requires vast amounts of data including local weather patterns and past output.
Smart Manufacturing and Internet of Things (IoT)
Smart manufacturing involves finding ways to improve operational efficiencies to increase revenues and profits. This typically requires ingestion of data from hundreds to millions of sensors to predict yield, bottlenecks or trace products to analyze impacts, thereby increasing throughput and output.
Health Insurance Claim processing
Fraud is also a prevalent problem in Health Insurance claims. Automatically determining submission completeness is a critical part of making the process efficient.
Prescription Drug Analytics and Logistics
Predicting what types of prescription drugs are needed at a location is a complex problem that only vast amounts of data can help solve.
Health Diagnosis
Health Diagnosis is often made with imaging techniques such as X-ray or MRI. Machine learning has proven very helpful, and sometimes better than humans, at predicting diseases. This type of application requires very large data sets of images that are voluminous due to their multi-dimensional nature.

Recommendations

Use the following recommendations as a starting point. Your requirements might differ from the architecture described here.

Gateway
The gateway can be a custom hub designed for specific data collection. It might also be a database such as Oracle Autonomous Data Warehouse, Oracle NoSQL Database Cloud Service, or some other database.
Transport
Use Oracle Cloud Infrastructure Data Integration to migrate all historical data offline to Oracle Cloud Infrastructure Object Storage. Once data is transferred to Object Storage, all Oracle Cloud Infrastructure (OCI) services can access the data. You can also use Oracle GoldenGate to move data from on-premises databases.
Streaming
Use Oracle Cloud Infrastructure Streaming for real-time ingestion of events and data that are consumed or stored in Oracle Cloud Infrastructure Object Storage.
Data Storage
- Object Storage
  Oracle Cloud Infrastructure Object Storage is the default storage in this architecture. Storing all structured, semi-structured, and unstructured data in Object Storage is the most cost effective solution.
- Database
  Use Oracle Autonomous Data Warehouse, Oracle MySQL Database Service, or other SQL and NoSQL databases to store data that must be accessed for analytics and reporting. Typically, only curated and processed data resides in the database while raw, less often accessed data, is more efficiently stored in Object Storage.
- HDFS datastore
  The Oracle Big Data Cloud Service offers a way to store very large amount of data on HDFS (Hadoop Distributed File System). This option is useful if your organization already leverages or is migrating other Hadoop-based applications. Oracle offers a HDFS connector to Oracle Cloud Infrastructure Object Storage, which is the recommended storage platform.
Data Science
Oracle Cloud Infrastructure Data Science offers a familiar development environment to data scientists in the form of a hosted Jupyter Lab and multiple conda-based environments to choose from.

The Data Science service supports the Oracle Advanced Data Science (ADS) library that makes it easy to create and store machine learning models into the Model Catalog, and deploy models through the Model Deployment.

The Jupyter Lab interface combined with the provided pre-packaged Apache Spark conda environment makes it simple to explore and design Spark-based models on in-memory datasets, and then deploy them as Oracle Cloud Infrastructure Data Flow Applications to run training or batch inference, or as Model Deployments for real-time or in-memory inference.
Distributed Data Processing
Oracle Cloud Infrastructure Data Flow delivers the Apache Spark distributed processing engine as a service, capable of running processing jobs on terabytes or even petabytes of data.

Spark applications developed in the Data Science service are easily transferred to Data Flow Applications thanks to the Oracle ADS library utilities, and can be configured to run on any shape, at any scale.

Considerations

When building and training models on massive datasets, consider the following points. Some are specific to the fraud detection use case, but you can interpolate most to any model design process.

Data Collection
When building any machine learning model, data is of critical importance: quality data in sufficient quantity is paramount. In the fraud detection sample code, we used synthetic transaction data with labels, generated using user profiles that are best guesses. In a real-world scenario, transactions will often not be labeled and fraud cases may not be detected or even known, let alone labeled, so building a proper data set is the first challenge to address.
Data Quantity
How much data is needed is always a difficult question to answer. As a rule of thumb, the more features are present, the more data is needed. In some cases reducing the number of features helps improve the model performance and avoid over-fitting; however, in many instances, more data is needed to improve model performance.

If the data can fit in memory, then it is typically more efficient to select a compute shape with enough memory to train on a single node, than work with a distributed processing framework like Apache Spark (or Oracle Cloud Infrastructure Data Flow, its managed equivalent).
Data Storage
Raw data is often voluminous and not directly useful for analytics. It's important to keep raw data, but it is seldom used to train or retrain models, so it is better stored on a cost-effective solution like Oracle Cloud Infrastructure Object Storage.

Aggregated data, predictions, or more generally data that is used for analytics and reporting must be live and is better stored in a database. Data that is both voluminous and required for analytics may benefit from distributed storage solutions such as HDFS and Oracle Big Data Cloud Service.
Feature Engineering and Data Augmentation
Ingested raw data is often not enough information or in the wrong format, and it is necessary to engineer features from it. For example, we may want to use the customer’s age as opposed to their date of birth. Categorical text, which is not easily understood by most machine learning models, may be encoded as numbers with a StringIndexer or more often with 1-hot encoding. This works great when the categorical data is not likely to change, such as for encoding gender. However, if it can change over time, using these techniques will require the model to be re-trained when new values come in, which is far from ideal. In this case it may be necessary to find a better way to encode this data.

In the fraud detection code sample, we use already generated longitude and latitude from the synthetic data set as a proxy for customer address. This information is likely not readily available from the device source and will require data augmentation. Calling on an external geocoding service to translate an address into geographic coordinates, either during data ingestion, or as a pre-processing step, will provide the required information.
Data Exploration
Data exploration is typically performed on a representative sample data set that fits in memory. If there is no easy way to determine that the sample data set is truly representative of the full data set, and the full data set is too large to fit into memory, then use Oracle Cloud Infrastructure Data Flow to build aggregated statistics about the data set and extract a meaningful sub-sample data set that is representative of the full data set. Engineering features on a sample data set that is not representative of the full data set and then training on the full data set will lead to a model with poor performance.
Model Training
Tuning Spark for best efficiency is challenging. See Sizing Your Data Flow Application and follow the recommendations to estimate the number of executors and compute shapes needed to run a specific job.

Deploy

The example for this reference architecture is available as a Jupyter Notebook in GitHub.

Go to GitHub to view the sample notebook.
Follow the instructions in the README document.

Explore More

Learn more about Oracle Cloud Infrastructure Data Flow.

Review these additional resources:

Oracle Cloud Infrastructure Data Flow documentation.
The homepage includes links to the API docs, SDK, community forums, and Oracle Support.
Sizing Your Data Flow Application
Best practices framework for Oracle Cloud Infrastructure

Acknowledgments

Authors: Emmanuel Leroy, Niranjan Ghadei, Nishant Patel, Alireza Dibazar