A Data Scientist's Guide to Getting Started with OCI
Oracle Cloud Infrastructure (OCI) offers a family of artificial intelligence and machine learning services. This guide takes a data scientist through an introductory tour of our data science offerings using the machine learning lifecycle as its framework.
The Machine Learning Lifecycle
Building a machine learning model is an iterative process. Many of the steps needed to build a machine learning model are reiterated and modified until data scientists are satisfied with the model's performance. This process requires a great deal of data exploration, visualization, and experimentation.
The OCI Data Science service is focused on supporting a data scientist throughout the full machine learning life cycle. It rapidly builds, trains, deploys, and manages machine learning models. Data Science users work in a familiar JupyterLab notebook interface where they write Python code and have access to the open source libraries.
As a prerequisite to using Data Science for the machine learning lifecycle, you need to prepare your OCI environment and workspace.
You can also explore all of our ML and AI offerings and visit additional resources. |
|
The quickest way to configure your tenancy for data science is to use Resource Manager, which can get your prerequisites taken care of with just a few clicks. See Using the Oracle Resource Manager to Configure Your Tenancy for Data Science for more information.
Before you can truly get started with data and modeling, you need to ensure that your OCI tenancy is properly configured, including:
- Compartments – A logical container for organizing OCI resources. Read more at Learn Best Practices for Setting Up Your Tenancy.
- User groups – A group of users, including data scientists.
- Dynamic groups – A special type of group that contains resources (such as data science notebook sessions, job runs, and model deployments) that match rules that you define. These matching rules allow group membership to change dynamically as resources that match those rules are created or deleted. These resources can make API calls to services according to policies written for the dynamic group. For example, using the resource principal of a Data Science notebook session, you could call the Object Storage API to read data from a bucket.
- Policies – Define what principals, such as users and resources, have access to in OCI. Access is granted at the group and compartment level. You can write a policy that gives a group a specific type of access within a specific compartment.
For a full tutorial on setting up a tenancy for Data Science, see Manually Configuring a Data Science Tenancy.
After you set up the required OCI infrastructure, you can set up your data science environment.
- Create a data science project within your compartment.
Projects are containers that enable data science teams to organize their work. They represent collaborative workspaces for organizing notebook sessions and models.
You can also use the
create_project
method of the ADS SDK. - Create a notebook session within your project and specify your compartment.
Notebook sessions are JupyterLab interfaces where you can work in an interactive coding environment to build and train models. Environments come with preinstalled open source libraries and the ability to add others.
Notebook sessions run in fully managed infrastructure. When you create your notebook session, you can select CPUs or GPUs, the compute shape, and the amount of storage without any manual provisioning. Every time you reactivate a notebook session, you have the opportunity to modify these options. You can also let the service handle networking for your notebook session.
The
create_notebook_session
method lets you create notebooks with the ADS SDK. - Within your notebook session, you must install or create a Conda environment - Conda is an open source environment and package management system that you can use to quickly install, run, and update packages and their dependencies. You can isolate different software configurations, switch environments, and publish environments to make your research reproducible.Tip
The fastest way to get started in a notebook session is to choose an existing Data Science Conda environment. The OCI Data Science team manages these environments. Environments are focused on providing specific tools and a framework to do machine learning work or providing a comprehensive environment to solve business use cases. Each Data Science environment comes with its own set of notebook examples, which help you get started with the libraries installed in the environment. - After installing a Conda environment in your notebooks session, access your data and start the machine learning lifecycle.
All machine learning models start with data. Data scientists using OCI Data Science can access and use data sources in any cloud or on-premises, allowing for more data features and better models. The complete listing of data sources and formats supported by the ADS SDK is available with more details.
When using the Data Science service, storing data in the notebook session for quick access is recommended. From your notebook session, you can access data from:
- Object Storage - To retrieve your data, you must first set up a connection to Object Storage. After this setup, you can use the OCI Python SDK in a notebook session to retrieve your data. You can also use the ADS SDK to pull data from Object Storage.
- Local storage - To load a dataframe from a local source using the ADS SDK, use functions from
pandas
directly. - HTTP and HTTPS endpoints - To load a dataframe from a remote web server source, use
pandas
directly. - Databases - You can connect to the Autonomous Data Warehouse (ADW) and from your notebook session. The
autonomous_database.ipynb
example notebook interactively illustrates this type of connection. - Streaming data sources - The
kafka-python
client library is available in notebook sessions. The Python client library for the Apache Kafka distributed stream processing system allows data scientists to connect to the Streaming service using its Kafka-compatible API. We provide thestreaming.ipynb
notebook example in the notebook session environment. - Reference libraries - To open a data set from reference libraries, use
DatasetBrowser
. To see supported libraries, useDatasetBrowser.list()
.
Security
You can use the OCI
Vault service to centrally manage the encryption keys that protect your data and the credentials that you use to securely access resources. You can use the vault.ipynb
example notebook to learn how to use vaults with Data Science.
For more information, see the ADS SDK's Vault documentation.
Data can be prepared, transformed, and manipulated with ADS SDK built-in functions. Underlying an ADSDataset
object is a Pandas dataframe. Any operation that can be performed on a Pandas dataframe can also be applied to an ADSDataset
.
All ADS data sets are immutable; any transforms that are applied result in a new data set.
Prepare
Your data might be incomplete, inconsistent, or contain errors. You can use the ADS SDK to:
- Combine and clean data using row and column operations
- Impute data by finding and replacing null values
- Encode categories
Feature types allow you to separate how data is represented physically from what the data actually measures. You can create and assign multiple feature types to your data. Read a blog post which explains how feature types improve your workflow.
Transform
You can use the ADS SDK to automatically tranform a data set using the following methods:
suggest_recommendations()
- Displays issues and recommends changes and code to fix the issuesauto_transform()
- Returns a transformed data set with all recommendations and optimizations applied automaticallyvisualize_transforms()
- Visualizes the transformation that has been performed on a data set
OCI Data Science also supports open source data manipulation tools such as Pandas, Dask, and Numpy.
After all data transformations are complete, you can split the data into a train and test or train, test, and validation set.
Visualize and Explore
Visualization is one of the initial steps used to derive value from data. It allows analysts to efficiently gain insights from the data and guides exploratory data analysis. The ADS SDK includes a smart visualization tool that automatically detects the data type and renders plots that optimally represent the characteristics of the data. Automatic visualization methods include:
show_in_notebook()
- A comprehensive preview of a data set's basic informationshow_corr()
- Includes the correlation ratio, the Pearson method, and the Cramer V methodplot()
- Automatic plotting to explore the relationship between two columnsfeature_plot()
- Custom plotting and visualizations using feature types
You can also use the ADS SDK's call()
method to plot data using your preferred libraries and packages, like Seaborn, Matplotlib, Plotly, Bokeh, and Geographic Information System (GIS). See the ADS SDK examples for more information.
Modeling builds the best mathematical representation of the relationship among data points. Models are artifacts created by the training process, which captures this relationship or pattern.
After training and evaluating the model, you can deploy it.
Training a model
You can train a model either by using Automated Machine Learning (AutoML) or from an open source library. You can train using:
- Notebooks - Write and run Python code by using libraries in the JupyterLab interface
- Conda environments - Use the ADS SDK, AutoML, or Machine Learning Explainability (MLX) to train
- Jobs - Run machine learning or data science tasks outside of your notebook sessions in JupyterLab
AutoML
Building a successful machine learning model requires many iterations and experimentation, and a model is rarely achieved using an optimal set of hyperparameters in the first iteration. AutoML automates four steps in the machine learning modeling process:
- Algorithm selection - Identifies best algorithms for the data and problem; faster than exhaustive search
- Adaptive sampling - Identifies right sample size and adjust for unbalanced data
- Feature selection - De-noise the data and reduce number of features
- Model tuning - Auto-tunes hyperparameters for best model accuracy
Evaluating and validating a model
After training your model, you can see how it performs against a series of benchmarks. Use evaluation functions to convert the output of your test data into an interpretable, standardized series of scores and charts.
Automated evaluation using the ADS SDK
Automated evaluation generates a comprehensive suite of metrics and visualizations to measure model performance against new data and to compare model candidates. ADS offers a collection of tools, metrics, and charts concerned with the contradistinction of several models. The evaluators are:
- Binary classifier - Used for models where the output is binary. For example: Yes or No, Up or Down, 1 or 0. These models are a special case of multiclass classification so have specifically catered metrics.
- Multiclass classifier - Used for models where the output is discrete. These models have a specialized set of charts and metrics for their evaluation.
- Regression - Used for models where the output is continuous. For example: price, height, sales, length. These models have their own specific metrics that help to benchmark the model. How close is close enough?
Validation, explanations, and interpretation
Machine learning explainability (MLX) is the process of explaining and interpreting machine learning and deep learning models. Explainability is the ability to explain the reasons behind a model's prediction. Interpretability is the level at which a human can understand that explanation. MLX can help you to:
- Better understand and interpret the model’s behavior
- Debug and improve the quality of the model
- Increase trust in the model and confidence in deploying the model
Read more about model explainability to familiarize yourself with global explainers, local explainers, and WhatIf explainers.
After the model training and evaluation processes are complete, the best candidate models are saved so they can be deployed. Read about model deployments and their key components.
The ADS SDK has a set of classes that take your model and push it to production in a few steps. See Model Serialization for more information.
Introduction to the model catalog
Before you can deploy your model, you need to save the model in the model catalog. The model catalog is a centralized and managed repository of model artifacts. Models stored in the model catalog can be shared across members of a team and they can be loaded back into a notebook session. A model artifact is an archive that contains:
- score.py - this Python script contains your custom logic for loading serialized model objects to memory, and defines an inference endpoint
(predict())
- runtime.yaml - the runtime environment of the model, which provides the necessary Conda environment reference for model deployment purposes
- In addition to score.py and runtime.yaml, you can include any additional files that are necessary to run your model in your artifactImportant
Any code used for inference must be zipped at the same level or as score.py or below. If any required files are present at folder levels above the score.py file, they are ignored, which could result in deployment failure. - Metadata about the provenance of the model, including any Git-related information
- Script or notebook used to push the model to the catalog
We have provided various model catalog examples and templates including the
score.py
files in the GitHub repo.Prepare model metadata and documentation
Model metadata is optional but recommended. See Preparing Model Metadata and Working with Metadata for more information. Metadata includes:
- Model input and output schemas - A description of the features that are necessary to make a successful model prediction
- Provenance - Documentation that helps you improve the model's reproducibility and auditability
- Taxonomy - A description of the model that you are saving to the model catalog
- Model introspection tests - A series of tests and checks run on a model artifact to test all aspects of the operational health of the model
The ADS SDK automatically populates the provenance and taxonomy on your behalf when you save a model with ADS.
Save the model to the catalog
You can save a model to the model catalog using the ADS SDK, OCI Python SDK, or the Console. For details, see Saving Models to the Model Catalog.
Model artifacts stored in the model catalog are immutable by design to prevent unwanted changes and ensure that any model in production can be tracked to the exact artifact used. Create a new model to make changes.
Deploy the model
The most common way that models are deployed to production is as HTTP endpoints to serve predictions in real time. The Data Science service manages model deployments as resources and handles all infrastructure operations, including compute provisioning and load balancing. You can deploy a model using the ADS SDK and the Console.
You can also deploy models as a function. Functions are highly scalable, on-demand, serverless architectures in OCI. See this blog post for more details.
Invoke your model
After a model is deployed and active, its endpoint can successfully receive requests made by clients. Invoking a model deployment means that you can pass feature vectors or data samples to the predict endpoint, and then the model returns predictions for those data samples. See Invoking a model deployment for more information and then read about editing, deactivating, and otherwise managing your deployed model.
MLOps is the standardization, streamlining, and automation of machine learning lifecycle management. ML assets are treated like other software assets within an iterative, continuous-integration, continuous-delivery environment.
In DevOps, continuous integration refers to the validation and integration of updated code into the central repository. Continuous deployment refers to the redeployment of those changes into production. In MLOps, continuous integration refers to the validation and integration of new data and ML models. Continuous deployment refers to releasing that model into production.
Continuous training is unique to MLOps and refers to the automatic retraining of ML models for redeployment. If the model isn’t updated, its predictions become less and less accurate, but you can use automation to retrain the model on new data as quickly as possible.
Jobs
Data Science jobs enable you to define and run a repeatable machine learning task on a fully managed infrastructure. Using jobs, you can:
- Run machine learning or data science tasks outside of your notebook session
- Operationalize discrete data science and machine learning tasks, such as reusable runnable operations
- Automate your MLOps or CI/CD pipeline
- Run batch or workloads triggered by events or actions
- Batch, mini batch, or distributed batch job inference
- In a JupyterLab notebook session, you can create long running tasks or computation intensive tasks in a Data Science job to keep your notebook free for you to continue your work
Monitoring
Monitoring and logging are the last steps in the jobs life cycle. They provide you with insights into your jobs’ performance and metrics, in addition to a record that you can refer to later for each job run. See About Notebook Session Metrics for more information about monitoring, alarms, and metrics.
- Monitoring consists of metrics and alarms, and it enables you to check the health, capacity, and performance of your cloud resources. You can then use this data to determine when to create more instances to handle increased load, troubleshoot issues with your instance, or better understand system behavior.
- Alarms get triggered when a metric breaches set thresholds.
- Metrics track CPU or GPU utilization, the percentage of available job run container memory usage, container network traffic, and container disk utilization. When these numbers reach a certain threshold, you can scale up your resources, such as block storage and compute shape, to accommodate the workload.
- The Events service lets you subscribe to changes in your resources, like job and job run events and respond to them by using functions, notifications, or streams. See Creating Automation Using Events for more information.
Logging
You can use service logs or custom logs with job runs. A job run emits service logs to the Logging service. With custom logs, you can specify which log events are collected in a particular context and the location where the logs are stored. You can use the Logging service to enable, manage, and browse job run logs for your jobs. For complete information, see Logging and About Logs.
Integrating jobs resources with the Logging service is optional, but recommended, both for debugging any potential issues and for monitoring the progress of running your job artifacts.
Full Listing of ML and AI Services
While this guide focuses on the OCI Data Science service, other ML and AI services can be used with Data Science as a way to consume the services or as part of broader machine learning projects.
OCI machine learning services are used primarily by data scientists to build, train, deploy, and manage machine learning models. Data Science provides curated environments so data scientists can access the open source tools they need to solve business problems faster.
- The Data Science service makes it possible to build, train, and manage machine learning models using open source Python, with added capabilities for automated machine learning (AutoML), model evaluation, and model explanation.
- The Data Labeling service provides labeled datasets to more accurately train AI and machine learning models. Users can assemble data, create and browse datasets, and apply labels to data records through user interfaces and public APIs. The labeled datasets can be exported and used for model development. When you’re building machine learning models that work on images, text, or speech, you need labeled data that can be used to train the models.
- The Data Flow service provides a scalable environment for developers and data scientists to run Apache Spark applications in batch execution, at scale. You can run applications written in any Spark language to perform various data preparation tasks.
- Machine Learning in Oracle Database supports data exploration and preparation as well as building and deploying machine learning models using SQL, R, Python, REST, AutoML, and no-code interfaces. It includes more than 30 in-database algorithms that produce models in Oracle Database for immediate use in applications. Build models quickly by simplifying and automating key elements of the machine learning process.
OCI's AI services contains prebuilt machine learning models for specific uses. Some of the AI services are pretrained, and some you can train with your own data. All are used by simply calling the API for the service, passing in data to be processed, and the service returns a result. There’s no infrastructure to manage.
- Digital Assistant is an AI service that offers prebuilt skills and templates to create conversational experiences for business applications and customers through text, chat, and voice interfaces.
- The Language service makes it possible to perform sophisticated text analysis at scale. Language includes pretrained models for sentiment analysis, key phrase extraction, text classification, named entity recognition, and more.
- The Speech service uses automatic speech recognition (ASR) to convert speech to text. Built on the same AI models used for Digital Assistant, developers can use time-tested acoustic and language models to provide highly accurate transcription for audio or video files across many languages.
- The Vision service applies computer vision to analyze image-based content. Developers can easily integrate pretrained models into their applications with APIs or custom train models to meet their specific use cases. These models can be used to detect visual anomalies in manufacturing, extract text from documents to automate business workflows, and tag items in images to count products or shipments.
- Anomaly Detection enables developers to more easily build business-specific anomaly detection models that flag critical incidents, resulting in faster time to detection and resolution. Specialized APIs and automated model selection simplify training and deploying anomaly detection models to applications and operations.