Using Notebook Sessions to Build and Train Models
Once you have a notebook session created, you can write and execute Python code using the machine learning libraries in the JupyterLab interface to build and train models.
Authenticating to the OCI APIs from a Notebook Session
When you are working within a notebook session, you are operating as the Linux user
datascience
. This user does not have an OCI
Identity and Access Management (IAM) identity, so it has no access to the OCI API. OCI
resources include Data Science projects and models and the resources of other OCI services,
such as Object Storage, Functions, Vault, Data Flow, and so on. To access these resources from
the notebook environment, use one of the two authentication approaches:
(Recommended) Authenticating Using a Notebook Session's Resource Principal
A resource principal is a feature of IAM that enables resources to be authorized principal actors that can perform actions on service resources. Each resource has its own identity, and it authenticates using the certificates that are added to it. These certificates are automatically created, assigned to resources, and rotated, avoiding the need for you to store credentials in your notebook session.
The Data Science service enables you to authenticate using your notebook session's resource principal to access other OCI resources. Resource principals provides a more secure way to authenticate to resources compared to the OCI configuration and API key approach
Your tenancy administrator must write policies to grant permissions for your resource principal to access other OCI resources, see Configuring Your Tenancy for Data Science.
You can authenticate with resource principals in a notebook session using the following interfaces:
- Oracle Accelerated Data Science SDK:
-
Run the following in a notebook cell:
import ads ads.set_auth(auth='resource_principal')
For details, see the Accelerated Data Science documentation.
- OCI Python SDK:
-
Run the following in a notebook cell.
import oci from oci.data_science import DataScienceClient rps = oci.auth.signers.get_resource_principals_signer() dsc = DataScienceClient(config={}, signer=rps)
- OCI CLI:
-
Use the
`--auth=resource_principal`
flag with commands.
The resource principal token is cached for 15 minutes. If you change the policy or the dynamic group, you have to wait for 15 minutes to see the effect of your changes.
If you don't explicitly use the resource principals when invoking an SDK or CLI, then the configuration file and API key approach is used
(Default) Authenticating Using OCI Configuration File and API Keys
You can operate as your own personal IAM user by setting up an OCI configuration file and API keys to access OCI resources. This is the default authentication approach
To authenticate using the configuration file and API key approach, you must upload an OCI
configuration file into the notebook session's /home/datascience/.oci/
directory. For the relevant profile defined in the OCI configuration file, you also need to
upload or create the required .pem
files.
Alternatively, you can use the included individual getting-started.ipynb
notebooks to interactively create configuration and key files, see Overview of the Notebook Examples.
You can use the api_keys.ipynb
notebook to interactively create OCI
configuration and API key files. To launch the api_keys.ipynb
notebook,
click Notebook Examples in the JupyterLab
Launcher tab
Working with Existing Code Files
You can create new files or work with your own existing files.
Files can be uploaded from your local machine by clicking Upload in the JupyterLab interface or by dragging and dropping files.
If you don't have a private key, you can create one in the notebook session by
running the ssh-keygen
command in the JupyterLab environment.
These instructions use a Git repository as an example though the steps are similar for other repositories. Flows between third-party version control providers and internal Git servers may differ.
You can execute sftp
, scp
, curl
,
wget
or rsync
commands to pull files into your notebook
session environment under the networking limitations imposed by your VCN and subnet
selection.
Installing Additional Python Libraries
You can install a library that's not preinstalled in the provided image.
Access to the public internet is required to install additional libraries. Install a library by opening a notebook session and running this command:
%%bash
pip install <library-name>==<library-version>
Data Science doesn't allow root privileges in notebook sessions. You can only install libraries using
yum
and
pip
as a normal user. Attempting to use sudu
results in
errors.You can install any open source package available on a publicly-accessible Python Package Index (PyPI) repository. You can also install private or custom libraries from your own internal repositories.
The VCN or subnet that you used to create the notebook session must have network access to the source locations for the packages you want to download and install, see Manually Configuring Your Tenancy for Data Science.
Using the Provided Environment Variables in Notebook Sessions
When you start up a notebook session, the service creates useful environment variables that you can use in your code:
NB_SESSION_COMPARTMENT_OCID
-
The compartment OCID of the current notebook session.
NB_SESSION_OCID
-
The OCID of the current notebook session.
PROJECT_OCID
-
The OCID of the project associated with the current notebook session.
USER_OCID
-
Your user OCID.
PROJECT_COMPARTMENT_OCID
-
The compartment OCID of the project associated with the current notebook session.
To access these environment variables in your notebook session, use the Python
os
library. For example:
import os
project_ocid = os.environ[‘PROJECT_OCID’]
print(project_ocid)
The
NB_SESSION_COMPARTMENT_OCID
and
PROJECT_COMPARTMENT_OCID
values do not update in a running notebook
session if the resources has moved compartments after the notebook session was created.Using the Oracle Accelerated Data Science SDK
The Oracle Accelerated Data Science (ADS) SDK is a Python library that is included as part of the OCI Data Science service notebook session resource. ADS offers a friendly user interface that covers many of the steps involved in the lifecycle of machine learning models, from connecting to different data sources to using AutoML for model training to model evaluation and explanation. ADS also provides a simple interface to access the OCI Data Science service model catalog and other OCI services including object storage.
For complete documentation on how to use the Accelerated Data Science SDK, see Accelerated Data Science Library and Accessing the Conda Environment Notebook Examples.
Connecting to Your Data
You can connect to your data in these ways:
To retrieve your data, you must first set up a connection to Oracle Cloud Infrastructure Object Storage, see .
After this setup, you can use the OCI Python SDK in a notebook session to retrieve your data from Object Storage. Also, you can use the ADS SDK to pull data from Object Storage. Example notebooks are provided in the notebook session environment to show you the necessary steps, see Accessing the Conda Environment Notebook Examples.
You can connect to the Autonomous Data Warehouse (ADW) from your notebook session.
The autonomous_database.ipynb
example notebook interactively
illustrates this type of connection.
The VCN and subnet configuration that you selected when creating your notebook session should permit access to your ADW database. Contact your IT administrator to confirm that access with the networking configuration you selected is permitted.
To connect to ADW and pull data into a dataframe in your notebook session:
You can access notebook examples within notebook sessions that show you the different steps involved in connecting and querying data from ADW and other data sources.
The kafka-python
Python client library is available in the notebook
session. It is a client library for the Apache Kafka distributed stream processing
system and it allows data scientists to connect to the Streaming service using its
Kafka-compatible API. We provide the streaming.ipynb
notebook
example in the notebook session environment. It is a step-by-step approach to
producing and consuming messages to and from a stream. The example includes:
- Creating a Stream Pool and a Stream.
- Storing your Streaming Credentials as Secrets in an OCI Vault.
- Retrieving your Secrets from the Vault.
- Producing Messages to a Stream.
- Consuming Messages from a Stream.
In addition, you can use the OCI Vault service to centrally manage the
encryption keys that protect your data and the credentials that you use to securely
access resources. You can use the vault.ipynb
example notebook to learn
how to use vaults with Data Science, it includes:
- Creating a vault.
- Creating a key.
- Working with secrets.
- Listing resources.
- Deleting secrets, keys, and vaults.