Using Notebook Sessions to Build and Train Models
Once you have a notebook session created, you can write and execute Python code using the machine learning libraries in the JupyterLab interface to build and train models.
Authenticating to the OCI APIs from a Notebook Session
When you are working within a notebook session, you are operating as the Linux user
datascience
. This user does not have an OCI
Identity and Access Management (IAM) identity, so it has no access to the OCI API. OCI
resources include Data Science projects and models and the resources of other OCI services,
such as Object Storage, Functions, Vault, Data Flow, and so on. To access these resources from
the notebook environment, use one of the two authentication approaches:
(Recommended) Authenticating Using a Notebook Session's Resource Principal
A resource principal is a feature of IAM that enables resources to be authorized principal actors that can perform actions on service resources. Each resource has its own identity, and it authenticates using the certificates that are added to it. These certificates are automatically created, assigned to resources, and rotated, avoiding the need for you to store credentials in your notebook session.
The Data Science service enables you to authenticate using your notebook session's resource principal to access other OCI resources. Resource principals provides a more secure way to authenticate to resources compared to the OCI configuration and API key approach
Your tenancy administrator must write policies to grant permissions for your resource principal to access other OCI resources, see Configuring Your Tenancy for Data Science.
You can authenticate with resource principals in a notebook session using the following interfaces:
- Oracle Accelerated Data Science SDK:
-
Run the following in a notebook cell:
import ads ads.set_auth(auth='resource_principal')
For details, see the Accelerated Data Science documentation.
- OCI Python SDK:
-
Run the following in a notebook cell.
import oci from oci.data_science import DataScienceClient rps = oci.auth.signers.get_resource_principals_signer() dsc = DataScienceClient(config={}, signer=rps)
- OCI CLI:
-
Use the
`--auth=resource_principal`
flag with commands.
The resource principal token is cached for 15 minutes. If you change the policy or the dynamic group, you have to wait for 15 minutes to see the effect of your changes.
If you don't explicitly use the resource principals when invoking an SDK or CLI, then the configuration file and API key approach is used
(Default) Authenticating Using OCI Configuration File and API Keys
You can operate as your own personal IAM user by setting up an OCI configuration file and API keys to access OCI resources. This is the default authentication approach
To authenticate using the configuration file and API key approach, you must upload an OCI
configuration file into the notebook session's /home/datascience/.oci/
directory. For the relevant profile defined in the OCI configuration file, you also need to
upload or create the required .pem
files.
You can use the api_keys.ipynb
notebook to interactively create OCI configuration and API key files. To launch the api_keys.ipynb
notebook, click Notebook Examples in the JupyterLab Launcher tab.
Working with Existing Code Files
You can create new files or work with your own existing files.
Files can be uploaded from your local machine by clicking Upload in the JupyterLab interface or by dragging and dropping files.
If you don't have a private key, you can create one in the notebook session by
running the ssh-keygen
command in the JupyterLab environment.
These instructions use a Git repository as an example though the steps are similar for other repositories. Flows between third-party version control providers and internal Git servers may differ.
You can execute sftp
, scp
, curl
,
wget
or rsync
commands to pull files into your notebook
session environment under the networking limitations imposed by your VCN and subnet
selection.
Installing Additional Python Libraries
You can install a library that's not preinstalled in the notebook session image. You can install and modify a pre-built conda environment or create a conda environment from scratch.
For more information, see the section on Installing Extra Libraries in the ADS documentation.
Using the Provided Environment Variables in Notebook Sessions
When you start up a notebook session, the service creates useful environment variables that you can use in your code:
Name |
Description |
---|---|
|
OCID of the tenancy the notebook belongs to. |
|
The OCID of the project associated with the current notebook session. |
|
OCID of the compartment of the project the notebook is associated with. |
|
Your user OCID. |
|
The OCID of the current notebook session. |
|
The compartment OCID of the current notebook session. |
|
Path to the OCI resource principal token. |
|
Id of the OCI resource principal token. |
To access these environment variables in your notebook session, use the Python
os
library. For example:
import os
project_ocid = os.environ[‘PROJECT_OCID’]
print(project_ocid)
The
NB_SESSION_COMPARTMENT_OCID
and
PROJECT_COMPARTMENT_OCID
values do not update in a running notebook
session if the resources has moved compartments after the notebook session was created.Using Custom Environment Variables
Use your own custom environment variables in notebook sessions.
After you define your custom environment variables, access these environment variables in your notebook session with the Python os
library. For example, if you define a key value pair with key of MY_CUSTOM_VAR1
and value of VALUE-1
, then when you run the following code, you get VALUE-1
.
import os
my_custom_var1 = os.environ[‘MY_CUSTOM_VAR1’]
print(my_custom_var1)
The system does not allow you to overwrite the system provided environment variables with your custom ones. For example, you cannot name your custom variable,
USER_OCID
. Using the Oracle Accelerated Data Science SDK
Oracle Accelerated Data Science (ADS) SDK speeds up common data science activities by providing tools that automate and simplify common data science tasks. Additionally, it provides data scientists a friendly Python interface to OCI services including Data Science including jobs, Big Data, Data Flow, Object Storage, Streaming, and Vault, and to Oracle Database. ADS gives you an interface to manage the life cycle of machine learning models, from data acquisition to model evaluation, interpretation, and model deployment.
With ADS you can:
- Read datasets from Object Storage, Oracle Database (ATP, ADW, and On-premise), AWS S3, and other sources into Pandas data frames.
- Tune models using hyperparameter optimization with the
ADSTuner
module. - Generate detailed evaluation reports of your model candidates with the
ADSEvaluator
module. - Save machine learning models to the Data Science model catalog.
- Deploy models as HTTP requests with model deployment.
- Launch distributed ETL, data processing, and model training jobs in Spark using Data Flow.
-
connect to the BDS from the notebook session, the cluster created must have Kerberos enabled.
Use Kerberos enabled clusters to connect to Big Data from a notebook session.
- Use feature types to characterize your data, create meaning summary statistics, and plot. Use the warning and validation system to test the quality of your data.
- Train machine learning models using Data Science jobs.
- Manage the life cycle of conda environments using the
ads conda
CLI.
Using a Git Repository in Notebook Sessions
Clone your Git repositories and use Git commands in your notebook session.
You can have Data Science clone your Git repository into your notebook session. When you create a notebook session, add your Git repository URL to the Runtime Configuration section.
- Git Constraints
-
-
The notebook must have internet access for Git repository to clone.
-
Only public Git repositories are supported.
-
Maximum of three Git repositories URLs are allowed.
-
Maximum length of a URL is 256 characters.
-
- Git-Related Directories in the Notebook Sessions
-
- Find the clones of your Git repositories in your notebook session's
/home/datascience/repos
directory. - For clone status such as success, failure, or in-progress, go to
/opt/log/odsc.log
. - For verbose logs, go to
/var/log/jupyterlab/runtime_config.log
.
Access the logs from a terminal in your notebook session.
- Find the clones of your Git repositories in your notebook session's
For existing notebook session, deactivate the notebook sessions. Then when you activate the notebook, add the Git repository URL in the Runtime Configuration section.
When you activate a notebook session, any previously saved data or files on the block volume of that deactivated notebook session are available in the activated notebook session. If you activate a notebook session with new Git repository URLs, any listed URL in the Runtime Configuration section, including previous URLs from the deactivated notebook sessions are also cloned to the notebook's /home/datascience/repos
directory.
To remove a cloned repository from a notebook session, you can delete it from the notebook session's /home/datascience/repos
directory.
If you want to replace an old clone from a deactivated notebook with a new one, delete the unwanted Git repository URL listed in the Runtime Configuration section, add the new URL, and then activate the notebook session.
Using Git Repositories in Notebook Sessions
You can use the file browser in JupyterLab to view the Git repository and a terminal window to execute Git commands as you would with any Git repository.
Alternatively, you can use the Git interface by clicking Git in the navigation panel to make authenticating users, creating branches, committing and pushing changes, and cloning easier.
You have to first initialize a new repository by clicking Initialize. The repository is displayed showing you the current branch, changes (staged, changed, and untracked), and history. When you make changes to the repository, you can add your comments and commit your changes using the dialogs. Next, you'll want to push your changes by clicking the push button at the top of the panel and supplying your Git credentials.
You can use the pull and refresh buttons to make sure that the repository is up to date. If errors occur, they appear in the lower right corner and you can click the error to get more information.
Connecting to Your Data
You can connect to your data in several different ways. ADS provides connectors to a variety of data sources, including Oracle Cloud Infrastructure Object Storage, Oracle database, or AWS S3. More information is available from the Connecting to Data Sources section of the ADS documentation. Also, you can install additional data connectors by creating or modifying existing conda environments.