Distributed Jobs

Distributed jobs provide a fully managed, multinode computing service that's designed to handle demanding, time-bound AI tasks such as large-scale model training, tuning, and data preprocessing.

Overview

By eliminating infrastructure complexities, it offers on-demand scalability and cost efficiency, while ensuring fast, secure, and reliable execution of AI workloads. You can now run ML or data workloads as jobs that span several compute nodes (VMs or GPUs) provisioned and orchestrated by Data Science, while specifying how those nodes are grouped and interact. Each node group in your cluster can be independently configured (compute shape, replica count, environment, and containers), and provisioned in parallel or sequence to match the needs of your distributed training or serving framework (such as PyTorch, TensorFlow, or custom solutions). You can bring your own code, artifacts and containers. All cluster communication is securely managed, and full metadata on the cluster is made available to your jobs for precise orchestration.

Distributed Jobs provides the following key capabilities:

Provisioning multinode infrastructure with node groups: Easily provision instances organized into node groups for optimized resource management.
Node group-specific configurations: Configure each node group independently with its own infrastructure settings and job-specific configurations, including environment variables (envars).
Configurable provision order for node groups: Provision several node groups in parallel if the distributed processing framework requires concurrent setup, or sequentially if required by the distributed processing framework.
Managed intracluster communication by using a tertiary network: The system automatically manages communication between nodes within the cluster, ensuring seamless interaction. The network communication occurs securely and in isolation over a tertiary network for enhanced security.
Cluster metadata access: A metadata file is made available to each node, containing details about the cluster, such as IP addresses, FQDN, node ID, and rank.
Bring your own container for multinode job runs: Users can configure Bring Your Own Container (BYOC) settings through environment configuration, with the ability to configure different environments between node groups.
Mounted file storage for multinode job runs: Support for mounted file storage during multinode job runs gives shared data, logs, checkpoints, and output storage.
All OCI Data Science service shapes supported: All supported shapes for existing workloads are available, except for shapes with two or less cores.
Multinode job run cancellation: Cancel job runs at any time with guaranteed successful termination.
Multinode job run timeout: Automatically timeout job runs that exceed the maximum runtime, with a default (and maximum) limit of 30 days.
Configuration override on job run create: Override infrastructure and node configurations for both single and multinode job runs during job run creation to adjust and customize resource requirements across job run iterations.
Logging integration: Integrated logging for job runs, with node-specific metadata.
Resource principal availability: Full support for resource principal authentication for secure operations within the nodes.

Distributed Jobs lets you run large-scale AI workloads without worrying about infrastructure, while optimizing cost, performance, and security.

Scalability: On-demand resources utilization to handle intensive AI tasks, ensuring fast execution.
Cost efficiency: Pay only for the compute you use, during the time duration it was used, with no idle resource costs.
Reduced complexity: No need to manage and provision complex multinode infrastructure, you can focus on AI development instead.
High availability: Built-in fault tolerance and guaranteed uptime for reliable operations.
Seamless integration: Easily integrates with popular AI frameworks, data storage, containers, and monitoring tools.
Security and compliance: Secured data handling and compliance at OCI.
Automated workflows: Set up time-bound tasks with scheduling.

Provided Environment Variables

For the environment variables, see Job Environment Variables.

Policy Setup

No new policies are needed. If you already use Data Science, you most likely have the required policies. For more information, see Policies.

Getting Started

Creating a Job Configuration

Follow the steps in Creating a Job.

Start a Job Run

Follow the steps in Starting a Job Run.

Provided Metadata

All the nodes in the cluster can access the following details:

Metadata file

One each node, a metadata file is available at the path: /home/datascience/job_run_cluster_metadata.json. This path is available as the value to the environment variable CLUSTER_NODES_METADATA_FILE.

The file contains the IP address, FQDN, Rank, and Node Group Name, and Node ID of all the nodes in the job run. For example:

[
 {"IpAddress":"<IP_address_1>","FQDN":"node-0","Rank":"0","NodeGroupName":"NODE_GROUP_1", "NodeId": "<Node_Id_1>"},
 {"IpAddress":"<IP_address_2>","FQDN":"node-1","Rank":"1","NodeGroupName":"NODE_GROUP_1", "NodeId": "<Node_Id_2>"},
 {"IpAddress2:"<IP_address_3>","FQDN":"node-2","Rank":"2","NodeGroupName":"NODE_GROUP_0", "NodeId": "<Node_Id_3>"}
]

Note

This file updates as new nodes join the cluster and is eventually consistent.

FQDN

Apart from the IP address, the nodes can communicate to other nodes using FQDN. The FQDN can be fetched from the job_run_cluster_metadata.json file.

Oracle Cloud Infrastructure Documentation