Manage Compute

This section covers the basic functions of creating, changing, or removing compute clusters in your AI Data Platform.

About Compute Clusters

All-purpose compute clusters provide you the compute resources to process your workloads in an AI Data Platform workspace.

You manage your compute clusters from the Compute page in your AI Data Platform.


AI Data Platform Compute page with Compute highlighted in left pane

Types of Compute

Two types of compute exist in your AI Data Platform: all-purpose compute clusters and Default Master Catalog Compute Cluster.

You can only create all-purpose compute clusters in your AI Data Platform. All-purpose compute clusters are suitable for a versatile range of workloads and can be attached to your notebooks and used in workflows. Unless otherwise specified, any references to 'compute cluster' or 'cluster' in documentation refer to all-purpose compute clusters.

Default Master Catalog Compute Cluster is present in all AI Data Platforms. This cluster is responsible for essential AI Data Platform functions, like search crawls, refreshing catalog objects, creating, editing, and deleting objects, and testing connections.

Cluster Runtime

All-purpose compute clusters can be created with an Apache Spark 3.5 runtime. The runtime environment is compatible with:

  • Spark 3.5.0
  • Delta 3.2.0 (pre-included)
  • Python 3.11
  • Hadoop 3.3.4
  • Java 17

Only Python and SQL-based user code is currently supported by Oracle AI Data Platform. Java and Scala support are coming soon.

Maintenance Updates for Compute Clusters

Oracle AI Data Platform compute automatically applies maintenance updates without user intervention. The maintenance updates cover any necessary security patches or bug fixes for operating system and AI Data Platform internal components.

AI Data Platform verifies there are no running clusters before applying these monthly maintenance updates.

NVIDIA GPU Shapes

NVIDIA GPU shapes use the following configurations:

GPU Count OCPU Block storage (GB) GPU memory (GB) CPU memory (GB)
1 15 1500 24 240
2 30 3000 48 480

Note:

When you use NVIDIA GPU shapes, both the Driver and Worker shape must be an NVIDIA GPU. Mixing CPU and GPU shapes for the same cluster is currently not supported.

Create a Cluster

You can create compute clusters to run applications in your AI Data Platform.

When creating a cluster you should select driver and worker options that best match the shape of the systems you are trying to mirror. You can set your clusters to be constantly active or you can set an interval of inactivity after which the cluster will stop. Stopped clusters will resume when called on by an attached workflow or notebook.
  1. Navigate to your workspace and click Compute.
  2. Click Create cluster iconCreate Cluster.
  3. Select Runtime version.
  4. Select the driver options for your cluster.
  5. Select the worker options for your cluster. These options apply to all cluster workers.
  6. Select whether the number of workers is static or scales automatically.
    • If Static amount, specify the number of workers.
    • If Autoscale, specify the minimum and maximum number of workers the cluster can scale to.
  7. For Run duration, select whether the cluster will stop running after a set duration of inactivity. If Idle timeout is selected, specify the idle time, in minutes, before the cluster will time out.
  8. Click Create.

Create an NVIDIA GPU Cluster

You can choose to use NVIDIA GPU in the All Purpose Compute Clusters to accelerate any workload in your unified AI and data pipeline.

  1. Navigate to your workspace and click Compute.
  2. Click Create cluster iconCreate Cluster.
  3. Select Runtime version.
  4. For your cluster driver options:
    • Select NVIDIA GPU as the Driver Shape.
    • Select 1 or 2 as the GPU count.
  5. For your cluster worker options:
    • Select NVIDIA GPU as the Worker Shape.
    • Select 1 or 2 as the GPU count.
  6. Select whether the number of workers is static or scales automatically.
    • If Static amount, specify the number of workers.
    • If Autoscale, specify the minimum and maximum number of workers the cluster can scale to.
  7. For Run duration, select whether the cluster will stop running after a set duration of inactivity. If Idle timeout is selected, specify the idle time, in minutes, before the cluster will time out.
  8. Click Create.

NVIDIA GPU Cluster Tuning

You can tune your NVIDIA GPU clusters to optimize their performance by using recommendations from the GPU provider and by installing optional libraries.

Tuning GPU clusters can help optimize the performance of those clusters when called on by jobs in your AI Data Platform.

For NVIDIA GPU-based clusters, you can follow NVIDIA's Tuning Guide for recommendations and steps you can take to optimize performance.

You also have the option of installing Spark RAPIDS libraries to assist with optimization:

  • Spark RAPIDS library is a RAPIDS accelerator for Apache Spark and provides a set of plugins that leverage GPUs to accelerate processing.
  • Spark RAPIDS ML library enables GPU-accelerated, distributed machine learning on Apache Spark and provides several PySpark ML compatible algorithms powered by the RAPIDS cuML library.

The Spark RAPIDS library is commonly used first for feature engineering and data cleaning, and then cross validation is performed at scale using the Spark RAPIDS ML library. You can use these libraries for use cases like fraud detection (time series), web clickstream, and A/B experimentation.

Table 13-1 Recommended Spark Configurations

Setting Value Note
spark.executor.instances 4 Number of worker x GPU count per worker

If the number of workers is 4, and GPU count per worker is 1, then recommended spark.executor.instances config is 4 x 1 = 4

spark.executor.cores 16 GPU count/ worker / CPU cores, maximum of 16
spark.executor.memory 32 GB 2GB / core or 80% of CPU memory / GPU count per worker (whichever is less)
spark.task.resource.gpu.amount 0.0625 1 / spark.executor.cores
spark.rapids.sql.concurrentGpuTasks 3 GPU memory / 8GB, maximum of 4
spark.rapids.shuffle.multiThreaded.writer.threads 32 CPU cores / GPU count per worker
spark.rapids.shuffle.multiThreaded.reader.threads 32 CPU cores / GPU count per worker
spark.shuffle.manager com.nvidia.spark.rapids.spark350.RapidsShuffleManager -
spark.rapids.shuffle.mode MULTITHREADED -
spark.plugins com.nvidia.spark.SQLPlugin -
spark.executor.resource.gpu.amount 1 -
spark.sql.files.maxPartitionBytes 2 GB Optional, recommended for large datasets
spark.rapids.sql.batchSizeBytes 2 GB Optional, recommended for large datasets
spark.rapids.memory.host.spillStorageSize 32 G Optional, recommended for large datasets
spark.rapids.memory.pinnedPool.size 8 G Optional, recommended for large datasets
spark.sql.adaptive.coalescePartitions.minPartitionSize 32 MB Optional, recommended for large datasets
spark.sql.adaptive.advisoryPartitionSizeInBytes 160 MB Optional, recommended for large datasets
spark.rapids.filecache.enabled True Optional, recommended if workloads will be reusing datasets

Modify a Cluster

You can change settings or add additional parameters for your clusters.

  1. Navigate to your workspace and click Compute.
  2. Next to the compute cluster you want to modify, click Actions three dot icon Actions then click Edit.
  3. Modify the attributes of your compute cluster or add additional parameters as needed.
  4. Click Save.

Delete a Cluster

You can delete compute clusters that are unused or no longer needed.

  1. Navigate to your workspace and click Compute.
  2. Next to the cluster you want to delete, click Actions three dot icon Actions and click Delete.
  3. Click Delete.

View Cluster Details

You can review the shape and settings of a cluster at any time.

  1. Navigate to your workspace and click Compute.
  2. Click the name of the cluster you want to view details for.
  3. Click the Details tab.