GPU Memory Fabric Firmware Pinning

Your AI cloud infrastructure consists of multiple device types: hosts, GPUs, and NVLink switches—all of which must run compatible firmware versions to ensure stable operation. To help maintain this critical compatibility, OCI enforces firmware version consistency across all components. You also have the ability to set the desired firmware bundle on your GPU memory fabric (called pinning) using the Memory Fabric API.

To set the desired firmware bundle, you use memoryFabricPreferences in UpdateComputeGpuMemoryFabric. Your selection takes precedence over any OCI process for determining the target firmware bundle in an automated upgrade, thus giving you greater control over the pace of migration to newer firmware.

(Note that you can also use memoryFabricPreferences in UpdateComputeGpuMemoryFabric to set the fabric recycle level. See GPU Memory Fabric Recycle Level.)

Setting GPU Memory Fabric Preferences

To set GPU memory fabric preferences, you can also use the OCIcommand line reference for Compute:

Use the compute-gpu-memory-fabric update command and required parameters:

oci compute compute-gpu-memory-fabric update --compute-gpu-memory-fabric-id <customerGmfId> --memory-fabric-preferences '{"customerDesiredFirmwareBundleId": "<firmwareBundleOcid>", "fabricRecycleLevel": "<recycleLevel>" }'

Firmware Bundle-Related APIs

Use these APIs to view and pin firmware bundles:

Firmware Bundle-related APIs
TaskRelated API
Obtain the available firmware bundles.ListFirmwareBundles
Specify the desired firmware bundle to be pinned to your GPU memory fabric.UpdateComputeGpuMemoryFabric
View the fabric's current firmware bundle.GetComputeGpuMemoryFabric
View the host's current firmware bundle.GetComputeHost
View the target firmware bundle that represents what the memory fabric will upgrade to, when the memory fabric's lifecycleState is next unoccupied by a GPU memory cluster.GetComputeGpuMemoryFabric
View the GPU memory fabric's lifecycleState. GetComputeGpuMemoryFabric

See also GPU Memory Fabric States.

List all GPU memory clusters.ListComputeGpuMemoryClusters

See also ComputeGpuMemoryFabric.

Firmware Details

You can obtain firmware details from GetComputeGpuMemoryFabric.

This table describes the firmware-related fields in more detail.

Firmware Fields
FieldDescription
targetFirmwareBundleId

The firmware bundle that the current firmware bundle will be set to, when the GPU memory fabric's lifecycleState is next unoccupied by a GPU memory cluster.

When the UpdateComputeGpuMemoryFabric API is called to set memoryFabricPreferences, the customerDesiredFirmwareBundle is reflected in the targetFirmwareBundleId field in the response, if the API call is successful.

Note: OCI can't upgrade or downgrade GPU memory fabric to the specified customerDesiredFirmwareBundleId (UpdateComputeGpuMemoryFabric) when its state is OCCUPIED (occupied by Compute GPU memory clusters).

Instead, once memoryFabricPreferences are updated, you must then terminate all Compute GPU memory clusters on the fabric. After termination is complete, OCI reprovisions the fabric with the specified firmware.

currentFirmwareBundleId

The currently installed firmware bundle on the fabric.

Possible values are:

  • null - when the fabric is undergoing a firmware upgrade
  • different from targetFirmwareBundleId - when the fabric is occupied (you must terminate Compute GPU memory clusters before the firmware upgrade can start). Note that OCI might be running its own memory clusters as part of its validation process, and they will be preempted if there's a firmware upgrade.
  • equal to targetFirmwareBundleId - when the firmware upgrade is complete.
firmwareUpdateState

Indicates whether a pending firmware upgrade on the fabric exists. Possible values are:

  • WILL_UPDATE: pending or ongoing firmware upgrade
  • NO_UPDATE: no pending firmware upgrade
  • SKIP_RECYCLE_ENABLED: fabricRecycleLevel has been set to SKIP_RECYCLE
firmwareUpdateReason Optional message describing the reason behind firmware update decisions.

FAQs

Can an ongoing firmware update be interrupted?

An ongoing firmware update can't be interrupted or canceled while the GPU memory fabric is in the PROVISIONING state.

To apply a new customerDesiredFirmwareBundleId, you must wait until the fabric transitions to the AVAILABLE state, at which point a new update can be initiated.

What does a 409 Conflict mean when updating a memory fabric?

A 409 Conflict indicates that the GPU memory fabric is currently in the PROVISIONING state and isn't eligible for updates. Updates are allowed only when the fabric is in the AVAILABLE or OCCUPIED state.

What happens if I terminate my GPU memory cluster before setting customerDesiredFirmwareBundleId?

All the hosts in the terminated GPU memory cluster will be provisioned with the existing firmware bundle, at first. The remaining behavior depends on whether the GPU memory fabric is associated with a single GPU memory cluster or multiple GPU memory clusters:

Single GPU memory cluster

When the only GPU memory cluster on the fabric is terminated:

  1. The fabric may transition back to the AVAILABLE state after termination completes. In this case, you can immediately set a new customerDesiredFirmwareBundleId. The remaining available hosts will then be provisioned with the new bundle. Hosts returning from the terminated GPU memory cluster will be reprovisioned to align with the new desired bundle.
  2. The fabric may also transition to the PROVISIONING state, if provisioning is needed. In this case, the ongoing provisioning can't be canceled, and you must wait for it to complete before applying a new update.

Multiple GPU memory clusters

When one of multiple GPU memory clusters is terminated:

  1. The fabric remains in the OCCUPIED state. You can still set a new customerDesiredFirmwareBundleId. However, available hosts will not be immediately provisioned with the new bundle.
  2. Provisioning to the new bundle will occur only after all GPU memory clusters are terminated and the fabric becomes eligible for provisioning. Hosts that return with the old bundle will be reprovisioned again to align with the new desired bundle.
Is a GPU memory cluster launch and terminate required to update firmware?

If the GPU memory fabric is already occupied by a GPU memory cluster, we recommend updating the fabric to a new firmware bundle, and then terminating the GPU memory cluster to start the firmware update.

If the fabric's lifecycleState is AVAILABLE, you don't have to launch a GPU memory cluster. Instead, update the fabric to a new firmware bundle. OCI automatically starts the firmware update process after about 15 to 30 minutes.