GPU Memory Fabric Firmware Pinning
Your AI cloud infrastructure consists of multiple device types: hosts, GPUs, and NVLink switches—all of which must run compatible firmware versions to ensure stable operation. To help maintain this critical compatibility, OCI enforces firmware version consistency across all components. You also have the ability to set the desired firmware bundle on your GPU memory fabric (called pinning) using the Memory Fabric API.
To set the desired firmware bundle, you use memoryFabricPreferences in UpdateComputeGpuMemoryFabric. Your selection takes precedence over any OCI process for determining the target firmware bundle in an automated upgrade, thus giving you greater control over the pace of migration to newer firmware.
(Note that you can also use memoryFabricPreferences in UpdateComputeGpuMemoryFabric to set the fabric recycle level. See GPU Memory Fabric Recycle Level.)
Setting GPU Memory Fabric Preferences
To set GPU memory fabric preferences, you can also use the OCIcommand line reference for Compute:
Use the compute-gpu-memory-fabric update command and required parameters:
oci compute compute-gpu-memory-fabric update --compute-gpu-memory-fabric-id <customerGmfId> --memory-fabric-preferences '{"customerDesiredFirmwareBundleId": "<firmwareBundleOcid>", "fabricRecycleLevel": "<recycleLevel>" }'
Firmware Bundle-Related APIs
Use these APIs to view and pin firmware bundles:
| Task | Related API |
|---|---|
| Obtain the available firmware bundles. | ListFirmwareBundles |
| Specify the desired firmware bundle to be pinned to your GPU memory fabric. | UpdateComputeGpuMemoryFabric |
| View the fabric's current firmware bundle. | GetComputeGpuMemoryFabric |
| View the host's current firmware bundle. | GetComputeHost |
View the target firmware bundle that represents what the memory fabric will upgrade to, when the memory fabric's lifecycleState is next unoccupied by a GPU memory cluster. | GetComputeGpuMemoryFabric |
View the GPU memory fabric's lifecycleState. | GetComputeGpuMemoryFabric See also GPU Memory Fabric States. |
| List all GPU memory clusters. | ListComputeGpuMemoryClusters |
See also ComputeGpuMemoryFabric.
Firmware Details
You can obtain firmware details from GetComputeGpuMemoryFabric.
This table describes the firmware-related fields in more detail.
| Field | Description |
|---|---|
targetFirmwareBundleId | The firmware bundle that the current firmware bundle will be set to, when the GPU memory fabric's When the UpdateComputeGpuMemoryFabric API is called to set Note: OCI can't upgrade or downgrade GPU memory fabric to the specified Instead, once |
currentFirmwareBundleId | The currently installed firmware bundle on the fabric. Possible values are:
|
firmwareUpdateState | Indicates whether a pending firmware upgrade on the fabric exists. Possible values are:
|
firmwareUpdateReason | Optional message describing the reason behind firmware update decisions. |
FAQs
An ongoing firmware update can't be interrupted or canceled while the GPU memory fabric is in the PROVISIONING state.
To apply a new customerDesiredFirmwareBundleId, you must wait until the fabric transitions to the AVAILABLE state, at which point a new update can be initiated.
A 409 Conflict indicates that the GPU memory fabric is currently in the PROVISIONING state and isn't eligible for updates. Updates are allowed only when the fabric is in the AVAILABLE or OCCUPIED state.
All the hosts in the terminated GPU memory cluster will be provisioned with the existing firmware bundle, at first. The remaining behavior depends on whether the GPU memory fabric is associated with a single GPU memory cluster or multiple GPU memory clusters:
Single GPU memory cluster
When the only GPU memory cluster on the fabric is terminated:
- The fabric may transition back to the AVAILABLE state after termination completes. In this case, you can immediately set a new
customerDesiredFirmwareBundleId. The remaining available hosts will then be provisioned with the new bundle. Hosts returning from the terminated GPU memory cluster will be reprovisioned to align with the new desired bundle. - The fabric may also transition to the PROVISIONING state, if provisioning is needed. In this case, the ongoing provisioning can't be canceled, and you must wait for it to complete before applying a new update.
Multiple GPU memory clusters
When one of multiple GPU memory clusters is terminated:
- The fabric remains in the OCCUPIED state. You can still set a new
customerDesiredFirmwareBundleId. However, available hosts will not be immediately provisioned with the new bundle. - Provisioning to the new bundle will occur only after all GPU memory clusters are terminated and the fabric becomes eligible for provisioning. Hosts that return with the old bundle will be reprovisioned again to align with the new desired bundle.
If the GPU memory fabric is already occupied by a GPU memory cluster, we recommend updating the fabric to a new firmware bundle, and then terminating the GPU memory cluster to start the firmware update.
If the fabric's lifecycleState is AVAILABLE, you don't have to launch a GPU memory cluster. Instead, update the fabric to a new firmware bundle. OCI automatically starts the firmware update process after about 15 to 30 minutes.