3.5.2 Replacing a PMEM Device Due to Degraded Performance

If a PMEM device has degraded performance, you might need to replace the module.

If degraded performance is detected on a PMEM device, the module status is set to warning - predictive failure and an alert is generated. The alert includes specific instructions for replacing the PMEM device. If you have configured the system for alert notifications, then the alerts are sent by e-mail message to the designated address.

The predictive failure status indicates that the PMEM device will fail soon, and should be replaced at the earliest opportunity. No new data is cached in the PMEM device until it is replaced.

To identify a PMEM device with the status predictive failure, you can also use the following command:

CellCLI> LIST PHYSICALDISK WHERE disktype=PMEM AND status='warning - predictive failure' DETAIL

         name:               PMEM_0_6
         diskType:           PMEM
         luns:               P0_D6
         makeModel:          "Intel NMA1XBD128GQS"
         physicalFirmware:   1.02.00.5365
         physicalInsertTime: 2019-11-30T21:24:45-08:00
         physicalSerial:     8089-A2-1838-00001234
         physicalSize:       126.375G
         slotNumber:         "CPU: 0; DIMM: 6"
         status:             warning - predictive failure

You can also locate the PMEM device using the information in the LIST DISKMAP command:

CellCLI> LIST DISKMAP
Name      PhysicalSerial         SlotNumber        Status       PhysicalSize
   CellDisk       DevicePartition    GridDisks
PMEM_0_1  8089-a2-0000-00000460  "CPU: 0; DIMM: 1"  normal      126G
   PM_00_cel01    /dev/dax5.0        PMEMCACHE_PM_00_cel01
PMEM_0_3  8089-a2-0000-000004c2  "CPU: 0; DIMM: 3"  normal      126G
   PM_02_cel01    /dev/dax4.0        PMEMCACHE_PM_02_cel01
PMEM_0_5  8089-a2-0000-00000a77  "CPU: 0; DIMM: 5"  normal      126G
   PM_03_cel01    /dev/dax3.0        PMEMCACHE_PM_03_cel01
PMEM_0_6  8089-a2-0000-000006ff  "CPU: 0; DIMM: 6"  warning -   126G
   PM_04_cel01    /dev/dax0.0        PMEMCACHE_PM_04_cel01
PMEM_0_8  8089-a2-0000-00000750  "CPU: 0; DIMM: 8"  normal      126G
   PM_05_cel01    /dev/dax1.0        PMEMCACHE_PM_05_cel01
PMEM_0_10 8089-a2-0000-00000103  "CPU: 0; DIMM: 10" normal      126G
   PM_01_cel01    /dev/dax2.0        PMEMCACHE_PM_01_cel01
PMEM_1_1  8089-a2-0000-000008f6  "CPU: 1; DIMM: 1"  normal      126G
   PM_06_cel01    /dev/dax11.0       PMEMCACHE_PM_06_cel01
PMEM_1_3  8089-a2-0000-000003bb  "CPU: 1; DIMM: 3"  normal      126G
   PM_08_cel01    /dev/dax10.0       PMEMCACHE_PM_08_cel01
PMEM_1_5  8089-a2-0000-00000708  "CPU: 1; DIMM: 5"  normal      126G
   PM_09_cel01    /dev/dax9.0        PMEMCACHE_PM_09_cel01
PMEM_1_6  8089-a2-0000-00000811  "CPU: 1; DIMM: 6"  normal      126G
   PM_10_cel01    /dev/dax6.0        PMEMCACHE_PM_10_cel01
PMEM_1_8  8089-a2-0000-00000829  "CPU: 1; DIMM: 8"   normal     126G
   PM_11_cel01    /dev/dax7.0        PMEMCACHE_PM_11_cel01
PMEM_1_10 8089-a2-0000-00000435  "CPU: 1; DIMM: 10"   normal    126G
   PM_07_cel01    /dev/dax8.0        PMEMCACHE_PM_07_cel01

If the PMEM device is used for write-back caching, then the data is flushed from the PMEM device to the flash cache. To ensure that data is flushed from the PMEM device, check the cachedBy attribute of all the grid disks and ensure that the affected PMEM device is not listed.

  1. Locate the storage server that contains the failing PMEM device.
    A white Locator LED is lit to help locate the affected storage server. When you have located the server, you can use the Fault Remind button to locate the failed DIMM.

    Caution:

    Do not attempt to remove a faulty DCPMM DIMM when the Do Not Service LED indicator is illuminated.
  2. Power down the storage server with the failing PMEM device and unplug the power cable for the server.
  3. Replace the failing PMEM device.
  4. Restart the storage server.

    Note:

    During the restart, the storage server will shut down a second time to complete the initialization of the new PMEM device.

The new PMEM device is automatically used by the system. If the PMEM device is used for caching, then the effective cache size increases. If the PMEM device is used for commit acceleration, then commit acceleration is enabled on the device.