Working with Problems

To aid serviceability, Oracle ZFS Storage Appliance detects persistent hardware failures (faults) and software failures (defects, often included under faults) and reports them as active problems on the Maintenance: Problems page in the BUI, and in maintenance problems in the CLI.

If the Phone Home service is enabled, active problems are automatically reported to Oracle Support, where a support case might be opened, depending on the service contract and the nature of the fault. Problem notification can be suspended while you are servicing Oracle ZFS Storage Appliance.

The following topics are described in this section:

Viewing Active Problems

The following table shows some example faults as they would be displayed in the Active Problems section of the Maintenance: Problems page in the BUI. For each problem, Oracle ZFS Storage Appliance reports what happened, when the problem was detected, the severity and type of the problem, and whether the problem has been phoned home. Severity can be Minor, Major, or Critical. Type can be Alert, Defect, Error, or Fault. Phoned Home is a date and time or Never. The table can be sorted by Date.

Table 1-9 Example BUI Problem Displays

Date Description Type Phoned Home

2022-09-16 13:56:36

SMART health-monitoring firmware reported that a disk failure is imminent.

Major Fault

Never

2022-09-05 17:42:55

A disk of a different type (cache, log, or data) was inserted into a slot. The newly inserted device must be of the same type.

Minor Fault

Never

2022-08-21 16:40:37

The ZFS pool has experienced currently unrecoverable I/O failures.

Major Error

Never

2022-07-16 22:03:22

A memory module is experiencing excessive correctable errors affecting large numbers of pages.

Major Fault

Never

Clicking on a problem shows more information about the problem in the Problem Details section of the page, including the impact to the system, affected components, the system's automated response (if any), and the recommended action for the administrator (if any).

To view the affected hardware component for a hardware fault and to optionally turn on its locator LED on Oracle ZFS Storage Appliance, see Locating a Failed Component - BUI, CLI.

The CLI provides similar information, as shown in the following example:

hostname:maintenance problems> show
Problems:

COMPONENT    DIAGNOSED            TYPE            DESCRIPTION
problem-000  2022-4-3 20:30:12    Major Fault     A sensor indicates that the
                                                  power supply '1235FM401W/PSU
                                                  01' is not operating
                                                  properly due to some
                                                  external condition.

problem-001  2022-4-3 17:53:58    Major Fault     External sensors indicate
                                                  that the power supply
                                                  'hostname/PSU 1' is no
                                                  longer operating correctly.

For more information, select a problem. Only the uuid, diagnosed, severity, type, and description fields are considered to be stable. Other property values might change in a new release.

hostname:maintenance problems> select problem-000
hostname:maintenance problem-000> show
Properties:
                          uuid = uuid
                          code = SENSOR-8000-7L
                     diagnosed = 2022-4-3 20:30:12
                   phoned_home = never
                      severity = Major
                          type = Fault
                           url = https://support.oracle.com/msg/SENSOR-8000-7L
                   description = A sensor indicates that the power supply
                                 '1235FM401W/PSU 01' is not operating properly
                                 due to some external condition.
                        impact = The enclosure may be getting inadequate
                                 power. Subsequent loss of power supplies may
                                 force the enclosure to shutdown.
                      response = None.
                        action = Check to see if the power cord is connected
                                 properly or if there are other conditions
                                 that may be causing inadequate power to be
                                 provided to the indicated power supply.
                                 Please refer to the associated reference
                                 document at
                                 https://support.oracle.com/msg/SENSOR-8000-7L
                                 for the latest service procedures and
                                 policies regarding this diagnosis.

Components:

component-000  100%  1235FM401W: PSU 01 (degraded)
                     Manufacturer: Oracle
                     Part number: part-number
                     Serial number: serial-number

hostname:maintenance problem-000> select component-000
hostname:maintenance problem-000 component-000> show
Properties:
                     certainty = 100
                        status = degraded
                 chassis_label = 1235FM401W
               component_label = PSU 01
                  manufacturer = Oracle
                          part = part-number
                        serial = serial-number

Related Topics

  • Persistent logs of all faults, defects, errors, and alerts are available under Maintenance: Logs in the BUI, and under maintenance logs in the CLI. For more information, see Using Logs.

  • Faults and defects are subcategories of alerts. Filter rules can be configured to cause Oracle ZFS Storage Appliance to email administrators or perform other actions when faults are detected. For more information about alerts, see Configuring Alerts in Oracle ZFS Storage Appliance Administration Guide, Release OS8.8.x.

Repairing Active Problems

Active problems can be a result of a hardware fault or software defect. To repair an active problem, perform the steps described in the suggested action section. For hardware faults, repair typically involves replacing a physical component. For software defects, repair typically involves reconfiguring and restarting the affected service.

After a problem is repaired, the problem no longer appears in the list of active problems.

While the system can detect repairs automatically, in some cases manual intervention is required. If a problem persists after completing the suggested action, contact Oracle support. You might be instructed to mark the problem as repaired. Manually marking a problem as repaired should only be done under the direction of Oracle service personnel or as part of a documented Oracle repair procedure.

Suspending and Resuming Problem Notification

Servicing the appliance can generate false failures. For example, replacing a disk generates FRU remove and Invalid Configuration events, which can generate SRs.

To avoid sending SRs when no problem exists, you can suspend all notifications during the period when you are performing the service.

Suspending Problem Notification

To suspend all notifications, do one of the following:

  • BUI – Check the Suspend Notifications box at the top of the Maintenance: Problems page.

  • CLI – Enable the suspend_notification property in maintenance problems.

    hostname:maintenance problems> ls
    Properties:
              suspend_notification = disabled
                            period =

    The period property is read-only. As in the BUI, it displays the remaining amount of time that notifications will be suspended.

To enable or disable notification suspension, the user must be assigned the maintenance authorization in the Appliance scope.

Notification suspension behaves in the following way:

  • All external notifications are suspended, including the following:

    • Phone Home

    • Emails

    • Any user-configured alert actions, as described in Configuring Alerts in Oracle ZFS Storage Appliance Administration Guide, Release OS8.8.x

  • If you suspend notifications for one node of a cluster, notifications are suspended for both cluster nodes.

  • While notifications are suspended, events continue to be logged and will be sent when event notification is resumed. See Resuming Problem Notification.

  • By default, notifications are suspended for 8 hours, or for a period of 480 minutes.

  • While notifications are suspended, a persistent minor alert is displayed in the Active Problems section of the Maintenance: Problems BUI page, or in the Problems section of maintenance problems: "The suspending of notifications has started."

Resuming Problem Notification

While notifications are suspended, events continue to be logged and will be sent when event notification is resumed.

Note:

Before you resume normal problem notification, clear any problem events that should not be sent to Oracle.

Before you resume normal problem notification, the only accumulated events in the Problems BUI page or in the maintenance problems CLI context should be problems that still need to be corrected and that need to be sent to Oracle for further action.

To end notification suspension and resume normal problem notification prior to the end of the default suspension period, do one of the following:

  • BUI – From the Maintenance menu, select Problems, and clear the Suspend Notifications box.

  • CLI – Disable the suspend_notification property in maintenance problems.

To enable or disable notification suspension, the user must be assigned the maintenance authorization in the Appliance scope.