Monitoring the Sun Storage J4000, F5100, and Sun Blade 6000 Array Families

C H A P T E R 4

This chapter describes the monitoring process and how to set up monitoring system wide and on individual arrays. It contains the following sections:

Monitoring Overview

Setting Up Notification for Fault Management

Configuring Array Health Monitoring

Monitoring Alarms and Events

Monitoring Field-Replaceable Units (FRUs)

Viewing Activity on All Arrays

Viewing Activity on All Arrays

For more information about the concepts introduced in this chapter, see the appropriate topic in the Online Help.

Monitoring Overview

The Fault Management Service (FMS) is a software component of the Sun StorageTek Common Array Manager that is used to monitor and diagnose the storage systems. The primary monitoring and diagnostic functions of the software are:

Array health monitoring

Event and alarm generation

Notification to configured recipients

Device and device component reporting

An FMS agent, which runs as a background process, monitors all devices managed by the Sun StorageTek Common Array Manager.

The high-level steps of a monitoring cycle are as follows.

1. Verify that the agent is idle.

The system generates instrumentation reports by probing the device for all relevant information, and it saves this information. The system then compares the report data to previous reports and evaluates the differences to determine whether health-related events need to be generated.

Events are also created from problems reported by the array. If the array reports a problem, an alarm is generated directly. When the problem is no longer reported by the array, the alarm is removed unless the specific alarm must be manually cleared.

2. Store instrumentation reports for future comparison.

Event logs are accessible by accessing the Events page for an array from the navigation pane in the user interface. The software updates the database with the necessary statistics. Some events require that a certain threshold be attained before an event is generated. For example, having the cyclic redundancy count (CRC) of a switch port increase by one is not sufficient to trigger an event, since a certain threshold is required.

When proxy agents are used, CAM stores all reports related to the arrays attached to the proxy host on the main server. The proxy is simply used as a "pass-through" for the primary instance of CAM.

3. Send the alarms to interested parties.

Alarms are sent only to recipients that have been set up for notification. The types of alarms can be filtered so that only pertinent alarms are sent to each individual.

Note - If they are enabled, the email providers receive notification of all alarms.

Alarms are created when a problem is encountered that requires action. When the root-cause problem of the alarm is corrected, the alarm will either be cleared automatically or you must manually clear the alarm. See the CAM Service Advisor procedures for details.

Monitoring Strategy

The following procedure is a typical strategy for monitoring.

1. Monitor the devices.

To get a broad view of the problem, the site administrator or Sun personnel can review reported information in context. This can be done by:

Displaying the device itself

Analyzing the device’s event log

2. Isolate the problem.

For many alarms, information regarding the probable cause and recommended action can be accessed from the alarm view. In most cases, this information enables you to isolate the source of the problem. In cases where the problem is still undetermined, diagnostic tests are necessary.

Once the problem is fixed, in most cases the management software automatically clears the alarm for the device.

About Event Life-Cycles

Most storage network events are based on health transitions. For example, a health transition occurs when the state of a device goes from online to offline. It is the transition from online to offline that generates an event, not the actual offline value. If the state alone were used to generate events, the same events would be generated repeatedly. Transitions cannot be used for monitoring log files, so log events can be repetitive. To minimize this problem, the agent uses predefined thresholds to entries in the log files.

The software includes an event maximums database that keeps track of the number of events generated about the same subject in a single eight-hour time frame. This database prevents the generation of repetitive events. For example, if the port of a switch toggles between offline and online every few minutes, the event maximums database ensures that this toggling is reported only once every eight hours instead of every five minutes.

Event generation usually follows this process:

1. The first time a device is monitored, a discovery event is generated. It is not actionable but is used to set a monitoring baseline This event describes, in detail, the components of the storage device. Every week after a device is discovered, an audit event is generated with the same content as the discovery event.

2. A log event can be generated when interesting information is found in storage log files. This information is usually associated with storage devices and sent to all users.

3. Events are generated when the software detects a change in the Field Replaceable Unit (FRU) status. The software periodically probes the device and compares the current FRU status to the previously reported FRU status, which is usually only minutes old. ProblemEvent, LogEvent, and ComponentRemovalEvent categories represent most of the events that are generated.

Note - Aggregated events and events that require action by service personnel (known as actionable events) are also referred to as alarms. Some alarms are based on a single state change and others are a summary of events where the event determined to be the root cause is advanced to the head of the queue as an alarm. The supporting events are grouped under the alarm and are referred to as aggregated events.

Setting Up Notification for Fault Management

The fault management features of the Sun StorageTek Common Array Manager software enables you to monitor and diagnose your arrays and storage environment. Alarm notification can be provided by:

Email notification

Simple Network Management Protocol (SNMP) traps

You can also set up Sun Service notification by enabling Auto Service Request as described in Setting Up Auto Service Request.

1. In the navigation pane, under General Configuration, choose Notification.

The following Notification Setup page is displayed.

Screen capture of the Email Notification page.

2. Enable local email.

a. Enter the name of the SMTP server.

If the host running this software has the sendmail daemon running, you can accept the default server, localhost, or the name of this host in the required field.

b. Specify the other optional parameters, as desired.

c. If you have changed or entered any parameters, click Save.

d. (Optional) Click Test Local Email to test your local email setup by sending a test email.

If you need help on any of the fields, click the Help button.

3. (Optional) Set up remote notifications by SNMP traps to an enterprise management application.

a. Select SNMP as the provider.

b. Click Save.

4. Set up local email notification recipients.

a. Click Administration > Notification > Email.

The following Email Notification page is displayed.

Screen capture of the Email Notification page where you specify SMTP servers and recipients of email notification.

b. Click New.

The following Add Email Notification page is displayed.

Screen capture of the Add Email Notification page.

c. Enter an email address for local notification. At least one address is required to begin monitoring events. You can customize emails to specific severity, event type, or product type.

d. Click Save.

5. (Optional) Set up email filters to prevent email notification about specific events that occur frequently. You can still view filtered events in the event log.

a. Click Administration > Notification > Email Filters.

The following Email Filters page is displayed.

Screen capture of the Email Filters page.

b. Click Add New Filter.

The following Add Filter page is displayed.

Screen capture of the Add Filter page.

c. Enter the event code that you want to filter. You can obtain event codes from the Event Details page of the event you want to filter to prevent email notification for events with that event code.

d. Click Save.

6. (Optional) Set up SNMP trap recipients.

a. Click Administration > Notification > SNMP

The following SNMP Notification page is displayed.

Screen capture showing the SNMP Notification page.See SNMP Trap MIB for MIB definitions.

b. Click New.

The following Add SNMP Notification page is displayed.

Screen capture showing SNMP properties.

c. Enter the event code that you want to filter. You can obtain event codes from the Event Details page of the event you want to filter to prevent email notification for events with that event code.

d. Click Save.

7. (Optional) Set up remote notifications by SNMP traps to an enterprise management application.

a. Click Administration > Notification > SNMP.

The SNMP Notification page is displayed.

b. Click New.

The Add SNMP Notification page is displayed.

c. Enter the following information

IP address of the SNMP recipient

The port used to send SNMP notifications.

(Optional) From the drop down menu, select the minimum alarm level for which SNMP notifications are to be sent to the new SNMP recipient.

(Optional) Specify whether you want to send configuration change events.

d. Click Save.

8. Perform optional fault management setup tasks:

Confirm administration information.

Add and activate agents.

Specify system timeout settings.

Configuring Array Health Monitoring

To enable array health monitoring, you must configure the Fault Management Service (FMS) agent, which probes devices. Events are generated with content, such as probable cause and recommended action, to help facilitate isolation to a single field-replaceable unit (FRU).

You must also enable array health monitoring for each array you want monitored.

Configuring the FMS Agent

1. In the navigation pane, expand General Configuration.

The navigation tree is expanded.

2. Choose General Health Monitoring.

The following General Health Monitoring Setup page is displayed.

Screen capture showing the General Health Monitoring Setup page.

3. Select the types of arrays that you want to monitor from the Categories to Monitor field. Use the shift key to select more than one array type.

4. Specify how often you want to monitor the arrays by selecting a value in the Monitoring Frequency field.

5. Specify the maximum number of arrays to monitor concurrently by selecting a value in the Maximum Monitoring Thread field.

6. In the Timeout Setting section, set the agent timeout settings.

The default timeout settings are appropriate for most storage area network (SAN) devices. However, network latencies, I/O loads, and other device and network characteristics may require that you customize these settings to meet your configuration requirements. Click in the value field for the parameter and enter the new value.

7. When all required changes are complete, click Save.

The configuration is saved.

Enabling Health Monitoring for an Array

1. In the navigation pane, select an array for which you want to display or edit the health monitoring status.

2. Click Array Health Monitoring

The following Array Health Monitoring Setup page is displayed.

Screen capture showing the health monitoring status.

3. For the array to be monitored, ensure that the monitoring agent is active and that the Device Category Monitored is set to Yes. If not, go to Configuring Array Health Monitoring

4. Select the checkbox next to Health Monitoring to enable health monitoring for this array; deselect the checkbox to disable health monitoring for the array.

5. Click Save.

Monitoring Alarms and Events

Events are generated to signify a health transition in a monitored device or device component. Events that require action are classified as alarms.

There are four event severity levels:

Down - Identifies a device or component as not functioning and in need of immediate service

Critical - Identifies a device or component in which a significant error condition is detected that requires immediate service

Major - Identifies a device or component in which a major error condition is detected and service may be required

Minor - Identifies a device or component in which a minor error condition is detected or an event of significance is detected

You can display alarms for all arrays listed or for an individual array. Events are listed for each array only.

Displaying Alarm Information

1. To display alarms for all registered arrays, in the navigation pane, choose Alarms.

The following Alarm Summary page for all arrays is displayed.

Screen capture of the Alarms page.

2. To display alarms that apply to an individual array, in the navigation pane select the array whose alarms you want to view and choose Alarms below it.

The following Alarm Summary page for that array is displayed.

Screen capture showing an example Alarms Summary page.

3. To view detailed information about an alarm, in the Alarm Summary page, click Details for the alarm.

The following Alarm Details page is displayed.

Screen capture showing the Alarm Details page.

4. To view the a list of events associated with an alarm, from the Alarm Details page, click Aggregated Events.

The following Aggregated Events page is displayed.

Note - The aggregation of events associated with an alarm can vary based on the time that an individual host probes the device. When not aggregated, the list of events, is consistent with all hosts.

Screen capture showing the Aggregated Events page.

Managing Alarms

An alarm that has the Auto Clear function set will be automatically deleted from the alarms page when the underlying fault has been addressed and corrected. To determine whether an alarm will be automatically deleted when it has been resolved, view the alarm summary page and examine the Auto Clear column. If the Auto Clear column is set to yes, then that alarm will be automatically deleted when the fault has been corrected, otherwise, the alarm will need to be manually removed after a service operation has been completed.

If the Auto Clear function is set to No, when resolved that alarm will not be automatically deleted from the Alarms page and you must manually delete that alarm from the Alarms page.

Acknowledging Alarms

When an alarm is generated, it remains open in the Alarm Summary page until you acknowledge it. Acknowledging an alarm is an optional feature that provides a way for administrators to indicate that an alarm has been seen and evaluated; it does not affect if or when an alarm will be cleared.

Acknowledging One or More Alarms

1. Display the Alarm Summary page by doing one of the following in the navigation pane:

To see the Alarm Summary page for all arrays, choose Alarms.

To see alarms for a particular array, expand that array and choose Alarms below it.

2. Select the check box for each alarm you want to acknowledge, and click Acknowledge.

The following Acknowledge Alarms confirmation window is displayed.

3. Enter an identifying name to be associated with this action, and click Acknowledge.

The Alarm Summary page is redisplayed, and the state of the acknowledged alarms is displayed as Acknowledged.

Note - You can also acknowledge an alarm from the Alarm Details page. You can also reopen acknowledged alarms from the Alarm Summary and Alarm Details pages.

Deleting Alarms

When you delete an open or acknowledged alarm, it is permanently removed from the Alarm Summary page.

Note - You cannot delete alarms which are designated as Auto Clear alarms. These alarms are removed from the Alarm Summary page either when the array is removed from the list of managed arrays or when the condition related to the problem is resolved.

Deleting One or More Alarms

1. In the navigation pane, display the Alarm Summary page for all registered arrays or for one particular array:

To see the Alarm Summary page for all arrays, choose Alarms.

To see alarms for a particular array, select that array and choose Alarms below it.

The Alarm Summary page displays a list of alarms.

2. Select the check box for each acknowledged alarm you want to delete, and click Delete.

The Delete Alarms confirmation window is displayed.

3. Click OK.

The Alarm Summary page is redisplayed without the deleted alarms.

Displaying Event Information

To gather additional information about an alarm, you can display the event log to view the underlying events on which the alarm is based.

Note - The event log is a historical representation of events in an array. In some cases the event log may differ when viewed from multiple hosts since the agents run at different times on separate hosts. This has no impact on fault isolation.

Displaying Information About Events

1. In the navigation pane select the array for which you want to view the event log and choose Events.

The following Events page displays.

Screen capture showing an example Events page.

2. To see detailed information about an event, click Details in the row that corresponds to the event.

The Event Details page is displayed for the selected event.

Screen capture showing the Event Details page.

Monitoring Field-Replaceable Units (FRUs)

The Common Array Manager software enables you to view a listing of the FRU components in the array, and to get detailed information about the health of each type of FRU. For a listing of the FRU components in your system, go to the FRU Summary page.

Note - All FRUs in the J4000 Array Family are also Customer Replaceable Units (CRUs).

For detailed information about each FRU type, refer to the hardware documentation for your array.

Viewing the Listing of FRUs in the Array

1. In the navigation pane, select the array whose FRUs you want to list and click FRUs.

The FRU Summary page is displayed. It lists the FRU types available and provides basic information about the FRUs. The types of FRU components available depend on the model of your array.

The following figure shows the FRU Summary page for the Sun Storage J4200 array.

Screen capture showing the FRU Summary page.

2. To view the list of FRU components of a particular type, click on name of the FRU in the FRU Type column.

The Component Summary page displays the list of FRUs available, along with basic information about each FRU component.

Screen capture showing the Component Summary page.

3. To view detailed health information about a particular FRU component, click on the component name.

Depending on the FRU type of the selected component, one of the following pages will display:

Disk Health Details Page

Fan Health Details Page

Power Supply Health Details Page

SIM Health Details Page for J4200/J4400 Arrays

Disk Health Details Page

The disk drives are used to store data. For detailed information about the disk drives and each of its components, refer to the hardware documentation for your array.

The following figure shows the Disk Health Detail page.

Screen capture showing the Disk Health Details page.

Note - See the Online Help for a complete description of health details for all arrrays.

Note - The disk health details vary for each array and disk type.

Fan Health Details Page

The fans in the Sun Storage J4000 Array Family circulate air inside the tray. Some array models, such as the J4200 array, contains two hot-swappable fans to provide redundant cooling. Other array models, such as the J4400, include fans in the power supplies. For detailed information, consult the hardware installation guide for your array.

The following figure shows the Fan Health Detail page.

Screen capture showing the Fan Health Details page.

NEM Health Details Page

The Sun Blade 6000 Multi-Fabric Network Express Module (NEM) connects server blades to disks through the use of a SAS expander. For detailed information about the disk drives and each of its components, refer to the hardware documentation for your array.

Power Supply Health Details Page

Each tray in an array has hot-swappable, redundant power supplies. If one power supply is turned off or malfunctions, the other power supply maintains electrical power to the array.

The following figure shows the Power Supply Health Detail page.

Screen capture showing the Power Supply Health details page.

SIM Health Details Page for J4200/J4400 Arrays

The SAS Interface Module (SIM) is a hot-swappable board that contains two SAS outbound connectors, one SAS inbound connector, and one serial management port. The serial management port is reserved for Sun Service personnel only.

The following figure shows the SIM Health Detail page.

Screen capture showing the SIM Health Details page.

Storage Module Health Details Page for the B6000 Array

The storage module is available as part of the Sun Storage B6000 array. For information about the system controller, refer to the hardware documentation for your array.

Note - See the Online Help for a complete description of health details for all arrrays.

The system controller is available as part of the Sun Storage J4500 array. The system controller is a hot-swappable board that contains four LSI SAS x36 expanders. These expanders provide a redundant set of independent SAS fabrics (two expanders per fabric), enabling two paths to the array’s disk drives. The serial management is reserved for Sun Service personnel only.

For more information about the system controller, refer to the hardware documentation for your array.

The following figure shows the Component Summary for the System Controller page.

Screen capture showing the System Controller component summary.

Viewing Activity on All Arrays

The activity log lists user-initiated actions performed for all registered arrays, in chronological order. These actions may have been initiated through either the Sun StorageTek Common Array Manager or the command-line interface (CLI).

Viewing the Activity Log

1. In the navigation pane, click General Configuration > Activity Log.

The Activity Log Summary page is displayed.

Screen capture showing the Activity Log Summary page.

Monitoring Storage Utilization

Common Array Manager graphically provides a summary of the total storage capacity of an array and the number of disk drives that provide that storage.

Screen capture showing the Storage Utilization page.