5 Platform and Application Alarms

This chapter provides recovery procedures for platform and application alarms related to the E5-APP-B.

Alarm Categories

This chapter describes recovery procedures to use when an alarm condition or other problem exists on the MPS system. For information about how and when alarm conditions are detected and reported, see Detecting and Reporting Problems.

When an alarm code is reported, locate the alarm in Platform and Application Alarms. The procedures for correcting alarm conditions are described in Platform and Application Alarms.

Note:

Sometimes the alarm string may consist of multiple alarms and must be decoded in order to use the Alarm Recovery Procedures in this manual. If the alarm code is not listed, see Decode Alarm Strings.
Platform and application errors are grouped by category and severity. The categories are listed from most to least severe:
  • Critical Platform Alarms

  • Critical Application Alarms

  • Major Platform Alarms

  • Major Application Alarms

  • Minor Platform Alarms

  • Minor Application Alarms

Table 5-1 shows the alarm numbers and alarm text for all alarms generated by the MPS platform and the ELAP application. The order within a category is not significant. Some of the alarms described in this chapter are not available with specific configurations.

Table 5-1 Platform and Application Alarms

Alarm Codes and Error Descriptor UAM Number
Critical Platform Alarm(There are no critical EPAP Platform Alarms)
1000000000002000 - Uncorrectable ECC Memory Error 0370
1000000000008000 – Server NTP Daemon lost NTP synchronization for extended time 0370
1000000000010000 – Server's time has gone backwards 0370
Critical Application Alarms(There are no critical EPAP Application Alarms)
2000000000000001 - LSMS DB Maintenance Required 0371
Major Platform Alarms
3000000000000001 – Server fan failure 0372
3000000000000002 - Server Internal Disk Error 0372
3000000000000008 - Server Platform Error 0372
3000000000000010 - Server File System Error 0372
3000000000000020 - Server Platform Process Error 0372
3000000000000080 - Server Swap Space Shortage Failure 0372
3000000000000100 - Server provisioning network error 0372
3000000000000200 – Server Eagle Network A error 0372
3000000000000400 – Server Eagle Network B error 0372
3000000000000800 – Server Sync network error 0372
3000000000001000 - Server Disk Space Shortage Error 0372
3000000000002000 - Server Default Route Network Error 0372
3000000000004000 - Server Temperature Error 0372
3000000000008000 - Server Mainboard Voltage Error 0372
3000000000010000 - Server Power Feed Error 0372
3000000000020000 - Server Disk Health Test Error 0372
3000000000040000 - Server Disk Unavailable Error 0372
3000000000080000 - Device Error 0372
3000000000100000 - Device Interface Error 0372
3000000000200000 - Correctable ECC Memory Error 0372
3000000400000000 - Multipath device access link problem 0372
3000000800000000 – Switch Link Down Error 0372
3000001000000000 - Half-open Socket Limit 0372
3000002000000000 - Flash Program Failure 0372
3000004000000000 - Serial Mezzanine Unseated 0372
3000000008000000 - Server HA Keepalive Error 0372
3000000010000000 - DRBD block device can not be mounted 0372
3000000020000000 - DRBD block device is not being replicated to peer 0372
3000000040000000 - DRBD peer needs intervention 0372
3000020000000000 - Server NTP Daemon never synchronized 0372
Major Application Alarms
4000000000000001 - Mate ELAP Unavailable 0373
4000000000000004 - Congestion 0373
4000000000000008 - File System Full 0373
4000000000000010 - Log Failure 0373
4000000000000040 - Fatal Software Error 0373
4000000000000080 - RTDB Corrupt 0373
4000000000000100 - RTDB Inconsistent 0373
4000000000000200 - RTDB Incoherent 0373
4000000000000800 - Transaction Log Full 0373
4000000000001000 - RTDB 100% Full 0373
4000000000002000 - RTDB Resynchronization In Progress 0373
4000000000004000 - RTDB Reload Is Required 0373
4000000000400000 - LVM Snapshot detected that is too old 0373
4000000000800000 - LVM Snapshot detected that is too full 0373
4000000001000000 - LVM Snapshot detected with invalid attributes 0373
4000000002000000 - DRBD Split Brain 0373
4000000010000000 - An instance of Snapmon already running 0373
Minor Platform Alarms
1000000000000001 – Breaker panel feed unavailable 0374
5000000000000001 - Server Disk Space Shortage Warning 0374
5000000000000002 - Server Application Process Error 0374
5000000000000004 - Server Hardware Configuration Error 0374
5000000000000020 - Server Swap Space Shortage Warning 0374
5000000000000040 - Server Default Router Not Defined 0374
5000000000000080 – Server temperature warning 0374
5000000000000100 - Server Core File Detected 0374
5000000000000200 - Server NTP Daemon Not Synchronized 0374
5000000000000800 - Server Disk Self Test Warning 0374
5000000000001000 - Device Warning 0374
5000000000002000 - Device Interface Warning 0374
5000000000004000 - Server Reboot Watchdog Initiated 0374
5000000000008000 - Server HA Failover Inhibited 0374
5000000000010000 - Server HA Active To Standby Transition 0374
5000000000020000 - Server HA Standby To Active Transition 0374
5000000000080000 - NTP Offset Check Failure 0374
5000000000100000 - NTP Stratum Check Failure 0374
5000000020000000 – Server Kernel Dump File Detected 0374
5000000040000000 – TPD Upgrade Failed 0374
5000000080000000– Half Open Socket Warning Limit 0374
5000000000800000 - DRBD failover busy 0374
5000000400000000 – NTP Source Server is not able to provide correct time 0374
Minor Application Alarms
4000000000020000 - Automatic RTDB Backup is not configured 0375
6000000000000010 - Minor Software Error 0375
6000000000000200 - RTDB Backup Failed 0375
6000000000000400 - Automatic RTDB Backup Failed 0375
6000000000000800 - Automatic Backup cron entry modified 0375
6000000000002000 - Configurable Quantity Threshold Exceeded 0375
6000000000020000 - Automatic RTDB Backup is not configured 0375
NOTE: The order within a category is not significant.

MPS Alarm Recovery Procedures

This section provides recovery procedures for the MPS and ELAP alarms, listed by alarm category and Alarm Code (alarm data string) within each category.

Critical Platform Alarms

Critical platform alarms are issued if uncorrectable memory problems are detected.

1000000000002000 - Uncorrectable ECC Memory Error

This alarm indicates that chipset has detected an uncorrectable (multiple-bit) memory error that the Error-Correcting Code (ECC) circuitry in the memory is unable to correct.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 to request hardware replacement.

1000000000008000 – Server NTP Daemon lost NTP synchronization for extended time

Alarm Type: TPD

Description: This alarm indicates that a TPD syscheck test determined that the time last synchronized with an NTP server has exceeded the critical threshold (LAST_SYNCHRONIZED_TIME_PERIOD_CRITICAL), as configured by the application.

Severity: Critical

Alarm ID: TKSPLATCR16

Recovery

Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

1000000000010000 – Server's time has gone backwards

Alarm Type: TPD

Description: This alarm indicates that syscheck determined that a server's time has gone backwards.

Severity: Critical

Alarm ID: TKSPLATCR17

Recovery

Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

Critical Application Alarms

This section describes the critical application alarms.

2000000000000001 - LSMS DB Maintenance Required

This alarm indicates that database maintenance is required.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

Major Platform Alarms

Major platform alarms involve hardware components, memory, and network connections.

3000000000000001 – Server fan failure

Alarm Type: TPD

Description: This alarm indicates that a fan on the application server is either failing or has failed completely. In either case, there is a danger of component failure due to overheating.

Description: This alarm indicates that a fan in the EAGLE fan tray in the EAGLE shelf where the E5-APP-B is "jacked in" is either failing or has failed completely. In either case, there is a danger of component failure due to overheating.

Severity: Major

OID: TpdFanErrorNotify 1.3.6.1.4.1.323.5.3.18.3.1.2.1

Alarm ID: TKSPLATMA13000000000000001

Recovery

Note:

  1. Run syscheck in Verbose mode to verify a fan failure using the following command:
    [admusr@hostname1351690497 ~]$ sudo syscheck -v hardware fan
    Running modules in class hardware...
             fan: Checking Status of Server Fans.
    *         fan: FAILURE:: MAJOR::3000000000000001 -- Server Fan Failure. This test uses the leaky bucket algorithm.
    *         fan: FAILURE:: Fan RPM is too low, fana: 0, CHIP: FAN
    One or more module in class "hardware" FAILED
    
    LOG LOCATION: /var/TKLC/log/syscheck/fail_log
    
  2. Refer to the procedure for determining the location of the fan assembly that contains the failed fan and replacing a fan assembly in the appropriate hardware manual. After you have opened the front lid to access the fan assemblies, determine whether any objects are interfering with the fan rotation. If some object is interfering with fan rotation, remove the object.
  3. Run "syscheck -v hardware fan" (see Running syscheck Through the ELAP GUI)
    • If the alarm has been cleared (as shown below), the problem is resolved
    [admusr@hostname1351691862 ~]$ sudo syscheck -v hardware fan
    Running modules in class hardware...
    Discarding cache...
             fan: Checking Status of Server Fans.
             fan: Fan is OK. fana: 1, CHIP: FAN
             fan: Server Fan Status OK.
                                     OK
    
    • If the alarm has not been cleared (as shown below) continue with the next step
    [admusr@hostname1351690497 ~]$ sudo syscheck -v hardware fan
    Running modules in class hardware...
             fan: Checking Status of Server Fans.
    *         fan: FAILURE:: MAJOR::3000000000000001 -- Server Fan Failure. This test uses the leaky bucket algorithm.
    *         fan: FAILURE:: Fan RPM is too low, fana: 0, CHIP: FAN
    One or more module in class "hardware" FAILED
    
    LOG LOCATION: /var/TKLC/log/syscheck/fail_log
    
  4. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000000002 - Server Internal Disk Error

This alarm indicates that the server is experiencing issues replicating data to one or more of its mirrored disk drives. This could indicate that one of the server disks has failed or is approaching failure.

Recovery

  1. Run syscheck in Verbose mode.
  2. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 and provide the system health check output.
  3. Run syscheck in verbose mode.
  4. The syscheck output will indicate which of the possible failures have been detected.
    Depending on which failure was detected, go to the step shown as follows:
    1. If both the md status check and the md configuration check failed, go to platform-and-application-alarms.html.
    2. If only the md status check failed, go to platform-and-application-alarms.html.
    3. If only the md configuration check failed, go to platform-and-application-alarms.html.
  5. Check to see if an md resynch is in-progress on the server:
    1. Log in as root to the server that is generating the alarm.
      
      Login:  root
      Password: <Enter root password>
      
    2. Examine the file /proc/mdstat using the command:# more /proc/mdstatIf any lines are printed that indicate an md group is resynching, wait up to two hours to allow the system to finish resyncing.
      If the resyncing has not completed in this time, proceed to platform-and-application-alarms.html. When an md group is in the process of resynching, its entry in /proc/mdstat will look similiar to the following:
      
      md2 : active raid1 hdc2[2] hda2[0]
            538112 blocks [2/1] [U_]
            [>....................]  recovery =  3.6% (20088/538112)
      finish=0.4min speed=20088K/sec
      
    3. After waiting the required amount of time, rerun syscheck.
      If the alarm has been cleared, the problem is solved. If the alarm has not been cleared, go to the next step.
  6. If the syscheck output contains the text:
    
    md status check failed
    
    or
    
    Both the md status check and the md configuration check failed
    
    1. Log in as root to the server that is generating the alarm.
      
      Login:  root
      Password: <Enter root password>
      
    2. Check the status of the md mirrors: # syscheck -v disk meta
    3. For each md group that is detected to have problems an entry will appear in the output similiar to the following:
      
      meta: Number of active devices for md2 does not match expected.
      meta: md2 is reporting faulty status, "_U".
      meta: "md2" is in error state ->
      meta: md2 : active raid1 hdc2[1]
      meta: 538112 blocks [2/1] [_U]
      
      This indicates which md group has a problem. In the example above, md group #2 is reporting failure and the information reported indicates that the group is still active on device hdc2, but 2 devices were expected and only 1 is functional. NOTE: The syscheck output shown only lists the devices in the md group that are currently functional and doesn’t explicitly indicate that a particular device has failed. Proceed to the next sub-step to determine which of the server’s disks has issues.
    4. The server is always configured with each disk drive mastering one of the system’s IDE channels.
      This means the logical device names for the disk drives will always be consistant, (see platform-and-application-alarms.html). The disk connected to the primary IDE channel will always be /dev/hda and the drive on the secondary channel is always /dev/hdc. Referring back to the syscheck output shown in the previous sub-step, the device hdc2 is reported to be still active, so the problem is obviously with the /dev/hda (Primary Master) device. Now that the device name is known, proceed to the next sub-step to attempt to correct this issue.

      Table 5-2 Logical Disk Name Matrix

      Physical Disk Logical Name

      Primary Master

      /dev/hda

      Secondary Master

      /dev/hdc

    5. In order to determine if this disk requires replacing it is very important to keep a record of every reported occurance of the platform-and-application-alarms.html alarm.
      With this in mind, create a record of this incident making note of the date and time, hostname of server, and device name determined in the previous sub-step. If this is the first reported problem with this particular disk drive, execute "Procedure 6-2: Creating or Repairing Mirrors on a Disk Drive" to attempt to repair the disk mirroring on the drive. If any previous occurances of the platform-and-application-alarms.html alarm have been recorded for this disk, the drive must be replaced. Follow Procedure 6-1, “Replacing a Faulty Disk Drive with a New Disk Drive to replace the disk.
  7. If the output from syscheck contains the text:
    
    md configuration check failed
    
    1. Log in as root to the server that is generating the alarm.
      
      Login:  root
      Password: <Enter root password>
      
    2. Check validity of the /etc/raidtab file.
      # more /etc/raidtabScan through the output to ensure that there are no duplicate entries and that each of the md groups active on the server is listed. (To get an idea which md groups are currently active on the server look at the file /proc/mdstat.) Each md group active on the server should have one and only one entry in the /etc/raidtab file that looks similar to:
      
      raiddev             /dev/md2
      raid-level                  1
      nr-raid-disks               2
      chunk-size                  64k
      persistent-superblock       1
      nr-spare-disks              0
          device          /dev/hda2
          raid-disk     0
          device          /dev/hdc2
          raid-disk     1
      
      If any discrepancies are found between /etc/raidtab and the /proc/mdstat file, make note of them and proceed to platform-and-application-alarms.html.
  8. Run savelogs to gather all application logs, (see Saving Logs Using the ELAP GUI).
  9. Run savelogs_plat to gather system information for further troubleshooting, (see Saving Logs Using the ELAP GUI), and contact Tekelec Platform Engineering.

3000000000000008 - Server Platform Error

This alarm indicates a major platform error such as a corrupt system configuration or missing files, or indicates that syscheck itself is corrupt.

Recovery

  1. Run syscheck in Verbose mode.
  2. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 and provide the system health check output.
  3. Log in as root to the server that is generating the alarm.
    
    Login:  root
    Password: <Enter root password>
    
  4. Reconfigure syscheck: # syscheck -reconfig
    By running this command syscheck will rewrite its configuration files.
  5. Exit from root shell: # exit
  6. Run syscheck.
    1. If the alarm has been cleared, the problem is resolved.
    2. If the alarm has not been cleared, continue with the next step.
  7. Run savelogs to gather all application logs, (see Saving Logs Using the ELAP GUI).
  8. Run savelogs_plat to gather system information for further troubleshooting, (see Saving Logs Using the ELAP GUI), and contact Tekelec Platform Engineering.

3000000000000010 - Server File System Error

This alarm indicates that syscheck was unsuccessful in writing to at least one of the server file systems.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000000000020 - Server Platform Process Error

This alarm indicates that either the minimum number of instances for a required process are not currently running or too many instances of a required process are running.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for recovery procedures.

3000000000000080 - Server Swap Space Shortage Failure

This alarm indicates that the server’s swap space is in danger of being depleted. This is usually caused by a process that has allocated a very large amount of memory over time.

Note:

In order for this alarm to clear, the underlying failure condition must be consistently undetected for a number of polling intervals. Therefore, the alarm may continue to be reported for several minutes after corrective actions are completed.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000000000100 - Server provisioning network error

Alarm Type: TPD

Description: This alarm indicates that the connection between the server’s eth01ethernet interface and the customer network is not functioning properly. The eth01 interface is at the upper right port on the rear of the server on the EAGLE backplane.

Note:

The interface identified as eth01 on the hardware is identified as eth91 by the software (in syscheck output, for example).

Severity: Major

OID: TpdProvNetworkErrorNotify 1.3.6.1.4.1.323.5.3.18.3.1.2.9

Alarm ID: TKSPLATMA93000000000000100

Recovery

  1. Perform the following substeps to verify that the network configuration is correct.
    1. Log in as elapconfig on the E5-APP-B server.

      Enter option 1, Display Configuration, from the ELAP Configuration Menu.

      
       /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice:  1
      
      Output similar to the following is displayed. The network configuration information related to the provisioning network is highlighted in bold.
      
      MPS Side A:  hostname: bahamas-a  hostid: a8c0ca3d
                   Platform Version: 2.0.2-4.0.0_50.26.0
                   Software Version: ELAP 1.0.1-4.0.0_50.31.0
                   Wed Sep  7 15:05:55 EDT 2005
      ELAP A Provisioning Network IP Address = 192.168.61.202
      ELAP B Provisioning Network IP Address = 192.168.61.203
      Provisioning Network Netmask = 255.255.255.0
      Provisioning Network Default Router = 192.168.61.250
      ELAP A Backup Prov Network IP Address  = Not configured
      ELAP B Backup Prov Network IP Address  = Not configured
      Backup Prov Network Netmask            = Not configured
      Backup Prov Network Default Router     = Not configured
      ELAP A Sync Network Address            = 192.168.2.100
      ELAP B Sync Network Address            = 192.168.2.200
      ELAP A Main DSM Network Address        = 192.168.120.100
      ELAP B Main DSM Network Address        = 192.168.120.200
      ELAP A Backup DSM Network Address      = 192.168.121.100
      ELAP B Backup DSM Network Address      = 192.168.121.200
      ELAP A HTTP Port                       = 80
      ELAP B HTTP Port                       = 80
      ELAP A HTTP SuExec Port                = 8001
      ELAP B HTTP SuExec Port                = 8001
      ELAP A Banner Connection Port          = 8473
      ELAP B Banner Connection Port          = 8473
      ELAP A Static NAT Address              = Not configured
      ELAP B Static NAT Address              = Not configured
      ELAP A LSMS Connection Port            = 7483
      ELAP B LSMS Connection Port            = 7483
      ELAP A EBDA Connection Port            = 1030
      ELAP B EBDA Connection Port            = 1030
      Time Zone                              = America/New_York
      
      Press return to continue...
      
    2. Verify that the provisioning network IP address, netmask, and network default router IP address for the server reporting this alarm are correct.
      If configuration changes are needed, refer to the Administration and LNP Feature Activation Guide for ELAP.
  2. Verify that a customer-supplied cable labeled TO CUSTOMER NETWORK is securely connected to the appropriate server. Follow the cable to its connection point on the local network and verify this connection is also secure.
  3. Test the customer-supplied cable labeled TO CUSTOMER NETWORK with an Ethernet Line Tester. If the cable does not test positive, replace it.
  4. Have your network administrator verify that the network is functioning properly.
  5. If no other nodes on the local network are experiencing problems and the fault has been isolated to the server or the network administrator is unable to determine the exact origin of the problem, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000000200 – Server Eagle Network A error

Alarm Type: TPD

Description: This alarm is generated by the MPS syscheck software package and is not part of the TPD distribution.

Description:

Note:

If these three alarms exist, the probable cause is a failed mate server.
  • 3000000000000200-Server Eagle Network A Error

  • 3000000000000400-Server Eagle Network B Error

  • 3000000000000800-Server Sync Network Error

This alarm indicates an error in the Main SM network, which connects to the SM A ports. The error may be caused by one or more of the following conditions:
  • One or both of the servers is not operational.

  • One or both of the switches is not powered on.

  • The link between the switches is not working.

  • The connection between server A and server B is not working.

Some of the connections between the servers of the SM networks (main and backup).
  • The eth01 interface (top ethernet port on the rear of the server) connects to the customer provisioning network.

  • The eth02 interface (2nd from top ethernet port on the rear of the server) connects to port 3 of switch A.

  • The eth03 interface (2nd from bottom ethernet port on the rear of the server) connects to port 3 of switch B.

  • The eth04 interface (bottom ethernet port on the rear of the server) connects to port 5 of switch A

  • The interfaces on the switch are ports 1 through 20 (from left to right) located on the front of the switch.

  • Ports 1 and 2 of switch A connect to ports 1 and 2 of switch B.

  • Ports 7 to 24 of switch A and ports 5 through 24 of switch B can be used for links to the Main SM ports (SM A ports) on the EAGLE.

Severity: Major

OID: 1.3.6.1.4.1.323.5.3.18.3.1.2.10

Alarm ID: TKSPLATMA103000000000000200

Recovery

  1. Perform the following:
    1. Verify that both servers are powered on by ensuring that the POWER LEDs on both servers are illuminated green.
    2. Verify that the Ethenet hubs or switches are powered on.
    3. Verify that no fault lights on the Ethenet hubs or switches are illuminated.
  2. Perform the following substeps to verify that the network configuration is correct.
    1. Log in as elapconfig on the E5-APP-B server.
      Enter option 1, Display Configuration, from the ELAP Configuration Menu.
      
      /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  8 | Mate Disaster Recovery            |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice:  1
      
      Output similar to the following is displayed. The network configuration information related to the EAGLE Network A (the Main DSM network) is highlighted in bold.
      
      MPS Side A:  hostname: bahamas-a  hostid: a8c0ca3d
                   Platform Version: 2.0.2-4.0.0_50.26.0
                   Software Version: ELAP 1.0.1-4.0.0_50.31.0
                   Wed Sep  7 15:05:55 EDT 2005
      
      ELAP A Provisioning Network IP Address = 192.168.61.202
      ELAP B Provisioning Network IP Address = 192.168.61.203
      Provisioning Network Netmask           = 255.255.255.0
      Provisioning Network Default Router    = 192.168.61.250
      ELAP A Backup Prov Network IP Address  = Not configured
      ELAP B Backup Prov Network IP Address  = Not configured
      Backup Prov Network Netmask            = Not configured
      Backup Prov Network Default Router     = Not configured
      ELAP A Sync Network Address            = 192.168.2.100
      ELAP B Sync Network Address            = 192.168.2.200
      ELAP A Main DSM Network Address = 192.168.120.100
      ELAP B Main DSM Network Address = 192.168.120.200
      ELAP A Backup DSM Network Address      = 192.168.121.100
      ELAP B Backup DSM Network Address      = 192.168.121.200
      ELAP A HTTP Port                       = 80
      ELAP B HTTP Port                       = 80
      ELAP A HTTP SuExec Port                = 8001
      ELAP B HTTP SuExec Port                = 8001
      ELAP A Banner Connection Port          = 8473
      ELAP B Banner Connection Port          = 8473
      ELAP A Static NAT Address              = Not configured
      ELAP B Static NAT Address              = Not configured
      ELAP A LSMS Connection Port            = 7483
      ELAP B LSMS Connection Port            = 7483
      ELAP A EBDA Connection Port            = 1030
      ELAP B EBDA Connection Port            = 1030
      Time Zone                              = America/New_York
      
      Press return to continue...
      
    2. Verify that the Main DSM Network IP address for the server reporting this alarm is correct.

      If configuration changes are needed, refer to the Administration and LNP Feature Activation Guide for ELAP.

  3. Verify that both servers are powered on by confirming that the POWER LEDs on both servers are illuminated green.
    1. Verify that the switch is powered on.
    2. Verify that the switch does not have any fault lights illuminated.
    3. Verify that the eth01 cable is securely connected to the top port on the server that is reporting the error.
    4. Trace the eth01 cable to the switch. Verify that the eth01 cable is securely connected at correct point of the customer uplink.
    5. Verify that the cable connecting the switches is securely connected at both switches.
  4. Run syscheck.
    1. If the alarm is cleared, the problem is resolved.
    2. If the alarm is not cleared, continue with the next step.
  5. Verify that the cable from eth01 to the switch tests positive with an Ethernet Line Tester. Replace any faulty cables.
  6. If the problem persists, call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.
  7. Perform general IP troubleshooting.
    The syscheck utility reports this error when it tries to ping hosts dsmm-a and dsmm-b a set number of times and fails. This failure could mean any number of things are at fault on the network, but general IP troubleshooting will usually resolve the issue. The platcfg utility can be used to help isolate the problem. To access the platcfg utility:
    1. Log in as platcfg to the server that is generating the alarm.
      
      Login:  platcfg
      Password: <Enter platcfg password>
      
    2. To display various network information and statistics, select menu options:Diagnostics->Network Diagnostics->Netstat
    3. To ping the dsmb-a and/or dsmb-b select menu options:Diagnostics->Network Diagnostics->Ping
    4. To verify no routing issues exist, select menu options:Diagnostics->Network Diagnostics->Traceroute
  8. Run savelogs to gather all application logs.
  9. Run savelogs_plat to gather system information for further troubleshooting, and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000000400 – Server Eagle Network B error

Alarm Type: TPD

Description: This alarm is generated by the MPS syscheck software package and is not part of the TPD distribution.

Description:

Note:

If these three alarms exist, the probable cause is a failed mate server.
  • 3000000000000200-Server Eagle Network A Error
  • 3000000000000400-Server Eagle Network B Error
  • 3000000000000800-Server Sync Network Error
This alarm indicates an error in the Backup SM network, which connects to the SM B ports. The error may be caused by one or more of the following conditions:
  • One or both of the servers is not operational.
  • One or both of the switches is not powered on.
  • The link between the switches is not working.
  • The connection between server A and server B is not working.
Some of the connections between the servers of the SM networks (main and backup).
  • The eth01 interface (top ethernet port on the rear of the server) connects to the customer provisioning network.
  • The eth02 interface (2nd from top ethernet port on the rear of the server) connects to port 4 of switch A.
  • The eth03 interface (2nd from bottom ethernet port on the rear of the server) connects to port 4 of switch B.
  • The eth04 interface (bottom ethernet port on the rear of the server) connects to port 6 of switch A.
  • The interfaces on the switch are ports 1 through 20 (from left to right) located on the front of the switch.
  • Ports 1 and 2 of switch A connect to ports 1 and 2 of switch B.
  • Ports 7 to 24 of switch A and ports 5 through 24 of switch B can be used for links to the Main SM ports (SM A ports) on the EAGLE.

Severity: Major

OID: 1.3.6.1.4.1.323.5.3.18.3.1.2.11

Alarm ID: TKSPLATMA113000000000000400

Recovery

  1. Perform the following:
    1. Verify that both servers are powered on by ensuring that the POWER LEDs on both servers are illuminated green.
    2. Verify that the Ethernet hubs or switches are powered on.
    3. Verify that no fault lights on the Ethernet hubs or switches are illuminated.
  2. Perform the following substeps to verify that the network configuration is correct.
    1. Log in as elapconfig on the E5-APP-B server.

      Enter option 1, Display Configuration, from the ELAP Configuration Menu.

      
      /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  8 | Mate Disaster Recovery            |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice:  1
      
      Output similar to the following is displayed. The network configuration information related to the EAGLE Network B (the Backup DSM network) is highlighted in bold.
      
      MPS Side A:  hostname: bahamas-a  hostid: a8c0ca3d
                   Platform Version: 2.0.2-4.0.0_50.26.0
                   Software Version: ELAP 1.0.1-4.0.0_50.31.0
                   Wed Sep  7 15:05:55 EDT 2005
      
      ELAP A Provisioning Network IP Address = 192.168.61.202
      ELAP B Provisioning Network IP Address = 192.168.61.203
      Provisioning Network Netmask           = 255.255.255.0
      Provisioning Network Default Router    = 192.168.61.250
      ELAP A Backup Prov Network IP Address  = Not configured
      ELAP B Backup Prov Network IP Address  = Not configured
      Backup Prov Network Netmask            = Not configured
      Backup Prov Network Default Router     = Not configured
      ELAP A Sync Network Address            = 192.168.2.100
      ELAP B Sync Network Address            = 192.168.2.200
      ELAP A Main DSM Network Address        = 192.168.120.100
      ELAP B Main DSM Network Address        = 192.168.120.200
      ELAP A Backup DSM Network Address = 192.168.121.100
      ELAP B Backup DSM Network Address = 192.168.121.200
      ELAP A HTTP Port                       = 80
      ELAP B HTTP Port                       = 80
      ELAP A HTTP SuExec Port                = 8001
      ELAP B HTTP SuExec Port                = 8001
      ELAP A Banner Connection Port          = 8473
      ELAP B Banner Connection Port          = 8473
      ELAP A Static NAT Address              = Not configured
      ELAP B Static NAT Address              = Not configured
      ELAP A LSMS Connection Port            = 7483
      ELAP B LSMS Connection Port            = 7483
      ELAP A EBDA Connection Port            = 1030
      ELAP B EBDA Connection Port            = 1030
      Time Zone                              = America/New_York
      
      Press return to continue...
      
    2. Verify that the Backup DSM Network IP address for the server reporting this alarm is correct.

      If configuration changes are needed, refer to Administration and LNP Feature Activation Guide for ELAP.

  3. Verify that both servers are powered on by confirming that the POWER LEDs on both servers are illuminated green.
    1. Verify that the switch is powered on.
    2. Verify that the switch does not have any fault lights illuminated.
    3. Verify that the eth01 cable is securely connected to the top port of the server that is reporting the error.
    4. Trace the eth01 cable to the switch. Verify that the eth01 cable is securely connected to the correct point of the customer uplink.
    5. Verify that the cable connecting the switches is securely connected at both switches.
  4. Run syscheck.
    1. If the alarm is cleared, the problem is resolved.
    2. If the alarm is not cleared, continue with the next step.
  5. Verify that the cable from eth01 to the hub tests positive with an Ethernet Line Tester. Replace any faulty cables.
  6. If the problem persists, call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.
  7. Perform general IP troubleshooting.
    The syscheck utility reports this error when it tries to ping hosts dsmb-a and dsmb-b a set number of times and fails. This failure could mean any number of things are at fault on the network, but general IP troubleshooting will usually resolve the issue. The platcfg utility can be used to help isolate the problem. To access the platcfg utility:
    1. Log in as platcfg to the server that is generating the alarm.
      
      Login:  platcfg
      Password: <Enter  platcfg
       password>
      
    2. To display various network information and statistics, select menu options:Diagnostics->Network Diagnostics->Netstat
    3. To ping the dsmm-a and/or dsmm-b select menu options:Diagnostics->Network Diagnostics->Ping
    4. To verify no routing issues exist, select menu options:Diagnostics->Network Diagnostics->Traceroute
  8. Run savelogs to gather all application logs, (see Saving Logs Using the ELAP GUI).
  9. Run savelogs_plat to gather system information for further troubleshooting, (see Saving Logs Using the ELAP GUI), and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000000800 – Server Sync network error

Alarm Type: TPD

Description: This alarm is generated by the MPS syscheck software package and is not part of the TPD distribution.

Description:

Note:

If these three alarms exist, the probable cause is a failed mate server.
  • 3000000000000200-Server Eagle Network A Error

  • 3000000000000400-Server Eagle Network B Error

  • 3000000000000800-Server Sync Network Error

This alarm indicates that the eth03 connection between the two servers is not functioning properly. The eth03 connection provides a network path over which the servers synchronize data with one another. The eth03 interface is the 2nd from the bottom ethernet port on the rear of the server.

Note:

The sync interface uses eth03 and goes through switch B. All pairs are required.

Severity: Major

OID: 1.3.6.1.4.1.323.5.3.18.3.1.2.12

Alarm ID: TKSPLATMA123000000000000800

Recovery

  1. Verify that both servers are booted up by ensuring that the POWER LEDs on both servers are illuminated green.
  2. Perform the following substeps to verify that the network configuration is correct.
    1. Log in as elapconfig on the E5-APP-B server.

      Enter option 1, Display Configuration, from the ELAP Configuration Menu.

      
      /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  8 | Mate Disaster Recovery            |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice:  1
      

      Output similar to the following is displayed. The network configuration information related to the Sync Network is highlighted in bold.

      
      MPS Side A:  hostname: bahamas-a  hostid: a8c0ca3d
                   Platform Version: 2.0.2-4.0.0_50.26.0
                   Software Version: ELAP 1.0.1-4.0.0_50.31.0
                   Wed Sep  7 15:05:55 EDT 2005
      ELAP A Provisioning Network IP Address = 192.168.61.202
      ELAP B Provisioning Network IP Address = 192.168.61.203
      Provisioning Network Netmask           = 255.255.255.0
      Provisioning Network Default Router    = 192.168.61.250
      ELAP A Backup Prov Network IP Address  = Not configured
      ELAP B Backup Prov Network IP Address  = Not configured
      Backup Prov Network Netmask            = Not configured
      Backup Prov Network Default Router     = Not configured
      ELAP A Sync Network Address = 192.168.2.100
      ELAP B Sync Network Address = 192.168.2.200
      ELAP A Main DSM Network Address        = 192.168.120.100
      ELAP B Main DSM Network Address        = 192.168.120.200
      ELAP A Backup DSM Network Address      = 192.168.121.100
      ELAP B Backup DSM Network Address      = 192.168.121.200
      ELAP A HTTP Port                       = 80
      ELAP B HTTP Port                       = 80
      ELAP A HTTP SuExec Port                = 8001
      ELAP B HTTP SuExec Port                = 8001
      ELAP A Banner Connection Port          = 8473
      ELAP B Banner Connection Port          = 8473
      ELAP A Static NAT Address              = Not configured
      ELAP B Static NAT Address              = Not configured
      ELAP A LSMS Connection Port            = 7483
      ELAP B LSMS Connection Port            = 7483
      ELAP A EBDA Connection Port            = 1030
      ELAP B EBDA Connection Port            = 1030
      Time Zone                              = America/New_York
      
      Press return to continue...
      
    2. Verify that the Sync Network IP address for the server reporting this alarm is correct.

      If configuration changes are needed, refer to ELAP Administration and LNP Feature Activation.

  3. If the problem persists, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000001000 - Server Disk Space Shortage Error

This alarm indicates that one of the following conditions has occurred:

  • A file system has exceeded a failure threshold, which means that more than 90% of the available disk storage has been used on the file system.

  • More than 90% of the total number of available files have been allocated on the file system.

  • A file system has a different number of blocks than it had when installed.

Recovery

  1. Run syscheck (see Running syscheck Through the ELAP GUI).
  2. Examine the syscheck output to determine if the file system /var/TKLC/epap/free/var/TKLC/elap/free is low on space. If it is, continue to the next step; otherwise go to 4.
  3. If possible, recover space on the free partition by deleting unnecessary files:
    1. Log in to the ELAP GUI (see Running syscheck Through the ELAP GUI).
    2. Select Debug>Manage Logs & Backups.
      A screen similar to Figure 5-1 is displayed. This screen displays information about the total amount of space allocated for and currently used by logs and backups. The display includes logs and backup files which might be selected for deletion to recover additional disk space.

      Figure 5-1 Manage Logs and Backups


      img/t_3000000000001000_mps_plat_alarms_t1000mps_maintmanual-fig1.jpg

      img/t_3000000000001000_mps_plat_alarms_t1100mps_maintmanual-fig1.jpg
    3. Click the checkbox of each file to be deleted, then click Delete Selected File(s).
  4. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000000002000 - Server Default Route Network Error

This alarm indicates that the default network route of the server is experiencing a problem. Running syscheck in Verbose mode will provide information about which type of problem.

Caution:

When changing the network routing configuration of the server, verify that the modifications will not impact the method of connectivity for the current login session. The route information must be entered correctly and set to the correct values. Incorrectly modifying the routing configuration of the server may result in total loss of remote network access.

Recovery

  1. Run syscheck in Verbose mode (see Running the System Health Check).

    The output should indicate one of the following errors:

    • The default router at <IP_address> cannot be pinged.

      This error indicates that the router may not be operating or is unreachable. If the syscheck Verbose output returns this error, go to 2 .

    • The default route is not on the provisioning network.

      This error indicates that the default route has been defined in the wrong network. If the syscheck Verbose output returns this error, go to 3.

    • An active route cannot be found for a configured default route.

      This error indicates that a mismatch exists between the active configuration and the stored configuration. If the syscheck Verbose output returns this error, go to 4.

    Note:

    If the syscheck Verbose output does not indicate one of the errors above, go to step 5.
  2. Perform the following substeps when syscheck Verbose output indicates:
    
    The default router at <IP_address> cannot be pinged
    
    1. Verify that the network cables are firmly attached to the server, network switch, router, Ethernet switch or hub, and any other connection points.
    2. Verify that the configured router is functioning properly.

      Request that the network administrator verify the router is powered on and routing traffic as required.

    3. Request that the router administrator verify that the router is configured to reply to pings on that interface.
    4. Run syscheck.
      • If the alarm is cleared, the problem is resolved and this procedure is complete.
      • If the alarm is not cleared, go to 5.
  3. Perform the following substeps when syscheck Verbose output indicates:
    
    The default route is not on the provisioning network
    
    1. Obtain the proper Provisioning Network netmask and the IP address of the appropriate Default Route on the provisioning network.

      This information is maintained by the customer network administrators.

    2. Log in to the MPS with user name elapconfig.
      The server designation at this site is displayed as well as hostname, hostid, Platform Version,

      Software Version, and the date. Verify that the side displayed is the MPS that is reporting the problem. In this example, it is MPS A. Enter option 2, Configure Network Interfaces Menu, from the ELAP Configuration Menu.

      
      MPS Side A:  hostname: mpsa-d1a8f8  hostid: 80d1a8f8
                   Platform Version: x.x.x-x.x.x
                   Software Version: ELAP x.x.x-x.x.x
                   Wed Jul 17 09:51:47 EST 2005
      /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  8 | Mate Disaster Recovery            |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice:  2
      
    3. Enter option 1, Configure Provisioning Network, from the Configure Network Interfaces Menu.

      The submenu for configuring communications networks and other information is displayed.

      
      /-----Configure Network Interfaces Menu-\
      /-----------------------------------------\
      |  1 | Configure Provisioning Network     |
      |----|------------------------------------|
      |  2 | Configure DSM Network              |
      |----|------------------------------------|
      |  3 | Configure Forwarded Ports          |
      |----|------------------------------------|
      |  4 | Configure Static NAT Addresses     |
      |----|------------------------------------|
      |  e | Exit                               |
      \-----------------------------------------/
      
      Enter choice:  1
      
      This warning is displayed.
      ELAP software is running. Stop it? [N] Y
    4. Type Y and press Enter.
    5. If the LSMS Connection has not been previously disabled, the following prompt is displayed. Type Y and press Enter.
      
      The LSMS Connection is currently enabled. Do you want to disable it? [Y]  Y
      
      This confirmation is displayed.
      
      The LSMS Connection has been disabled.
      
      The ELAP A provisioning network IP address is displayed.
      
      Verifying connectivity with mate ...
      Enter the ELAP A provisioning network IP Address [192.168.61.90]:
      
    6. Press Enter after each address is displayed until the Default Route address displays.
      
      Verifying connectivity with mate ...
      Enter the ELAP A provisioning network IP Address [192.168.61.90]: Enter the ELAP B provisioning network IP Address [192.168.61.91]: Enter the ELAP provisioning network netmask [255.255.255.0]: 
      Enter the ELAP provisioning network default router IP Address: 192.168.61.250
      
    7. If the default router IP address is incorrect, type the correct address and press Enter.
    8. After the Provisioning Network configuration information is verified and corrected, type e to return to the Configure Network Interfaces Menu.
    9. Type e again to return to the ELAP Configuration Menu.
    10. Run syscheck.
      • If the alarm is cleared, the problem is resolved. Restart the ELAP software and enable the connection to the LSMS as described in 3.k, 3.l, and 3.m.
      • If the alarm is not cleared, go to 5.
    11. Restart the ELAP software.
    12. Select Maintenance>LSMS Connection>Change Allowed: a window similar to the example shown in Figure 5-4 displays.

      Figure 5-2 Enable LSMS Connection Window


      img/t_3000000000002000_mps_plat_alarms_t1100mps_maintmanual-fig1.jpg
    13. Click the Enable LSMS Connection button.

      After the connection is enabled, the workspace displays the information shown in Figure 5-3.

      Figure 5-3 LSMS Connection Enabled


      img/t_3000000000002000_mps_plat_alarms_t1100mps_maintmanual-fig2.jpg

      This procedure is complete.

  4. Perform the following substeps to reboot the server if the syscheck Verbose output indicates the following error:
    
    An active route cannot be found for a configured default route, . .
    
    1. Log in as elapconfig on the E5-APP-B server.

      Enter option 6, Platform Menu, from the ELAP Configuration Menu.

      
       /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  8 | Mate Disaster Recovery            |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice: 6
      
    2. Enter option 2, Reboot MPS, from the ELAP Platform Menu.

      At the prompt, enter the identifier of the MPS to which you are logged in (A or B); in this example, A is used.

      
      /-----ELAP Platform Menu-\
      /--------------------------\
      |  1 | Initiate Upgrade    |
      |----|---------------------|
      |  2 | Reboot MPS          |
      |----|---------------------|
      |  3 | MySQL Backup        |
      |----|---------------------|
      |  4 | RTDB Backup         |
      |----|---------------------|
      |  e | Exit                |
      \--------------------------/
      Enter Choice:  2
      
      Are you sure you want to reboot the MPS?
      Reboot MPS A, MPS B or BOTH? [BOTH]:  A
      Reboot local MPS...
      
    3. Wait for the reboot to complete.
    4. Run syscheck.
      • If the alarm is cleared, the problem is resolved. Restart the ELAP software and enable the connection to the LSMS as described in 4.e, 4.f, and 4.g.

      • If the alarm is not cleared, go to 5.

    5. Restart the ELAP software.
    6. Select Maintenance>LSMS Connection>Change Allowed: a window similar to the example shown in Figure 5-4 displays.

      Figure 5-4 Enable LSMS Connection Window


      img/t_3000000000002000_mps_plat_alarms_t1100mps_maintmanual-fig3.jpg
    7. Click the Enable LSMS Connection button.

      After the connection is enabled, the workspace displays the information shown in Figure 5-5.

      Figure 5-5 LSMS Connection Enabled


      img/t_3000000000002000_mps_plat_alarms_t1100mps_maintmanual-fig4.jpg

      This procedure is complete.

  5. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for further assistance. Provide the syscheck output collected in the previous steps.

3000000000004000 - Server Temperature Error

Alarm Type: TPD

Description: The internal temperature within the server is unacceptably high.

Severity: Major

OID: TpdTemperatureErrorNotify 1.3.6.1.4.1.323.5.3.18.3.1.2.15

Alarm ID: TKSPLATMA153000000000004000

Recovery

  1. Ensure that nothing is blocking the fan's intake. Remove any blockage.
  2. Verify that the temperature in the room is normal with the following table. If it is too hot, lower the temperature in the room to an acceptable level.

    Table 5-3 Server Environmental Conditions

    Ambient Temperature

    Operating: 5 degrees C to 40 degrees C

    Exceptional Operating Limit: 0 degrees C to 50 degrees C

    Storage: -20 degrees C to 60 degrees C

    Ambient Temperature

    Operating: 5° C to 35° C

    Storage: -20° C to 60° C

    Relative Humidity

    Operating: 5% to 85% non-condensing

    Storage: 5% to 950% non-condensing

    Elevation

    Operating: -300m to +300m

    Storage: -300m to +1200m

    Heating, Ventilation, and Air Conditioning

    Capacity must compensate for up to 5100 BTUs/hr for each installed frame.

    Calculate HVAC capacity as follows:

    Determine the wattage of the installed equipment. Use the formula: watts x 3.143 = BTUs/hr

    Note:

    Be prepared to wait the appropriate period of time before continuing with the next step. Conditions need to be below alarm thresholds consistently for the alarm to clear. The alarm may take up to five minutes to clear after conditions improve. It may take about ten minutes after the room returns to an acceptable temperature before syscheck shows the alarm cleared.
  3. Verify that the temperature in the room is normal. If it is too hot, lower the temperature in the room to an acceptable level.

    Note:

    Be prepared to wait the appropriate period of time before continuing with the next step. Conditions need to be below alarm thresholds consistently for the alarm to clear. It may take about ten minutes after the room returns to an acceptable temperature before the alarm cleared.
  4. Run syscheck Check to see if the alarm has cleared
    • If the alarm has been cleared, the problem is resolved.
    • If the alarm has not been cleared, continue with the next step.
  5. Run syscheck Check to see if the alarm has cleared
    • If the alarm has been cleared, the problem is resolved.
    • If the alarm has not been cleared, continue with the next step.
  6. Replace the filter (refer to the appropriate hardware manual).

    Note:

    Be prepared to wait the appropriate period of time before continuing with the next step. Conditions need to be below alarm thresholds consistently for the alarm to clear. The alarm may take up to five minutes to clear after conditions improve. It may take about ten minutes after the filter is replaced before syscheck shows the alarm cleared.
  7. Run syscheck.
    • If the alarm has been cleared, the problem is resolved.
    • If the alarm has not been cleared, continue with the next step.
  8. If the problem has not been resolved, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000008000 - Server Mainboard Voltage Error

This alarm indicates that at least one monitored voltages on the server mainboard is not within the normal operating range.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000000010000 - Server Power Feed Error

This alarm indicates that one of the power feeds to the server has failed.

Recovery

  1. Locate the server supplied by the faulty power feed. Verify that all connections to the power supply units are connected securely. To determine where the cables connect to the servers, see the Power Connections and Cables page of the ELAP E5-APP-B Interconnect.
  2. Run syscheck (see Running syscheck Through the ELAP GUI).
    1. If the alarm is cleared, the problem is resolved.
    2. If the alarm is not cleared, go to 3.
  3. Trace the power feed to its connection on the power source.
    Verify that the power source is on and that the power feed is properly secured.
  4. Run syscheck (see Running syscheck Through the ELAP GUI).
    1. If the alarm is cleared, the problem is resolved.
    2. If the alarm is not cleared, go to 5.
  5. If the power source is functioning properly and all connections are secure, request that an electrician check the voltage on the power feed.
  6. Run syscheck (see Running syscheck Through the ELAP GUI).
    1. If the alarm is cleared, the problem is resolved.
    2. If the alarm is not cleared, go to 7.
  7. If the problem is not resolved, call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.
  8. Run savelogs_plat to gather system information for further troubleshooting, (see Saving Logs Using the ELAP GUI), and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000000020000 - Server Disk Health Test Error

This alarm indicates that the hard drive has failed or failure is imminent.

Recovery

  1. Immediately contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance with a disk replacement.

3000000000040000 - Server Disk Unavailable Error

This alarm indicates that the smartd service is not able to read the disk status because the disk has other problems that are reported by other alarms. This alarm appears only while a server is booting.

Recovery

  1. Perform the recovery procedures for the other alarms that accompany this alarm.

3000000000080000 - Device Error

This alarm indicates that the offboard storage server has a problem with its disk volume filling.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000000100000 - Device Interface Error

This alarm indicates that the IP bond is either not configured or not functioning.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000000200000 - Correctable ECC Memory Error

This alarm indicates that chipset has detected a correctable (single-bit) memory error that has been corrected by the Error-Correcting Code (ECC) circuitry in the memory.

Recovery

  1. No recovery necessary.

    If the condition persists, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 to request hardware replacement.

3000000400000000 - Multipath device access link problem

Alarm Type: TPD

Description: One or more "access paths" of a multipath device are failing or are not healthy, or the multipath device does not exist.

Severity: Major

OID: TpdMpathDeviceProblemNotify1.3.6.1.4.1.323.5.3.18.3.1.2.35

Alarm ID: TKSPLATMA353000000400000000

Recovery

  1. unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 should do the following:
    1. Check in the MSA administration console (web-application) that correct "volumes" on MSA exist, and read/write access is granted to the blade server.
    2. Check if multipath daemon/service is running on the blade server: service multipathd status. Resolution:
      1. start multipathd: service multipathd start
    3. Check output of "multipath -ll": it shows all multipath devices existing in the system and their access paths; check that particular /dev/sdX devices exist. This may be due to SCSI bus and/or FC HBAs haven't been rescanned to see if new devices exist. Resolution:
      1. run "/opt/hp/hp_fibreutils/hp_rescan -a",
      2. "echo 1 > /sys/class/fc_host/host*/issue_lip",
      3. "echo '- - -' > /sys/class/scsi_host/host*/scan"
    4. Check if syscheck::disk::multipath test is configured to monitor right multipath devices and its access paths: see output of "multipath -ll" and compare them to "syscheckAdm disk multipath - -get - -var=MPATH_LINKS" output. Resolution:
      1. configure disk::multipath check correctly.
  2. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000800000000 – Switch Link Down Error

This alarm indicates that the switch is reporting that the link is down. The link that is down is reported in the alarm. For example, port 1/1/2 is reported as 1102.

Recovery Procedure:

  1. Verify cabling between the offending port and remote side.
  2. Verify networking on the remote end.
  3. If problem persists, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 to verify port settings on both the server and the switch.

3000001000000000 - Half-open Socket Limit

Alarm Type: TPD

Description:This alarm indicates that the number of half open TCP sockets has reached the major threshold. This problem is caused by a remote system failing to complete the TCP 3-way handshake.

Severity: Major

OID: tpdHalfOpenSocketLimit 1.3.6.1.4.1.323.5.3.18.3.1.2.37

Alarm ID: TKSPLATMA37 3000001000000000

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000002000000000 - Flash Program Failure

Alarm Type: TPD

Description: This alarm indicates there was an error while trying to update the firmware flash on the E5-APP-B cards.

Severity: Major

OID: tpdFlashProgramFailure 1.3.6.1.4.1.323.5.3.18.3.1.2.38

Alarm ID: TKSPLATMA383000002000000000

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000004000000000 - Serial Mezzanine Unseated

Alarm Type: TPD

Description:This alarm indicates the serial mezzanine board was not properly seated.

Severity: Major

OID: tpdSerialMezzUnseated 1.3.6.1.4.1.323.5.3.18.3.1.2.39

Alarm ID: TKSPLATMA393000004000000000

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

3000000008000000 - Server HA Keepalive Error

This alarm indicates that heartbeat process has detected that it has failed to receive a heartbeat packet within the timeout period.

Recovery

  1. Determine if the mate server is currently operating. If the mate server is not operating, attempt to restore it to operation.
  2. Determine if the keepalive interface is operating.
  3. Determine if heartbeart is running (service TKLCha status).
  4. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000010000000 - DRBD block device can not be mounted

This alarm indicates that DRBD is not functioning properly on the local server. The DRBD state (disk state, node state, or connection state) indicates a problem.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000020000000 - DRBD block device is not being replicated to peer

This alarm indicates that DRBD is not replicating to the peer server. Usually this alarm indicates that DRBD is not connected to the peer server. A DRBD Split Brain may have occurred.

Recovery

  1. Determine if the mate server is currently operating.
  2. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000000040000000 - DRBD peer needs intervention

This alarm indicates that DRBD is not functioning properly on the peer server. DRBD is connected to the peer server, but the DRBD state on the peer server is either unknown or indicates a problem.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

3000020000000000 - Server NTP Daemon never synchronized

Alarm Type: TPD

Description: This alarm indicates that the NTP sync file (/var/TKLC/log/syscheck/ntp_sync_config) and the NTP last known good time file (/var/TKLC/log/syscheck/ntp_last_good_time) have not been synchronized.

Severity: Major

Alarm ID: TKSPLATMA42

Recovery

Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

Major Application Alarms

The major application alarms involve the ELAP software, RTDBs, file system, and logs.

4000000000000001 - Mate ELAP Unavailable

One ELAP has reported that the other ELAP is unreachable.

Recovery

  1. Log in to the ELAPGUI (see Accessing the ELAP GUI Interface).
  2. View the ELAP status on the banner.
    • If the mate ELAP status is DOWN, go to 4 .
    • If the mate ELAP status is ACTIVE or STANDBY, go to 7.
  3. Select the Select Mate menu item to change to the mate ELAP.
  4. Select Process Control>Start Software to start the mate ELAP software.
  5. View the ELAP status on the banner.
    • If the mate ELAP status is ACTIVE or STANDBY, the problem is resolved.
    • If the mate ELAP status is still DOWN, continue with 6.
  6. Select the Select Mate menu item to change back to the side that reported the alarm.
  7. Stop and start the software on the side that is reporting the alarm (see Restarting the ELAP Software).
  8. If the problem persists, run savelogs to gather system information for further troubleshooting (see Saving Logs Using the ELAP GUI), and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000002 - RTDB Mate Unavailable

The local ELAP cannot use the direct link to the Standby for RTDB database synchronization.

Recovery

  1. Log in to the ELAPGUI (see Accessing the ELAP GUI Interface).
  2. View the ELAP status on the banner.
    • If the mate ELAP status is DOWN, go to 4.
    • If the mate ELAP status is ACTIVE or STANDBY, go to 7.
  3. Select the Select Mate menu item to change to the mate ELAP.
  4. Select Process Control>Start Software to start the mate ELAP software.
  5. Determine whether the alarm has cleared by verifying whether it is still being displayed in the banner or in the Alarm View window.
    • If the alarm has cleared, the problem is resolved.
    • If the alarm has not yet cleared, continue with 6.
  6. Ensure that you are logged into the side opposite from the side reporting the alarm.

    If it is necessary to change sides, select the Select Mate menu item to change to the side opposite the side that reported the alarm.

  7. Stop and start the software on the side that is reporting the alarm (see Restarting the ELAP Software).
  8. Select RTDB>View RTDB Status to verify that the RTDB status on both sides is coherent, as shown in Figure 5-6.

    Figure 5-6 Coherent RTDB Status


    img/t_4000000000000002_mps_appl_alarms_t1100mps_maintmanual-fig1.jpg

    img/t_4000000000000002_mps_appl_alarms_t1000mps_maintmanual-fig1.jpg
  9. If the problem persists, run savelogs to gather system information for further troubleshooting (see Saving Logs Using the ELAP GUI), and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000004 - Congestion

The ELAP RTDB database record cache used to keep updates currently being provisioned is above 80% capacity.

Recovery

  1. At the EAGLE input terminal, enter the rept-stat-mps command to verify the status.

    Refer to Commands User's Guide to interpret the output.

  2. If the problem does not clear within 2 hours with an "ELAP Available" notice, capture the log files on both ELAPs (see Saving Logs Using the ELAP GUI) and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000008 - File System Full

This alarm indicates that the server file system is full.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

4000000000000010 - Log Failure

This alarm indicates that the system was unsuccessful in writing to at least one log file.

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

4000000000000040 - Fatal Software Error

A major software component on the ELAP has failed.

Recovery

  1. Perform Restarting the ELAP Software
  2. Capture the log files on both ELAPs (see Saving Logs Using the ELAP GUI) and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000080 - RTDB Corrupt

A real-time database is corrupt. The calculated checksum did not match the checksum value stored for one or more records.

Recovery

  1. Capture the log files on both ELAPs (see Saving Logs Using the ELAP GUI) and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000100 - RTDB Inconsistent

The real-time database for one or more Service Module cards is inconsistent with the current real-time database on the Active ELAP fixed disks.

Recovery

  1. Capture the log files on both ELAPs (see Saving Logs Using the ELAP GUI) and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000200 - RTDB Incoherent

This message usually indicates that the RTDB database download is in progress.

When the download is complete, the following UIM message will appear:


0452 - RTDB reload complete

Recovery

  1. If this alarm displays while an RTDB download is in progress, no further action is necessary.
  2. If this alarm displays when an RTDB download is not in progress, capture the log files on both ELAPs (see Saving Logs Using the ELAP GUI) and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000000800 - Transaction Log Full

The transaction log is full.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000001000 - RTDB 100% Full

The RTDB on the ELAP is at capacity. The ELAP RTDB is not updating.

You may be able to free up space by deleting unnecessary data in the database. The ELAP must be upgraded in order to add disk capacity.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

4000000000002000 - RTDB Resynchronization In Progress

This message indicates that the RTDB resynchronization is in progress.

Recovery

  1. No further action is necessary.

4000000000004000 - RTDB Reload Is Required

This message indicates that the RTDB reload is required because the transaction logs did not contain enough information to resynchronize the databases (the transaction logs may be too small).

Caution:

If both sides are reporting this error, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

If only one side is reporting this error, use the following procedure.

Recovery

  1. Log in to the User Interface screen of the ELAP (see Accessing the ELAP Text Interface).
  2. Refer to LNP Database Synchronization for the correct procedures.
  3. If the problem persists, contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000400000 - LVM Snapshot detected that is too old

Note:

LVM alarms are valid for ELAP 8.0 and later.

This alarm indicates that an LVM snapshot has been present on the system for longer than 30 minutes. LVM snapshots should not exist for longer than 15 minutes or performance may be degraded as the LVM snapshot overfills with data.

The Logical Volume Manager (LVM) creates read-only snapshots of the database. These LVM snapshots are present when an audit of the LSMS database is active and when the database is being downloaded to EAGLE. An LVM snapshot provides rollback and recovery capability to the active database. All LVM snapshots are removed when no longer needed, or are removed and recreated when in continuous use such as during an LSMS audit which may last several hours.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000000800000 - LVM Snapshot detected that is too full

This alarm usually occurs when an LVM snapshot has remained in existence too long and has a higher full percentage than expected; however, the alarm may occur also if an unusually large number of updates, distributed evenly across the entire database, have been received.

The Logical Volume Manager (LVM) creates read-only snapshots of the database. These LVM snapshots are present when an audit of the LSMS database is active and when the database is being downloaded to EAGLE. An LVM snapshot provides rollback and recovery capability to the active database. All LVM snapshots are removed when no longer needed, or are removed and recreated when in continuous use such as during an LSMS audit which may last several hours.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000001000000 - LVM Snapshot detected with invalid attributes

An LVM snapshot has been detected with invalid attributes. This alarm may occur if an LVM snapshot cannot be removed completely due to an error in the LVM subsystem. Restarting the ELAP software may clear this condition.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000002000000 - DRBD Split Brain

This alarm occurs when the ELAP A and B servers have simultaneous outages or if the three heartbeat paths are lost between the two servers. If either condition occurs, neither server can determine which server has the most recent copy of the database. The first system to recover becomes the HA active system. Manual action is required to determine which copy of the shared data is valid and to resynchronize with the other system.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

4000000010000000 - An instance of Snapmon already running

This is an indication that the ELAP snapshot monitoring of LVM snapshots is in progress. The monitoring is done every 10 minutes via snapmon cron job. The following lnpdb snapshots are monitored:
  • prov1snap
  • prov2snap
  • auditsnap
  • backupsnap

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7

Minor Platform Alarms

Minor platform alarms involve disk space, application processes, RAM, and configuration errors.

1000000000000001 – Breaker panel feed unavailable

Alarm Type: TPD

Description: This alarm is generated by the MPS syscheck software package and is not part of the TPD distribution. Refer to MPS-specific documentation for information regarding this alarm.

Severity: Critical

OID: 1.3.6.1.4.1.323.5.3.18.3.1.1.1

Alarm ID: TKSPLATCR1

Recovery

  1. See 910-5129-001 Rev. A, PM&C/T5100 Platform Troubleshooting Guide.
  2. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000000000001 - Server Disk Space Shortage Warning

This alarm indicates that one of the following conditions has occurred:

  • A file system has exceeded a warning threshold, which means that more than 80% (but less than 90%) of the available disk storage has been used on the file system.

  • More than 80% (but less than 90%) of the total number of available files have been allocated on the file system.

Recovery

  1. Run syscheck (see Running syscheck Through the ELAP GUI)
  2. Examine the syscheck output to determine if the file system /var/TKLC/epap/free/var/TKLC/elap/free is the one that is low on space. If the file system is low on disk space, continue to 3; otherwise go to 4.
  3. You may be able to free up space on the free partition by deleting unnecessary files, as follows:
    1. Log in to the ELAPGUI (see Accessing the ELAP GUI Interface)
    2. Select Debug>Manage Logs & Backups.

      A screen similar to Figure 5-7 is displayed. This screen displays information about the total amount of space allocated for and currently used by logs and backups. The display includes logs and backup files which might be selected for deletion to recover additional disk space.

      Figure 5-7 Manage Logs and Backups


      img/t_5000000000000001_mps_plat_alarms_t1100mps_maintmanual-fig1.jpg

      img/t_5000000000000001_mps_plat_alarms_t1100mps_maintmanual-fig2.jpg
    3. Click the checkbox of each file to be deleted, then click Delete Selected File(s).
  4. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000000002 - Server Application Process Error

This alarm indicates that either the minimum number of instances for a required process are not currently running or too many instances of a required process are running.

Recovery

  1. If a 3000000000000020 - Server Platform Process Error alarm is also present, execute the recovery procedure associated with that alarm before proceeding.
  2. Log in to the User Interface screen of the ELAPGUI (see Accessing the ELAP GUI Interface)
  3. Check the banner information above the menu to verify that you are logged into the problem ELAP indicated in the UAM.

    If it is necessary to switch to the other side, select Select Mate.

  4. Open the Process Control folder, and select the Stop Software menu item.
  5. Open the Process Control folder, and select the Start Software menu item.
  6. Capture the log files on both ELAPs (see Saving Logs Using the ELAP GUI) and contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000000000004 - Server Hardware Configuration Error

This alarm indicates that one or more of the server’s hardware components are not in compliance with proper specifications (refer to Application B Card Hardware and Installation Guide.

Recovery

  1. Run syscheck in verbose mode.
  2. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000000008 - Server RAM Shortage Warning

This alarm indicates one of two conditions:

  • Less memory than the expected amount is installed.
  • The system is swapping pages in and out of physical memory at a fast rate, indicating a possible degradation in system performance.

This alarm may not clear immediately when conditions fall below the alarm threshold. Conditions must be below the alarm threshold consistently for the alarm to clear. The alarm may take up to five minutes to clear after conditions improve.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000000020 - Server Swap Space Shortage Warning

This alarm indicates that the swap space available on the server is less than expected. This is usually caused by a process that has allocated a very large amount of memory over time.

Note:

In order for this alarm to clear, the underlying failure condition must be consistently undetected for a number of polling intervals. Therefore, the alarm may continue to be reported for several minutes after corrective actions are completed.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000000040 - Server Default Router Not Defined

This alarm indicates that the default network route is either not configured or the current configuration contains an invalid IP address or hostname.

Caution:

When changing the server’s network routing configuration it is important to verify that the modifications will not impact the method of connectivity for the current login session. It is also crucial that this information not be entered incorrectly or set to improper values. Incorrectly modifying the server’s routing configuration may result in total loss of remote network access.

Recovery

  1. Perform the following substeps to define the default router:
    1. Obtain the proper Provisioning Network netmask and the IP address of the appropriate Default Route on the provisioning network.

      These are maintained by the customer network administrators.

    2. Log in to the MPS with username elapconfig (see Accessing the ELAP Text Interface).

      The server designation at this site is displayed as well as hostname, hostid, Platform Version, Software Version, and the date. Ensure that the side displayed is the MPS that is reporting the problem. In this example, it is MPS A. Enter option 2, Configure Network Interfaces Menu, from the ELAP Configuration Menu.

      
      MPS Side A:  hostname: mpsa-d1a8f8  hostid: 80d1a8f8
                   Platform Version: x.x.x-x.x.x
                   Software Version: ELAP x.x.x-x.x.x
                   Wed Jul 17 09:51:47 EST 2005
      
      /-----ELAP Configuration Menu----------\
      /----------------------------------------\
      |  1 | Display Configuration             |
      |----|-----------------------------------|
      |  2 | Configure Network Interfaces Menu |
      |----|-----------------------------------|
      |  3 | Set Time Zone                     |
      |----|-----------------------------------|
      |  4 | Exchange Secure Shell Keys        |
      |----|-----------------------------------|
      |  5 | Change Password                   |
      |----|-----------------------------------|
      |  6 | Platform Menu                     |
      |----|-----------------------------------|
      |  7 | Configure NTP Server              |
      |----|-----------------------------------|
      |  8 | Mate Disaster Recovery            |
      |----|-----------------------------------|
      |  e | Exit                              |
      \----------------------------------------/
      Enter Choice:  2
      
    3. Enter option 1, Configure Provisioning Network from the Configure Network Interfaces Menu.

      This displays the submenu for configuring communications networks and other information.

      
      /-----Configure Network Interfaces Menu-\
      /-----------------------------------------\
      |  1 | Configure Provisioning Network     |
      |----|------------------------------------|
      |  2 | Configure DSM Network              |
      |----|------------------------------------|
      |  3 | Configure Forwarded Ports          |
      |----|------------------------------------|
      |  4 | Configure Static NAT Addresses     |
      |----|------------------------------------|
      |  e | Exit                               |
      \-----------------------------------------/
      Enter choice:  1
      
      The following warning appears:

      ELAP software is running. Stop it?

    4. Type Y and press Enter.
      If the LSMS Connection has not been previously disabled, the following prompt appears:
      
      The LSMS Connection is currently enabled. Do you want to disable it? [Y]  Y
      
    5. Type Y and press Enter.
      The following confirmation appears:
      
      The LSMS Connection has been disabled.
      
      The ELAP A provisioning network IP address displays:
      
      Verifying connectivity with mate ...
      Enter the ELAP A provisioning network IP Address [192.168.61.90]:
      
    6. Press Enter after each address is displayed until the Default Route address displays:
      
      Verifying connectivity with mate ...
      Enter the ELAP A provisioning network IP Address [192.168.61.90]: Enter the ELAP B provisioning network IP Address [192.168.61.91]: Enter the ELAP provisioning network netmask [255.255.255.0]: 
      Enter the ELAP provisioning network default router IP Address: 192.168.61.250
      
    7. If the default router IP address is incorrect, correct it, and press Enter.
    8. After vverifying or correcting the Provisioning Network configuration information, enter e to return to the Configure Network Interfaces Menu.
    9. Enter e again to return to the ELAP Configuration Menu.
  2. Rerun syscheck.
  3. Perform the following substeps to restart the ELAP and enable the LSMS connection.
    1. Restart the ELAP software (see Restarting the ELAP Software).
    2. Select Maintenance>LSMS Connection>Change Allowed: a window similar to the example shown in Figure 5-8 displays.

      Figure 5-8 Enable LSMS Connection Window


      img/t_5000000000000040_mps_plat_alarms_t1100mps_maintmanual-fig1.jpg
    3. Click the Enable LSMS Connection button.

      When the connection has been enabled, the workspace displays the information shown inFigure 5-9.sw

      .

      Figure 5-9 LSMS Connection Enabled


      img/t_5000000000000040_mps_plat_alarms_t1100mps_maintmanual-fig2.jpg

5000000000000080 – Server temperature warning

Alarm Type: TPD

Description: This alarm indicates that the internal temperature within the server is outside of the normal operating range. A server Fan Failure may also exist along with the Server Temperature Warning.

Severity: Minor

OID: tpdTemperatureWarningNotify 1.3.6.1.4.1.323.5.3.18.3.1.3.8

Alarm ID: TKSPLATMI85000000000000080

Recovery

  1. Ensure that nothing is blocking the fan's intake. Remove any blockage.
  2. Verify that the temperature in the room is normal. If it is too hot, lower the temperature in the room to an acceptable level.

    Table 5-4 Server Environmental Conditions

    Ambient Temperature

    Operating: 5 degrees C to 40 degrees C

    Exceptional Operating Limit: 0 degrees C to 50 degrees C

    Storage: -20 degrees C to 60 degrees C

    Relative Humidity

    Operating: 5% to 85% non-condensing

    Storage: 5% to 950% non-condensing

    Elevation

    Operating: -300m to +300m

    Storage: -300m to +1200m

    Heating, Ventilation, and Air Conditioning

    Capacity must compensate for up to 5100 BTUs/hr for each installed frame.

    Calculate HVAC capacity as follows:

    Determine the wattage of the installed equipment. Use the formula: watts x 3.143 = BTUs/hr

    Note:

    Be prepared to wait the appropriate period of time before continuing with the next step. Conditions need to be below alarm thresholds consistently for the alarm to clear. The alarm may take up to five minutes to clear after conditions improve. It may take about ten minutes after the room returns to an acceptable temperature before syscheck shows the alarm cleared.
  3. Verify that the temperature in the room is normal. If it is too hot, lower the temperature in the room to an acceptable level.

    Note:

    Be prepared to wait the appropriate period of time before continuing with the next step. Conditions need to be below alarm thresholds consistently for the alarm to clear. It may take about ten minutes after the room returns to an acceptable temperature before the alarm cleared.
  4. Run syscheck to see if the alarm has cleared
    • If the alarm has been cleared, the problem is resolved.
    • If the alarm has not been cleared, continue with the next step.
  5. Replace the filter (refer to the appropriate hardware manual).

    Note:

    Be prepared to wait the appropriate period of time before continuing with the next step. Conditions need to be below alarm thresholds consistently for the alarm to clear. It may take about ten minutes after the filter is replaced before the alarm cleared.
  6. Run syscheck to see if the alarm has cleared

5000000000000100 - Server Core File Detected

This alarm indicates that an application process has failed and debug information is available.

Recovery

  1. Run savelogs to gather system information (see Running syscheck Through the ELAP GUI”)
  2. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

    They will examine the files in /var/TKLC/core and remove them after all information has been extracted.

5000000000000200 - Server NTP Daemon Not Synchronized

This alarm indicates that the NTP daemon (background process) has been unable to locate a server to provide an acceptable time reference for synchronization.

Severity: Minor

Alarm ID: TKSPLATMI10

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000000000400 - Server CMOS Battery Voltage Low

The presence of this alarm indicates that the CMOS battery voltage has been detected to be below the expected value. This alarm is an early warning indicator of CMOS battery end-of-life failure which will cause problems in the event the server is powered off.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000000000800 - Server Disk Self Test Warning

A non-fatal disk issue (such as a sector cannot be read) exists.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000000001000 - Device Warning

This alarm indicates that either a snmpget cannot be performed on the configured SNMP OID or the returned value failed the specified comparison operation.

Recovery

  1. Run syscheck in Verbose mode. (See Running the System Health Check.)
  2. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000002000 - Device Interface Warning

This alarm can be generated by either an SNMP trap or an IP bond error. If syscheck is configured to receive SNMP traps, this alarm indicates that a SNMP trap was received with the set state. If syscheck is configured for IP bond monitoring, this alarm can mean that a slave device is not operating, a primary device is not active, or syscheck is unable to read bonding information from interface configuration files.

Recovery

  1. Run syscheck in Verbose mode. (See Running the System Health Check.)
  2. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000004000 - Server Reboot Watchdog Initiated

This alarm indicates that the server has been rebooted due to a hardware watchdog.

Recovery

  1. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.
    This condition should never happen.

5000000000008000 - Server HA Failover Inhibited

This alarm indicates that the server has been inhibited and HA failover is prevented from occurring.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000010000 - Server HA Active To Standby Transition

This alarm indicates that the server is in the process of transitioning HA state from Active to Standby.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000020000 - Server HA Standby To Active Transition

This alarm indicates that the server is in the process of transitioning HA state from Standby to Active.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000040000 - Platform Health Check Failure

This alarm indicates a syscheck configuration error.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000080000 - NTP Offset Check Failure

This alarm indicates that time on the server is outside the acceptable range or offset from the NTP server. The alarm message provides the offset value of the server from the NTP server and the offset limit set for the system by the application.

Alarm Type: TPD

Severity: Minor

Alarm ID: TKSPLATMI20

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000000100000 - NTP Stratum Check Failure

This alarm indicates that NTP is syncing to a server, but the stratum level of the NTP server is outside the acceptable limit. The alarm message provides the stratum value of the NTP server and the stratum limit set for the system by the application.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000020000000 – Server Kernel Dump File Detected

Alarm Type: TPD

Description: This alarm indicates that the kernel has crashed and debug information is available.

Severity: Minor

OID: 1.3.6.1.4.1.323.5.3.18.3.1.3.30

Alarm ID: TKSPLATMI305000000020000000

Recovery

  1. Run syscheck in Verbose mode (see Running the System Health Check).
  2. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000040000000 – TPD Upgrade Failed

Alarm Type: TPD

Description: This alarm indicates that a TPD upgrade has failed.

Severity: Minor

OID: tpdServerUpgradeFailDetectedNotify 1.3.6.1.4.1.323.5.3.18.3.1.3.31

Alarm ID: TKSPLATMI315000000040000000

Recovery

  1. Run the following command to clear the alarm.
    /usr/TKLC/plat/bin/alarmMgr –clear TKSPLATMI31
  2. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

5000000080000000– Half Open Socket Warning Limit

Alarm Type: TPD

This alarm indicates that the number of half open TCP sockets has reached the major threshold. This problem is caused by a remote system failing to complete the TCP 3-way handshake.

Severity: Minor

OID: tpdHalfOpenSocketWarningNotify1.3.6.1.4.1.323.5.3.18.3.1.3.32

Alarm ID: TKSPLATMI325000000080000s000

Recovery

  1. Run syscheck in verbose mode (see Running the System Health Check ).
  2. Run syscheck (see Running syscheck Using the syscheck Login)
  3. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 and provide the system health check output.

5000000000200000 - SAS Presence Sensor Missing

This alarm indicates that the server drive sensor is not working.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance with a replacement server.

5000000000400000 - SAS Drive Missing

This alarm indicates that the number of drives configured for this server is not being detected.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 to determine if the alarm is caused by a failed drive or failed configuration.

5000000000800000 - DRBD failover busy

This alarm indicates that a DRBD sync is in progress from the peer server to the local server. The local server is not ready to bethe primary DRBD node because its data is not current.

Recovery

  1. Wait for approximately 20 minutes, then check if the DRBD sync has completed. A DRBD sync should take no more than 15 minutes to complete.
  2. If the alarm persists longer than this time interval, call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

5000000001000000 - HP disk resync

This alarm indicates that the HP disk subsystem is currently resyncing after a failed or replaced drive, or after another change in the configuration of the HP disk subsystem. The output of the message will include the disk that is resyncing and the percentage complete. This alarm eventually clears after the resync of the disk is completed. The time to clear is dependant on the size of the disk and the amount of activity on the system..

Recovery

  1. Run syscheck in Verbose mode.
  2. If the percent recovering is not updating, wait at least 5 minutes between subsequent runs of syscheck, then call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 with the syscheck output.

5000000400000000 – NTP Source Server is not able to provide correct time

This alarm indicates that an NTP server was not able to provide a good time.

Severity: Minor

Alarm ID: TKSPLATMI35

Recovery

Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

Minor Application Alarms

Minor application alarms involve RTDB capacity and software errors.

4000000000020000 - Automatic RTDB Backup is not configured

This is an indication that the Automatic RTDB Backup is not configured on the system, i.e., the Backup Type is "None."

Recovery

  1. Configure the Automatic RTDB backup with backup type other than None. Refer to Automatic RTDB Backup for details on how to configure the Automatic RTDB Backup.

6000000000000010 - Minor Software Error

A minor software error has been detected.

Recovery

  1. Run syscheck.
  2. Contact unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7.

    Have the system health check data available.

6000000000000200 - RTDB Backup Failed

This alarm indicates that the system was unable to complete an RTDB backup.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

6000000000000400 - Automatic RTDB Backup Failed

This alarm indicates that the system was unable to complete an automatic RTDB backup.

Recovery

  1. Call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

6000000000000800 - Automatic Backup cron entry modified

This alarm indicates that the cron entry for automatic backups has been modified. No further action is required.

6000000000002000 - Configurable Quantity Threshold Exceeded

This alarm indicates that the RTDB file system has reached the user-configured threshold.

Recovery

  1. If the user-configurable threshold is set to less than 90%, then the user may increase the threshold to a higher value.
    1. Log in to the User Interface of the ELAP GUI. See Accessing the ELAP GUI Interface.
    2. Select User Administration, and then Modify Defaults to change the threshold value (1-99). See ELAP Administration and LNP Feature Activation for additional information.
  2. If the user-configurable threshold is set to 90% or higher, call unresolvable-reference.html#GUID-A2C37E16-F0BA-4FB6-9D93-1D4A95A40DC7 for assistance.

6000000000020000 - Automatic RTDB Backup is not configured

This is an indication that the Automatic RTDB Backup is not configured on the system, i.e., the Backup Type is "None."

Recovery

  1. Configure the Automatic RTDB backup with backup type other than None. Refer to Automatic RTDB Backup for details on how to configure the Automatic RTDB Backup.