2 Maintenance

This chapter provides maintenance information, problem detection description, and general recovery procedures for the E5-APP-B.

2.1 Introduction

This chapter provides preventive and corrective maintenance information. Customers perform a small number of daily preventive maintenance tasks. The ELAP application performs automatic monitoring and problem reporting.

Detailed information about recovery procedures is contained in the remaining chapters of this manual.

2.2 Preventive Maintenance

This section describes the following recommended periodic maintenance:

2.2.1 Daily Maintenance Procedures

Use the Automatic PDB/RTDB Backup feature to backup all data stored in the PDB/RTDB. The manual backup procedures are included in this section in case the database backup needs to be performed manually. Storing database backups in a secure off-site location ensures the ability to recover from system failures.

This section describes the following recommended daily maintenance procedures:

2.2.1.1 Backing Up the RTDB

For ELAP 8.0 or later, a daily RTDB backup is created automatically. For automatic RTDB Backup, see Automatic RTDB Backup.

  1. Log in to the ELAPGUI on MPS A as the elapall user.
    For information about how to log in to the ELAPGUI, see Accessing the ELAP GUI Interface.

    Note:

    For ELAP 8.0 or later, the ELAP software can continue to operate while performing the RTDB backup.
  2. From the ELAP menu, select RTDB, and then Maintenance, and then Backup RTDB.
    The window in Figure 2-1 is displayed.

    Figure 2-1 Backup the RTDB


    img/capture_backup_rtdb_elap.jpg
  3. Click Backup RTDB.
    The window in Figure 2-2 displays a request for confirmation.

    Figure 2-2 Backup the RTDB Confirmation


    img/capture_backup_rtdb_confirm_elap.jpg
  4. Click Confirm RTDB Backup.
    If the backup starts successfully, the following message will scroll through the GUI banner:
    Backup RTDB in progress.
    After the backup completes successfully, the success window is displayed.
  5. The RTDB backup procedure is complete.
  6. Select Process Control, and then Start Software from the ELAP Menu.
  7. On the Start ELAP Software screen as shown in Figure 2-3, click Start ELAP Software.

    Figure 2-3 Start ELAP Software


    img/t_daily_maintenance_mps_maintenance_t1100mps_maintmanual-fig11.jpg
    After the ELAP software has started successfully, the screen in Figure 2-4 is displayed.

    Figure 2-4 Start ELAP Software - Success


    img/t_daily_maintenance_mps_maintenance_t1100mps_maintmanual-fig12.jpg
  8. Select Maintenance, and then LSMS Connection, and then Change Allowed from the ELAP Menu.
  9. Click the Enable LSMS Connection button to enable the LSMS connection.
    Figure 2-5 shows the Change LSMS Connection Allowedwindow with the LSMS connection disabled.

    Figure 2-5 Change LSMS Connection Allowed


    img/t_daily_maintenance_mps_maintenance_t1100mps_maintmanual-fig13.jpg
    After the LSMS Connection is successfully enabled, the screen in Figure 2-6 is displayed.

    Figure 2-6 Successfully Enabled LSMS Connection


    img/t_daily_maintenance_mps_maintenance_t1100mps_maintmanual-fig14.jpg
2.2.1.2 Transferring RTDB Backup File

Perform this procedure once each day. The estimated time required to complete this procedure depends on network bandwidth. File sizes can be several gigabytes for the database.

  1. Log in to the ELAP command line interface with user name elapdev and the password associated with that name.
  2. Use the Secure File Transfer Protocol (sftp) to transfer to a remote, secure location the RTDB backup file created by the procedure Backing Up the RTDB.
2.2.1.3 Automatic RTDB Backup
Automatic RTDB Backup can be scheduled during off-peak provisioning hours, eliminated the need for human intervention. Automatic ELAP RTDB backup intervals are scheduled at 6:a.m. every morning in the Active Server.

User Interface

The menu item circled in the following image is available on the ELAP GUI of the Active ELAP server only:

Figure 2-7 Automatic RTDB Backup Menu Item


Automatic RTDB Backup Menu Item

Clicking Automatic RTDB Backup opens the page shown in Figure 2-8.

Figure 2-8 Automatic RTDB Backup GUI Screen


img/automatic-rtdb-backup-gui.jpg
The Backup Type field has five options:
  1. Local
  2. Mate
  3. Local and Mate
  4. Remote
  5. None

By default, backups shall be stored on both local and mate ELAP servers. If Automatic RTDB Backup is not configured, "None" option will not be available in the Backup Type field.

Note:

The following semantic rules must be followed:
  • Time of day must be in hh:mm 24-hour format. Example: 14:03
  • File path (in remote only) must be the absolute path from root
  • IP address must be in xxx.yyy.zzz.aaa format. Example: 192.168.210.111
  • Password entered will be displayed with asterisks (*)

Backup Type: Local

Selecting the Backup Type “Local” creates the backup on the same ELAP server. The user must provide the following inputs:
  • Time of day to start Local Backup
  • Frequency:
    • 12 hours
    • 1 day (daily)
    • 2 days
    • 3 days
    • 5 days
    • 7 days

    Note:

    Daily backup frequency is the default. Selecting an option other than 1 day prompts the user for reconfirmation of the backup frequency, as daily is the recommended frequency.
  • File path where the user can provide the subdirectories created within the directory "/var/TKLC/elap/free/backup/"

    Note:

    By default, Backup file is saved in the Default File path.
  • Option to delete old backups. When the user selects "yes," server will delete the old backups, except the latest number of backup files specified by the user in the "Specify the number of files to maintain" field. By default, 5 backup files are maintained. If this option is "yes," a maximum of 7 and minimum of 1 backup file may be maintained.
  • Specify the number of files to maintain

Backup Type: Mate

Selecting the Backup Type "Mate" creates the backup on the local ELAP server and transfers (moves) the same backup on the mate ELAP server. The user must provide the following inputs:
  • Time of day to start Backup
  • Frequency (same configuration as Local)
  • File path (same configuration as Local)
  • Option to delete old backups (same configuration as Local)
  • Specify the number of files to maintain

Backup Type: Local and Mate

Selecting the Backup Type "Local and Mate" creates the backup on the local ELAP server and transfers (moves) the same on the mate ELAP server. The user must provide the following inputs:
  • Time of day to start Backup
  • Frequency (same configuration as Local)
  • File path (same configuration as Local)
  • Option to delete old backups (same configuration as Local)
  • Specify the number of files to maintain

Backup Type: Remote

Selecting the Backup Type "Remote" creates the backup on the local ELAP server and transfers (moves) the same on the remote ELAP server. The user must provide the following inputs:
  • Time of day to start Backup
  • Frequency (same configuration as Local)
  • File path, which includes the absolute path for storing the backup file. If the user provides a non-existent directory, the directory will not be created and transfer of RTDB Backup file to the Remote Machine will fail.
  • IP address of the Remote Machine
  • User Login
  • User Password
  • Save the local copies in the default path. When the user selects "yes," the server will also save the RTDB Backup files in the local machine.

Backup Type: None

Selecting the Backup Type "None" cancels all currently scheduled backups All items on the form will be disabled except the submit button.

2.3 System Health Check Overview

The server runs a self-diagnostic utility program called syscheck to monitor itself. The system health check utility syscheck tests the server hardware and platform software. Checks and balances verify the health of the server and platform software for each test, and verify the presence of required application software.

If the syscheck utility detects a problem, an alarm code is generated. The alarm code is a 16-character data string in hexadecimal format. All alarm codes are ranked by severity: critical, major, and minor. Alarm Categories lists the platform alarms and their alarm codes.

The syscheck output can be in either of the following forms (see Health Check Outputs for output examples):

  • Normal— results summary of the checks performed by syscheck
  • Verbose—detailed results for each check performed by syscheck

The syscheck utility can be run in the following ways:

  • The operator can invoke syscheck :
  • syscheck runs automatically by timer at the following frequencies:

    • Tests for critical platform errors run automatically every 30 seconds.
    • Tests for major and minor platform errors run automatically every 60 seconds.

Functions Checked by syscheck

Table 2-1 summarizes the functions checked by syscheck.

Table 2-1 System Health Check Operation

System Check Function
Disk Access Verify disk read and write functions continue to be operable. This test attempts to write test data in the file system to verify disk operability. If the test shows the disk is not usable, an alarm is reported to indicate the file system cannot be written to.
Smart Verify that the smartd service has not reported any problems.
File System Verify the file systems have space available to operate. Determine what file systems are currently mounted and perform checks accordingly. Failures in the file system are reported if certain thresholds are exceeded, if the file system size is incorrect, or if the partition could not be found. Alarm thresholds are reported in a similar manner.
Memory Verify that 8 GB of RAM is installed.
Network Verify that all ports are functioning by pinging each network connection (provisioning, sync, and DSM networks). Check the configuration of the default route.
Process Verify that the following critical processes are running. If a program is not running the minimum required number of processes, an alarm is reported. If more than the recommended processes are running, an alarm is also reported.
  • sshd (Secure Shelldaemon)
  • ntpd (NTPdaemon)
  • syscheck (System Health Check daemon)
Hardware Configuration Verify that the processor is running at an appropriate speed and that the processor matches what is required on the server. Alarms are reported when a processor is not available as expected.
Cooling Fans Verifies no fan alarm is present. Fan alarm will be issued if fans are outside expected RPM.
Voltages Measure all monitored voltages on the server main board. Verify that all monitored voltages are within the expected operating range.
Temperature

Measure the following temperatures and verify that they are within a specified range.

  • Inlet and Outlet temperatures
  • Processor internal temperature
  • MCH internal temperature
MPS Platform Provide alarm if internal diagnostics detect any other error, such as server syscheck script failures.

2.3.1 Health Check Outputs

System health check utility syscheck output can be either Normal (brief) or Verbose (more detailed), depending upon how syscheck was initiated. The following examples show Normal and Verbose output formats:

Normal Output

Running modules in class disk...
                                  OK
Running modules in class hardware...
                                  OK
Running modules in class net...
                                  OK
Running modules in class proc...
                                  OK
Running modules in class services...
                                  OK
Running modules in class system...
                                  OK
Running modules in class upgrade...
                                  OK

Verbose Output Containing Errors

If an error occurs, the system health check utility syscheck provides alarm data strings and diagnostic information for platform errors in its output. The following is an example of Verbose syscheck output:

Running modules in class disk...
         drbd: Checking DRBD status file, /proc/drbd
         drbd: line #1: DRBD version=[8.3.11]
         drbd: line #2 contains DRBD compilation info
         drbd: line #3: resource=[0]
         drbd: line #3: cs{0}=[Connected]
         drbd: line #3: st_self{0}=[Primary] st_peer{0}=[Secondary]
         drbd: line #3: ds_self{0}=[UpToDate] ds_peer{0}=[UpToDate]
         drbd: line #4 contains network stats
         drbd: processing alarms for resource=0
           fs: Current file space use in "/" is 43%.
           fs: Current Inode used in "/" is 14%.
           fs: Current file space use in "/boot" is 41%.
           fs: Current Inode used in "/boot" is 0%.
           fs: Current file space use in "/usr" is 57%.
           fs: Current Inode used in "/usr" is 20%.
           fs: Current file space use in "/var" is 34%.
           fs: Current Inode used in "/var" is 4%.
           fs: Current file space use in "/var/TKLC" is 40%.
           fs: Current Inode used in "/var/TKLC" is 1%.
           fs: Current file space use in "/tmp" is 0%.
           fs: Current Inode used in "/tmp" is 0%.
           fs: Current file space use in "/usr/TKLC/elap" is 6%.
           fs: Current Inode used in "/usr/TKLC/elap" is 0%.
           fs: Current file space use in "/var/TKLC/elap/drbd/mysql" is 4%.
           fs: Current Inode used in "/var/TKLC/elap/drbd/mysql" is 0%.
           fs: Current file space use in "/var/TKLC/elap/logs" is 0%.
           fs: Current Inode used in "/var/TKLC/elap/logs" is 0%.
           fs: Current file space use in "/var/TKLC/elap/free" is 3%.
           fs: Current Inode used in "/var/TKLC/elap/free" is 0%.
       hpdisk: Only HP ProLiant servers support hpdisk diagnostics.
          lsi: Could not find LSI controller. Not running test.
         meta: Checking md status on system.
         meta: md Status OK, with 3 active volumes.
         meta: Checking md configuration on system.
         meta: Server md configuration OK.
    multipath: No multipath devices configured to be checked.
          sas: Only T1200 supports SAS diagnostics.
        smart: Finished examining logs for disk: sdb.
        smart: Finished examining logs for disk: sda.
        smart: SMART status OK.
        write: Successfully read from file system "/".
        write: Successfully read from file system "/boot".
        write: Successfully read from file system "/usr".
        write: Successfully read from file system "/var".
        write: Successfully read from file system "/var/TKLC".
        write: Successfully read from file system "/tmp".
        write: Successfully read from file system "/usr/TKLC/elap".
        write: Successfully read from file system "/var/TKLC/elap/logs".
        write: Successfully read from file system "/var/TKLC/elap/free".
    Running modules in class hardware...
   cmosbattery: This hardware does not support monitoring the CMOS battery.
   cmosbattery: The test will not be ran.
          ecc: Checking ECC hardware.
          ecc: Correctible Error Count: 0
          ecc: Uncorrectible Error Count: 0
06/20/2016 05:11:30 EDT | inf | Discarding cache...
          fan: Checking Status of Server Fans.
          fan: Fan is OK. fana: 1, CHIP: FAN
          fan: Server Fan Status OK.
   fancontrol: EAGLE_E5APPB does not support Fan Controls
   fancontrol: Will not run the test.
   flashdevice: Checking programmable devices.
   flashdevice: PSOC OK.
   flashdevice: CPLD OK.
   flashdevice: BIOS OK.
   flashdevice: ALL Programmable Devices OK.
         mezz: Checking Status of Serial Mezzanine.
         mezz: Serial Mezzanine is OK. mezza: 1, CHIP: MEZZ
         mezz: Serial Mezzanine is OK. mezzb: 1, CHIP: MEZZ
         mezz: Server Serial Mezz Status OK.
        oemHW: Only Oracle servers support hwmgmt.
          psu: This hardware does not support power feed monitoring.
          psu: Will not run test.
          psu: This hardware does not support PSU monitoring.
          psu: Will not run test.
       serial: Running serial port configuration test
       serial: EAGLE_E5APPB does not support serial port configuration monitoring
       serial: Will not run test.
         temp: Checking server temperature.
         temp: Server Temp OK. Inlet Air Temp: +25.0 C (high = +70.0 C, warn = +66 C, hyst = +75.0 C), CHIP: lm75-i2c-0-48
         temp: Server Temp OK. Outlet Air Temp: +30.0 C (high = +70.0 C, warn = +66 C, hyst = +75.0 C), CHIP: lm75-i2c-0-49
         temp: Server Temp OK. MCH Diode Temp: +41.0 C (high = +95.0 C, warn = +90 C, low = +10.0 C), CHIP: sch311x-isa-0a70
         temp: Server Temp OK. Internal Temp: +26.8 C (high = +95.0 C, warn = +90 C, low = +10.0 C), CHIP: sch311x-isa-0a70
         temp: Server Temp OK. Core 0: +30.0 C (high = +71.0 C, crit = +95.0 C, warn = +67 C), CHIP: coretemp-isa-0000
         temp: Server Temp OK. Core 1: +24.0 C (high = +71.0 C, crit = +95.0 C, warn = +67 C), CHIP: coretemp-isa-0000
      voltage: Checking server voltages.
      voltage: Voltage is OK. V2.5: +2.44 V (min = +2.37 V, max = +2.63 V), CHIP: sch311x-isa-0a70
      voltage: Voltage is OK. Vccp: +1.04 V (min = +0.85 V, max = +1.35 V), CHIP: sch311x-isa-0a70
      voltage: Voltage is OK. V3.3: +3.27 V (min = +3.13 V, max = +3.47 V), CHIP: sch311x-isa-0a70
      voltage: Voltage is OK. V5: +4.97 V (min = +4.74 V, max = +5.26 V), CHIP: sch311x-isa-0a70
      voltage: Voltage is OK. V1.8: +1.81 V (min = +1.69 V, max = +1.88 V), CHIP: sch311x-isa-0a70
      voltage: Voltage is OK. V3.3stby: +3.28 V (min = +3.13 V, max = +3.47 V), CHIP: sch311x-isa-0a70
      voltage: Voltage is OK. V3.3: +3.29 V (min = +3.13 V, max = +3.46 V), CHIP: cy8c27x43-i2c-0-28
      voltage: Voltage is OK. V1.8: +1.81 V (min = +1.71 V, max = +1.89 V), CHIP: cy8c27x43-i2c-0-28
      voltage: Voltage is OK. V1.5: +1.50 V (min = +1.42 V, max = +1.57 V), CHIP: cy8c27x43-i2c-0-28
      voltage: Voltage is OK. V1.2: +1.20 V (min = +1.14 V, max = +1.26 V), CHIP: cy8c27x43-i2c-0-28
      voltage: Voltage is OK. V1.05: +1.04 V (min = +1.00 V, max = +1.10 V), CHIP: cy8c27x43-i2c-0-28
      voltage: Voltage is OK. V1.0: +1.00 V (min = +0.95 V, max = +1.05 V), CHIP: cy8c27x43-i2c-0-28
      voltage: Server Voltages OK.
Running modules in class net...
   defaultroute: Checking default route(s)
   defaultroute:   Checking static default route through device eth01 to gateway 192.168.61.250...
         ping: Checking ping hosts
         ping: prova-ip network connection OK
         ping: provb-ip network connection OK
         ping: dsmm-a network connection OK
         ping: dsmm-b network connection OK
         ping: dsmb-a network connection OK
         ping: dsmb-b network connection OK
         ping: sync-a network connection OK
         ping: sync-b network connection OK
                                  OK
Running modules in class proc...
          run: Checking RTCtimeStampd...
          run: Found 1 instance(s) of the RTCtimeStampd process.
          run: Checking ntdMgr...
          run: Found 1 instance(s) of the ntdMgr process.
          run: Checking smartd...
          run: Found 1 instance(s) of the smartd process.
          run: Checking switchMon...
          run: Found 1 instance(s) of the switchMon process.
          run: Checking atd...
          run: Found 1 instance(s) of the atd process.
          run: Checking crond...
          run: Found 1 instance(s) of the crond process.
          run: Checking snmpd...
          run: Found 1 instance(s) of the snmpd process.
          run: Checking sshd...
          run: Found 7 instance(s) of the sshd process.
          run: Checking syscheck...
          run: Found 1 instance(s) of the syscheck process.
          run: Checking rsyslogd...
          run: Found 1 instance(s) of the rsyslogd process.
          run: Checking tklcTpdCardCfgS...
          run: Found 1 instance(s) of the tklcTpdCardCfgS process.
          run: Checking alarmMgr...
          run: Found 1 instance(s) of the alarmMgr process.
          run: Checking tpdProvd...
          run: Found 1 instance(s) of the tpdProvd process.
          run: Checking trpd...
          run: Found 1 instance(s) of the trpd process.
          run: Checking prov...
          run: Found 1 instance(s) of the prov process.
          run: Checking ebdad...
          run: Found 1 instance(s) of the ebdad process.
          run: Checking hsopd...
          run: Found 1 instance(s) of the hsopd process.
          run: Checking maint...
          run: Found 1 instance(s) of the maint process.
          run: Checking exinit...
  run: Found 1 instance(s) of the syscheck process.
          run: Checking rsyslogd...
          run: Found 1 instance(s) of the rsyslogd process.
          run: Checking tklcTpdCardCfgS...
          run: Found 1 instance(s) of the tklcTpdCardCfgS process.
          run: Checking alarmMgr...
          run: Found 1 instance(s) of the alarmMgr process.
          run: Checking tpdProvd...
          run: Found 1 instance(s) of the tpdProvd process.
          run: Checking trpd...
          run: Found 1 instance(s) of the trpd process.
          run: Checking prov...
          run: Found 1 instance(s) of the prov process.
          run: Checking ebdad...
          run: Found 1 instance(s) of the ebdad process.
          run: Checking hsopd...
          run: Found 1 instance(s) of the hsopd process.
          run: Checking maint...
          run: Found 1 instance(s) of the maint process.
          run: Checking exinit...
          run: Found 1 instance(s) of the exinit process.
          run: Checking gs...
          run: Found 1 instance(s) of the gs process.
          run: Checking mysqld...
          run: Found 1 instance(s) of the mysqld process.
          run: Checking hamond...
          run: Found 1 instance(s) of the hamond process.
                                  OK
Running modules in class services...
   ha_keepalive: HA Keepalive Syscheck Test Start
   ha_keepalive: {   Broadcast        eth04              17401}: UP
   ha_keepalive: HA Keepalive Test Complete
   ha_transition: HA Transition Syscheck Test Start
   ha_transition: HA ACTIVE, no transition in progress.
   ha_transition: HA Transition Syscheck Test Complete
                                  OK
Running modules in class system...
         core: Checking for core files.
          cpu: Found "2" CPU(s)... OK
          cpu: CPU 0 is on-line... OK
          cpu: CPU 0 speed: 2660.017 MHz... OK
          cpu: CPU 1 is on-line... OK
          cpu: CPU 1 speed: 2660.017 MHz... OK
        kdump: Checking for kernel dump files.
          mem: Skipping expected memory check.
          mem: Minimum expected memory found.
          mem: 8252936192 bytes (~7871 Mb) of RAM installed.
                                  OK
Running modules in class upgrade...
    snapshots: No snapshots found. Not running test.
                                  OK

2.4 Running the System Health Check

The operator can run syscheck to obtain the operational platform status with one of the following procedures:

2.4.1 Running syscheck from the Command line

The admusr can use sudo to run syscheck from the command line. This method can be used whether an application is installed or whether the GUI is available.

  1. Log in to the MPS as the admusr:
    
    Login:  admusr
    Password:  <Enter admusr password>
    
  2. Run syscheck with any command line arguments.
    $ sudo syscheck

    For help on command syntax, use the -h option.$ syscheck

2.4.2 Running syscheck Through the ELAP GUI

Refer to ELAP Administration and LNP Feature Activation for more details and information about logins and permissions.

  1. Log in to the User Interface of the ELAPGUI (see Accessing the ELAP GUI Interface).
  2. Check the banner information above the menu to verify that the ELAP about which system health information is sought is the one that is logged into.

    Figure 2-9 Login Window


    Login Window

  3. If it is necessary to switch to the other ELAP, click the Select Mate menu item.
  4. When the GUI shows you are logged into the ELAP about which you want system health information, select Platform>Run Health Check. as shown in the following window.

    Figure 2-10 Run Health Check


    img/t_elap_gui_running_system_health_check_general_procedures_t1100mps_maintmanual-fig2.jpg
  5. On the Run Health Check window, use the pull-down menu to select Normal or Verbose for the Output detail level desired.
  6. Click the Perform Check button to run the system health check on the selected server.
    The system health check output data is displayed, as shown in Figure 2-11.

    Figure 2-11 Displaying System Health Check on ELAP GUI


    img/t_elap_gui_running_system_health_check_general_procedures_t1100mps_maintmanual-fig3.jpg

2.4.3 Running syscheck Using the syscheck Login

If the ELAP application has not been installed on the server or you are unable to log in to the ELAP user interface, you cannot run syscheck through the GUI. Instead, you can run syscheck from the syscheck login, and report the results to My Oracle Support.

  1. Connect the Local Access Terminal to the server whose status you want to check (see Connecting a Local Access Terminal to Server’s Serial Port).
  2. Log in as the syscheck user.
    
    Login:  syscheck
    Password:  syscheck
    
    The syscheck utility runs and its output is displayed to the screen.