System Health Check Overview

The server runs a self-diagnostic utility program called syscheck to monitor itself. The system health check utility syscheck tests the server hardware and platform software. Checks and balances verify the health of the server and platform software for each test, and verify the presence of required application software.

If the syscheck utility detects a problem, an alarm code is generated. The alarm code is a 16-character data string in hexadecimal format. All alarm codes are ranked by severity: critical, major, and minor. Alarm Categories lists the platform alarms and their alarm codes.

The syscheck output can be in either of the following forms (see Health Check OutputsHealth Check Outputs for output examples):

  • Normal— results summary of the checks performed by syscheck
  • Verbose—detailed results for each check performed by syscheck

The syscheck utility can be run in the following ways:

  • The operator can invoke syscheck :
  • syscheck runs automatically by timer at the following frequencies:

    • Tests for critical platform errors run automatically every 30 seconds.
    • Tests for major and minor platform errors run automatically every 60 seconds.

Functions Checked by syscheck

Table 3-1 summarizes the functions checked by syscheck.

Table 3-1 System Health Check Operation

System Check Function
Disk Access Verify disk read and write functions continue to be operable. This test attempts to write test data in the file system to verify disk operability. If the test shows the disk is not usable, an alarm is reported to indicate the file system cannot be written to.
Smart Verify that the smartd service has not reported any problems.
File System Verify the file systems have space available to operate. Determine what file systems are currently mounted and perform checks accordingly. Failures in the file system are reported if certain thresholds are exceeded, if the file system size is incorrect, or if the partition could not be found. Alarm thresholds are reported in a similar manner.
Swap Space Verify that disk swap space is sufficient for efficient operation. All TPD installations are configured with 16 Gigabytes of swap space. The swap space is allocated between two physical disk partitions:
  • The first partition is 2 Gigabytes in size. It resides on a software RAID device, /dev/md2, which is a raid-1 mirror set made up of physical devices /dev/hda2 and /dev/hdc2.
  • The second partition is 14 Gigabytes and is formatted with a filesystem. The 14 Gigabytes of space on this partition is divided into multiple 2 Gigabyte swap files. The second partition is software RAID device /dev/md11, which is a mirror set consisting of physical partitions /dev/hda11 and /dev/hdc11, and is mounted under /var/TKLC/swap.
Memory Verify that 8 GB of RAM is installed.
Network Verify that all ports are functioning by pinging each network connection (provisioning, sync, and DSM networks). Check the configuration of the default route.
Process Verify that the following critical processes are running. If a program is not running the minimum required number of processes, an alarm is reported. If more than the recommended processes are running, an alarm is also reported.
  • sshd (Secure Shelldaemon)
  • ntpd (NTPdaemon)
  • syscheck (System Health Check daemon)
Hardware Configuration Verify that the processor is running at an appropriate speed and that the processor matches what is required on the server. Alarms are reported when a processor is not available as expected.
Cooling Fans Verifies no fan alarm is present. Fan alarm will be issued if fans are outside expected RPM.
Voltages Measure all monitored voltages on the server main board. Verify that all monitored voltages are within the expected operating range.
Temperature

Measure the following temperatures and verify that they are within a specified range.

  • Inlet and Outlet temperatures
  • Processor internal temperature
  • MCH internal temperature
MPS Platform Provide alarm if internal diagnostics detect any other error, such as server syscheck script failures.