Alarm Type: TPD
Description: This alarm indicates that an application process has failed and debug information is available.
Severity: Minor
OID: tpdCoreFileDetectedNotify 1.3.6.1.4.1.323.5.3.18.3.1.3.9
Alarm ID: TKSPLATMI95000000000000100
Recovery
There is a special case of heartbeat process aborting and producing core file not as a result of a bug, but as an expected and intentional response of the process to unexpected activity on the network connecting the cluster nodes. Example of such activity could be switch configuration being performed during the time cluster nodes are trying to, or already are coupled together. To recognize such a case, the investigator first needs to find out if the core file was produced by the heartbeat process:
core: Checking for core files. core: There are core files on the system: core: CORE DIR: /var/TKLC/core core: CORE: core.heartbeat.<pid> core: CORE: core.heartbeat.<pid>.bt * core: FAILURE:: MINOR::5000000000000100 -- Server Core File DetectedThere, investigator finds out there is a core file named core.heartbeat.<pid>, where <pid> is the process ID of the failed heartbeat process.
gdb /usr/lib/hearbeat/heartbeat /var/TKLC/core/core.heartbeat.<pid>Once in gdb shell, entering bt. The output would be similar to the following:
(gdb) bt #0 0x00002b872c2c0215 in raise () from /lib64/libc.so.6 #1 0x00002b872c2c1cc0 in abort () from /lib64/libc.so.6 #2 0x000000000040b20c in update_ackseq () #3 0x000000000040d225 in send_cluster_msg () #4 0x000000000040d8d7 in send_local_status () #5 0x000000000040da63 in hb_send_local_status () #6 0x00002b872b2733d7 in Gmain_timeout_dispatch (src=0x13b66bc8, func=0x40da40 , user_data=0x0) at GSource.c:1570 #7 0x00002b872b8bbdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #8 0x00002b872b8bec0d in ?? () from /lib64/libglib-2.0.so.0 #9 0x00002b872b8bef1a in g_main_loop_run () from /lib64/libglib-2.0.so.0 #10 0x000000000040e8de in initialize_heartbeat () #11 0x000000000040f235 in main ()The investigator is concerned in lines beginning with #0 through #5, where, in the fourth column, after the word "in", are listed function names called within the heartbeat process. If the order of called functions is the same as in the example above (i.e., raise on line #0) then abort, update_ackseq, send_cluster_msg, send_local_status, and hb_send_local_status on line #5, it is likely that the special case occurred. If such a case was recognized, the investigator can safely delete files /var/TKLC/core/core.heartbeat.<pid> and /var/TKLC/core/core.heartbeat.<pid>.bt and then clear the alarm itself by calling alarmMgr - -clear TKSPLATMI9.
They will examine the files in /var/TKLC/core and remove them after all information has been extracted.