There is a special case of heartbeat process aborting and producing core file not as a result of a bug, but as an expected and intentional response of the process to unexpected activity on the network connecting the cluster nodes. Example of such activity could be switch configuration being performed during the time cluster nodes are trying to, or already are coupled together. To recognize such a case, the investigator first needs to find out if the core file was produced by the heartbeat process:
Inspect syscheck verbose output, and look for "core" module. The output would be similar to following:
core: Checking for core files. core: There are core files on the system: core: CORE DIR: /var/TKLC/core core: CORE: core.heartbeat.<pid> core: CORE: core.heartbeat.<pid>.bt * core: FAILURE:: MINOR::5000000000000100 -- Server Core File Detected
There, investigator finds out there is a core file named core.heartbeat.<pid>, where <pid> is the process ID of the failed heartbeat process.
If heartbeat core file was found, the investigator must get the backtrace of the process from the core file by running command:
Once in gdb shell, entering bt. The output would be similar to the following:
(gdb) bt #0 0x00002b872c2c0215 in raise () from /lib64/libc.so.6 #1 0x00002b872c2c1cc0 in abort () from /lib64/libc.so.6 #2 0x000000000040b20c in update_ackseq () #3 0x000000000040d225 in send_cluster_msg () #4 0x000000000040d8d7 in send_local_status () #5 0x000000000040da63 in hb_send_local_status () #6 0x00002b872b2733d7 in Gmain_timeout_dispatch (src=0x13b66bc8, func=0x40da40 , user_data=0x0) at GSource.c:1570 #7 0x00002b872b8bbdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #8 0x00002b872b8bec0d in ?? () from /lib64/libglib-2.0.so.0 #9 0x00002b872b8bef1a in g_main_loop_run () from /lib64/libglib-2.0.so.0 #10 0x000000000040e8de in initialize_heartbeat () #11 0x000000000040f235 in main ()
The investigator is concerned in lines beginning with #0 through #5, where, in the fourth column, after the word "in", are listed function names called within the heartbeat process. If the order of called functions is the same as in the example above (i.e., raise on line #0) then abort, update_ackseq, send_cluster_msg, send_local_status, and hb_send_local_status on line #5, it is likely that the special case occurred. If such a case was recognized, the investigator can safely delete files /var/TKLC/core/core.heartbeat.<pid> and /var/TKLC/core/core.heartbeat.<pid>.bt and then clear the alarm itself by calling alarmMgr - -clear TKSPLATMI9.
They will examine the files in /var/TKLC/core and remove them after all information has been extracted.