31228 - HA standby offline

Alarm Group:

Description:

High availability standby server is offline.

Severity:

Critical

Instance:

May include AlarmLocation, AlarmId, AlarmState, AlarmSeverity, and bindVarNamesValueStr

HA Score:

Normal

Auto Clear Seconds:

0 (zero)

OID:

eagleXgDsrHaStandbyOfflineNotify

Cause:

There are HA heartbeat messages among the servers. If the servers, such as NO and SO, cannot get the HA heartbeat from its mate even after trying several times, the alarm raises. The default interval time is 250 ms. The alarm raises after retrying five times.

Diagnostic Information:

To diagnose the alarm further, perform the following:

The platform savelogs on active NO and SO servers.
Get iqt -E HaCfg from active NO and SO servers.

Recovery:

If loss of communication between the active and standby servers is caused intentionally by maintenance activity, the alarm can be ignored. It clears automatically when communication is restored between the two servers.
If communication fails at any other time, look for network connectivity issues and it is recommended to contact My Oracle Support, if needed.
A workaround for this problem is to increase the failCount values for all server groups in the HaCfg table. Bumping it from 5 to 10 should solve the problem. Check with the application team before applying this workaround. Run the iset -ffailCount=10 HaCfg command on the active NO where "1=1".

Note:
This command is disruptive and causes active servers in the entire topology to lose service for about one minute while HA is reconfigured. A new server may be selected as active after the change is applied. If less disruption is required, you can apply the change one server group at a time as an alternative.