在节点上启动资源后,RGM 并不直接调用 PROBE 方法,而是调用 Monitor_start 方法来启动监视器。 xfnts_monitor_start 方法将在 PMF 的控制下启动故障监视器。 xfnts_monitor_stop 方法用来停止该故障监视器。
使用为检查基于 TCP 的简单服务(例如 xfs)而特别设计的公用程序来定期监视 xfs 服务器守护程序的运行情况。
跟踪应用程序在时间窗口定义的时间内遇到的问题(使用 Retry_count 和 Retry_interval 特性),并在应用程序完全失败的情况下决定是重启还是故障切换数据服务。 scds_fm_action() 和 scds_fm_sleep() 函数为此跟踪和决定机制提供了内置支持。
使用 scds_fm_action() 实现故障切换或重启决定。
更新资源状态使其可供管理工具和图形用户界面使用。
xfonts_probe 方法实现一个循环。 在实现该循环之前,xfonts_probe 将:
检索 xfnts 资源的网络地址资源,如下所示。
/* Get the ip addresses available for this resource */
if (scds_get_netaddr_list(scds_handle, &netaddr)) {
scds_syslog(LOG_ERR,
"No network address resource in resource group.");
scds_close(&scds_handle);
return (1);
}
/* Return an error if there are no network resources */
if (netaddr == NULL || netaddr->num_netaddrs == 0) {
scds_syslog(LOG_ERR,
"No network address resource in resource group.");
return (1);
}
调用 scds_fm_sleep() 并将 Thorough_probe_interval 的值作为超时值传送。 在探测操作之间,该探测程序将休眠,休眠时间为 Thorough_probe_interval 的值。
timeout = scds_get_ext_probe_timeout(scds_handle);
for (;;) {
/*
* sleep for a duration of thorough_probe_interval between
* successive probes.
*/
(void) scds_fm_sleep(scds_handle,
scds_get_rs_thorough_probe_interval(scds_handle));
xfnts_probe 方法将按以下方式实现该循环。
for (ip = 0; ip < netaddr->num_netaddrs; ip++) {
/*
* Grab the hostname and port on which the
* health has to be monitored.
*/
hostname = netaddr->netaddrs[ip].hostname;
port = netaddr->netaddrs[ip].port_proto.port;
/*
* HA-XFS supports only one port and
* hence obtain the port value from the
* first entry in the array of ports.
*/
ht1 = gethrtime(); /* Latch probe start time */
scds_syslog(LOG_INFO, "Probing the service on port: %d.", port);
probe_result =
svc_probe(scds_handle, hostname, port, timeout);
/*
* Update service probe history,
* take action if necessary.
* Latch probe end time.
*/
ht2 = gethrtime();
/* Convert to milliseconds */
dt = (ulong_t)((ht2 - ht1) / 1e6);
/*
* Compute failure history and take
* action if needed
*/
(void) scds_fm_action(scds_handle,
probe_result, (long)dt);
} /* Each net resource */
} /* Keep probing forever */
svc_probe() 函数将实现探测程序逻辑。 从 svc_probe() 返回的值被传送到 scds_fm_action(),以决定是否重新启动应用程序、对资源组进行故障切换或不进行任何操作。
svc_probe() 函数将通过调用 scds_fm_tcp_connect() 建立指定端口的简单套接字连接。 如果连接失败,svc_probe() 将返回值 100,该值表明操作完全失败。 如果连接成功,但断开连接操作失败,svc_probe() 将返回值 50,该值表明操作部分失败。 如果连接和断开连接的操作都成功,svc_probe() 将返回值 0,该值表明操作成功。
svc_probe() 的代码如下。
int svc_probe(scds_handle_t scds_handle,
char *hostname, int port, int timeout)
{
int rc;
hrtime_t t1, t2;
int sock;
char testcmd[2048];
int time_used, time_remaining;
time_t connect_timeout;
/*
* probe the data service by doing a socket connection to the port */
* specified in the port_list property to the host that is
* serving the XFS data service. If the XFS service which is configured
* to listen on the specified port, replies to the connection, then
* the probe is successful. Else we will wait for a time period set
* in probe_timeout property before concluding that the probe failed.
*/
/*
* Use the SVC_CONNECT_TIMEOUT_PCT percentage of timeout
* to connect to the port
*/
connect_timeout = (SVC_CONNECT_TIMEOUT_PCT * timeout)/100;
t1 = (hrtime_t)(gethrtime()/1E9);
/*
* the probe makes a connection to the specified hostname and port.
* The connection is timed for 95% of the actual probe_timeout.
*/
rc = scds_fm_tcp_connect(scds_handle, &sock, hostname, port,
connect_timeout);
if (rc) {
scds_syslog(LOG_ERR,
"Failed to connect to port <%d> of resource <%s>.",
port, scds_get_resource_name(scds_handle));
/* this is a complete failure */
return (SCDS_PROBE_COMPLETE_FAILURE);
}
t2 = (hrtime_t)(gethrtime()/1E9);
/*
* Compute the actual time it took to connect. This should be less than
* or equal to connect_timeout, the time allocated to connect.
* If the connect uses all the time that is allocated for it,
* then the remaining value from the probe_timeout that is passed to
* this function will be used as disconnect timeout. Otherwise, the
* the remaining time from the connect call will also be added to
* the disconnect timeout.
*
*/
time_used = (int)(t2 - t1);
/*
* Use the remaining time(timeout - time_took_to_connect) to disconnect
*/
time_remaining = timeout - (int)time_used;
/*
* If all the time is used up, use a small hardcoded timeout
* to still try to disconnect. This will avoid the fd leak.
*/
if (time_remaining <= 0) {
scds_syslog_debug(DBG_LEVEL_LOW,
"svc_probe used entire timeout of "
"%d seconds during connect operation and exceeded the "
"timeout by %d seconds. Attempting disconnect with timeout"
" %d ",
connect_timeout,
abs(time_used),
SVC_DISCONNECT_TIMEOUT_SECONDS);
time_remaining = SVC_DISCONNECT_TIMEOUT_SECONDS;
}
/*
* Return partial failure in case of disconnection failure.
* Reason: The connect call is successful, which means
* the application is alive. A disconnection failure
* could happen due to a hung application or heavy load.
* If it is the later case, don't declare the application
* as dead by returning complete failure. Instead, declare
* it as partial failure. If this situation persists, the
* disconnect call will fail again and the application will be
* restarted.
*/
rc = scds_fm_tcp_disconnect(scds_handle, sock, time_remaining);
if (rc != SCHA_ERR_NOERR) {
scds_syslog(LOG_ERR,
"Failed to disconnect to port %d of resource %s.",
port, scds_get_resource_name(scds_handle));
/* this is a partial failure */
return (SCDS_PROBE_COMPLETE_FAILURE/2);
}
t2 = (hrtime_t)(gethrtime()/1E9);
time_used = (int)(t2 - t1);
time_remaining = timeout - time_used;
/*
* If there is no time left, don't do the full test with
* fsinfo. Return SCDS_PROBE_COMPLETE_FAILURE/2
* instead. This will make sure that if this timeout
* persists, server will be restarted.
*/
if (time_remaining <= 0) {
scds_syslog(LOG_ERR, "Probe timed out.");
return (SCDS_PROBE_COMPLETE_FAILURE/2);
}
/*
* The connection and disconnection to port is successful,
* Run the fsinfo command to perform a full check of
* server health.
* Redirect stdout, otherwise the output from fsinfo
* ends up on the console.
*/
(void) sprintf(testcmd,
"/usr/openwin/bin/fsinfo -server %s:%d > /dev/null",
hostname, port);
scds_syslog_debug(DBG_LEVEL_HIGH,
"Checking the server status with %s.", testcmd);
if (scds_timerun(scds_handle, testcmd, time_remaining,
SIGKILL, &rc) != SCHA_ERR_NOERR || rc != 0) {
scds_syslog(LOG_ERR,
"Failed to check server status with command <%s>",
testcmd);
return (SCDS_PROBE_COMPLETE_FAILURE/2);
}
return (0);
}
完成操作后,svc_probe() 将返回表示成功的值 (0)、表示部分失败的值 (50) 或表示完全失败的值 (100)。 xfnts_probe 方法将把此值传送到 scds_fm_action()。
xfnts_probe 方法将调用 scds_fm_action() 来确定要执行的操作。 scds_fm_action() 中的逻辑如下:
保留 Retry_interval 特性值定义的时间之内的累积失败历史记录。
如果累积失败值达到 100(完全失败),将重启该数据服务。 如果超出 Retry_interval 的值,将重置该历史记录。
如果重启的次数超出了 Retry_count 特性的值,则将在 Retry_interval 指定的时间内对该数据服务进行故障切换。
例如,假设该探测程序建立了 xfs 服务器的连接,但是断开连接操作失败。 这表明该服务器正在运行,但是可能处于挂起状态或恰好处于临时装入状态。 如果断开连接操作失败,将向 scds_fm_action() 发送表明部分失败的值 (50)。 此值虽然小于用来重启该数据服务的阈值,但是它将保留在失败历史记录中。
如果在下一次探测中,连接服务器再次失败,值 50 将被添加到由 scds_fm_action() 维护的失败历史记录中。 现在累积的失败值为 100,因此 scds_fm_action() 将重启该数据服务。