Tuning the HA for PostgreSQL Fault Monitor

Language:

The HA for PostgreSQL fault monitor verifies that the data service is running in a healthy condition.

A HA for PostgreSQL fault monitor is contained in each resource that represents the PostgreSQL instance. You created these resources when you registered and configured HA for PostgreSQL. For more information, see Registering and Configuring HA for PostgreSQL.

Standard properties and extension properties of the PostgreSQL resources control the behavior of the fault monitor. The default values of these properties determine the preset behavior of the fault monitor. Because the preset behavior should be suitable for most Oracle Solaris Cluster installations, tune the HA for PostgreSQL fault monitor only if you need to modify this preset behavior.

Tuning the HA for PostgreSQL fault monitor involves the following tasks:

Setting the return value for failed PostgreSQL monitor connections
Setting the interval between fault monitor probes
Setting the time-out for fault monitor probes
Defining the criteria for persistent faults
Specifying the failover behavior of a resource

The fault monitor HA for PostgreSQL differentiates between connection problems and definitive application failures. The value of NOCONRET in the PostgreSQL parameter file specifies the return code for connection problems. This value results in a certain amount of ignored consecutive failed probes as long as they all return the value of NOCONRET. The first successful probe reverts this “failed probe counter” back to zero. The maximum number of failed probes is calculated as100 / NOCONRET. A definitive application failure will result in an immediate restart or failover.

The definition of the return value NOCONRET defines one of two behaviors for failed database connections of a PostgreSQL resource.

Retry the connection to the test database several times before considering the PostgreSQL resource as failed and triggering a restart or failover.
Complain at every probe that the connection to the test database failed. No restart or failover will be triggered.

To achieve either of these behaviors, you need to consider the standard resource properties retry_interval and thorough_probe_interval.

A “just complaining” probe is achieved as soon as the following inequation is true:
```
retry_interval < thorough_probe_interval *100/NOCONRET
```
As soon as this inequation is false, the PostgreSQL resource restarts or fails over after 100 / NOCONRET consecutive probe failures.

The value 100/NOCONRET defines the maximum number of retries for the probe in the case of a failed connection.

Assume that the following resource parameters are set:

thorough_probe_interval=60
retry_interval=900
NOCONRET=10

If you encounter, for example, a shortage of available database sessions for 7 minutes, you will see 7 complaints in /var/adm/messages, but no resource restart. If the shortage lasts 10 minutes, you will have a restart of the PostgreSQL resource after the 10th probe.

If you do not want a resource restart in the previous example, set the value of NOCONRET=10 to 5 or less.

For more information, see Tuning Fault Monitors for Oracle Solaris Cluster Data Services in Oracle Solaris Cluster 4.3 Data Services Planning and Administration Guide.

Operation of the HA for PostgreSQL Parameter File

The HA for PostgreSQL resources use a parameter file to pass parameters to the start, stop, and probe commands. Changes to these parameters take effect at least at every restart or enabling, disabling of the resource.

Changing one of the following parameters, takes effect at the next probe of the PostgreSQL resource:

USER
PGROOT
PGPORT
PGHOST
LD_LIBRARY_PATH
SCDB
SCUSER
SCTABLE
SCPASS
NOCONRET

Note - A false change of the parameters with an enabled PostgreSQL resource might result in an unplanned service outage. Therefore, disable the PostgreSQL resource first, execute the change, and then re-enable the resource.

Operation of the Fault Monitor for HA for PostgreSQL

The fault monitor for HA for PostgreSQL ensures that all the requirements for the zone boot component to run are met:

The HA for PostgreSQL main postmaster process is running.

If this process is not running, the fault monitor restarts the PostgreSQL database server. If the fault persists, the fault monitor fails over the resource group that contains the resource for the PostgreSQL.
Connections to the PostgreSQL database server are possible, and the database catalog is accessible.

If the connection fails, the probe exits with the connection failed return code NOCONRET. If the database catalog is not accessible, the fault monitor restarts the PostgreSQL resource.
The test database is healthy.

If the test table in the test database can be manipulated, the database server is considered healthy. If table manipulation fails, it is differentiated, whether the problem was a connection error or the database manipulation was unsuccessful for any other reason.

If the connection was impossible the probe exits with the connection failed return code NOCONRET. If the table manipulation itself was unsuccessful, the fault monitor triggers a restart or a failover the PostgreSQL database server resource.