17 Health Check Daemon

This chapter provides the information about Session Monitor Health Check Daemon for viewing memory usage of the services.

Introduction

Oracle Communications Operations Monitor consists of several interdependent parts realized as system services. To avoid a situation in which one of the parts consumes so much memory that it would impair the functioning of the whole system, it includes a mechanism to restart services with high memory usage. For some installations, the default limit on memory use of the system may not be suitable, or the mechanism may not choose to restart those services which are actually responsible for the high memory usage. Health Check Daemon, cghealth is designed on systemd and cgroups which controls the memory consumption of the services on Operations Monitor.

The cghealth sets the hard limit for each service for the total memory consumption. cghealth attempts a restart of specific services if the service memory consumption reaches near to the set limit and creates a system log whenever cghealth daemon restarts a service. The hard limit is set by the configuration setting 'limit' in section 'memory'. It is a limit for the total memory consumption of all OCSM services (called PLD slice), not for individual processes.

Note:

To effectively change the configuration, the task should only be performed by system administrators who understand how cghealth works.

Important:

cghealth restarts services only when the total memory usage is above the high water mark (setting 'hiwm' in section 'memory'), which is by default defined as a percentage of the hard limit.

Editing cghealth Daemon Configuration File

You can edit cghealth daemon using command line interface. To edit cghealth daemon:

To edit cghealth daemon:

  1. Login to a system that has Operations Monitor installed as admin (root user).

  2. Open Command Line Interface.

    Note:

    The current default values are stored in file, /opt/oracle/ocsm/etc/iptego/cghealth.conf. Read this file first to know the current limits, but do not change the values there. Instead, you can set the values by creating a new file, /opt/oracle/ocsm/etc/iptego/cghealth.conf.local. When you are adding new values in cghealth.conf.local, you should also add the proper [section] for it, by copying it from the default file, cghealth.conf.
  3. Run the following command to edit specific values:

    vi /opt/oracle/ocsm/etc/iptego/cghealth.conf.local
    

    As a result of the command input, you can set the memory consumption of the services.For example,

    [memory_high]
    pld-vsi = 40%
    
  4. To restart the service, run the following command:

    systemctl restart pld-cghealth.service
    

Setting Hard Limit for Services

cghealth sets a limit on the total memory consumption including swap of the PLD slice. If the measures specified fails, kernel enforces the limit by killing processes. The limit is configured in the cghealth.conf file. The local changes should be made via cghealth.conf.local file.

[memory]
limit = 60%

The percentage is in relation to the total physical memory of the machine. You can specify an absolute value, for example, limit = 17179869184 or, equivalently, a value with measurement, for example, limit = 16G.

Note:

  • Customizing cghealth services is optional.

  • MySQL is not part of pld.slice file. This limit should be chosen low enough to leave sufficient space for it.

Restarting Services

cghealth restarts a service when the total memory consumption is near to the set limit.

For example, following is a service memory consumption.

[memory]
hiwm = 90%

The percentage is in relation to the hard limit. Absolute values are also possible.

If the total memory consumption of all services in the pld.slice exceeds the high water mark, cghealth will, via systemd, restart all services in that slice whose memory consumption is high.

Here is an example of the individual limit for the vsi service

[memory_high]
pld-vsi = 30%

This defines that the memory consumption of pld-vsi.service is considered high if it is above 30% of the configured total limit. Again, absolute values can also be used.

For services which do not have an explicit value set, a default value is used. cghealth creates a system log whenever the services are restarted. Following is an example of such a log entry from /var/log/messages:

Jun 12 15:21:21 ocsm journal: OCSM memory usage: 20,498,919,424 bytes
Jun 12 15:21:21 ocsm journal: * 7,272,095,744 pld-vsi.service
...
Jun 12 15:21:21 ocsm journal: 5,971,968 pld-enum-probe.service
Jun 12 15:21:21 ocsm journal: Restarting marked services.

Here is a calculation example:

For this example, we will assume that the entire system memory size is 36 GB.

From file, /opt/oracle/ocsm/etc/iptego/cghealth.conf, you can view the existing limits:

limit = 60% (from all memory, this means 60% x 36  = 21.6 GB)
hiwm = 90% (from limit, this means 90% x 21.6) = 19.44 GB
pld-vsi = 30% = (from limit, this means 30% x 21.6) = 6.48 GB

This means, that as soon as memory usage for all OCSM services together is over 19.44 GB, cghealth will start looking at individual services to see which one is above its individual limit, in which case those services will be restarted.

For example pld-vsi, because the limit is 30%, the service will be restarted if its current memory usage is over 6.48 GB.

Nightly Check for Individual Service Memory Usage

During the night (at 03:27 am) there is an additional check for individual used memory. At this time, cghealth checks each individual service and looks at the memory it used throughout the day. If the memory used by the individual service is above the value in the 'memory_high' section for that service (or the "default" value in the 'memory_high' section), that service is restarted. Note that the nightly check is independent of the memory used by the full PLD slice. This means that during the nightly check it will not matter what the full PLD slice memory consumption is, because only the individual service memory usage is checked.

For Example:

Assuming that the 'memory_high' section in the /opt/oracle/ocsm/etc/iptego/cghealth.conf file (or /opt/oracle/ocsm/etc/iptego/cghealth.local.conf file) looks as follows:

[memory_high]
default = 10%
pld-vsi = 40%
pld-meco = 30%

During the night, cghealth checks each individual service and looks at the memory it used throughout the day. If pld-vsi memory usage goes above 40% of the limit (the limit defined in the 'memory' section), it is restarted. If pld-meco goes above 30% of the limit, it is restarted. For all other service from the PLD slice (for example pld-apid), the memory usage is compared to 10% of the limit (because that is the default value).