Monitoring and Managing

3 Monitoring and Managing

This chapter gives some hints and tips for system administrators to manage the Convergent Charging Controller product.

Monitoring and Managing Overview

For system administrators new to the Convergent Charging Controller solution, the monitoring and managing of the solution can seem a little daunting due to the large number of different processes and potentially complex interactions with different network types and services.

Like anything though, the more you use it and the more familiar it becomes and the easier it gets.

The aim of this chapter is to introduce some common tools and aids to help administrators become more comfortable in monitoring the solution without getting involved with the technical detail of each component and its operation.

Software Version Levels

The Convergent Charging Controller base components, or applications, and Convergent Charging Controller application patches are installed on the system using the operating system commands.

Running Processes

In the UNIX world a running application, or program, is referred to as a process.

Basically a process is the active instance of the running program that the operating system is allocating system resources to, allowing it to run instructions.

In the UNIX environment the ps command (see man ps) is used to generate a snapshot report of the current status of process(es) and hence how you can tell that an application is running.

Complex environments

In a complex environment, where multiple running processes making up an application suite, the trick is knowing which processes to identify.

To do this you would need to distinguish each Convergent Charging Controller daemon that is configured to run in the SLEE configuration file (SLEE.cfg) and the following:

For Solaris:

/etc/inittab file

For Linux:

Run the following command:

systemctl startservice

where

service

is the name of the service.

The use of start-up scripts can further complicate this task, requiring each script to be checked for the processable binary file name.

To simplify this task the Convergent Charging Controller support tools package provides a utility called pslist that, among other things, will identify and quickly verify the status of the Convergent Charging Controller processes running on the platform.

pslist command

The pslist command uses a process list configuration (.plc) file containing regular expressions of the relevant Convergent Charging Controller processes that are matched against the ps -ef command output. Enter:

$ pslist

Response:


------------------------ Tue Nov 23 01:03:46 GMT 2010 -------------------------- 
C APP  USER       PID PPID  STIME COMMAND 
1 ACS  acs_oper  1004    1 04-Oct /IN/service_packages/ACS/bin/acsCompilerDaemon 
1 ACS  acs_oper  1008    1 04-Oct IN/service_packages/ACS/bin/acsProfileCompiler 
1 ACS  acs_oper  7553    1 28-Oct rvice_packages/ACS/bin/acsStatisticsDBInserter 
1 OSD  acs_oper  1047    1 04-Oct IN/service_packages/OSD/bin/osdWsdlRegenerator 
1 CCS  ccs_oper  1011    1 04-Oct /IN/service_packages/CCS/bin/ccsCDRLoader 
1 CCS  ccs_oper  1033    
1 04-Oct N/service_packages/CCS/bin/ccsCDRFileGenerator 
2 CCS  ccs_oper  1406 1043 04-Oct /IN/service_packages/CCS/bin/ccsProfileDaemon 
1 CCS  ccs_oper 20252    1 28-Oct /IN/service_packages/CCS/bin/ccsBeOrb 
1 CCS  ccs_oper  9413    1 04-Oct /IN/service_packages/CCS/bin/ccsChangeDaemon 
1 EFM  smf_oper   995    1 04-Oct /IN/service_packages/EFM/bin/smsAlarmManager 
1 PI   smf_oper  1080    1 04-Oct /IN/service_packages/PI/bin/PImanager 
6 PI   smf_oper  1319 1080 04-Oct PIprocess 
1 PI   smf_oper  9186 1080 04-Oct PIbeClient 
2 SMS  smf_oper  6173    1 26-Oct /IN/service_packages/SMS/bin/smsMaster 
1 SMS  smf_oper   943    1 04-Oct /IN/service_packages/SMS/bin/smsNamingServer 
1 SMS  smf_oper   944    1 04-Oct /IN/service_packages/SMS/bin/smsReportsDaemon 
1 SMS  smf_oper   946    1 04-Oct IN/service_packages/SMS/bin/smsReportScheduler 
1 SMS  smf_oper   947    1 04-Oct /IN/service_packages/SMS/bin/smsAlarmDaemon 
1 SMS  smf_oper   948    1 04-Oct /IN/service_packages/SMS/bin/smsStatsThreshold
1 SMS  smf_oper   949    
1 04-Oct /IN/service_packages/SMS/bin/smsTaskAgent 
1 SMS  smf_oper   969    1 04-Oct /IN/service_packages/SMS/bin/smsTrigDaemon 
2 SMS  smf_oper   979    1 04-Oct /IN/service_packages/SMS/bin/smsConfigDaemon 
1 SMS  smf_oper   980    1 04-Oct /IN/service_packages/SMS/bin/smsStatsDaemonRep 
total processes found = 31 [ 31 expected ] 
================================= run-level 3 ==================================

Note:

The listed output is from a SMS platform and is grouped by user and application components.

Default plc file

When a UNIX user first runs the pslist command without command line options it will automatically generate a default .plc file by scanning the /etc/inittab and /IN/service_packages/SLEE/etc/SLEE.cfg file, if existing.

The user's default .plc file can be viewed with the pslist -v command line option. Enter:

$ pslist –v

Response:

[ /IN/service_packages/ACS/tmp/ps_processes.telco-p-slc01.plc ] 
############################################################################ 
# pslist: default process list configuration (plc) file used to match and  # 
# display running processes.                                               #  
# File creation time: Tue Nov 23 01:58:39 GMT 2010                         #  
# Lines beginning with a hash (#) character are ignored.                   # 
# $1="grouped-apps name (max 5-char)" $2="regex of process" [$3+=comments] # 
############################################################################ 
ACS   acs_oper.*\/IN\/service_packages\/ACS\/bin\/acsStatsMaster               inittab  
ACS   acs_oper.*\/IN\/service_packages\/ACS\/bin\/cmnPushFiles                 inittab  
SMS   smf_oper.*\/IN\/service_packages\/SMS\/bin\/cmnPushFiles                 inittab  
SMS   smf_oper.*\/IN\/service_packages\/SMS\/bin\/smsAlarmDaemon               inittab  
SMS   smf_oper.*\/IN\/service_packages\/SMS\/bin\/smsConfigDaemon$             inittab  
SMS   smf_oper.*\/IN\/service_packages\/SMS\/bin\/smsStatsDaemon               inittab  
SMS   smf_oper.*\/IN\/service_packages\/SMS\/bin\/updateLoader                 inittab  
SMS   smf_oper.*bin\/(smsC|c)+ompareResync(Client|Recv)+     inittab: Update Loader child 
process 
UIS   uis_oper.*\/IN\/service_packages\/UIS\/bin\/UssdMfileD                   inittab  
UIS   uis_oper.*\/IN\/service_packages\/UIS\/bin\/cmnPushFiles                 inittab  
XMS   smf_oper.*\/IN\/service_packages\/XMS\/bin\/oraPStoreCleaner             inittab  
SLEE  .*_oper.*\/IN\/service_packages\/ACS\/bin\/acsStatsLocalSLEE              
SLEE  .*_oper.*\/IN\/service_packages\/ACS\/bin\/slee_acs                       
SLEE  .*_oper.*\/IN\/service_packages\/E2BE\/bin\/BeClient                      
SLEE  .*_oper.*\/IN\/service_packages\/SLEE\/bin\/\/watchdog                    
SLEE  .*_oper.*\/IN\/service_packages\/SLEE\/bin\/alarmIF                       
SLEE  .*_oper.*\/IN\/service_packages\/SLEE\/bin\/replicationIF                 
SLEE  .*_oper.*\/IN\/service_packages\/SLEE\/bin\/timerIF                       
SLEE  .*_oper.*\/IN\/service_packages\/SLEE\/bin\/xmlTcapInterface              
ps_processes.telco-p-slc01.plc: END 
[Press space to continue, q to quit, h for help]

SLEE against Service Daemons

There is an important distinction to make for the processes that are managed by either the service daemon or SLEE.

Turn on and off the process /etc/inittab file by doing the following:

For Solaris:

init q

For Linux:

/IN/bin/OUI_systemctl.sh

When started, the SLEE creates a common backbone that ties a group of applications together.

Note:

You can individually configure the processes to run except for SLEE managed processes.

pslist SLEE only example

The following pslist example is taken on a VWS and shows running SLEE processes, but no running init managed processes. Enter:

$ pslist

Response:

------------------------ Tue Nov 23 02:48:46 GMT 2010 -------------------------- 
C APP  USER       PID PPID    STIME COMMAND 
1 SLEE ebe_oper 16139    1 02:23:25 /IN/service_packages/SLEE/bin/timerIF 
16 SLEE ebe_oper 16143    1 02:23:25 /IN/service_packages/E2BE/bin/beVWARS 
1 SLEE ebe_oper 16156    1 02:23:25 /IN/service_packages/E2BE/bin/beSync 
1 SLEE ebe_oper 16157    1 02:23:25 /IN/service_packages/E2BE/bin/beServer 
4 SLEE ebe_oper 16166    1 02:23:25 /IN/service_packages/E2BE/bin/beGroveller 
1 SLEE ebe_oper 16178    1 02:23:26 /IN/service_packages/SLEE/bin/replicationIF 
1 SLEE ebe_oper 16179    1 02:23:26 /IN/service_packages/DAP/bin/dapIF 
1 SLEE ebe_oper 16182    1 02:23:26 /service_packages/E2BE/bin/beEventStorageIF 
1 SLEE ebe_oper 16183    1 02:23:26 /service_packages/E2BE/bin/beServiceTrigger 
1 SLEE ebe_oper 16184    1 02:23:26 ervice_packages/CCS/bin/ccsSLEEChangeDaemon 
1 SLEE ebe_oper 16185    1 02:23:26 /IN/service_packages/SLEE/bin//watchdog 
0 CCS  Did not match regex: /ccs_oper.*.\/bin\/updateLoader( |$)+/ 
0 CCS  Did not match regex: /ccs_oper.*\/IN\/service_packages\/CCS\/bin\/ccsMFileCompiler( 
|$)+/ 
0 CCS  Did not match regex: /ccs_oper.*bin\/(smsC|c)+ompareResync(Client|Recv)+( |$)+/ 
0 CCS  Did not match regex: /ccs_oper.*cmnPushFiles( |$)+/ 
0 E2BE Did not match regex: /ebe_oper.*\/IN\/service_packages\/E2BE\/bin\/beCDRMover( |$)+/ 
0 E2BE Did not match regex: /ebe_oper.*cmnPushFiles( |$)+/ 
0 SMS  Did not match regex: /smf_oper.*\/IN\/service_packages\/SMS\/bin\/smsAlarmDaemon( |$)+/ 
0 SMS  Did not match regex: /smf_oper.*\/IN\/service_packages\/SMS\/bin\/smsConfigDaemon$/ 
0 SMS  Did not match regex: /smf_oper.*\/IN\/service_packages\/SMS\/bin\/smsStatsDaemon( |$)+/ 
total processes found = 29 [ 38 expected, 9 not found ] 
================================= run-level 2 ==================================

Note:

The count column (C) is 0 for the processes that were not matched and the total processes found output highlights that the expected number of processes were not matched. In this case there is a clue on the bottom line as to why the init managed processes are not running.

Expected against Not Found processes

It is worthwhile noting that the expected and not found counts are only an estimate of the number of processes missing.

Some processes may spawn multiple child processes that may have the same process name (for example, PI processes on the SMS) making it impossible to accurately predict how many processes are expected or not found.

Recreate default plc file

If changes are made to the platform and the running services then it may be necessary to re-create the default .plc file.

You can do this using the pslist -d option.

Warning: Only processes that are configured to run will be added to the default .plc file. For example, if a /etc/inittab process is commented out, or set to off, then when the pslist-d command is run the commented entry will not be added to the list.

Another option is to become root user and run pslist -xy to delete all user's default .plc file. The next time a user runs the pslist command, it will automatically re-create a default .plc file.

pslist syntax

To see a full list of supported command line options, with a brief description, use pslist -h (help) option or man pslist for full documentation on pslist usage.

Syntax:

pslist -[acCehiqrRstvxyz] [-d (esg|slee|init)] [-S SLEE config] [-f .plc file] [-u 
user] [-k (1|9|15)] [-l (1|2)] [regex (app-name|process name)]

pslist parameters

This table gives a brief description of the supported options.

Parameter	Description
-d	Creates the default process list configuration [~ user /tmp/ps_processes. hostname .plc] file. Valid hostname arguments are: esg scan both the inittab and SLEE config files (default) slee scan the defined SLEE configuration file init[tab] scan the /etc/inittab file
-r	Displays system resource use information (for example. %cpu, %memory).
-s	Scans the SLEE config file and prints the status of the matched process(es).
-i	Scans /etc/inittab file for start-up scripts and prints the status of the matched process(es).
-e	Scans both the defined SLEE config file and the /etc/inittab file and prints the status of the matched process(es).
-c	Clustered SMS option that creates a temporary .plc file to list cluster managed processes (requires scstat command).
-f	Specify a process list configuration (.plc) file different to the default one. [~ user /tmp/ps_processes. hostname .plc].
-S	Specify a different SLEE config file.
-u	Specify a different user as the process owner (can be regex).
-C	The output in the COMMAND column is not trimmed so the full command line is displayed.
-a	Displays the command line arguments of matched process(es) (SunOS only).
-q	No output in quiet mode. The exit value is the count of the matched processes (up to a maximum of 99).
-t	Displays a process tree of PPID and COMMAND (SunOS only).
-v	View the process list configuration (.plc) file.
-x	Ask to remove any default process list configuration files (ps_processes. hostname .plc) found in the /tmp, ~ user /tmp, and ~<*_oper>/tmp directories.
-k	Ask if one of the following kill signals should be sent to all matched processes. Valid arguments are: 1 SIGHUP 9 SIGKILL 15 SIGTERM
-y	Specify yes. Use with -x to confirm remove, or -k option to confirm kill.
-l	Adds an entry to the system log with the expected and actual process count info.
-z	Creates a softlink for the shorter "pslist" command in directory /usr/local/bin. Note: Users must have that directory set in their PATH environment variable for it to work.
-R	Displays pslist revision info.
-h	Displays the above usage text.

Creating own plc file

As can be seen, there are multiple command line options and the more experienced administrator may like to make their own .plc file (-f option) with an alias command to list other important processes.

For example, you could create a command to list the Oracle database processes on the VWS.

$ cat <<EOF >ora.plc 
ORA ora_.*E2BE 
ORA oracleE2BE 
core db processes 
shadow process 
EOF 
$ alias psdb='pslist -f ora.plc $*' # note: add this line to your user 
~/.profile. 
$ psdb -r ------------------------ Tue Nov 23 21:28:57 GMT 2010 -------------------------- 
APP USER     
PID PPID S %CPU %MEM     VSZ     RSS  TIME     ELAPSED COMMAND 
ORA oracle   744    1 S  0.0 51.3 4319336 4091784 30:40 50-07:34:48 ora_pmon_E2BE 
ORA oracle   746    1 S  0.0 51.2 4317864 4090304 02:13 50-07:34:48 ora_psp0_E2BE 
ORA oracle   748    1 S  0.0 51.2 4317864 4090528 02:41 50-07:34:48 ora_mman_E2BE 
ORA oracle   750    1 S  0.0 51.3 4323928 4095976 05:02 50-07:34:48 ora_dbw0_E2BE 
ORA oracle   752    1 S  0.0 51.2 4330840 4091264 03:47 50-07:34:48 ora_lgwr_E2BE 
ORA oracle   754    1 S  0.0 51.2 4319928 4091472 45:09 50-07:34:48 ora_ckpt_E2BE 
ORA oracle   756    1 S  0.0 51.3 4318952 4092184 04:40 50-07:34:48 ora_smon_E2BE 
ORA oracle   758    1 S  0.0 51.2 4317928 4090904 00:03 50-07:34:48 ora_reco_E2BE 
ORA oracle   760    1 S  0.0 51.3 4319720 4092048 04:59 50-07:34:48 ora_mmon_E2BE 
ORA oracle   762    1 S  0.0 51.2 4317928 4090744 02:32 50-07:34:48 ora_mmnl_E2BE 
ORA oracle 16159    1 S  0.0 46.1 4317920 3681768 00:00    19:05:33 oracleE2BE 
ORA oracle 16161    1 S  0.0 46.1 4317920 3681768 00:00    19:05:33 oracleE2BE 
ORA oracle 16163    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16168    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16170    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16172    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16175    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16177    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16181    
1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16189    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16191    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16193    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16195    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16197    1 S  0.0 46.1 4317920 3681768 00:00    19:05:32 oracleE2BE 
ORA oracle 16199    1 S  0.0 46.1 4317920 3681704 00:00    
19:05:31 oracleE2BE 
ORA oracle 16201    1 S  0.0 46.1 4317920 3681704 00:00    19:05:31 oracleE2BE 
ORA oracle 16203    1 S  0.0 46.1 4317984 3681832 00:00    19:05:31 oracleE2BE 
ORA oracle 16205    1 S  0.0 46.1 4317920 3681768 00:00    19:05:31 oracleE2BE 
ORA oracle 16207    1 S  0.0 46.1 4317920 3681768 00:00    19:05:31 oracleE2BE 
ORA oracle 16209    1 S  0.0 46.1 4317920 3681704 00:00    19:05:31 oracleE2BE 
ORA oracle 16212    1 S  0.0 46.1 4317920 3681704 00:00    19:05:31 oracleE2BE 
ORA oracle 16214    
1 S  0.0 46.1 4317920 3681704 00:00    19:05:31 oracleE2BE 
ORA oracle 16216    1 S  0.0 46.1 4317920 3681768 00:00    19:05:31 oracleE2BE 
ORA oracle 16218    1 S  0.0 46.1 4317920 3681704 00:01    19:05:31 oracleE2BE 
ORA oracle 16222    1 S  0.0 46.1 4317920 3681832 00:00    19:05:31 oracleE2BE 
total processes found = 35 
================================= run-level 2 ==================================

General comment

Generally speaking, pslist is the only command you need to remember when checking for the Convergent Charging Controller processes configured to run on any platform and it is a good place to start when investigating or troubleshooting issues.

Note: The slee-ctrl status and resource commands respectively call the pslist -s and pslist -sr commands to generate output.

SLEE Resource Usage

SLEE resources are created at run time and are used during service control for passing information between SLEE interfaces and applications.

These are internal values that have no specific meaning to the rest of the network and are entirely local to the running SLEE.

The Convergent Charging Controller SLEE provides an application called check that can monitor and report on current SLEE resource usage. See SLEE Technical Guide for more information on this utility.

SLEE resources

There are three main types of internal SLEE resources;

Dialogs - exist for the duration of the call.
Events - these are transient and are deleted upon delivery to the relevant SLEE entity.
Calls

If you have a networking background then you may notice a similarity to to the SS7 model here. In fact, at a high level, the SLC SLEE considers everything that hits the platform as a Call instance, be it a voice or data call, SMS, or some sort of business process logic.

Like the SS7 model, a call instance has a source and destination address for a communication channel or dialog to pass data or events back and forth.

For example, a new service request hits a SLEE interface generating a new call instance.

The SLEE Interface will create a dialog to the relevant SLEE application and the call context data is put into an event and sent (through the dialog) to the SLEE application for processing.

The SLEE application can create new dialogs to other SLEE interfaces or applications for the call instance.

Resource snapshot

The following is an example of typical SLEE resource usage snapshots taken on a SLC at one second intervals.


/IN/service_packages/SLEE/bin$ check 1 5 
22:24:02    Dialogs  Apps    AppIns        Servs        Events   Calls 
            [70000]  [10]     [101]         [10]       [190242] [25000] 
22:24:02    61753     9        90            0          196170    21874 
22:24:03    61797     9        90            0          196164    21892 
22:24:04    61845     9        90            0          196164    21900 
22:24:05    61841     9        90            0          196170    21900 
22:24:06    61886     9        90            0          196192    21909

Notes:

The dialog, events, and calls values changing.
The output values, in square brackets, on the second row, indicates the maximum SLEE resource values allocated on SLEE start-up.
You should note that the events value appears to be lower than the first output value. This is a known issue due to the fact Events can be configured to pre-allocated sizes with different values (SLEE.cfg) causing the check output value to be miscalculated. This issue can safely be ignored.

SLEE health

The SLC SLEE resource usage in a healthy, well-functioning, network will fluctuate up and down during the course of a day as the call rates and call mix changes.

Note: The SLEE SLC config includes a concept of the max call rate. This is the maximum call activity that can be supported and is used as an input in SLEE.cfg to size resource requirements.

During high or peak call volume, more SLEE resources are used (that is, lower SLEE resource values), but will return back towards their maximum values at low call volume times, such as early morning.

The configured amount of SLEE resource is a lot higher than necessary for normal operation to allow the SLC to cope for a period of time with abnormal network behaviour and/or software bugs which can prevent the SLEE resources from being freed up correctly (this situation is often referred to as a call leak).

Normal check output

So what is normal for the SLC check output?

There is no simple answer as every network is different. What is normal for your network can only be found by monitoring the SLC SLEE resource usage over a period of time, taking regular SLEE resource snapshots to create a normal historical trend.

Once a normal pattern of SLEE resource usage has been established, over weeks and months, then anything outside of this normal trend may indicate an issue with either; the Convergent Charging Controller software, or, more often than not, the connecting SS7 network itself.

Experience has shown that this usually occurs due to routing failure and/or congestion within the SS7 network elements. The Convergent Charging Controller term we use for leaked voice calls is an AWOL (absent without leave) call.

AWOL calls

An AWOL voice call usually occurs when the SLC does not receive an applyChargingReport (ACR) from the SS7 network and therefore does not know whether the call has ended or not. This will tie up the SLEE resource because, as far as the SLC is concerned, the call never ends.

The CAP (CAMEL Application Part) signalling standard does not cover the scenario of stale calls on the SLC where the SLC does not receive a response from the SS7 network for an in progress call.

It can be argued that when a response from the Network does not arrive, after a certain amount of time since the last response, that the SLC could assume it will never receive any more responses and should therefore abort the call, thereby freeing up the SLEE resource.

Another consideration is that most billing reservations have a defined length of validity and call flow transactions within the SS7 network need to be within this time limit also.

As a defense mechanism to this type of stale call scenario, the Convergent Charging Controller solution has an optional AWOL setting that allows the SLC to clear out stale calls once they pass a configurable period of time. This allows the finite SLEE resources on the SLC to be freed up, if abnormal SS7 network behaviour occurs.

Note: The timeout parameters in the MAP (Mobile Application Part) signaling protocol, used for Short Message Service (SMS), allows the SLC to clear out stale transactions.

Scarce SLEE resources

As a rule of thumb, the following values should indicate a problem condition that may require urgent investigation:

Serious - should be investigated:

SLEE dialogs or calls are less than 25% of their maximum value.

Critical - must be urgently investigated:

SLEE dialogs or calls are less than 10% of their maximum value.SLEE events are around 50% of their maximum value.

Note: New calls can still be made if there are available dialog and calls resources that are greater than 0. The only time new calls cannot be made is when the free dialogs and/or calls are completely exhausted. At this stage the best cause of action is a quick SLEE restart, to reset the SLEE resources, allowing new calls to be made. If the situation reoccurs after a SLEE restart, review the setting of max call rate if significant additional traffic has been cut over.

Warning messages

The Convergent Charging Controller SLEE will generate one warning event, in the /var/adm/messages log file, for each SLEE resource that breaks a threshold of 80% of its maximum value, similar to this example:

Sep 12 01:36:37 telco-p-slc01 watchdog: [ID 953149 user.warning] watchdog(18965) 
WARNING: 20006 Call Instances Locked breaches 80% of total available instances 
(25000)

If SLEE resources should become exhausted then the /var/adm/messages log file will start spewing out entries, similar to the following notice message, for each failed call attempt:

Sep 17 03:22:59 telco-p-slc01 xmsTrigger: [ID 167701 user.notice] xmsTrigger(18957) 
NOTICE SleeInterfaceAPI=8006 sleeInterfaceAPI.cc@248: Overload: SLEE is out of call 
instances

Resource leak

If a SLEE resource leak is caught early then it will be the rate of the call leak that determines the true severity of the issue.

A slow leak of days, weeks, or months may be acceptable with general platform maintenance clearing leak calls during SLEE restarts.

A rapid SLEE resource loss over hours, minutes or seconds (depending on current call rates) requires instant action and investigation.

Good call tracing in the signaling network is vital in tracking down where in the network the issue is originating.

Monitoring SLEE resources

The Convergent Charging Controller Support Tools package provides the check-SLEE.sh script that, among other things, can be used for generating a timestamped log file of SLEE resource usage.

It is designed to run from the SLC acs_oper user's crontab file. If the following entry does not exist, add it with the crontab command, as below:

acs_oper@telco-p-slc01$ crontab -e 
...skipped... 
# the following entry is logging internal SLEE resource usage
* * * * * ( . /IN/service_packages/ACS/.profile ; 
/IN/service_packages/SUPPORT/bin/check-SLEE.sh ) >> 
/IN/service_packages/SUPPORT/tmp/check-SLEE.log 2>&1

Tip: One minute snapshot intervals are recommended (as highlighted).

check-SLEE.sh output

This example output is generally what you will see on a well-functioning healthy network with SLEE resources adjusting up and down by small margins over the course of a minute.

The output of the check-SLEE.sh script is written to the check-Time_Dialogs_Events_Calls_CAPS.log file in the /IN/service_packages/SUPPORT/tmp directory. You can view this log with ckslee command as follows:

$ ckslee 
Date-Time Diags Events Calls[Change]      CAPS 
20101125-03:43:00 86889 253630 43287[+0.04%]        5  
20101125-03:44:01 86904 253630 43300[+0.05%]        5  
20101125-03:45:00 86905 253631 43283[-0.07%]        5  
20101125-03:46:00 86911 253627 43297[+0.06%]        6  
20101125-03:47:00 86924 253632 43296[-0.00%]        5  
20101125-03:48:00 86870 253629 43264[-0.13%]        5  
20101125-03:49:00 86926 253631 43296[+0.13%]        5  
20101125-03:50:00 86945 253630 43306[+0.04%]        4  
20101125-03:51:00 86908 253627 43293[-0.05%]        4  
20101125-03:52:00 86938 253628 43302[+0.04%]        5

Notes:

The [Change] value is a measure percentage difference between the previous Calls value and is useful in highlighting a sudden change resource usage.

The CAPS (Call attempts per second) column is an extra, but requires the SIGTRAN stack(s) to be configured to output their average CAPS rate. If they are not configured, this column will show N/A. See Convergent Charging Controller SIGTRAN Technical Guide and the monitorperiod, rejectlevel, reportperiod, and displaymonitors parameters to enable CAPS reporting.

check-SLEE.sh archiving

Problems with SLEE resources will usually be found when the solution is first installed, or when the underlying signaling network has changes made to it.

This is when having a historical log of SLEE resource usage over time is a great tool in determining the trend, or a trigger point to an issue occurring.

To archive the check-Time_Dialogs_Events_Calls_CAPS.log file, add the following line to the /IN/service_packages/ACS/etc/logjob.conf file.

log /IN/service_packages/SUPPORT/tmp/check-Time_Dialogs_Events_Calls_CAPS.log age 
240 size 1M arcdir /IN/service_packages/SUPPORT/tmp/archive

check-SLEE.sh usage

The head of the check-SLEE.sh script is extensively commented and explains some other ways the check-SLEE.sh script can be used, such as automatic SLEErestarts if SLEE resources go below a defined threshold.

VWS SLEE resources

As all the VWS processes are SLEE Interfaces, the SLEEresource usage pattern is different to the SLC.

On SLEE start-up, dialogs are created between the different SLEE interfaces for events to be passed back and forth. However the operations on the VWS are purely transactional and the concept of a call instance is not necessary. Therefore only the events value ever changes, with the dialogs and calls values remaining static as shown here:

$ slee-ctrl check 1 5 
SLEE Control: v,1.0.10: script v,1.19: functions v,1.46: pslist v,1.118 
 
[ebe_oper] slee-ctrl> check 1 5 
SLEE: Using shared memory offset: 0xc0000000 
03:05:06        Dialogs  Apps    AppIns     Servs      Events   Calls 
               [1000]    [10]     [100]      [10]      [71250]  [500] 
03:05:06        964       10       100        10        71252   500 
03:05:07        964       10       100        10        71253   500 
03:05:08        964       10       100        10        71255   500 
03:05:09        964       10       100        10        71254   500 
03:05:10        964       10       100        10        71255   500

Note:

You can use the slee-ctrl command to call the check program. Unless advised to, the monitoring of VWS SLEE resource usage is of little benefit.

Rolling Snoop Archives

The Convergent Charging Controller solution sits between the SS7 signaling network and LAN with Network Connectivity Agents (NCA) providing interfaces into the SLEE to interpret, convert and retransmit various network protocols, such as SIGTRAN (SS7 over IP), DIAMETER, SOAP/XML, SMPP and internal messaging protocols.

Being able to capture the reality of what is actually being sent over the wire is a vital tool when analyzing NCA related issues to help identify the source of the problem, be it local or remote.

Running the tshark/snoop utility allows you to trace and capture incoming and outgoing traffic passing through a server's Network Interface Card (NIC) Devices.

To aid in the tracing of network devices and the retention of their snoop capture files the Support Tools package provides, what has become known as, the rolling snoop scripts.

The Linux operating systems applies tcpdump command to trace the traffic passing through server’s NIC devices.

Scripts

You can find the latest version of the rolling snoop scripts in the latest Support Tools package which are installed in the /IN/service_packages/SUPPORT/bin directory.

The rolling-snoop.sh and start-rolling-snoop.sh scripts are heavily commented and it is advisable to read these to get a better understanding of the scripts and their configuration.

For ease of use and convenience it is recommended that the start-rolling-snoop.sh and stop-rolling-snoop.sh wrapper scripts are used to control the rolling-snoop.sh script even though it may be run independently.

rolling-snoop.sh

The script that runs the snoop command.

Responsible for checking disk space, and archiving capture files before exiting.

The command line arguments allow you to specify a capture file name prefix and the ability to pass in valid snoop command line expressions to filter packets on, such as IP traffic that is either; sctp or tcp protocol, on port x or host y (use $ man snoop for more details of valid expressions).

Usage:

rolling-snoop.sh [capture file prefix:]network_card_interface_device_name 
[snoop options]

For example:

rolling-snoop.sh e1000g2 
rolling-snoop.sh uas01-sig-pri:e1000g2 sctp port 14000 
rolling-snoop.sh usms2-chrg-sec:nxge1 -c 1000000 tcp port 3989 
rolling-snoop.sh usms01a-mgmt-pi:e1000g0 not icmp port 2999 or port 3000

Note:

Must be run as root super-user.

start-rolling-snoop.sh

This is a wrapper script where you configure all the network devices you want to snoop, which are then passed as command line parameters to the rolling-snoop.sh script.

Each configured NIC will start a rolling-snoop.sh script, which in turn starts and controls the the snoop command. The example section below is where you define the network device(s) to trace.

/IN/service_packages/SUPPORT/bin$ vi  start-rolling-snoop.sh 
...skipped... 
cat << CONFIG_END |sed '/^ *#/d; s/^ *//; s/\(.*\)\(#.*\)/\1/; /^ *$/d' > $TEMP_FILE 
### CONFIGURE NETWORK DEVICES HERE ### 
# configuration examples below: 
# e1000g1 # comments ignored after hash 
# ${HOST}-sig-pri:e1000g1 
# ${HOST}-sig-sec:nxge1 $PACK_COUNT sctp port 14000 
# uas01-chrg-pri:e1000g0 -c 250000 tcp host charging-gw port 3989 
CONFIG_END 
...

stop-rolling-snoop.sh

When the stop-rolling-snoop.sh is run, all snoop processes are terminated (independently started snoop commands will also be terminated), quickly followed by the termination of the rolling-snoop.sh scripts.

It is the stopping of the rolling-snoop.sh script that manages the archiving of the current capture files. So by regularly stopping and starting the rolling-snoop.sh script, we can easily create an archived repository of snooped traffic.

snoop_archiver.sh

This is a wrapper script to run the start-rolling-snoop.sh and stop-rolling-snoop.sh scripts and manage the removal of old archived capture files.

This script can be configured, as an hourly root crontab job, thereby creating an archive repository of capture files in hourly timestamped directories.

Here is an example snoop_archiver.sh script:

/IN/service_packages/SUPPORT/bin$ cat snoop_archiver.sh 
#!/bin/ksh 
####################################################################### 
# Revision: : snoop_archiver.sh,v 1.4 2010/01/05 01:55:10 gcato Exp $ 
# 
# script to regularly archive rolling snoops files 
# see rolling_snoop.sh KEEP_SECONDS variable to define archiving frequency 
# (usually 60 minutes intervals)
# 
# run from root crontab 
# 59 * * * * /IN/service_packages/SUPPORT/bin/snoop_archiver.sh >/dev/null 2>&1 
# 
####################################################################### 
# how long to keep archived snoop files before deleting 
KEEP_DAYS=3 
# snoops archive directory 
SNOOP_DIR=/IN/service_packages/SUPPORT/snoops/archive 
# snoop directory suffix (usually TZ variable value) 
SUFFIX=GMT 
# stop all snoops (this will automatically archive the files) 
/IN/service_packages/SUPPORT/bin/stop-rolling-snoop.sh 
sleep 1 
# restart the rolling snoops 
/IN/service_packages/SUPPORT/bin/start-rolling-snoop.sh 
# delete old archived snoop dirs 
find ${SNOOP_DIR} -type d -name \*${SUFFIX} -mtime +${KEEP_DAYS} |xargs rm -rf 
# compress snoop files 
find ${SNOOP_DIR}/*${SUFFIX}/ \( -name \*snoop -a ! -name \*gz \) |xargs nice gzip 2>/dev/null

Default directory

The default output directory is /IN/service_packages/SUPPORT/snoops/(current|archive). The current directory contains the capture files currently being written to and archive directory contains time-stamped directories with the saved capture files.

Analyzing the Capture Files

A search of the web will provide a list of different protocol analyzer products, such as Wireshark, that can be used to view the capture file data. Please see the documentation of your protocol analyzer of choice for more information on interpreting the output.

Rolling Snoop Risks

When running rolling snoop there are potential problems inherent with any data capture tool. This topic covers the major "look out for" issues.

Missing packets

Watch out for the routing of packets through secondary, or fail-over, NIC devices, which are configured in a multipathing group. You will need to snoop both network interfaces to capture all incoming and outgoing traffic.

# man ifconfig 
...skipped... 
MULTIPATHING GROUPS 
Physical interfaces that share the same IP broadcast domain 
can be collected into a multipathing group using the group 
keyword. Interfaces assigned to the same multipathing group 
are  treated  as  equivalent  and outgoing traffic is spread 
across the interfaces on a per-IP-destination basis.  In 
addition, individual interfaces in a multipathing group are 
monitored for failures; the addresses associated with failed 
interfaces  are automatically transferred to other function- 
ing interfaces within the group. 
For more details on IP multipathing, see in.mpathd(1M) and 
...<snip>...

Basically, just because a packet came in on a network interface does not mean it will go out on the same interface. To find multipathed interfaces use the ifconfig -a command to find the network interfaces that are configured with the same groupname (if any).

Some monitoring and testing will usually show you which interfaces you need to monitor to catch all the traffic you want.

The crontab configured start and stop time will also have a small window of missed packets.

Disk space

The rolling-snoop.sh script has a MAX_DISK_PERCENTAGE variable (default 75%) and will not run if the output capture file disk partition exceeds this disk space usage threshold (only checked on start-up and subsequent stop/start of snoop command).

This is to prevent the capture files from taking too much disk space and affecting the Event Data Records and process log files from being created.

Warning Change with extreme caution.

If there is limited disk space then you can either; reduce the KEEP_DAYS variable in the snoop_archiver.sh script, or soft-link the /IN/service_packages/SUPPORT/snoop directory to a disk with spare capacity. For example:

# mkdir /volA/snoops 
# rm -r /IN/service_packages/SUPPORT/snoop 
# ln -s /volA/snoops /IN/service_packages/SUPPORT/snoops

Capture file size

The snoop command does not have a max duration option.

Do not confuse the KEEP_SECONDS variable in the rolling-snoop.sh script with how long the snoop command actually runs for.

The MAX_PACKET_COUNT variable (default 100000 packets), also configured in the rolling-snoop.sh script, sets the limit to how big a capture file will grow to before a new capture file is started. If there is a lot of traffic on an interface, you may want to decrease this value to keep the capture file to a manageable size.

It is recommended that this is set inside the configurable section of the start-rolling-snoop.sh by defining a -c option. Further filtering options, on a per device level, can also help keep the capture file size manageable (read the script's comments for more details).

Depending on the amount of IP traffic, you may want to increase or decrease the frequency that the snoop_archiver.sh runs in the crontab.

If increasing the frequency to greater than 60 minutes then you must also increase the KEEP_SECONDS variable (default 3600) in the rolling-snoop.sh script, otherwise when rolling-snoop.sh is stopped, capture files older than 60 minutes will be rolled over.

Warning: Not setting, or setting the MAX_PACKET_COUNT variable to a huge value, increases the potential for a snoop capture file to completely fill up the used space of the output disk partition to 100%.

Using External Tools for Monitoring

This topic describes how to monitor Oracle Communications Convergent Charging Controller (CCC) using external monitoring tools. You can configure them to provide a real-time operational view of CCC and also helps you monitor the status of all the three components (SMS, SLC, and VWS).

In this topic, open source tools such as Pushgateway, Prometheus, and Grafana are used as an example. However, it is not restricted to only these tools. You can use any third party tool that supports the metric output.

Prometheus collects and stores the metric data in time-series database and Grafana is used for graphical visualizations.

Prometheus collects the following application metrics:

CPU Utilization
Memory Utilization
SLEE Resource Usage
Process-wise Memory Utilization

Architecture Diagram

This is image alt text.

The functions of the various components are described below:

Monitoring Scripts: Collect metrics, transform, and post to Pushgateway.
Pushgateway: Serves as scraping target to Prometheus.
Prometheus: Stores the metric data in time-series database.
Grafana: Uses Prometheus data for graphical visualization and alerts.

Monitoring Scripts

For collecting metric data, transforming them into Prometheus format, and posting them to Pushgateway, the following scripts are used:

start_system_monitor.py
stop_system_monitor.py
start_service_monitor.py
stop_service_monitor.py
start_memory_monitor.py
stop_memory_monitor.py
start_SLEE_resource_monitor.py
stop_SLEE_resource_monitor.py

All the monitoring scripts are available in /IN/service_packages/MONITORING/bin directory.

Use the following command to run start scripts in background, in all the three application nodes (SMS, SLC, and VWS):

nohup <start_script_name> &

start_system_monitor.py

This script collects overall CPU and memory usage metric.

The Prometheus format metric generated with this script is as follows:

[<metric_name>{resource="CPU Usage"} <value is percentage>\n','

<metric_name>{resource="Total Physical Memory"}<value in megabytes>\n','

<metric_name>{resource="Free Memory"} <value in megabytes>\n']

Example


['ocncc_SLC_system_details{resource="CPU Usage"} 72.7\n', 
'ocncc_SLC_system_details{resource="Total Physical Memory"} 14761\n', 
'ocncc_SLC_system_details{resource="Free Memory"} 12786\n']

stop_system_monitor.py

This script is used to stop the system monitoring. The start_system_monitor.py script writes its pid into a file at /IN/service_packages/MONITORING/tmp, which is used by this script to stop the start_system_monitor.py script.

start_service_monitor.py

This script collects the details about running smf_oper processes in the node.

The Prometheus format metric generated with this script is as follows:


['<metric_name>{process=<process1>} <number of instances of process1 currently 
running>\n', '<metric_name>{process=<process2>} <number of instances of process1 
currently running>\n', ..... \n']

Example


['ocncc_SLC_service_status{process="slee_acs"} 0\n', 
'ocncc_SLC_service_status{process="xmsTrigger"} 0\n', 
'ocncc_SLC_service_status{process="diameterControlAgent"} 0\n', 
'ocncc_SLC_service_status{process="diameterBeClient"} 0\n', 
'ocncc_SLC_service_status{process="capgw"} 0\n', 
'ocncc_SLC_service_status{process="ussdgw"} 0\n']

stop_service_monitor.py

This script is used to stop the service monitoring. The start_service_monitor.py script writes its pid into a file at /IN/service_packages/MONITORING/tmp, which is used by this script to stop the start_service_monitor.py script.

start_memory_monitor.py

This script collects the memory usage of smf_oper processes in the node.

The Prometheus format metric generated with this script is as follows:

['<metric_name>{process=<process1>, } <process1 memory usage in 
kilobytes>\n','<metric_name>{process=<processX>, } <processX memory usage in 
kilobytes>\n', ..... ]

When no processes are running, then the Prometheus format metric generated is as follows:

['<metric_name>{process="Service Down"} 0\n']

Example


['ocncc_SLC_service_memory_usage{process="slee_acs.14179"} 201044\n', 
'ocncc_SLC_service_memory_usage{process="ussdgw.14182"} 35044\n', 
'ocncc_SLC_service_memory_usage{process="capgw.14184"} 20592\n', 
'ocncc_SLC_service_memory_usage{process="xmsTrigger.14196"} 58184\n', 
'ocncc_SLC_service_memory_usage{process="diameterControlAgent.14213"} 23052\n', 
'ocncc_SLC_service_memory_usage{process="diameterBeClient.14218"} 20724\n']

stop_memory_monitor.py

This script is used to stop the memory monitoring. The start_memory_monitor.py script writes its pid into a file at /IN/service_packages/MONITORING/tmp, which is used by this script to stop the start_memory_monitor.py script.

start_SLEE_resource_monitor.py

This script collects the following SLEE resources in the system:

Max Dialogs
Used Dialogs
Max Events
Used Events
Max Calls
Used Calls

The Prometheus format metric generated with this script is as following:


['<metric_name>{resource="Max Dialogs", } <number of total dialogs at max>\n', 
'<metric_name>{resource="Used Dialogs", } <number of dialogs in use currently>\n', 
'<metric_name>{resource="Max Events", } <number of total events at max>\n', 
'<metric_name>{resource="Used Events", } <number of events in use currently>\n', 
'<metric_name>{resource="Max Calls", } <number of total calls at max>\n', 
'<metric_name>{resource="Used Calls", } <number of calls in use currently>\n']

Example

['ocncc_SLC_resources{resource="Max Dialogs", } 70000\n', 
'ocncc_SLC_resources{resource="Used Dialogs", } 0\n', 
'ocncc_SLC_resources{resource="Max Events", } 196208\n', 
'ocncc_SLC_resources{resource="Used Events", } 22\n', 
'ocncc_SLC_resources{resource="Max Calls", } 25000\n', 
'ocncc_SLC_resources{resource="Used Calls", } 0\n']

stop_SLEE_resource_monitor.py

This script is used to stop the resource monitoring. The start_SLEE_resource_monitor.py script writes its pid into a file at /IN/service_packages/MONITORING/tmp, which is used by this script to stop the start_SLEE_resource_monitor.py script.

Note:

All the monitoring scripts support the syntax and semantics of python 3.

Configuring Monitoring Scripts

You can configure the properties of monitoring scripts through configurations.yml file. Keep this file in /IN/service_packages/MONITORING/etc directory. Sample configuration files are provided in the same directory. Configuration file entries follow the standard YAML notations.

The following parameters are configured through configurations.yml file:

Note:

All the parameters are mandatory.

pushgateway: This section is used to specify the server details hosting the Pushgateway.

host: Fully qualified Pushgateway hostname
port: Listening port for the Pushgateway server.
protocol: Protocol used for communication with Pushgateway server (http/https).

prometheus: This section provides the details of the metric that would be sent out. It has the following four sub-sections:

service_monitoring: CCC processes metric detail. This section describes the properties of metric that is used to determine which how many instances of the monitored services (processes) are running in the system.
memory_monitoring: CCC process memory metric detail. Configure the properties of the metric which tells about the memory usage of individual CCC processes.
system_monitoring: CCC node CPU and physical memory usage metric detail.
SLEE_resource_monitoring: CCC SLEE resource usage. This is only used in SLC and VWS nodes.

All the above sub-sections have the following parameters:

ocncc_status_metric_name: Unique metric name. Metric names across all the nodes must be different.
scrape_interval: Metric collection interval in seconds.
logging: Whether to log metric collection output
- 0 - Disable logging
- 1 - Enable logging
.

Log files are available at /IN/service_packages/MONITORING/tmp directory.

ocncc-services: This section lists the CCC processes to be monitored. It is used by the following sub-sections:

service_monitoring
memory_monitoring

Setting up Pushgateway

Pushgateway serves as the target for Prometheus to scrape from. It listens on http port. All the monitor scripts push the metric data to Pushgateway.

Installing Pushgateway

To download Pushgateway, visit https://prometheus.io/download/#pushgateway.

Note: Pushgateway can also be set-up as a service or init job to run during the system start. By default, Pushgateway listens on http 9091 port.

Integrating Pushgateway with Monitoring Scripts

For integrating the monitoring scripts to send the metric data to Pushgateway, configure Pushgateway hostname & port details under pushgateway section in configuration.yml file.

Setting up Prometheus

To download Prometheus, visit https://prometheus.io/download/

For instructions on how to install Prometheus, visit Prometheus website.

Note: Sample prometheus.yml config file is available in SMS node at /IN/service_packages/MONITORING/etc/Prometheus-Grafana-Samples directory.

Accessing Prometheus Web UI

You can access the Prometheus UI on 9090 (default port) of the Prometheus server.

http://<prometheus-ip>:9090/graph

Checking Targets in Prometheus UI

To check targets, access Prometheus UI and click Status > Targets.

It shows the targets it is scraping on (based on scrape_configs in prometheus.yml file).

Setting up Grafana

To download and install Grafana, visit https://grafana.com/.

Configuring Grafana

Default configurations are enough for basic use. However, if you need to change any specific parameter, refer the configuration details at https://grafana.com/docs/grafana/latest/administration/configuration/.

Accessing Grafana Web UI

You can access Grafana web UI using the following URL:

http://hostname:3000

where hostname is the name of the host where Grafana is running.

By default Grafana listens on port 3000. Details about the default credentials are available in official Grafana documentation.

Creating Data Source

For instructions on creating data source, visit https://grafana.com/.

Creating Dashboards

You can configure Grafana dashboards to visualize and monitor the metrics from the data source. For instructions on creating dashboards, visit https://grafana.com/.

Note: Sample JSON files for dashboards are available in SMS node at /IN/service_packages/MONITORING/etc/Prometheus-Grafana-Samples directory.

You can import and edit it as per the requirement.

Configuring Alerts in Grafana

You can configure alerts in Grafana for all the metrics being exported from CCC.

For information about creating alerts, visit https://grafana.com/docs/grafana/latest/alerting/.

Monitoring SIGTRAN Traffic with Prometheus and Grafana

You can monitor SIGTRAN traffic in real time using Prometheus and Grafana. Dashboards display current CAP messages, transactions per second (TPS), and transaction latency, providing operational health and performance tracking.

The SIGTRAN application collects metrics and pushes them to a configured Prometheus PushGateway endpoint.

Prometheus scrapes this data at regular intervals, and Grafana visualizes these metrics. Metrics remain available even if the application or node restarts.

The following metrics are collected:

Table 3-1 Collected Metrics

Name	Type	Description	Labels	Details/Buckets
<moduleName>_caps_total	Counter	Total number of calls received.	`direction="ingress", event="CAPs", node, process`	Increments on each CAP event
<moduleName>_tps_total	Counter	Total number of transactions received.	`direction="ingress", event="TPS", node, process`	Increments on each TPS event
<moduleName>_milliseconds	Histogram	End-to-end latency of CAP transactions	`direction="ingress", event="latency", node, process`	Buckets: "1,5,10,20,50,100,200,500,1000,2000,5000" milliseconds

How to Enable Monitoring

To enable SIGTRAN monitoring, follow these steps:

Note:

It is assumed that the Pushgateway host and port are already configured.

Open the ESERV config file used in m3ua process for editing

In the METRICS section, add or update a block for your subsystem (e.g., M3UA). Set all the required parameters as shown in the example below.

```
METRICS = {
pushGatewayHost = "YOUR_PUSHGATEWAY_HOSTNAME"     # Required: Hostname or IP for PushGateway
pushGatewayPort = "9091"     # Required: PushGateway port     
M3UA = {
moduleName = "sigtran"     # Required: Metric name prefix
pushGatewayJob = "sigtran_metrics" # Required: Job label/group in Prometheus
nodeLabel = "slc"     # Required: Node/site/instance identifier label
processLabel = "m3ua_if"     # Required: Process/service identifier label
enableMetrics = "true"     # Set "true" to enable monitoring for this process/subsystem (default: false)
latencyBuckets = "150,200,500"     # Optional: Histogram buckets in ms (default: 1,5,10,20,50,100,200,500,1000,2000,5000)
pushInterval = 1000     # Optional: Push interval in ms (default: 5000)
}
}
```

Save the configuration file.
Restart your application or the relevant subsystem for changes to take effect.

How to Disable Metric Collection

To disable metric collection for a subsystem, either:

Set enableMetrics = "false", or
Omit the enableMetrics parameter entirely.

No metrics will be collected or pushed for that subsystem. The application log confirms metric reporting is disabled.

Parameter Requirements and Effects

Metrics will only be collected and pushed if:
- enableMetrics is set to true, and
- All other required parameters (pushGatewayHost, pushGatewayPort, moduleName, pushGatewayJob, nodeLabel, processLabel) are present and valid.
If any required parameter is missing/invalid, or if enableMetrics is set to false or omitted, no metrics will be collected or pushed for that process/subsystem; the application logs will state the reason.
Optional parameters (latencyBuckets, pushInterval) will use default values if not specified.
After making configuration changes, restart the application/process to apply the new monitoring settings.

Troubleshooting

If you do not see metrics in Prometheus or Grafana:

Confirm all required parameters are set and correct.
Ensure the PushGateway is reachable from the application node.
Check application logs for any warnings or errors about metrics export or configuration.

Example Dashboard Queries

Total CAPs:
sum(sigtran_caps_total{direction="ingress"})
Total TPS:
sum(sigtran_tps_total{direction="ingress"})
99th percentile Latency:
histogram_quantile(0.99, sum(rate(sigtran_latency_milliseconds_bucket[5m])) by (le))
1-min average Latency:
rate(sigtran_latency_milliseconds_sum[1m]) / rate(sigtran_latency_milliseconds_count[1m])

Using External Tools for Logging

You can use log aggregators and data visualization dashboards for logging services in Convergent Charging Controller (CCC). For example, you can use open source tool such as Fluentd for collecting, transforming and shipping log data to the log aggregator. Fluentd is a data collector on the nodes to tail the log files, filter and transform the log data, and deliver it to the log aggregator, where it is indexed and stored. You can use any log aggregator and visualization dashboard to monitor logs.

This setup helps you to have a real-time operational view of CCC. You can integrate this with the application to visualize all the application logs through the data visualization dashboard. You can also configure it to notify or trigger alerts whenever a specified error message occurs in the logs.

Architecture Diagram

This is image alt text.

Installing Fluentd

If you are using Fluentd as the data collector, perform the following steps after the installation:

Install fluent-plugin-concat for merging multiple logstash. To install the plugin, run the following command:
```
/usr/sbin/td-agent-gem install fluent-plugin-concat 
```
To add td-agent to esg group, run the following command as a root user:
```
usermod -G esg td-agent
```
To check if td-agent is part of esg, run the following command:
```
id td-agent
```

Create three different index for three machines (SMS, SLC, and VWS) to display the logs in their respective index. To create a new index, run the following commands (where the log aggregator and the data visualization tools are running):

curl -XPUT http://localhost:9200/sms_index 
curl -XGET http://localhost:9200/sms_index 
curl -XPUT http://localhost:9200/slc_index 
curl -XGET http://localhost:9200/slc_index 
curl -XPUT http://localhost:9200/vws_index 
curl -XGET http://localhost:9200/vws_index

To delete your index, run the following command:
```
curl -XDELETE http://localhost:9200/sms_index
```
To display the index in a more structured view, run the following command:
```
curl -XGET http://localhost:9200/sms_index?pretty 
```
Edit the /etc/td-agent/td-agent.conf file to define the required Fluentd plugins and patterns, so that Fluentd collects the logs from the source path and ships it to the log aggregator.
Sample td-agent.conf file path in SMS: /IN/service_packages/MONITORING/etc/

Note:

After editing td-agent.conf file, restart td-agent.service for the changes to take effect.