Fault Management

Unified Assurance’s Fault Management platform aggregates and correlates fault and event information. The platform has many applications that are capable of receiving, processing, and enriching events in any format, from virtually any device. The out-of-the-box de-duplication and the highly customizable rules engine allows you to easily adjust the Unified Assurance. Fault management applications actively process events to isolate and pinpoint the underlying cause of a problem. Advanced correlation tools are available and can use topological information to correlate events.

The following section covers event monitoring and management in Unified Assurance.

Objectives

Fault Management Sample

This procedure shows you how to start the Trapd and Syslog aggregators to receive traps. An event filter and an event tool are also created. The event rules are viewed using the Rules interface.

Event Collection - Aggregators

  1. Navigate to Services.

    Configuration -> Broker Control -> Services

  2. Click the Trapd Aggregator service. This will open up the Service (Edit) on the right.

    1. The Trapd Aggregator is a generic SNMP Trap message listener (listening on port 162 by default) that receives messages from devices, parses the results with customizable rules and creates de-duplicated events within Unified Assurance.
  3. In the form, change the Status to Enabled and click Submit to save the changes.

  4. Select the Trapd Aggregator service, and click Start to start the service.

    1. Note: Depending on the server specifications and loads, the number of Threads that the aggregator uses may need to be increased. A good starting point for threads is 3 * (# of CPU Cores).

    2. Note: By default, devices send SNMP Trap Messages on UDP port 162. If messages are not reaching the Unified Assurance system, check the firewall settings of the network.

  5. Using the same method outlined in the steps above, enable and start the Syslog Aggregator service.

    1. The Syslog Aggregator is a generic syslog message listener that receives messages from devices, parses the results with customizable rules and creates de-duplicated events within Unified Assurance.

Note:

Event List

Once started, the Trapd and Syslog Aggregator services will listen on their respective ports for incoming events. Incoming events are saved to the Events database and are available in the Events interface.

  1. Navigate to the Events navigation, then expand the Filters by Group: Global folder, then select the All Events filter. This will open the Event List UI. All events will be visible here (provided your devices have been configured to forward SNMP trap and Syslog events).

    1. All Events is a filter to display all events, regardless of event type, method, severity, etc. Underneath All Events are filters with more specific search criteria, such as events in the past 5 minutes, Traps only, Syslog only, etc.
  2. The refresh timer in the top left of the Event List UI shows the time remaining before the UI will refresh. Click the timer to pause the Event List refresh. Click it again to resume the refresh timer.

    1. The default value for the refresh timer is 60 seconds. This value can be configured/changed within Unified Assurance to a value of your choosing.
  3. The filter bar on the bottom of the UI is used to filter events by severity. The number of events of a particular severity is shown on the respective filter button. You can click the filter button to filter by that severity. For example, clicking on the Critical (red) filter button will display events of critical severity only.

  4. Right-click an event to open the event menu. From this menu, you can select from a list of tools to run on a particular alarm, or on a number of alarms simultaneously (depending on the tool).

  5. Click the Alarm Info tool (from the Administrators sub-menu). This will display detailed information on the selected alarm, such as Node, Summary, IP Address, and First reported. The Custom1 field shows the raw alarm data received by Unified Assurance (before it was processed in the rules).

The following procedure shows you how to create your own event tools to be executed from the Event List, specify a menu to display the tool, and create your own menus.

  1. Navigate to Tools.

    Configuration -> Events -> Tools

  2. From the Tools UI, click Add to add a new tool. A drop-down list will appear, giving you the choice of creating an SQL Tool or a View Tool.

    1. The SQL Tool is a tool based on an SQL database query.

    2. The View Tool is used to execute/run a script from the event list.

  3. Click to create an SQL Tool. The SQL Tool (New) form will open. Use this tool to escalate the severity of an alarm to Critical.

  4. In the Tool Name field, enter Escalate Alarm as the tool name. The tool icon is optional, but for this example the red flag icon is used.

  5. Enter the following query in the SQL field:

UPDATE Alarm SET Severity=5 WHERE Severity !=5 AND AlarmId IN ($ALARMLIST);
  1. The query takes the AlarmId of the selected event(s), searches the Alarm table in the Events database for the alarm, and updates the Severity field to 5 (Critical).

  2. Click Submit to save the tool.

  3. The following steps show you how to add the tool to a menu. Navigate to Menus.

    Configuration -> Events -> Menus

  4. Click the Administrators (default) menu to open the Edit form to the right of the UI.

  5. From the form, select the Escalate Alarm tool (in the Available section) and add it to the menu (Selected section). Use the down arrow button to perform this action.

    1. The tool will be added to the bottom of the list in the Selected section. The list can be reorganized by clicking and dragging the tools to reposition them.

    2. Click-and-drag the Escalate Alarm tool to reorder it between the Clear and Delete tools.

  6. Click Submit to save the changes.

  7. Click the Unified Assurance logo to refresh the interface.

  8. Navigate to the Events navigation, then expand the Filters by Group: Global folder, then select the All Events filter.

  9. Select a low-severity alarm and execute the new Escalate Alarm right-click tool on the alarm.

    1. The severity of the alarm changes to critical (red) after the tool is executed.

    2. You can also create Sub-menus in the Menus interface, allowing for a more advanced hierarchy of menus and tools.

  10. The following steps shows you how to create a View Tool.

  11. Navigate to Tools.

    Configuration -> Events -> Tools

  12. Click Add and select View Tool.

  13. In this example, an SNMP Walk tool is created. When the tool is executed on an alarm, the tool gathers SNMP information of the originating device and displays the information in a new window.

  14. The tool information is added to the form.

    1. In the Path field, the path to the executable file/script is entered.

    2. The AlarmId value is passed to the snmp.php script (?AlarmId=\<AlarmId>). Using the AlarmId value, the script queries the database to find the device associated with that alarm, runs the snmpwalk command on the device, and outputs the results in the form of an html table.

  15. Once the tool is created, it is added to the Administrators menu.

  16. From the Event List, when the tool is executed on a device, it returns SNMP information for that device, if available.

Rules

When a trap or Syslog is received in Unified Assurance, it is processed by the Trapd and Syslog Aggregator rules files respectively. These rules files process the raw trap data/Syslog message and determine the output to the Alarms table and Event List.

  1. Navigate to Rules.

    Configuration -> Rules

    1. The UI contains a list of rules directories and subdirectories.

    2. Click the white "arrow" symbol to the immediate left of a folder icon to expand that directory. Clicking on the black "arrow" symbol will collapse the directory.

  2. Click to expand Event Rules (event) -> Default read-write branch (default) -> eventStdAggregator -> trap.

  3. Click the base.rules file to open it for viewing/editing (will open to the right of the UI).

    1. Unified Assurance rules are separated into BaseRules, LoadRules and IncludeRules files.

    2. The LoadRules file, by default base.load, is a rules file that is loaded and executed once during the poller/aggregator start up, and is used to pre-load enrichment data that can be used throughout the rules files. Here you can define hashes, run SQL lookups, and define any other variables that one could use throughout the rules. In addition the Trapd "base.load" includes the code necessary for:

      1. The correct functioning of Extended Rules.

      2. The aggregator to support event maintenance windows (commented out by default).

      3. The aggregator to support priority scoring (commented out by default).

    3. The IncludeRules, by default base.includes, is a file that will contain a list of any additional rules files you have in your rules repository that you wish to execute. For Trapd IncludeRules, The default files include:

      CustomerRules,eventBase/common/aggregators/customer.rules
      StandardMIB2Rules,eventStdAggregator/trap/vendor/standardmib2.rules
      
  4. Where the format of the file is:

    NameOfSubfunctionToCreate,Path/RelativeToBaseDir/rules-file.rules
    
  5. You can then create a rules file in the trap rules file repository and execute the rules file by calling the sub-function name defined in the IncludeRules file, ex. StandardMIB2Rules();.

    1. The BaseRules, by default base.rules, is a file that is executed per trap received. Depending upon the rules defined within, the aggregator will decide which additional rules files to execute (those defined in IncludeRules). The BaseRules process the trap based on the Generic Trap value and, if it is an Enterprise Specific Trap (Generic 6), it will process it based on the Enterprise OID and Specific value, for example:

      elsif ($generic == 6) {    # Enterprise Specific Trap
          if ($enterprise =~ /^1\.3\.6\.1\.6\.3\.1\.1\.5/) {
              # Standard MIB2 Trap Rules
              $Event->{'SubMethod'} = "Standard MIB2 Traps";
              $Log->Message('DEBUG', "Using Standard MIB2 Trap Rules - $specific - $generic");
              StandardMIB2Rules();
          }
          else {
              # NO Rules
              $Event->{'SubMethod'} = "Unknown Traps";
              $Event->{'Summary'}   = "Unknown Trap From $ip [$enterprise] Generic: $generic Specific: $specific";
              $Event->{'AlarmKey'}  = "Unknown Trap From $ip [$enterprise] Generic: $generic Specific: $specific";
              $Log->Message('ERROR', "!ERROR! No Rules Defined");
          }
      }
      

In the above example if an enterprise specific trap came in with Enterprise OID 1.3.6.1.6.3.1.1.5, it would then run the StandardMIB2Rules subfunction, passing the data to the rules file in IncludeRules where you would handle the Specific trap.

The customer.rules file is a rules file that is executed per event in the BaseRules after all other IncludeRules have executed. The customer.rules is not specific to any one aggregator but is a rules file in which common event rules logic can occur across the aggregators.

Process Order

The following list shows you the process order for the Trapd rules:

The Trapd Aggregator application configuration defines which files are the BaseRules, IncludeRules and LoadRules and other settings. The application configuration can be viewed and edited under Services.

[*Configuration -> Broker Control -> Services*](../../users-guide/ui/broker/Services.md)

Creating Custom Rules Files

To add custom rules files for use within Unified Assurance, you must understand the COM functionality. The COM Overview documentation explains how to convert a vendor MIB file into an Unified Assurance rules file.

Event Filters

The Event Filters UI allows you to customize the default event filters and create your own filters, to filter events based on certain criteria that you define. In this example, a filter is created to display all critical severity events that have not been acknowledged.

  1. Navigate to Filters.

    Configuration -> Events -> Filters

  2. Click Add and fill out the form with the relevant information.

    1. The Filter Clause is where the search criteria for the filter is entered.

    2. Users can restrict access to certain filters to a specific User, to the User's Group, to any group or any user. Event Lists using defined Filters are accessible through the Event Filters in the Events accordion (far left of the UI), or visually shown as a gauge in the Gauge View section of the Event Console. The Gauge View section can be accessed by clicking on an Event Filter folder within the Events accordion.

    3. Private filters set to a specific user will show up in their own 'Private Filters' group for the user in the Event Filters navigation tree.

    4. Note: Administrators can view private user filters, but can NOT modify or delete them.

  3. Click Submit to save the filter.

  4. Navigate to Filter Groups.

    Configuration -> Events -> Filter Groups

  5. Click the Global filter group to open it for editing and add the new filter to the group.

  6. Click Submit to save the changes. The filter can now be used by navigating to the Events UI and clicking on the filter.

Mechanizations

Unified Assurance Mechanizations are MySQL stored procedures that are run at scheduled times and facilitate event correlation, the deletion of expired events, etc. The following example demonstrates the setup of a basic mechanization that will delete cleared (green) alarms that are over 60 minutes old.

  1. Navigate to Mechanizations.

    Configuration -> Events -> Processing -> Mechanizations

  2. Click Add to create a new Mechanization.

  3. Fill out the form as shown in the image above. Enter the following in the "Stored Procedure" text area:

    DECLARE StartTime int default UNIX_TIMESTAMP();
    DECLARE StopTime int;
    DECLARE Rows int default 0;
    
    DELETE FROM Alarm
          WHERE Severity = 0 
            AND LastReported < (UNIX_TIMESTAMP() - 3600);
    
    SET StopTime = UNIX_TIMESTAMP();
    SET Rows     = ROW_COUNT();
    
    INSERT INTO EventMechanizationLog
         VALUES (StartTime,
                 'delete_clears',
                 Rows,
                 StopTime - StartTime);
    
  4. This will delete events from the Alarm table where the event is a clear event (Severity 0) and has not been updated with over 60 mins (3600 seconds).

  5. An entry is also made to the EventMechanizationLog table, which logs the changes made by event mechanizations.

  6. Click Submit to save the changes.

Logs

The Trapd and Syslog Aggregators write to log files under the /opt/assure1/logs/ directory by default. These logs are very useful for debugging and troubleshooting purposes. The log files are set to the ERROR level by default, which means that only errors will be written to the logs. For debugging purposes, it is good practice to set the log level to DEBUG, which will output all error, warning, info and debug information to the log. The following example demonstrates how to do this for the Trapd Aggregator, and how to view the log file after.

  1. Navigate to Services.

    Configuration -> Broker Control -> Services

  2. Find the Event Trap Aggregator service and click to select it.

  3. Change the LogLevel value from ERROR to DEBUG using the drop-down list.

  4. Click Submit to save the changes to the configuration.

  5. Find the Event Trap Aggregator service and click to select it again.

  6. Click on the Restart button (top button menu) to restart the service (so that the new application configuration will be taken into effect).

  7. Navigate to the Logs UI.

  8. In the search bar, enter the following to only show logs for the trap aggregator:

    app:"Trapd"
    

    Additionally, the All Systems button can be clicked on to only show logs for a single server. As you can see, having the log level set to DEBUG reveals further information useful for debugging and troubleshooting purposes.

Watcher Policies

Watcher Policies are used to send out new synthetic events when certain criteria are met. These policies poll for certain events (or lack thereof) within a time period and trigger a Meta Event depending on the results. Examples of Watcher policies include sending an event when there have been no syslog messages from a device in 15 minutes and sending an event when a user has three failed login attempts within 15 minutes.

[*Configuration -> Events -> Processing -> Watcher Policies*](../../users-guide/ui/event/WatcherPolicies.md)

CAPE Policies

CAPE Policies trigger custom Perl code to perform a custom set of actions. CAPE Policies are used to perform special processing based on information from within the Unified Assurance database. However, nodes are not limited to Unified Assurance information only. The policies act similar to services in that they have a defined poll interval. A policy queries the Unified Assurance database and, if the query returns a value, triggers a CAPE Node (defined within the Policy). The CAPE Node contains Perl-based logic to perform custom actions on the data returned by the CAPE Policy

The following example demonstrated the setting up of a CAPE Policy and accompanying CAPE Node.

  1. Create the CAPE Node via the Nodes UI.

  2. Navigate to Nodes.

    Configuration -> Events -> CAPE -> Nodes

  3. Click Add to bring up the form.

  4. Give the Node a useful name and description.

  5. Use the following code in the Rules Text section:

    # This section grabs the values selected in the 'SELECT' statement used in the CAPE Policy configuration. Only things referenced in the SQL SELECT statement are available within the CAPE Node.
    
    my $AlarmID   = $EventData->{'AlarmID'};
    my $IPAddress = $EventData->{'IPAddress'};
    my $Severity  = $EventData->{'Severity'};
    my $Custom5   = $EventData->{'Custom5'};
    
    # Due to the SQL code used in the CAPE Policy, every event that gets pulled back are DeviceUpDown events, which get created due to the Threshold Engine for the Device Down threshold. The CAPE Policy created will now attempt to ping the device and store the result of the attempt into the $ping variable.
    
    $Log->Message("DEBUG", "Running Ping check command");
    my $ping = `ping -c10 $IPAddress`;
    $Log->Message("DEBUG", "Result of Ping test ----> [ $ping ]");
    
    # If the result of the ping contains the text 'ttl', the ping was successful. The event severity will be set to '0' (cleared) and a log message is saved to the CAPE log.
    
    if ($ping =~ m/ttl/i){
       $Log->Message("DEBUG", "Result of Ping Test SUCCESSFUL - suppressing event");
       $Severity = 0;
    }
    
    # An Alarm Journal entry is created for the AlarmID in question with the output of the ping from the step above.
    
    $Log->Message("DEBUG", "Inserting into Journal [$AlarmID:1:$time:$ping]");
    
    my ($ErrorFlag, $Message) = AddJournal({
        DBH       => $EventsDBH,
        AlarmID   => $AlarmID,
        TimeStamp => time(),
        UserID    => '1',           # Admin
        Entry     => $ping
    });
    
    # The next step is to update the Alarm with new information. The severity is updated (if ping was successful), and also update Custom5 so that this event is not caught by the CAPE Policy SQL code again. This prevents CAPE redoing events it has previously acted upon. 
    
    $Log->Message("DEBUG", "Updating alarm entry [$AlarmID:$Severity:$Custom5]");
    $Custom5 = $Custom5 . "-Ping Check";
    
    my ($ErrorFlag, $Message) = UpdateEvent({
        DBH     => $EventsDBH,
        AlarmID => $AlarmID,
        Values  => {
            Custom5  => $Custom5,
            Severity => $Severity
        }
    });
    
  6. Create a CAPE Policy from the Policies UI.

  7. Navigate to Policies.

    Configuration -> Events -> CAPE -> Policies

  8. Click Add to bring up the form.

  9. Give the policy a useful name , set to enabled, and give it a good description of its use.

  10. The Event Grouping form field determines whether the returned data is processed individually (one row at a time), or all of the returned data at one time. In this example, Event Grouping is set to Process Events Individually.

  11. Set the poll time to 5 minutes; this will cause the SQL to be run every 5 minutes against the event list.

  12. Select the 'First Node Executed' to be the corresponding 'Node' created in the previous step (50).

  13. Use the following SQL statement in the Event Select Statement section:

    SELECT AlarmID, Severity, IPAddress, Custom5 
      FROM Alarm 
     WHERE Severity   = 5 
       AND AlarmGroup = "DeviceUpDown" 
       AND Custom5    NOT LIKE '%Ping Check%'