Examples of Semantic Clustering Using Natural Language Processing

The nlp command can be used to extract keywords from a string field, or to cluster records based on these extracted keywords. Keyword extraction can be controlled using a custom NLP dictionary. If no dictionary is provided, the default out-of-the box-dictionary is used.

Topics:

For more information on semantic clustering, see Semantic Clustering.

Cluster Kernel Errors in Linux Syslog Logs

The following query clusters Kernel messages in Linux Syslog Logs:

'Log Source' = 'Linux Syslog Logs' and kernel
| link cluster()
| where 'Potential Issue' = '1'
| nlp table = 'iSCSI Errors' cluster('Cluster Sample') as 'Cluster ID',
              keywords('Cluster Sample') as Summary
| sort 'Cluster ID'

In the above query:

  • link cluster() runs the traditional cluster and returns a Cluster Sample field.

  • nlp cluster('Cluster Sample') processes each Cluster Sample and assigns a Cluster ID. Messages that have similar meaning would get the same Cluster ID.

  • keywords('Cluster Sample') extracts the keywords used in clustering. This is returned in the Summary field.

The following image shows the link results returned:


semantic clustering of the linux syslog logs for the kernel errors

  • The first and second rows are not similar, and hence get different cluster IDs.

  • The third and fourth rows have similarity in the Cluster Sample. This can be seen in the overlap of keywords extracted in the Summary field.

  • By default, a 70% overlap is required to form a cluster. This can be overridden using the similarity parameter to cluster.

  • The Cluster ID generated is deterministic. Thus, the Cluster ID can be used as a shortcut for the list of keywords shown in the Summary column.

Use similarity to Control the Number of Clusters

Running cluster using the default dictionary and a lower similarity threshold would produce fewer clusters:

'Log Source' = 'Linux Syslog Logs' and kernel
| link cluster()
| where 'Potential Issue' = '1'
| nlp similarity=0.2 cluster('Cluster Sample') as 'Cluster ID',
                     keywords('Cluster Sample') as Summary
| sort 'Cluster ID'

This merged some of the rows into the existing clusters, as well as reduced the number of clusters:


semantic clustering of the linux syslog logs for the kernel errors after reducing the number of clusters based on similarity

Cluster the Database Alert Logs

The following query shows an example of semantically clustering Database Alert Logs:

'Log Source' = 'Database Alert Logs'
| link cluster()
| nlp cluster('Cluster Sample') as 'Cluster ID',
      keywords('Cluster Sample') as Summary
| where Summary != null
| classify 'Start Time', Summary, 'Cluster ID' as 'Database Messages'

semantic clustering of the database logs for the cluster ID 1188814328


semantic clustering of the database logs for the select cluster ID and the adjacent summary of the keywords