Examples of Semantic Clustering Using Natural Language Processing

The nlp command can be used to extract keywords from a string field, or to cluster records based on these extracted keywords. Keyword extraction can be controlled using a custom NLP dictionary. If no dictionary is provided, the default out-of-the box-dictionary is used.

Topics:

For more information on semantic clustering, see Semantic Clustering Using Natural Language Processing.

Cluster Kernel Errors in Linux Syslog Logs

The following query clusters Kernel messages in Linux Syslog Logs:

'Log Source' = 'Linux Syslog Logs' and kernel
| link cluster()
| where 'Potential Issue' = '1'
| nlp table = 'iSCSI Errors' cluster('Cluster Sample') as 'Cluster ID',
              keywords('Cluster Sample') as Summary
| sort 'Cluster ID'

In the above query:

link cluster() runs the traditional cluster and returns a Cluster Sample field.
nlp cluster('Cluster Sample') processes each Cluster Sample and assigns a Cluster ID. Messages that have similar meaning would get the same Cluster ID.
keywords('Cluster Sample') extracts the keywords used in clustering. This is returned in the Summary field.

The following image shows the link results returned:

Description of nlp_cluster_syslog_kernel.png follows

Description of the illustration nlp_cluster_syslog_kernel.png

The first and second rows are not similar, and hence get different cluster IDs.
The third and fourth rows have similarity in the Cluster Sample. This can be seen in the overlap of keywords extracted in the Summary field.
By default, a 70% overlap is required to form a cluster. This can be overridden using the similarity parameter to cluster.
The Cluster ID generated is deterministic. Thus, the Cluster ID can be used as a shortcut for the list of keywords shown in the Summary column.

Use similarity to Control the Number of Clusters

Running cluster using the default dictionary and a lower similarity threshold would produce fewer clusters:

'Log Source' = 'Linux Syslog Logs' and kernel
| link cluster()
| where 'Potential Issue' = '1'
| nlp similarity=0.2 cluster('Cluster Sample') as 'Cluster ID',
                     keywords('Cluster Sample') as Summary
| sort 'Cluster ID'

This merged some of the rows into the existing clusters, as well as reduced the number of clusters:

Description of nlp_cluster_syslog_kernel2.png follows

Description of the illustration nlp_cluster_syslog_kernel2.png

Cluster the Database Alert Logs

The following query shows an example of semantically clustering Database Alert Logs:

'Log Source' = 'Database Alert Logs'
| link cluster()
| nlp cluster('Cluster Sample') as 'Cluster ID',
      keywords('Cluster Sample') as Summary
| where Summary != null
| classify 'Start Time', Summary, 'Cluster ID' as 'Database Messages'

Description of db_alert_logs_cluster1.png follows

Description of the illustration db_alert_logs_cluster1.png

Description of db_alert_logs_cluster2.png follows

Description of the illustration db_alert_logs_cluster2.png