Examples of Semantic Clustering Using Natural Language Processing

The nlp command can be used to extract keywords from a string field, or to cluster records based on these extracted keywords. Keyword extraction can be controlled using a custom NLP dictionary. If no dictionary is provided, the default out-of-the box-dictionary is used.

Cluster Kernel Errors in Linux Syslog Logs

The following query clusters Kernel messages in Linux Syslog Logs:

'Log Source' = 'Linux Syslog Logs' and kernel
| link cluster()
| where 'Potential Issue' = '1'
| nlp table = 'iSCSI Errors' cluster('Cluster Sample') as 'Cluster ID',
              keywords('Cluster Sample') as Summary
| sort 'Cluster ID'

In the above query:

  • link cluster() runs the traditional cluster and returns a Cluster Sample field.

  • nlp cluster('Cluster Sample') processes each Cluster Sample and assigns a Cluster ID. Messages that have similar meaning would get the same Cluster ID.

  • keywords('Cluster Sample') extracts the keywords used in clustering. This is returned in the Summary field.

The following image shows the link results returned:



  • The first and second rows are not similar, and hence get different cluster IDs.

  • The third and fourth rows have similarity in the Cluster Sample. This can be seen in the overlap of keywords extracted in the Summary field.

  • By default, a 70% overlap is required to form a cluster. This can be overridden using the similarity parameter to cluster.

  • The Cluster ID generated is deterministic. Thus, the Cluster ID can be used as a shortcut for the list of keywords shown in the Summary column.

Use similarity to Control the Number of Clusters

Running cluster using the default dictionary and a lower similarity threshold would produce fewer clusters:

'Log Source' = 'Linux Syslog Logs' and kernel
| link cluster()
| where 'Potential Issue' = '1'
| nlp similarity=0.2 cluster('Cluster Sample') as 'Cluster ID',
                     keywords('Cluster Sample') as Summary
| sort 'Cluster ID'

This merged some of the rows into the existing clusters, as well as reduced the number of clusters:



Cluster the Database Alert Logs

The following query shows an example of semantically clustering Database Alert Logs:

'Log Source' = 'Database Alert Logs'
| link cluster()
| nlp cluster('Cluster Sample') as 'Cluster ID',
      keywords('Cluster Sample') as Summary
| where Summary != null
| classify 'Start Time', Summary, 'Cluster ID' as 'Database Messages'