Tuning strategy for clusters

This topic provides guidelines for tuning the clustering parameters.

The guidelines for the clusters tuning strategy include initial values (to be used for the first trial clustering run) and recommended values. The tuning process will involve changing the parameters from their initial values toward their recommended values, with certain variation dependent on the properties of the particular data set and the application needs.

In general, the tuning strategy involves starting with the parameters at a permissive setting and then gradually decreasing the value. You tune the parameters by observing their impact simultaneously on the results for several different queries (no query or node 0; broad queries; narrow queries; single-term query; multi-term query). In other words, you should avoid tuning the parameters based on a specific query.

The following procedure is intended as a tool for gradual tuning, as it allows you to observe the effect of changing the parameters on several different queries at once. Use the suggested order, as it maps to the order in which these parameters impact the clustering algorithm, from upstream to downstream.

1: Number of records sampled from the navigation state

Recommended value: 500

Initial value: 500

Strategy: Start with the parameter set to 500, and increase it if you see that the terms at the bottom of your related terms list (terms 100-120 or so) are seen in fewer than 3 records.

Note that the recommended value of 500 is for data sets with 20 or more terms tagged onto each record. Use a higher value for data sets with fewer terms per record. If an average record has only 2 to 3 terms per record, set this value to 2000. A good rule of thumb for this value is: when the 120 most frequent terms are sampled for clustering, the 120th term should be present in at least 3 records. If it is present in fewer, this setting should be increased.

2: Maximum refinement precision

Recommended value: 0.25

Initial value: 1.0

Strategy: Start with this value set to 1.0 (no precision filtering). Try several different queries and pick a level of top useful precision that separates useful terms from the frequent but uninformative ones. Note that, typically, only the values between 0.05 and 0.5 will be useful.

3: Maximum number of terms per cluster

Recommended value: 6 - 8

Initial value: 10

Strategy: Start with a value of 10 to see all the terms that are getting into clusters. Reduce the value until the clusters are small enough to fit into whatever real estate your UI provides. Using a value of 2 is not recommended. Note that the cluster coverage (and recall) are reduced when the number of terms is reduced.

4: Cluster Coherence

Recommended value: 5

Initial value: 5

Strategy: Start with the default value of 5. If you see undesirable cluster splintering (several clusters that seem to map to the same semantic areas), this value should be decreased; on the other hand, if the cluster set is missing some semantic areas, this value should be increased. Note that it is acceptable to have several overlapping clusters remaining after tuning this value, because they will be removed in the next step.

5: Maximum cluster overlap

Recommended value: 5

Initial value: 10

Strategy: Start with a value of 10, then decrease this parameter until the desired number of overlapping clusters remains (i.e., in some cases, depending on customer needs, some cluster overlap can be retained, particularly if the smaller cluster is an especially coherent one).

6: Maximum number of clusters

Recommended value: 6

Initial value: 10

Strategy: Start with a value of 10 to see all the available clusters after all the other settings had been applied. Reduce this number if you still see more clusters than permitted by the available UI space.