28 Quality Reports

A skill that can distinguish between its intents easily will have fewer intent resolution errors and better user adoption. The quality reports can help you reach these goals.

You can use these reports when you’re forming your training data and later on, when you’ve published your skill and want to find out how your intents are fielding customer messages at any point in time.

How Do I Use the Data Quality Reports?

Using the Utterances, Suggestions, and History pages, you can find out if your skill has a sufficient number of intents in the first place, and if so, if these intents overlap, need editing, or if they’re behaving as expected in a production environment.
  • Utterances—Assigns quality rankings to pairs of intents as follows:
    • High—The intents are distinct.

    • Medium—The intents have similar utterances.

    • Low—The intent pairs aren’t differentiated enough.

    You can edit or delete utterances from this page.
  • Suggestions—Tells you if your skill is viable. You can find out if you’ve added enough intents and if you defined a sufficient number of utterances for each intent.

  • History—Shows your skill’s resolution history, so that you can identify when the intents worked as expected and when they didn’t. You can use this feedback to retrain your skill.


When you’re building your training corpus, you can gauge how distinct your intents are from one another by running an utterance quality report. This report shows you different combinations of intent pairs, each rated on the similarity of their respective utterances. It generates these results by randomly splitting the utterances into two sets: training and testing. It builds and trains a model from 80% of the utterances and then uses the remaining 20% to test this model. If you don’t already have a lot of training data, you can build high-quality intents by combining this report with the utterance guidelines.
Run an Utterance Quality Report

Use this report to find out which utterances are too alike or can be potentially misclassified (associated with the wrong intent).

  1. Before you run a report, you need to train your skill.
  2. When the training is complete, click Quality (This is an image of the left navbar Quality icon.) in the left navbar.

  3. Click Run Report. The report scores the intent pairs in terms of their utterances that are too alike.

    Score What Does this Mean? (And What Do I Do?)
    High While your skill can easily distinguish between these intents, they still may have utterances that are too alike, so you should continue to edit and add training data.
    Medium The utterances are so similar that they potentially blur the meanings of these intents. Because your skill may have trouble distinguishing between these intents, edit or delete these utterances.
    Low The utterances are too alike, so the skill can’t distinguish between them. To fix this, edit or delete these utterances and then retrain the skill. You can also add more utterances to your intent.
  4. If needed, click Show All. By default, this switch is toggled off (This is an image of the Show All toggle in off position.), so the report shows only the medium and low-ranking intent pairs. Keep in mind that just because a report ranks an intent pair as high quality, doesn’t mean that your corpus is complete, or doesn’t need more utterances.

  5. If needed, choose a sorting option to view the intent pairs.

  6. Click an intent pair. The report has two categories:
    • Similar Utterances—For both intents, this report shows the utterances that are too alike. By hovering over an utterance, you can edit or delete it.
    • Misclassified Utterances—Gauges the probability that an utterance used to test the model might be classified under the wrong intent.


      By default, the report displays the medium- and low-accuracy utterance because the Show All toggle is switched off (This is an image of the Show All toggle in off position.).
      Utterance Expected Intent Observed Intent Accuracy
      For each of the expected intents in this report, we detect the utterances that were used in testing. The intent for which the utterance was initially created. The intent to which the utterance may belong.

      How utterance quality contributes to the potential for misclassification

      Low—Intent resolution is low due to the high probability of misclassification.

      Medium—The probability of misclassification means that the likelihood of successful classication is only moderate.

      High—Accurate intent resolution is likely because the probability of the misclassifcation is low.

  7. If needed, adjust the utterances.

    Based on the score of the similar entities, your typical course of action is:
    • Adding new utterances.

    • Modifying the similar utterances.

    • Deleting similar utterances.

    • Leaving the utterances alone, even if they clash.

    • Collapsing the two intents into a single intent if they have too many utterances in common. This common intent uses entities (such as list value entities with synonyms) to recognize the distinctions in the user input.

    For misclassified utterances:
    • To improve accuracy, add more example utterances to the expected intent.

    • If there are disproportionately more utterances for the observed intent than for the expected intent, consider removing some unnecessary ones.


    Keep your sights on how modifying the data improves your skill’s coverage, not the report results. While you can increase accuracy within the context of this report by adding similar utterances to an intent, you should instead focus on anticipating real-world user input maintaining a diverse set of utterances for each intent. If you pad your intents to suit the report, your skill won’t perform well.
  8. After you’ve made your changes, retrain your skill again and then click Rerun Report.

Unexpected Results in Utterance Quality Reports

Sometimes the report can output results that seem contradictory. Here are some issues and possible causes.

Issue Cause
The report shows similar utterances for high-quality intent pairs. The report not only compares utterances, but it also looks at an intent as a whole. So if most of the utterances for an intent pair are distinct, then a low number of similar ones won’t detract from the overall quality rating. For example, two intents that have both been trained on 100 utterances each can be easily distinguished even though they both share a small number of similar utterances.
The report doesn't show similar utterances for low-quality pairs. This can happen because, on the whole, the report can’t distinguish between the intent pair even though they don’t share any utterances. Factors like a low number of utterances, or vague and general wording can cause this.
The report continues to show utterances as similar, even after you've edited them.

Whenever you delete or edit and utterance, you need to retrain the bot before you run the report again.


When you’re starting out with your data set, check the Suggestions page to find out if your skill meets the minimum standards of having at least two intents, each of which has two or more utterances.


While the Utterance and Suggestion pages help you evaluate your skill as you develop it, you’d use the History page when your training data is robust. The reports that you run from this page return user messages along with the intents that resolved them, ranked by win margin and confidence level. The report is designed to help you look for:
  • Complete failures (unresolved intents)—When your skill can’t classify the user comment to any of its intents.

  • Potentially misclassified user messages—When the top intent is separated from the second intent by only a narrow margin.

  • Low confidence levels—When the intended intent resolves the message, but just barely, as indicated by a low confidence level.

Run a History Report
  1. Choose the time period. You can use one of the preset periods, like Today, Yesterday, or Last 90 Days, or add your own by first choosing Custom and then by setting the collection period using the date picker.


    If you don’t want to filter this data any further, then just delete the filter criteria (This is an image of the delete icon. ) and then click Search.
  2. If you want to use the report to find out about intents that resolve the messages correctly but with only a low confidence level or by a thin margin, first chose one of the operators (All or Any) and then apply your search criteria.
    Description of add_criteria.png follows
    Description of the illustration add_criteria.png

  3. Click Search. For each message within the time frame, the report shows you which intent your skill used to resolve the message along with the second-runner up. To reflect the intent ranking, the report shows you the intents’ confidence ranking and, for the top intent, it’s win margin, the difference, in terms of confidence, between it and the second intent.


    In general, you set the win margin at around 10%.

    If you click Show All, you can see the lower-ranking intents (if any).

    By expanding the General section of page, you can see which entities played a role in resolving the message and the channel (which you can set as a filter).
  4. If you think the message improves your corpus by say, widening the win margin between the top two intents, select the intent’s confidence level radio button and then click Add Example. Remember that since you’ve now added a new utterance to your corpus, you need to retrain your skill.
Run Failure Reports

To identify all of the messages that your skill treated as unresolved because the resolution fell below the confidence threshold, set Top Intent Confidence to a value lower than the one set for the skill’s Confidence Threshold property. You can add the messages returned by this report to an existing intent, or if they indicate that users want your skill to perform some other action entirely, you can use them to define a new intent.

Run Low Confidence Reports

When the top intent resolves the message, but only with a low confidence might indicate that you might need to revise the utterances that belong to the intents because they’re potentially misclassified. To run a report of low confidence intents, set Top Intent Confidence equal to a value that’s just above the one set for the skill’s Confidence Threshold property.

Troubleshoot Narrow Win Margins

Thin win margins might indicate where user messages fall in between your bot’s intents. Review these messages to make sure that they are getting resolved by the right intent. You can also configure the Win Margin property in Settings > Configuration to help your bot respond to vaguely worded or compound user messages.