You will need to test the understanding of your model for each skill and then later for the digital assistant as a whole. A well-trained model in a skill that understands how to correctly map in-domain messages to an intent and that does not respond to non-domain messages is an important pillar of a well-trained digital assistant.
Oracle Digital Assistant provides an utterance tester in its skills that allow you to perform manual and batch testing of how well the model resolves intents from user messages. For batch testing, it is where you use the 20% of the utterances you defined for an intent but that you held back for testing.
In general, you should test your models often and early, but not before you have enough good utterances for the skill’s intents. The goal of your tests is for the model to gain a high level of confidence in resolving intents.
Create a Baseline
After development is completed, you should run tests and use the results to establish a baseline of the model’s level of understanding. You can use that baseline as a point of comparison when you update the training model with additional and improved utterances and when you later test the skill on updated versions of the Digital Assistant platform. For these and future tests, you need a model that is trained with a sufficient number of quality utterances.
Perform Positive and Negative Testing
You should have both positive and negative tests:
In positive tests, you want the utterances to resolve to the intent you have designated. The more tests that pass, the better the model is trained.
For negative tests, you want the utterances to not resolve. Negative tests help you tighten the boundaries of understanding for an intent.
As an example, for a positive test, assume that in an expense report skill you are testing the "create expense" intent. All utterances in a positive test contain messages that should resolve to this intent. So, the more tests that pass, the better the model is trained.
Negative testing includes the following types of tests:
Neighbour testing: Test an intent with the utterances you created to test the other intents in a skill.
Out-of-domain testing: With these tests you try utterances that semantically don't belong to the intent but use similar words. For example, an expense report should understand "I bought a family calendar for work" as a user requesting to file a new expense, but should not respond to "create a new entry in my family calendar".
Random phrase testing: Trying random messages should not resolve to the intent you test. For example, "the cookie cutter cuts cookies" or "I am on a stairway to heaven" should not lead to a match for the "create expense" intent.
Checklist for Model Testing
- ☑ Test early and test often.
- ☑ Don't test an undertrained model.
- ☑ Use positive and negative testing.
- ☑ Utterances used for testing should be of the same quality as training utterences, but must not be the same utterances used for training.
- ☑ Aim for results well above confidence threshold when testing utterances. (However, a 100 % confidence rate is not a goal.)
- ☑ Before putting your skill into production, keep a note of the test results as a baseline for future tests you run.
- Oracle Design Camp video: Testing Strategies