Best Practices for Creating Evaluation Sets
Creating a good evaluation set helps in improving the efficiency of your evaluation. Here are some best practices for creating evaluation data sets and testing protocols for agent teams.
The foundation of a successful evaluation is a high-quality data set. This data set consists of paired questions or inputs, and expected responses. The expected response must be factually correct and grounded in the source context.
Example Data Set
| Question | Expected Response |
| Is aromatherapy covered? | No, aromatherapy isn'tcovered. According to the provided context, aromatherapy is listed under "Alternative Treatments" which aren't covered by UnitedHealthcare Medical Plans. |
| Do you pay for thermometers? | Based on the provided context, thermometers aren't covered. The document "MEDICAL SUPPLIES AND APPLIANCES" lists thermometers as excluded supplies. |
| Is laser surgery for eyes covered? | Based on the provided context, laser surgery for eyes isn'tcovered. The "VISION" section states that surgery to correct nearsightedness, including laser surgery, is listed under plan exclusions. |
General Evaluation Guidelines
These principles apply to all agent teams, to ensure comprehensive testing.
- Safety and grounding: Validate that the agent adheres to safety policies. It must avoid hallucinations, that is, if no relevant information is found, it should state that, rather than guessing.
- Ambiguity and multipart queries: Evaluate the agent’s ability to handle vague queries, implicit reasoning, or single turns containing multiple distinct questions.
- Negative testing: Introduce out-of-scope or unanswerable questions, for example, topics not covered by the available tools, to ensure the agent responds with an appropriate "I don't know" or redirect.
- Multiturn dialogue: If the agent is conversational, design tests that require maintaining context and history across several turns, for example, answering a follow-up question based on the previous answer.
- Latency and performance: Measure response time and efficiency, particularly in scenarios involving calls to multiple tools or complex reasoning.
Evaluation Guidelines for Supervisor Agent Teams
Agent teams of type Supervisor are evaluated based on the tools they use.
| Tool | Guidelines |
|---|---|
| Document Tool (RAG) |
Questions designed for Retrieval-Augmented Generation (RAG) must test the agent's ability to handle complexity, not just simple keyword retrieval.
|
| Business Object |
|
| REST |
|
| Deep Link | Validation: Confirm that deep links are generated correctly and for the appropriate scenarios. |
| MCP (Model Context Protocol) |
|
Evaluation Guidelines for Workflow Agent Teams
Evaluations for Workflow agents must test the overall logic flow and the robustness of individual nodes.
Workflow Structure and Logic
- Path coverage: If the workflow has multiple paths or branches, ensure the evaluation tests all paths.
- Scenario depth: Test multiple distinct scenarios for the samepath to ensure consistency.
- Unsupported scenarios: Include tests that identify scenarios the workflow doesn't support, to verify graceful failure or error handling.
Workflow Nodes
A workflow includes multiple node types, and the evaluation must cover the key scenarios applicable to each type.
| Node | Scenario |
|---|---|
| LLM | Test the LLM prompt to identify if it can handle all types of questions and formatting instructions. |
| Code |
|
|
Apply the same guidelines as specified for agent teams of type Supervisor.Test functions from different angles and with varied parameters. |
| RAG Document Tool | Apply the same guidelines as specified for agent teams of type Supervisor. |
| Document Processor | Format Handling: Test the node for various attachment types (PDF, txt, and so on) to ensure consistent text extraction. |
| Vector DB Reader | Retrieval Accuracy: Test if the node retrieves the most semantically relevant chunks based on the input query. |
For the other supported nodes, make sure evaluation tests cover other nodes including Tools, Vector DB Reader, and Vector DB Writer.