Best Practices for Creating Evaluation Sets

Creating a good evaluation set helps in improving the efficiency of your evaluation. Here are some best practices for creating evaluation data sets and testing protocols for agent teams.

The foundation of a successful evaluation is a high-quality data set. This data set consists of paired questions or inputs, and expected responses. The expected response must be factually correct and grounded in the source context.

Example Data Set

Question	Expected Response
Is aromatherapy covered?	No, aromatherapy isn't covered. According to the provided context, aromatherapy is listed under "Alternative Treatments" which aren't covered by United Healthcare Medical Plans.
Do you pay for thermometers?	Based on the provided context, thermometers aren't covered. The document "MEDICAL SUPPLIES AND APPLIANCES" lists thermometers as excluded supplies.
Is laser surgery for eyes covered?	Based on the provided context, laser surgery for eyes isn't covered. The "VISION" section states that surgery to correct nearsightedness, including laser surgery, is listed under plan exclusions.

General Evaluation Guidelines

These principles apply to all agent teams, to ensure comprehensive testing.

Safety and grounding: Validate that the agent adheres to safety policies. It must avoid hallucinations, that is, if no relevant information is found, it should state that, rather than guessing.
Ambiguity and multipart queries: Evaluate the agent’s ability to handle vague queries, implicit reasoning, or single turns containing multiple distinct questions.
Negative testing: Introduce out-of-scope or unanswerable questions, for example, topics not covered by the available tools, to ensure the agent responds with an appropriate "I don't know" or redirect.
Multiturn dialogue: If the agent is conversational, design tests that require maintaining context and history across several turns, for example, answering a follow-up question based on the previous answer.
Latency and performance: Measure response time and efficiency, particularly in scenarios involving calls to multiple tools or complex reasoning.

Evaluation Guidelines for Supervisor Agent Teams

Agent teams of type Supervisor are evaluated based on the tools they use.

Tool	Guidelines
Document Tool (RAG)	Questions designed for Retrieval-Augmented Generation (RAG) must test the agent's ability to handle complexity, not just simple keyword retrieval. Long Range Context: Test if the agent can resolve dependencies scattered across distant sections or several pages of a document. Distributed Context: Ensure the agent can aggregate information from multiple noncontiguous parts of the document to answer comprehensively. Concealed Context: Test the ability to find and extract specific, obscure details deep within the text. Reasoning: Check if the agent can apply reasoning or logic to the retrieved information to provide a correct answer. Table-Sourced: Test the ability to interpret and pull accurate data from tables within the document.
Business Object	Function Coverage: Ensure evaluations test all business functions available in the business objects. Parameter Variation: Test the same business function from different angles and with different parameters. For example, if a BO creates an object, test it with various input types.
REST	Endpoint Coverage: Ensure evaluations test all functions available in the REST tool. Scenario Variation: Test the same REST tool for different scenarios and parameters, for example, handling different payload structures or update types.
Deep Link	Validation: Confirm that deep links are generated correctly and for the appropriate scenarios.
Model Context Protocol (MCP)	Tool selection: Validate that the correct MCP tool is called. Function accuracy: Validate that the correct functions are called within the MCP tool.

Evaluation Guidelines for Workflow Agent Teams

Evaluations for Workflow agents must test the overall logic flow and the robustness of individual nodes.

Workflow Structure and Logic

Path coverage: If the workflow has multiple paths or branches, ensure the evaluation tests all paths.
Scenario depth: Test multiple distinct scenarios for the same path to ensure consistency.
Unsupported scenarios: Include tests that identify scenarios the workflow doesn't support, to verify graceful failure or error handling.

Workflow Nodes

A workflow includes multiple node types, and the evaluation must cover the key scenarios applicable to each type.

Node	Scenario
LLM	Test the LLM prompt to identify if it can handle all types of questions and formatting instructions.
Code	Robustness: Include questions that test the stability of the code. Edge Cases: Test for scenarios that cover different edge cases to ensure the code doesn't break the flow.
Business Object REST	Apply the same guidelines as specified for agent teams of type Supervisor. Test functions from different angles and with varied parameters.
RAG Document Tool	Apply the same guidelines as specified for agent teams of type Supervisor.
Document Processor	Format Handling: Test the node for various attachment types (PDF, txt, and so on) to ensure consistent text extraction.
Vector DB Reader	Retrieval Accuracy: Test if the node retrieves the most semantically relevant chunks based on the input query.

For the other supported nodes, make sure evaluation tests cover other nodes including Tools, Vector DB Reader, and Vector DB Writer.