Evaluate Agents

Use evaluation sets to assess your agents’ performance. An evaluation set contains one or more test questions, the expected agent responses, and the metrics to be measured. Evaluation sets are specific to each agent, and an agent can have multiple evaluation sets.

  1. Create an evaluation set for an agent.
    1. Go to Navigator > Tools > AI Agent Studio.
    2. Select the Monitoring and Evaluation tab.

      All evaluations run on the agents are displayed in the Evaluation tab.

    3. To create an evaluation set, click Manage Evaluations and select Add.

    4. Enter a name, code, and description for the evaluation set, and select the agent team to be evaluated.
    5. Choose the run mode.

      Sequential: Runs the questions in the exact order you define them. Use this if one question depends on the context of the previous one.

      Random: Runs the questions in a random order.

    6. From the Questions tab, add common questions that users are likely to ask the agent and the answers you would like the agent to deliver. Ensure both the questions and answers are concise, user-friendly, and reflect best practices.

      You can either add your questions and the expected answers individually, or upload a CSV file with the questions in the first column and the expected answers in the second column.

    7. From the Metrics tab, edit each metric to set the pass and fail criteria. For example, to indicate that the test fails if the correctness score is less than 0.7, choose < as the threshold condition and enter 0.7 as the threshold value.
    8. Select Create to save the evaluation set.
  2. Run the evaluation set.
    1. On the Manage Evaluations page, select the Initiate Evaluation Run action for your evaluation set.
    2. Choose the version of the agent team to evaluate and run the evaluation.
  3. Analyze the results.
    1. Click the evaluation set to view the Evaluation Runs page.
    2. Select the evaluation run and select the View Run Results action.

      Tab Information Displayed
      Response performance
      • Comparison of the expected response versus the actual response from the agent for each question, along with the metrics for each question in the evaluation.
      • Trace provides information about the detailed timeline for each question in the evaluation.
      Correctness Detailed breakdown of the correctness score. The LLM provides an initial score and feedback, and you can add your own feedback for record keeping in the Correctness Score by Human column.