Fusion AI | How can I monitor and evaluate AI agents?

Title and music.

In AI Agent Studio, you can monitor your agents to see how they're performing and analyze their accuracy.

The metrics provided to AI Agent Studio are collected on a regular basis by a scheduled process.

Go to Scheduled Processes and select Schedule New Process.

Search for Aggregate AI Agent Usage and Metrics.

For this example, we'll schedule it to run once daily.

In AI Agent Studio, under the Monitoring and Evaluation tab, the Monitoring subtab displays the aggregated metrics of supervisor and workflow agent runs over the selected time frame, including agents in draft status.

Each row represents a single session and displays the number of back-and-forth messages, including whether or not the session finished successfully, any errors encountered, and the number of tokens used .

I'll select a session that we can take a closer look at.

The detailed trace view displays a step-by-step time line of the entire conversation. It shows which tools were called, the duration, and the metrics for each step.

You can also use evaluation sets to assess the performance of your agents.

These are specific to each agent, and an agent can have multiple evaluation sets.

To create an evaluation set, go to Manage Evaluations and select the Add icon.

Enter a name, then select the agent team to be evaluated, and a description for the evaluation set.

For this example, I'll leave the run mode as sequential to run the questions in this exact order because at least one of my questions depends on the context of a previous one.

Select Enable Document Tool Evaluation Metrics if you want to view RAG metrics in the evaluation report.

In the Questions tab, add common questions that users are likely to ask the agent and the answers that you want the agent to deliver.

Here, I'll use some sample questions that are already established for this agent.

You can add questions individually or you can add them from a CSV file with questions listed in the first column and the expected answers in the second column.

In the Metrics tab, edit each metric to set the pass and fail criteria.

For this example, I'll set the correctness score to indicate that the test fails if its value is less than zero-point-seven.

Select Create to save the evaluation set.

When you're ready, on the Manage Evaluations page, select Initiate Evaluation Run for your evaluation set.

Once it's done running, we can view the results.

In the correctness subtab, the correctness score uses LLM-as-a-judge to compare the actual results to the expected results, and provides both a numeric score and an explanation.

To spot any regressions or performance issues after modifying your agent, you can compare different runs of the same evaluation.

The Summary tab displays a high-level overview of the performance differences between the selected runs.

The Details tab provides a question-by-question breakdown of the runs. You can also see comparisons between latencies, number of tokens used, and trace links for each question.

Thanks for watching.