Monitor and Evaluate AI Agents
- Monitoring - Monitoring tracks performance and provides insight on how your agents behave in production. Monitor agents to ensure that your quality bars for response time and token counts are maintained over time. You can also see any errors logged here.
- Evaluation - Evaluate agents before you deploy them, to ensure that they're ready for production. Test your agents for response correctness, response time, and token usage to meet your quality standards. After making any changes to your agent, or after a model update, rerun evaluations to confirm that your agent continues to perform as expected. This proactive approach helps you maintain high-quality experiences for your users.
This table summarizes some key metrics, their descriptions, and their availability for monitoring or evaluation.
Metric | Description | Available to Evaluate | Available to Monitor |
---|---|---|---|
Error Rate | Percentage of user sessions that ended in an error. | Yes | Yes |
Error Count | Total number of errors recorded. | Yes | Yes |
Session Count | Total number of conversations initiated with agents. | Yes | Yes |
P99 Latency | The maximum wait time in milliseconds for 99% of users, revealing any areas where you should review and optimize the prompts or structure of the agent. | Yes | Yes |
P50 Latency |
The maximum wait time in milliseconds for 50% of users, helping identify performance issues. You can view this metric in the details of the monitoring or evaluation results. |
Yes | Yes |
Total Tokens | Cumulative number of tokens used by all agents. | Yes | Yes |
Input Token Count | Total tokens sent to the LLM for requests. This includes system prompts, user messages, retrieved or context data, chat history, and tool or function definitions. | No | Yes |
Output Token Count | Total tokens generated by the LLM for requests sent to it. | Yes | Yes |
Median Correctness | The 50th percentile of correctness scores across evaluation runs. Each score (0–1) is computed by comparing the agent’s answer to the reference answer provided in the evaluation set. | Yes | No |
Session Count | The number of unique conversational sessions between a user and an AI agent. One session can include multiple messages or evaluation runs. | Yes | Yes |
Prerequisites
Aggregate the metrics that are displayed on the Monitoring and Evaluation tab in AI Agent Studio.
- Go to .
- In Scheduled Processes, click Schedule New Process.
- Leave the type as Job.
- Search for and select Aggregate AI Agent Usage and Metrics.
-
Run the Aggregate AI Agent Usage and Metrics scheduled process.
You can schedule this process to run on a recurring basis, for example, once a day.
The process aggregates the metrics that are displayed in the Monitoring and Evaluation tab of AI Agent Studio.