
- Capture golden-path sessions in the web UI,
- Configure simple metrics (tool trajectory vs. response similarity),
- Run evaluations from the UI and CLI,
- Iterate on prompts and tools to improve scores.
- Tool trajectory (tool usage and order): Did the agent call the right tools in the right order?
- Final response similarity: Is the final agent message close to the reference/expected response?
- Python Basics — minimal imports and local test data referenced below.
- Record golden-path sessions in the ADK web UI.
- Group sessions into an eval set (e.g.,
helpdesk_core_flows). - Configure threshold criteria (JSON file).
- Run the evaluation (UI or
adk eval). - Review per-case metrics and iterate on prompts, tool definitions, or expected responses.
- Locked account flow:
- Agent should call
lookup_userand detect a locked account. - Agent should inform the user the account is locked and offer to open a ticket.
- Agent should call
- VPN outage flow:
- Agent should call
check_service_status. - If the service is degraded or in outage for multiple users, create a high-impact ticket.
- Agent should call
tool_trajectory_avg_score and response_match_score.
Metrics summary
| Metric | What it measures | Example threshold |
|---|---|---|
| tool_trajectory_avg_score | Correct tools called in the correct order and arguments | 1.0 |
| response_match_score | Similarity of the final response to the expected response | 0.8 |
- User: “My email says my account is locked. My email is carol@example.com”
- Desired agent: “I see that your account carol@example.com is indeed locked. To unlock your account, you’ll need to contact IT directly. Would you like me to open a ticket for this issue?”
- User (VPN case): “My whole team can’t use the VPN this morning. We’re all blocked. My email is alice@example.com”
- Desired agent: “I’ve checked the VPN service status and it appears degraded. Since this is affecting your team, I can open a ticket. Would you like me to do that?”
helpdesk_core_flows) and add these captured sessions.
Configuring thresholds via JSON
Create folder helpdesk_agent/evals and add test_config.json to tweak pass/fail thresholds:
If you get errors like “ModuleNotFoundError: No module named ‘rouge_score’”, install the ADK eval extras which include text-similarity dependencies:pip install “google-adk[eval]”This pulls in packages used by the evaluator for response similarity metrics.
- Create an eval set (e.g.,
helpdesk_core_flows). - Add the captured sessions to that eval set.
- Configure thresholds inline or reference your
test_config.json. - Click “Run evaluation” to see detailed results: expected vs. actual tool calls and response similarity scores.
- Ensure eval dependencies are installed:
- Run the ADK web server if you want the UI:
- Run evaluations programmatically with
adk eval. Provide your agent module path, the evalset file, and the config file:
helpdesk_agent is the importable agent module/directory.
Sample CLI output (condensed)
- Trajectory passed: the expected tool was called in the correct order.
- Response similarity failed: the similarity score (~0.639) did not meet the 0.8 threshold.
LLMs are non-deterministic: a case that passes once may fail later. If you see flakiness, try multiple runs, relax thresholds temporarily, or increase robustness by improving prompts and tool grounding.
adk evalsupports flags for config files, eval storage URIs, and detailed printing:--config_file_path— JSON criteria file.--eval_storage_uri— where to store eval results (e.g.,gs://bucket/...).--print_detailed_results— prints the full invocation result to the console.
- Use CI jobs or nightly runs to detect regressions over time.
- If an eval fails, start with the prompt and tool instructions — improving the initial prompt often yields the best gains.
- Lower thresholds to get passing CI quickly, but prioritize improving the underlying prompts and tool grounding to achieve reliable results.
- Add more golden-path and edge cases to your eval sets to cover real-world variations.
- Evaluator invocation objects include:
- Expected and actual tool calls and their arguments,
- Expected and actual final responses,
- Per-invocation metric scores.
- Ingest these objects into a database or CI artifact store for nightly regressions or pre-release checks.
- A smart IT helpdesk assistant that:
- Troubleshoots email, VPN, GitLab, and Wi‑Fi issues,
- Looks up users and service status with tools,
- Creates structured tickets using a
Ticketschema, - Uses evaluation sets to guard against regressions.
- Iterate on your instruction prompt to be explicit about which tools to call and when.
- Add more golden-path and edge cases to your eval set.
- Store evaluation results in cloud storage or a database to track trends over time.
- If your helpdesk grows, split responsibilities across smaller agents (triage, ticketing, knowledge) for clearer eval boundaries.
- Python basics course: Python Basics