Evaluations

A presentation slide showing the word "Evaluations" on the left and a large dark curved shape on the right with the word "Demo" in blue. Small "© Copyright KodeKloud" text appears in the bottom-left corner.

This lesson demonstrates a lightweight, event-driven evaluation loop for ADK agents. The goal is a minimum viable evaluation setup that lets you:

Capture golden-path sessions in the web UI,
Configure simple metrics (tool trajectory vs. response similarity),
Run evaluations from the UI and CLI,
Iterate on prompts and tools to improve scores.

Two core signals ADK encourages you to measure:

Tool trajectory (tool usage and order): Did the agent call the right tools in the right order?
Final response similarity: Is the final agent message close to the reference/expected response?

Both signals are important: correct tool usage AND a correct final message. Quick links

Python Basics — minimal imports and local test data referenced below.

Overview of the evaluation flow

Record golden-path sessions in the ADK web UI.
Group sessions into an eval set (e.g., helpdesk_core_flows).
Configure threshold criteria (JSON file).
Run the evaluation (UI or adk eval).
Review per-case metrics and iterate on prompts, tool definitions, or expected responses.

Local test data and minimal imports

from typing import Dict, Any
from datetime import datetime
import uuid

from google.adk.tools import FunctionTool
from pydantic import BaseModel, Field

from schemas.ticket import Ticket

# Minimal fake directory and service status for local testing
_FAKE_USER_DIRECTORY: Dict[str, Dict[str, Any]] = {}

_FAKE_SERVICE_STATUS: Dict[str, str] = {
    "email": "operational",
    "vpn": "degraded",
    "gitlab": "outage",
    "wifi": "operational",
}

High-level test flows to capture

Locked account flow:
- Agent should call lookup_user and detect a locked account.
- Agent should inform the user the account is locked and offer to open a ticket.
VPN outage flow:
- Agent should call check_service_status.
- If the service is degraded or in outage for multiple users, create a high-impact ticket.

Capture these two flows as golden sessions in the ADK web UI, then evaluate them against two built-in metrics: tool_trajectory_avg_score and response_match_score. Metrics summary

Metric	What it measures	Example threshold
tool_trajectory_avg_score	Correct tools called in the correct order and arguments	1.0
response_match_score	Similarity of the final response to the expected response	0.8

Example UI session (condensed)

User: “My email says my account is locked. My email is carol@example.com”
Desired agent: “I see that your account carol@example.com is indeed locked. To unlock your account, you’ll need to contact IT directly. Would you like me to open a ticket for this issue?”
User (VPN case): “My whole team can’t use the VPN this morning. We’re all blocked. My email is alice@example.com”
Desired agent: “I’ve checked the VPN service status and it appears degraded. Since this is affecting your team, I can open a ticket. Would you like me to do that?”

Create an eval set in the UI (example id/name: helpdesk_core_flows) and add these captured sessions. Configuring thresholds via JSON Create folder helpdesk_agent/evals and add test_config.json to tweak pass/fail thresholds:

{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8
  }
}

The JSON controls what counts as PASS for each metric. Install evaluation dependencies

If you get errors like “ModuleNotFoundError: No module named ‘rouge_score’”, install the ADK eval extras which include text-similarity dependencies:pip install “google-adk[eval]”This pulls in packages used by the evaluator for response similarity metrics.

Running evaluations from the ADK web UI

Create an eval set (e.g., helpdesk_core_flows).
Add the captured sessions to that eval set.
Configure thresholds inline or reference your test_config.json.
Click “Run evaluation” to see detailed results: expected vs. actual tool calls and response similarity scores.

Common CLI evaluation workflow

Ensure eval dependencies are installed:

(.venv) $ pip install "google-adk[eval]"

Run the ADK web server if you want the UI:

(.venv) $ adk web

Run evaluations programmatically with adk eval. Provide your agent module path, the evalset file, and the config file:

(.venv) $ adk eval \
  helpdesk_agent \
  helpdesk_agent/helpdesk_core_flows.evalset.json \
  --config_file_path=helpdesk_agent/evals/test_config.json \
  --print_detailed_results

Make sure helpdesk_agent is the importable agent module/directory. Sample CLI output (condensed)

Eval Set Id: helpdesk_core_flows
Eval Id: casef6b2f7
Overall Eval Status: FAILED
---------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 1.0
---------------------------------------------------------------
Metric: response_match_score, Status: FAILED, Score: 0.6391, Threshold: 0.8
---------------------------------------------------------------
Invocation Details:
- Prompt: My email says my account is locked. My email is carol@example.com
- Expected response: I see that your account carol@example.com is indeed locked. To unlock your account, you will need to contact IT directly. Would you like me to open a ticket for this issue?
- Actual response: It looks like your account, carol@example.com, is indeed locked. You'll need IT to unlock it for you. Would you like me to open a ticket for this issue?
- Expected tool calls: lookup_user(email=carol@example.com)
- Actual tool calls: lookup_user(email=carol@example.com)

Interpreting results

Trajectory passed: the expected tool was called in the correct order.
Response similarity failed: the similarity score (~0.639) did not meet the 0.8 threshold.

LLMs are non-deterministic: a case that passes once may fail later. If you see flakiness, try multiple runs, relax thresholds temporarily, or increase robustness by improving prompts and tool grounding.

Using the CLI in automated workflows

adk eval supports flags for config files, eval storage URIs, and detailed printing:
- --config_file_path — JSON criteria file.
- --eval_storage_uri — where to store eval results (e.g., gs://bucket/...).
- --print_detailed_results — prints the full invocation result to the console.
Use CI jobs or nightly runs to detect regressions over time.

Handling common issues and tuning guidance

If an eval fails, start with the prompt and tool instructions — improving the initial prompt often yields the best gains.
Lower thresholds to get passing CI quickly, but prioritize improving the underlying prompts and tool grounding to achieve reliable results.
Add more golden-path and edge cases to your eval sets to cover real-world variations.

Automating collection and storage of results

Evaluator invocation objects include:
- Expected and actual tool calls and their arguments,
- Expected and actual final responses,
- Per-invocation metric scores.
Ingest these objects into a database or CI artifact store for nightly regressions or pre-release checks.

Agent orchestration & schema snippets Root agent configuration (instruction and model selection):

root_agent = Agent(
    model='gemini-2.5-flash',
    name="helpdesk_root_agent",
    description=(
        "Smart IT Helpdesk assistant that troubleshoots common IT issues "
        "using clarifying questions and internal tools."
    ),
    instruction=(
        "You are a friendly but efficient IT helpdesk assistant for an internal company.\n"
        "\n"
        "You are running inside a multi-turn session. ADK will give you the full "
        "conversation history each time, so you should remember what has already "
        "been asked and answered.\n"
        "\n"
        "=== OVERALL GOAL ===\n"
        "- Help the user troubleshoot issues with email, VPN, GitLab, Wi-Fi and similar services.\n"
        "- When appropriate, look up their account and check the status of backend services.\n"
        "- Explain what steps the user should take and offer to open a support ticket when needed.\n"
    ),
)

Minimal tool implementation pattern:

def lookup_user_impl(email: str) -> Dict[str, Any]:
    """Look up a user in the internal directory.

    Args:
        email: The user's work email address.

    Returns:
        dict: A result object with:
            - status: 'success' or 'error'
            - user: user details if found
            - error_message: explanation when status='error'
    """
    user = _FAKE_USER_DIRECTORY.get(email.lower())
    if not user:
        return {"status": "error", "error_message": "User not found", "user": None}
    return {"status": "success", "user": user}

What we built in this lesson

A smart IT helpdesk assistant that:
- Troubleshoots email, VPN, GitLab, and Wi‑Fi issues,
- Looks up users and service status with tools,
- Creates structured tickets using a Ticket schema,
- Uses evaluation sets to guard against regressions.

Example eval set fragment

{
  "eval_set_id": "helpdesk_core_flows",
  "name": "helpdesk_core_flows",
  "eval_cases": [
    {
      "eval_id": "casee4c486",
      "conversation": [
        {
          "invocation_id": "e-c5ca4cfa-9f76-48bb-8f9b-28ea39475115",
          "user_content": {
            "parts": [
              {
                "text": "My email says my account is locked. My email is carol@example.com"
              }
            ],
            "role": "user"
          },
          "final_response": {
            "parts": [
              {
                "text": "I see that your account carol@example.com is indeed locked..."
              }
            ]
          }
        }
      ]
    }
  ]
}

Next steps and recommendations

Iterate on your instruction prompt to be explicit about which tools to call and when.
Add more golden-path and edge cases to your eval set.
Store evaluation results in cloud storage or a database to track trends over time.
If your helpdesk grows, split responsibilities across smaller agents (triage, ticketing, knowledge) for clearer eval boundaries.

References and further reading

Python basics course: Python Basics

Introduction

ADK Fundamentals

Deploying and Operating ADK Agents

Watch Video

Practice Lab