Pydantic Evals
Pydantic Evals is a powerful evaluation framework for systematically testing and evaluating AI systems, from simple LLM calls to complex multi-agent applications.
Design Philosophy
Code-First Approach
Pydantic Evals follows a code-first philosophy where all evaluation components are defined in Python. This differs from platforms with web-based configuration. You write and run evals in code, and can write the results to disk or view them in your terminal or in Pydantic Logfire.
Evals are an Emerging Practice
Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored. We've designed Pydantic Evals to be flexible and useful without being too opinionated.
Quick Navigation
Getting Started:
Evaluators:
- Evaluators Overview - Compare evaluator types and learn when to use each approach
- Built-in Evaluators - Complete reference for exact match, instance checks, and other ready-to-use evaluators
- LLM as a Judge - Use LLMs to evaluate subjective qualities, complex criteria, and natural language outputs
- Custom Evaluators - Implement domain-specific scoring logic and custom evaluation metrics
- Span-Based Evaluation - Evaluate internal agent behavior (tool calls, execution flow) using OpenTelemetry traces. Essential for complex agents where correctness depends on how the answer was reached, not just the final output. Also ensures eval assertions align with production telemetry.
How-To Guides:
- Logfire Integration - Visualize results
- Dataset Management - Save, load, generate
- Concurrency & Performance - Control parallel execution
- Retry Strategies - Handle transient failures
- Metrics & Attributes - Track custom data
Examples:
- Simple Validation - Basic example
Reference:
Code-First Evaluation
Pydantic Evals follows a code-first approach where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.
When you run an Experiment you'll see a progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.
If you are using Pydantic Logfire, your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.
Installation
To install the Pydantic Evals package, run:
pip install pydantic-evals
uv add pydantic-evals
pydantic-evals does not depend on pydantic-ai, but has an optional dependency on logfire if you'd like to
use OpenTelemetry traces in your evals, or send evaluation results to logfire.
pip install 'pydantic-evals[logfire]'
uv add 'pydantic-evals[logfire]'
Pydantic Evals Data Model
Pydantic Evals is built around a simple data model:
Data Model Diagram
Dataset (1) ──────────── (Many) Case
│ │
│ │
└─── (Many) Experiment ──┴─── (Many) Case results
│
└─── (1) Task
│
└─── (Many) Evaluator
Key Relationships
- Dataset → Cases: One Dataset contains many Cases
- Dataset → Experiments: One Dataset can be used across many Experiments over time
- Experiment → Case results: One Experiment generates results by executing each Case
- Experiment → Task: One Experiment evaluates one defined Task
- Experiment → Evaluators: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all Cases, and Case-specific Evaluators against their respective Cases
Data Flow
- Dataset creation: Define cases and evaluators in YAML/JSON, or directly in Python
- Experiment execution: Run
dataset.evaluate_sync(task_function) - Cases run: Each Case is executed against the Task
- Evaluation: Evaluators score the Task outputs for each Case
- Results: All Case results are collected into a summary report
A metaphor
A useful metaphor (although not perfect) is to think of evals like a Unit Testing framework:
-
Cases + Evaluators are your individual unit tests - each one defines a specific scenario you want to test, complete with inputs and expected outcomes. Just like a unit test, a case asks: "Given this input, does my system produce the right output?"
-
Datasets are like test suites - they are the scaffolding that holds your unit tests together. They group related cases and define shared evaluation criteria that should apply across all tests in the suite.
-
Experiments are like running your entire test suite and getting a report. When you execute
dataset.evaluate_sync(my_ai_function), you're running all your cases against your AI system and collecting the results - just like runningpytestand getting a summary of passes, failures, and performance metrics.
The key difference from traditional unit testing is that AI systems are probabilistic. If you're type checking you'll still get a simple pass/fail, but scores for text outputs are likely qualitative and/or categorical, and more open to interpretation.
For a deeper understanding, see Core Concepts.
Datasets and Cases
In Pydantic Evals, everything begins with Datasets and Cases:
Dataset: A collection of test Cases designed for the evaluation of a specific task or functionCase: A single test scenario corresponding to Task inputs, with optional expected outputs, metadata, and case-specific evaluators
from pydantic_evals import Case, Dataset
case1 = Case(
name='simple_case',
inputs='What is the capital of France?',
expected_output='Paris',
metadata={'difficulty': 'easy'},
)
dataset = Dataset(cases=[case1])
(This example is complete, it can be run "as is")
See Dataset Management to learn about saving, loading, and generating datasets.
Evaluators
Evaluators analyze and score the results of your Task when tested against a Case.
These can be deterministic, code-based checks (such as testing model output format with a regex, or checking for the appearance of PII or sensitive data), or they can assess non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations, or instruction-following.
While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs.
Pydantic Evals includes several built-in evaluators and allows you to define custom evaluators:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
from pydantic_evals.evaluators.common import IsInstance
from simple_eval_dataset import dataset
dataset.add_evaluator(IsInstance(type_name='str')) # (1)!
@dataclass
class MyEvaluator(Evaluator):
async def evaluate(self, ctx: EvaluatorContext[str, str]) -> float: # (2)!
if ctx.output == ctx.expected_output:
return 1.0
elif (
isinstance(ctx.output, str)
and ctx.expected_output.lower() in ctx.output.lower()
):
return 0.8
else:
return 0.0
dataset.add_evaluator(MyEvaluator())
- You can add built-in evaluators to a dataset using the
add_evaluatormethod. - This custom evaluator returns a simple score based on whether the output matches the expected output.
(This example is complete, it can be run "as is")
Learn more:
- Evaluators Overview - When to use different types
- Built-in Evaluators - Complete reference
- LLM Judge - Using LLMs as evaluators
- Custom Evaluators - Write your own logic
- Span-Based Evaluation - Analyze execution traces
Running Experiments
Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment".
Putting the above two examples together and using the more declarative evaluators kwarg to Dataset:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance
case1 = Case( # (1)!
name='simple_case',
inputs='What is the capital of France?',
expected_output='Paris',
metadata={'difficulty': 'easy'},
)
class MyEvaluator(Evaluator[str, str]):
def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
if ctx.output == ctx.expected_output:
return 1.0
elif (
isinstance(ctx.output, str)
and ctx.expected_output.lower() in ctx.output.lower()
):
return 0.8
else:
return 0.0
dataset = Dataset(
cases=[case1],
evaluators=[IsInstance(type_name='str'), MyEvaluator()], # (2)!
)
async def guess_city(question: str) -> str: # (3)!
return 'Paris'
report = dataset.evaluate_sync(guess_city) # (4)!
report.print(include_input=True, include_output=True, include_durations=False) # (5)!
"""
Evaluation Summary: guess_city
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Case ID ┃ Inputs ┃ Outputs ┃ Scores ┃ Assertions ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ simple_case │ What is the capital of France? │ Paris │ MyEvaluator: 1.00 │ ✔ │
├─────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┤
│ Averages │ │ │ MyEvaluator: 1.00 │ 100.0% ✔ │
└─────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┘
"""
- Create a test case as above
- Create a
Datasetwith test cases andevaluators - Our function to evaluate.
- Run the evaluation with
evaluate_sync, which runs the function against all test cases in the dataset, and returns anEvaluationReportobject. - Print the report with
print, which shows the results of the evaluation. We have omitted duration here just to keep the printed output from changing from run to run.
(This example is complete, it can be run "as is")
See Quick Start for more examples and Concurrency & Performance to learn about controlling parallel execution.
API Reference
For comprehensive coverage of all classes, methods, and configuration options, see the detailed API Reference documentation.
Next Steps
- Start with simple evaluations using Quick Start
- Understand the data model with Core Concepts
- Explore built-in evaluators in Built-in Evaluators
- Integrate with Logfire for visualization: Logfire Integration
- Build comprehensive test suites with Dataset Management
- Implement custom evaluators for domain-specific metrics: Custom Evaluators