Evals¶
Evals are structured tests for AI agents — you define a set of inputs, describe what good output looks like, and run them repeatedly to catch regressions before they reach production. Unlike unit tests, evals account for the non-deterministic nature of LLMs: they measure behavior across a range of inputs, track quality over time, and can use other models as judges when rule-based checks are not enough.
sentinel-ai-evals is the built-in eval framework for Sentinel AI agents. It is fully generic (Agent<R, T, A>), works for both string and structured outputs, and integrates into any CI pipeline with a single assertion.
Why use evals?¶
Agents are non-deterministic. Without evals you won't know when a prompt change silently breaks a tool-calling flow, when a model upgrade regresses output quality, or whether your agent handles edge-case inputs correctly before they reach production. Evals give you a repeatable, versioned safety net — run them on every pull request as a gate, or on demand against live models for quality benchmarking.
Getting started¶
Add dependency (version managed via the Sentinel BOM):
<dependency>
<groupId>com.phonepe.sentinel-ai</groupId>
<artifactId>sentinel-ai-evals</artifactId>
</dependency>
Create a dataset, attach expectations, and run it against your agent to generate a report:
import com.phonepe.sentinelai.evals.EvalEngine;
import com.phonepe.sentinelai.evals.tests.Dataset;
import com.phonepe.sentinelai.evals.tests.Expectations;
import com.phonepe.sentinelai.evals.tests.TestCase;
var dataset = new Dataset<>("smoke",
List.of(
new TestCase<>("What is the status?",
List.of(
Expectations.outputContains("OK"),
Expectations.toolCalled("fetch_status"),
Expectations.jsonPathEquals("$.status", "OK")
))));
var report = new EvalEngine().run(dataset, agent);
System.out.printf("Passed=%d Failed=%d Skipped=%d%n",
report.getPassedTestCases(), report.getFailedTestCases(), report.getSkippedTestCases());
Core concepts¶
Before looking at the available eval types, it helps to understand the building blocks:
-
Dataset is a named collection of eval scenarios, usually representing one suite such as a smoke test set, regression pack, or domain-specific benchmark, and it contains multiple
TestCaseentries. -
TestCase represents one concrete input/output evaluation scenario by combining the agent input, the expectations to evaluate, and an optional per-test timeout;
for example, input =
"What is the status?"with expectations that output contains"OK"and toolfetch_statuswas called. -
Expectation is a pass/fail rule applied to the agent result and execution context, best suited for deterministic assertions such as exact output checks, JSONPath assertions, and tool-call verification; when an expectation fails, the test case fails.
-
Metric is a scoring-based evaluator that produces a numeric score instead of only pass/fail, making it useful for similarity, topical relevance, and LLM-judge style quality checks.
Metrics are usually wrapped as expectations through helpers like
Expectations.outputSimilarity(...)orExpectations.answerRelevance(...), often with a threshold. -
Reports capture outcomes of a eval run:
ExpectationReportcaptures one expectation or metric result,TestCaseReportaggregates expectation outcomes for one test case, andEvalReportaggregates the full dataset run including passed/failed/skipped counts and collected metric scores.
Available evals¶
Deterministic expectations¶
Deterministic expectations evaluate the agent output without any external model call. They are fast, free, and should be the first layer of every eval suite.
All helpers are available via com.phonepe.sentinelai.evals.tests.Expectations.
Output content¶
| Expectation | What it checks |
|---|---|
outputContains(String substring) |
Output string contains the given substring |
outputEquals(R expected) |
Output is exactly equal to the expected value |
JSON Path¶
Evaluates structured output (POJOs or JSON) using JSONPath expressions. Available operators: EQ, NE, GT, GTE, LT, LTE, IN, NOT_IN.
| Expectation | What it checks |
|---|---|
jsonPathEquals(String path, Object value) |
Shorthand for where(path).eq(value) |
where(String path).eq(value) |
Field at path equals value |
where(String path).ne(value) |
Field at path does not equal value |
where(String path).gt(value) |
Field at path is greater than value |
where(String path).gte(value) |
Field at path is greater than or equal to value |
where(String path).lt(value) |
Field at path is less than value |
where(String path).lte(value) |
Field at path is less than or equal to value |
where(String path).in(Collection) |
Field at path is contained in the set |
where(String path).notIn(Collection) |
Field at path is not contained in the set |
at(String path) |
Alias for where(path) |
Tool call assertions¶
Asserts that the agent called a specific tool during the run — useful for verifying agentic routing behavior.
| Expectation | What it checks |
|---|---|
toolCalled(String toolName) |
Tool was called at least once |
toolCalled(String toolName, int times) |
Tool was called exactly N times |
toolCalled(String toolName, int times, Map<String,Object> params) |
Tool was called N times with matching parameters |
ordered(MessageExpectation... expectations) |
Message-level expectations are satisfied in order |
Custom expectations¶
If built-in expectations are not enough, create your own expectation type and executor, then register it in ExpectationExecutorRegistry.
record StringLengthExpectation(int min, int max) implements Expectation<String, Object> {}
final class StringLengthExpectationExecutor implements ExpectationExecutor<String, Object> {
private final StringLengthExpectation expectation;
StringLengthExpectationExecutor(StringLengthExpectation expectation) {
this.expectation = expectation;
}
@Override
public boolean evaluate(String result, EvalExpectationContext<Object> context) {
return result != null
&& result.length() >= expectation.min()
&& result.length() <= expectation.max();
}
}
var expectationExecutorFactory = ExpectationExecutorRegistry.withDefaults()
.register(StringLengthExpectation.class,
(agent, expectation, objectMapper, executorService) -> (ExpectationExecutor) new StringLengthExpectationExecutor(
(StringLengthExpectation) expectation));
var report = new EvalEngine(mapper, expectationExecutorFactory).run(dataset, agent);
For a complete working example (including multiple custom expectations), see
sentinel-ai-evals/src/test/java/com/phonepe/sentinelai/evals/tests/expectations/CustomExpectationExampleTest.java.
Embedding-based metrics¶
Embedding-based metrics use vector similarity to evaluate output quality, they require an EmbeddingModel resolved through MetricExecutorRegistry at runtime but are otherwise deterministic for a fixed model.
Use Expectations.* for one-step expectation wiring, or Metrics.* to get a raw Metric<String, T> to compose with a custom threshold via MetricExpectation. The embedding model is not passed inline — it is supplied once to MetricExecutorRegistry.withDefaults(...) via an EmbeddingModelFactory.
| Expectation helper | Metric class | What it measures |
|---|---|---|
Expectations.outputSimilarity(referenceText [, threshold]) |
OutputSimilarityMetric |
Cosine similarity between output and a fixed reference text |
Metrics.outputSimilarity(referenceText) |
OutputSimilarityMetric |
Raw metric — wire your own threshold via MetricExpectation |
Metrics.outputRelevanceBySimilarity() |
OutputRelevanceBySimilarityMetric |
Cosine similarity between output and the original request — a proxy for topical relevance |
When to use: output similarity is useful when you have a reference or golden answer and want to catch semantic regressions without an exact-match assertion. outputRelevanceBySimilarity is useful when you only want to ensure the output stays on-topic with the original request.
Limitation: both are proximity measures — they cannot reason about intent, constraints, or factual accuracy. Use LLM-judged metrics below for that.
LLM-judged metrics¶
LLM-judged metrics delegate scoring to a judge model that receives the original request and the agent answer, then returns a structured {"score": 0.0–1.0, "reason": "..."} payload. They are the most capable evaluators but require a live model call per test case.
Use Expectations.answerRelevance(...) for one-step wiring, or Metrics.answerRelevance(...) to get the raw metric. The judge model is injected through the metric executor registry.
| Expectation helper | Metric class | What it measures |
|---|---|---|
answerRelevance([promptTemplate] [, threshold]) |
OutputRelevanceMetric |
LLM-judged relevance of the answer to the request |
Metrics.answerRelevance([promptTemplate]) |
OutputRelevanceMetric |
Raw metric — wire your own threshold via MetricExpectation |
The default judge prompt evaluates three dimensions:
- Intent coverage — did the answer address what was asked?
- Request scope alignment — did the answer stay within the factual scope of the question?
- No off-topic content — no irrelevant filler or tangential information?
Custom judge prompts must contain {request} and {answer} placeholders, which are validated at construction time.
ObjectMapper mapper = ...;
// Wire the judge model via factory + identifier (preferred API)
LLMModelFactory llmFactory = id -> new SimpleOpenAIModel<>(id.modelId(), openAiProvider, mapper, options);
var metricExecutorRegistry = MetricExecutorRegistry.withDefaults(
new EmbeddingModelIdentifier("text-embedding-3-small"),
embeddingFactory, // EmbeddingModelFactory — null to skip embedding-based metrics
new LLMIdentifier("gpt-4o"),
llmFactory);
var expectationExecutorFactory = ExpectationExecutorRegistry.withDefaults(metricExecutorRegistry);
var evalEngine = new EvalEngine(mapper, expectationExecutorFactory);
Running evals¶
On demand¶
Run directly from a test class or a main method. No extra setup is required.
var report = new EvalEngine().run(dataset, agent);
System.out.printf("Passed=%d Failed=%d Skipped=%d%n",
report.getPassedTestCases(), report.getFailedTestCases(), report.getSkippedTestCases());
In CI¶
Wrap the run inside a JUnit test and inspect the generated report. The test can fail the build when eval failures are detected.
@Test
void agentSmokeEvals() {
var report = new EvalEngine().run(dataset, agent,
EvalRunConfig.defaults()
.withSamplePercentage(30) // run 30 % of cases on PRs
.withFailFast(true)
.withDefaultTestCaseTimeout(Duration.ofSeconds(15)));
System.out.printf("Passed=%d Failed=%d Skipped=%d%n",
report.getPassedTestCases(), report.getFailedTestCases(), report.getSkippedTestCases());
}
Use samplePercentage to keep PR builds fast while running the full suite on merges to main.
JUnit 5 integration (optional)¶
If you want rich assertion diagnostics in JUnit 5 tests, add the sentinel-ai-evals test-jar:
<dependency>
<groupId>com.phonepe.sentinel-ai</groupId>
<artifactId>sentinel-ai-evals</artifactId>
<type>test-jar</type>
<scope>test</scope>
</dependency>
Use EvalReportAssertions to fail with expectation-level diagnostics:
import com.phonepe.sentinelai.evals.junit.assertions.EvalReportAssertions;
@Test
void agentSmokeEvals() {
var report = new EvalEngine().run(dataset, agent);
EvalReportAssertions.assertNoFailures(report);
}
Real agent integration example¶
For end-to-end validation with a live model (no model mocking), see
sentinel-ai-evals/src/test/java/com/phonepe/sentinelai/evals/integration/RealNicknameAgentExpectationsIntegrationTest.java.
This test demonstrates:
- typed request object (
name,age) and structured nickname output - deterministic expectations (
outputEquals,jsonPathEquals,where/atoperators) - tool-call expectations (
toolCalled,ordered) - metric expectations (
outputSimilarity,answerRelevance)
The test is gated to run only when real endpoints are enabled (for example with -Preal-tests).