Building Trust in AI: A Guide to LLM Evaluations
Large language models (LLMs) are inherently probabilistic, meaning the same input can produce different outputs. That variability makes traditional unit tests, which verify exact results, ineffective for AI systems. In healthcare, where quality and accuracy are nonnegotiable, this creates a unique challenge: how do you ensure AI performs reliably at scale? At HealthEdge, we address this through a multi-layered evaluation strategy that combines human evaluations, LLM-as-a-Judge, CI/CD automation, and online, real-time monitoring to meet healthcare’s rigorous quality standards.
Why do we need multiple evaluation types?
Each serves a distinct purpose in the AI development lifecycle:
- Human evaluations establish ground truth. Only domain experts can judge whether an AI summary captures clinically relevant details or if generated test cases are actually executable. Humans define what “good” looks like.
- LLM-as-a-Judge scales human judgment. We can’t have subject matter experts (SMEs) review every output during rapid development. A judge-LLM applies human-defined criteria consistently across thousands of examples, enabling fast iteration.
- CI/CD regression evaluations prevent quality backslides. When prompts or models change, automated tests catch regressions before they reach production, which is essential when multiple teams ship AI features weekly.
- Online (real-time) evaluations catch real-world drift. Production traffic contains edge cases that no test dataset anticipates. Continuous monitoring detects degradation before users complain.
We’ll illustrate each type of evaluation using our QA Test Case Generation Agent, which reads Jira tickets and generates test cases with titles, preconditions, steps, and expected results.
Human Evaluations
Human evaluations are the gold standard. For healthcare AI, human oversight is non-negotiable. AWS Bedrock supports this through human-based evaluation jobs: collect inference examples, upload to S3, create evaluation jobs with custom metrics, and review results through Bedrock’s console.
SMEs are best suited for measuring the performance of highly complex operations. For instance, the QA Test Generation Agent takes in a nontrivial input, a Jira ticket, and outputs an entire spreadsheet of test cases with multiple test steps. It takes a multitude of steps to translate from input to output, all of which simulate the role of a QA engineer.
LLM-as-a-Judge
LLM-as-a-Judge uses a second LLM to evaluate primary agent outputs, scaling human-like judgment across large datasets without requiring SME time for every evaluation run.
Each evaluation metric is defined by a prompt that instructs the judge LLM what to assess and how to score. For example, a “Relevance” evaluator prompt asks the LLM to compare the generated output to the source input and rate how relevant the response is. These evaluation prompts can be customized for domain-specific criteria, allowing teams to encode their quality standards into reusable, automated checks.
When initially building LLM-as-a-Judge evaluators, it’s helpful to compare their scores against human evaluations on the same dataset. This calibration ensures the LLM evaluators resemble SME judgment as closely as possible. If the judge LLM scores differ significantly from human reviewers, the evaluation prompt needs refinement until alignment improves. Bedrock offers built-in evaluators for correctness, relevance, and hallucination, as well as custom prompts.
For the QA Test Generation Agent, the same criteria evaluated by SMEs can be provided as prompts for the LLM judges, providing a secondary aggregate of metrics. Acting as a baseline, they can indicate any dips in performance of the agent.
CI/CD Regression Evaluations
CI/CD evaluations automate quality gates. When developers merge changes to prompts, models, or agent architecture, automated evaluations catch regressions before they move into production.
AWS AgentCore integrates with GitHub Actions to configure datasets, define LLM-as-a-judge evaluators, and specify task functions. The pipeline triggers on the merge, blocking deployment if thresholds aren’t met.
For example, with the HealthEdge QA agent, we block if test comprehensiveness, as evaluated by an LLM as a comparison with ground truth data, drops below 80% or CSV output format adherence falls below 95%.
Online (Real-Time) Evaluations
Online evaluations monitor production traffic, sampling live requests to detect drift that static datasets miss. These evaluations use the same LLM-as-a-Judge evaluators defined during development, applying them continuously to production data rather than pre-constructed test sets. AgentCore supports configurable sampling (1-5% of traffic), running judge prompts on sampled requests and surfacing score trends through observability dashboards. If quality degrades from unexpected inputs, online evaluations catch it before users report issues.
The Evaluation Lifecycle
These four evaluation types form a continuous loop: human evaluations establish ground truth; LLM-as-a-Judge enables rapid iteration; CI/CD gates releases; online monitoring feeds edge cases back into development.
AI evaluation requires fundamentally different approaches than traditional testing. By combining human evaluations, LLM-as-a-Judge, CI/CD automation, and real-time monitoring, HealthEdge ensures AI features meet healthcare’s quality standards.
To follow HealthEdge’s AI strategy in greater detail, visit our Resource Center.