Skip to main content

Building Trust in LLM Solutions: A Practical Guide to Evaluation Planning 

Artificial intelligence (AI) is fundamentally changing how healthcare software is built. From automated test case generation to intelligent documentation and decision support, large language models are becoming embedded within the software development lifecycle itself.

As AI becomes part of how solutions are designed and validated, the question is no longer just whether it adds efficiency. It’s whether organizations can systematically evaluate and trust the outputs it produces.

At HealthEdge®, we’re deploying the Wellframe QA team’s test case generation agent. The agent takes Jira tickets for new front-end functionality, including acceptance criteria, and generates test cases as CSV files for a downstream test management tool. This collaboration has demonstrated that successful LLM deployment requires building trust through rigorous evaluation.

What Are LLM Evaluations?

Traditional software applications are straightforward to assess with clearly established patterns: unit testing, integration testing, UAT, etc. LLM applications are different: infinite output possibilities, context-dependent responses, and subtle failure modes.

LLM evaluations systematically measure whether your LLM application solves the problem you built it to solve. They provide concrete evidence of what works and reveal specific areas that need improvement.

Evaluations serve different audiences with different needs.

  • For stakeholders, they provide transparency and set realistic expectations about what the system can and cannot do.
  • For developers, they highlight specific shortcomings that need attention and help prioritize improvement efforts.
  • For users, they build confidence that the system has been rigorously tested.

The ultimate goal is trust. Users need to trust that your LLM solution will perform reliably. Evaluations are how you earn and maintain that trust.

The Four Components of a Robust Evaluation Plan

Our QA test generation agent presents a complex evaluation challenge. Given a Jira ticket, it generates test cases with sections, titles, preconditions, steps, expected results, and metadata. There’s no single correct output, and quality is multidimensional.

Consequently, we devised a complete evaluation plan with four components: criteria, methods, dataset, and execution strategy.

Component 1 – Evaluation Criteria: Criteria should stem directly from the problem the model is solving. For our QA test generation agent, we identified multiple critical criteria based on what makes test cases valuable to our QA team:

  • Required Test Recall measures comprehensiveness. Are we generating all the necessary test cases that a human QA engineer would write? We calculate this as the number of “required” test cases covered by the agent divided by the total number of required test cases a human would write. We set a realistic target recall based on task complexity and risk.
  • Acceptance Criteria Coverage measures thoroughness. Does the generated test suite adequately test all the acceptance criteria mentioned in the Jira ticket? We target 90%+ coverage to ensure nothing slips through the cracks.
  • Test Comprehensiveness involves human evaluators to score on a 1-5 scale based on their holistic judgment of the test suite’s quality.

Each criterion targets a specific aspect of quality that matters to our end users (the QA team). We’re measuring concrete traits that determine whether the agent provides real value.

The key is to cover all bases and edge cases from different angles. A test suite could score high on recall (finding all the important scenarios) but low on coverage (missing acceptance criteria details). Both matter, so we measure both.

Component 2 – Evaluation Methods: The HealthEdge team pursued three approaches:

  1. Automated computable metrics (exact match, fuzzy match) work when success is mathematically defined.
  2. Human evaluation handles judgment requiring domain expertise.
  3. LLM-as-a-judge uses another LLM to evaluate based on a rubric.

For this project, we used automated checks for format and human subject matter experts (SMEs) for quality assessment.

Component 3 – The Evaluation Dataset: This is the most critical component. If the dataset doesn’t match production, the process will miss problems. For example, a resume-screening tool designed to evaluate software engineer resumes only might fail on designer or marketer resumes in production. Evaluation datasets must follow three rules:

  1. Representative means it reflects the actual distribution of cases you’ll see in production. If 60% of production tickets describe UI features, 30% describe API changes, and 10% describe infrastructure work, your evaluation dataset should match those proportions. If edge cases happen 5% of the time in production, they should appear roughly 5% of the time in your dataset.
  2. Diverse means covering the full range of scenarios, including edge cases and failure modes. For our QA agent, we need Jira tickets that vary in complexity (simple bug fixes vs. major features), clarity (well-written vs. vague requirements), and completeness (detailed acceptance criteria vs. minimal descriptions). Each variation might affect output quality differently.
  3. Consistent means the ground truth labels or expected outputs are reliable and reproducible. If three QA engineers evaluate the same test cases, they should largely agree on what’s required and what’s comprehensive. Inconsistent ground truth means you’re measuring noise instead of signal.

For this project, the Wellframe QA team curated a substantial dataset of real Jira tickets spanning different feature types and created the “required” test cases for each. This gave the team reliable ground truth to measure against, as it was built by the very subject-matter experts who will be using the agent in production.

Component 4 – The Execution Plan: Evaluations can be offline (using your dataset), which is comprehensive and controlled, or online (monitoring production), which catches unexpected inputs but often lacks ground truth.

For our QA agent, we chose the offline evaluation because our criteria require human subject matter experts. The strategy focused on periodic manual reviews conducted every few weeks during development. Before releases, a comprehensive evaluation served as a quality gate. In the post-deployment phase, the team focused on continuous monitoring.

Putting It All Together 

To recap our successful process, the critical steps we followed included defining concrete criteria, choosing appropriate methods, investing in a high-quality dataset, and designing an execution plan. For our QA agent, we accepted that evaluation requires human SMEs. We prioritized offline evaluation and invested in a diverse dataset with ground truth.

The result: A confident deployment with evidence of strengths and visibility into limitations.

Contact HealthEdge to learn how our AI solutions are reinventing the way our software solutions are being designed and tested. 

About the Author

Justin Wolkowicz is a software engineer at HealthEdge. During his time with the company, he has contributed to a range of initiatives spanning software and data science, with his current focus centering on the development of the company's AI platform. A Boston College graduate, he has carried his love of innovative problem-solving into both his career and personal projects. Outside of HealthEdge, Justin is passionate about the intersection of tech and philanthropy, and has developed a range of projects uniting immersive digital experiences and non-profit education.