Last week was a kind of a solo parenting week. This meant two things - delay in writing but also less screen time and lots of time to think about what’s the feedback loop for teams building agents (in healthcare)? It seems like Evals is that answer but then it’s not that simple.
Recently 2 questions feature frequently in my conversations. Both with customers and with peers.
Aren’t evals and audits the same? They seem to be aligned towards the same goal?
In a world where Opus 4.6 & Claude Code have become exceedingly good, do we even need an eval layer?
The answer to both starts with a story about insurance underwriters and noise.
The 55% Problem
In 2015, Daniel Kahneman studied 48 insurance underwriters at a major company. He gave them five identical customer profiles - exact same age, same health, same risk factors. Expected output - quote a premium. The executives expected maybe 10% variation. "We all follow the same underwriting rules."
The actual variation: 55%.
One underwriter quoted $9,500. Another quoted $16,700. Same customer. Same risk. Same guidelines. But here's the thing: most orgs don't know their noise level because they've never measured it.
Noise is real. Across decisions. Across industries.
So how do you reduce noise? The goal is simple: get decisions and judgments more aligned with documented policies. Let’s just throw humans at auditing and quality evaluations?
Meet Low Cost Sampling
Throw humans at the problem? That’s expensive. Quality assurance, over decades has graduated and now settled comfortably at sampling. Call centers review 3-5% of interactions. Healthcare providers audit random charts. Financial services spot-check transactions.
And everyone accepts this trade-off: “expert judgment” (gold standard) versus scale (too expensive). This sampling problem isn't unique to insurance. Take a collections agency with 250,000 mortgage collections calls annually:
Take 20 auditors doing this day in, day out. Even if each review takes 20 minutes, their annual audit capacity: ~40,000 reviews. Quality coverage: ~16% (optimistic). The reality is 20 people audit team is rare. Reality is that most hit achieve 3-5%. This means - You're blind to 95% of what's happening.
Promising Promise of LLM-as-the-Judge
When LLMs emerged, they promised a breakthrough: evaluate 100% of cases at $0.01 per review instead of $50 for human review. And for quality monitoring, it genuinely works: you can detect patterns at scale, guide improvement efforts, track performance trends and more.
Eval platforms provide real value here. There is a big LLM token usage bill but nothing to the tune of $2M.
Why LLM-as-Judge Can't Be an Audit
Here's what happens when "100% coverage with 85/100 quality scores" meets regulatory reality: The AI team presents their eval results. Quality looks solid. Coverage is comprehensive. They're raring to move pilots to production.
The compliance officer asks a simple question: "Show me how this interaction complied with our protocols. Which specific criteria from the protocol were met?"
The team pulls up the eval score: 85/100. The reasoning says "appropriate care with good documentation." So teams celebrate. Finally, 100% coverage! Quality scores at 85/100. They're ready to deploy. Then they meet compliance
Because, when you run the evaluation again, the score is now 82/100. The compliance officer stops them. "We can't defend a system to regulators if the same input produces different outputs. Not approved."
Some version of this conversation, happens constantly in healthcare, financial services, and pharma. AI Engineers create a new problem - variability while solving the sampling problem.
Evals ≠ Audits (Different Problems Entirely)
Here's what that compliance officer require that many AI teams don't fully comprehend. Evals and audits aren't solutions to fundamentally different problems.
Evals answer: "How good is this?" (Quality measurement)
Audits answer: "Does this provably follow regulation X?" (Compliance proof)
Here’s a better analogy.
Evals are your speedometer - track performance, identify trends, signal problems. Some variance is fine. You want to know if you're generally going 65 or 85 mph.
Audits are your safety inspection - prove you meet regulatory standards, pass emissions, are legal to drive. No variance allowed. Your brakes either meet the standard or they don't.
You wouldn't skip the speedometer. But you also can't pass inspection by showing your average speed.
Why LLM-as-Judge Works for One But Not the Other
This isn't a criticism of LLM-as-judge. It's brilliantly designed for what it does. For quality monitoring (evals), LLM (System 1 ) based approaches are perfect. Noise is acceptable when tracking trends.
For compliance proof (audits), you need something else: deterministic, rule-based reasoning that produces identical results every time. Kahneman called this System 1 vs System 2 thinking:
System 1: Fast, intuitive, noisy (where LLMs operate)
System 2: Slow, deliberate, deterministic (what audits require)
A probabilistic system can't provide deterministic proof. It's not broken—it's just designed for a different use case.
The POC Hell and Path to Production
Most teams building AI agents go through three stages:
Stage 1: Observability — "What happened?"
Logs, metrics, monitoring
Can see what the agent did
Stage 2: Evals — "How good was it?"
Quality assessment, testing
Can measure performance
Stage 3: Audits — "Can you prove compliance?"
Regulatory verification
Can defend to regulators
Most teams are stuck between Stage 2 and 3.
They have great eval coverage. Quality scores look good. But compliance won't approve production deployment because they need deterministic proof, not probabilistic scores. Regulated industries need Stage 3 to deploy.
The Second Question - do we even need an eval layer?
The latest breed of models have become exceedingly good at code generation and there precisely lies the answer. The complexity and the subjectivity of the task decides the need of just evals vs evals and audits. Here’s a detailed breakdown and also a fun interactive visual to understand this better - what requires evals, cases where models are exceedingly good and where subjectivity demands more.

If you have read this far, do let me know if this made sense. Just reply "got it." If something didn't land, tell me that as well. The fun part is figuring out how to say complicated stuff in a way that actually sticks.
And if you liked it - maybe send it to someone who is wrangling their head with these jargons?

