Guardrails - Status Quo vs What's Needed

Last edition, I had covered why evals are not audits. Claude rephrased it as - Evals are vibe audits. And it’s only natural to ask - can we do an audit at runtime and catch policy violations? That’s what Guardrails are supposed to be.

It’s always important to ask, how were things done before Nov’22 - Did we even have guardrails before GPT?

Guardrails Are Not New

Of course we did. We just didn't call them that. Remember - how sign up forms required a business email and gmail IDs won’t be accepted? That was a classic marketing guardrail.

Every claims processing system had validation rules.
Every EHR had required fields and range checks.
Every prior auth workflow had decision trees that enforced policy before a human saw the output.
If a dosage fell outside the approved range, the system rejected it. Not probabilistically. Deterministically.

These weren't sexy. Nobody built a pitch deck around them. But they worked. A business rule that says "pediatric dosage cannot exceed X mg/kg" doesn't
hallucinate. It doesn't drift.

Here's what actually changed in the post GPT world and people underestimate this.

Take example of prior authorization.

	PreGPT	Post GPT
Input	Structured form — CPT code, ICD-10 diagnosis code, patient age, plan ID. Discrete fields. Machine-readable.	A doctor's free-text note — "Patient presents with chronic lower back pain radiating to left leg, failed physical therapy and NSAIDs over the past several months, requesting MRI lumbar spine."
Processing	Decision tree matches codes against policy criteria. If CPT 72148 (lumbar MRI) requires 6 weeks of documented conservative treatment, the system checks for the treatment codes. Present or absent. Binary.	The agent has to interpret the narrative. Extract clinical facts. Determine that "several months" satisfies the 6-week conservative treatment requirement. Match to the right policy. Apply the right version of the criteria.
Output	Approved or denied. Reason code. Done.	A natural language explanation of the decision — not a code, but generated text.
Guardrail	If the dosage field says 500mg and the policy ceiling is 200mg, reject. No ambiguity. No interpretation. The guardrail operates on the same structured data as the workflow.	...what exactly? The old validation rule checked if a coded field fell within range. Now there's no coded field. There's a paragraph of generated text that sounds right but might have interpreted "several months" as meeting a threshold it doesn't actually meet, or pulled the 2021 criteria when the 2024 update changed the required treatment duration.

Both the input and the output went from structured to unstructured. And the guardrails that worked on structured data — the ones that actually enforced your SOPs — have nothing to grab onto.

So what filled the gap? Content safety. Toxicity filters. PII detection. Not because they solve the compliance problem, but because they're the guardrails that can operate on unstructured text. They check the surface - is the language professional? is there a PII leakage? because checking the substance requires understanding your SOPs, your policies, your specific decision logic. Kahneman calls this -the substitution effect.

❝

When faced with a hard question, we answer an easier one instead, and don't notice we've done it.

Daniel Kahneman

The hard question here is: "Does this AI output comply with our SOPs?" That requires reasoning over policy logic, version checks, patient-category matching - slow, deliberate, System 2 work. 

The easy question is: "Does this output pass content safety?" Toxicity check, PII scan, jailbreak filter — fast, pattern-based, System 1 work. The dashboard shows green. Everyone nods. ust like Audits, Guardrails Have Feedback

Guardrails Is THE Feedback Layer

Here's the thing most teams miss: guardrails aren't just a safety or compliance enforcement system. They provide feedback and set the foundation for a closed loop leearning system.

The ideal closed loop system that learns from it’s own execution.

The point isn't just to block bad outputs. It's to catch what went wrong, figure out why, and feed that back into the system so it gets better.

Shallow catch → shallow improvement. 
Rich catch → structural improvement.

Think about it:

Shallow feedback: "Toxic content detected" → block the response, move on. System learns nothing useful.
Rich feedback: "Response cited Protocol v2021 when v2024 changed the threshold from 5.0 to 7.5. Knowledge base version control needs update." → System learns exactly what to fix.

The first triggers a whack-a-mole feeling. The second is an actual learning loop. Most "guardrails" are in the first layer. Regulated industries need the second.

Simply put: Guardrails = Real-Time Audits.

If evals are post-deployment quality checks, and audits are post-deployment compliance proof - then guardrails should be audits done in real time,
with enforcement. Not "did the AI say something harmful?" but "did the AI follow Protocol X, Section 4.2, using the correct version and the right patient category?" — checked before the response goes out. And if it didn't, blocked with a specific reason why.

The 2X2

	Basic Quality	SOP Compliance
Runtime (before response)	Content Guardrails	Real-Time Audit (SOP enforcement)
Post-deployment (after the fact)	Evals	Audits

Most teams have the left column. Very few have the right column at runtime. And that gap is where the real risk lives. To be fair - this is an evolving landscape. Teams are figuring how to go about this. Most use cases haven’t even hit production and the missing guardrails to enforce SOPs remains a key reason. Options?

Meet - Human In The Loop

"Don't worry, we have a human-expert-in-the-loop." This is the phrase that ends the safety conversation too early. Lender of last resort. And to be fair - HITL makes sense. As a bridge. When your system isn't mature enough to enforce SOPs automatically, having a clinician review outputs before go live is rational. Most teams building AI in healthcare are here. That's fine.

But here's where it breaks: Human in the loop distributes liability without solving the underlying problem. Read this from Claire Hast on her experience with an AI Scribe as a patient - The AI fabricated findings to make the note look complete. The doctor scrolled to the bottom and signed. The chart now says "clinician-reviewed and signed."

The doctor's calculus was rational. Reading the AI note: 3-5 minutes. Patient backlog: 45 minutes behind. Risk of not reading carefully: low (nobody checks,
AI usually sounds right). Risk of falling further behind: high.

Throughput beats accuracy. Every time. Not because doctors are lazy. Because the incentives are misaligned. And HITL under throughput pressure becomes accountability theater - the signature means "I am now liable," not "I verified accuracy."

So what would be an alternative approach to guardrails?

In the last edition we covered why LLM-as-a-Judge approaches for evals don’t move the needle. The same applies to guardrails.

Trying to build Guardrails with the same probabilistic logic that created it, doesn’t work. Embeddings retrieved the wrong context. Now we're using embeddings to check if the output matches the (wrong) context. That's checking your work with the same logic that failed.

Guardrails need to be grounded in a source of truth - Not 1,000 embeddings of chunks of text. This is exactly where neuro-symbolic approaches come in.

Neuro-symbolic approaches extend both. The LLM handles what it's good at - interpreting unstructured language, extracting clinical intent from messy notes, understanding context. Then a symbolic reasoning layer - ontologies, decision graphs, versioned policy rules - handles what it's good at: checking that extracted intent against your actual SOPs. Deterministically. With traceability.

That's how you get from "does this sound right?" to "does this provably follow the right protocol?" - which is what guardrails were always supposed to do.

That’s this edition. Next edition, we will cover chain of thought.

If you have read this far, do let me know. Just reply "got it." If something didn't land, tell me that as well. The fun part is figuring out how to say complicated stuff in a way that actually sticks.

And if you liked it - maybe send it to someone who is wrangling their head with these jargons?