How We See It

The AI Arms Race in RCM Is Optimizing the Wrong Thing

Why reliable AI in healthcare billing turns out to be less about the AI itself and more about what's built around it.

By Joe Chou, Head of Data

8 min read

Contents

The Core Question

What Happens When the AI Gets Something Wrong?

There's a question every AI vendor in the revenue cycle space should be answering right now: what happens when the AI gets something wrong? And can AI truly replace human judgment in its current state?

A lot of companies are promising their systems can reduce or even eliminate human error. But healthcare billing doesn't happen in a controlled environment. Claims are submitted for real patients, to real payers, each with their own rules, edge cases, and unpredictable reasons for denying or rejecting something.

The answer to those two questions tells you more about a vendor than anything in their sales deck. And understanding why starts with a concept that has been quietly shaping how the best engineering teams are building AI in 2026.

The Evolution of AI Engineering

From Telling AI What to Do, to What It Should Know, to Never Making the Same Mistake Twice

The first wave of AI tools was built on the belief that better prompts lead to better outputs. Give the model clearer instructions and it performs better.

Then came context engineering. The focus shifted from "how do we ask better questions?" to "how do we give the model better information?" In healthcare billing, that means feeding in payer contracts, fee schedules, and historical claims data so the system could make smarter decisions about how to get claims paid.

But a huge chunk of billing knowledge doesn't live in any dataset, it lives in people's heads. Experienced billers know things that never make it into a contract or rules engine: which payers quietly require an undocumented modifier, which denial codes are basically noise, which claims will bounce back no matter what and why. You can get closer with better inputs and better prompting, but you're still relying on humans to surface and maintain that knowledge. And that doesn't scale.

That's the gap harness engineering closes. Prompts and data still matter, but the real shift is building systems that learn from what actually happens: capturing corrections, adapting over time, and turning tribal knowledge into something persistent. Without it, the system only stays current if people keep updating it. With it, it gets better on its own over time.

Good Engineering Principles

Harness Engineering Existed Before AI, but AI Is Where Its Importance Really Shows

What the best engineering teams have figured out is that building reliable AI is not a new problem. It's the same set of problems they've always dealt with: clear responsibilities, well-defined inputs and outputs, components that don't blindly trust each other, solid testing, and enough visibility to understand what went wrong when something breaks.

A lot of teams, caught up in the hype, dropped those fundamentals in favor of "move fast and let the model figure it out." The teams seeing the strongest results in 2026 are the ones who didn't do that. They kept applying standard engineering discipline to AI the same way they would to any other complex system. Some call it harness engineering, but really it's just good engineering.

More Difficult Than The Demo Suggests

Applying AI to Billing Is Harder Than It Looks

A strong AI model on its own isn't enough for RCM because billing isn't just a prediction problem. It depends on a long list of payer-specific rules, strict sequencing, and judgment calls about when to escalate to a human. The model can help with the reasoning, but the system around it is what makes that reasoning usable in production and durable over time.

Billing isn't one decision. It's a chain of connected checks that all have to be answered correctly: is eligibility active, was the service authorized, is the coding appropriate for the diagnosis, does this payer require a modifier in this situation, is the documentation sufficient for audit, does this claim look unusual for this provider. Each step depends on the ones before it.

Every step in that chain matters. Small issues early in the process often don't show up until a claim is denied weeks later.

A general-purpose model tends to handle each step in isolation, based on what it has seen in training. It doesn't naturally stay in sync with constantly shifting payer rules or the operational context of how billing actually gets done. That's where a harness comes in. It structures the workflow into steps with explicit inputs and outputs, adds checkpoints so errors are caught earlier, and logs what happens at each stage so failures are traceable.

This also helps explain why some AI billing tools look strong in demos but struggle in real use. Demos are usually run on clean, predictable data where models perform well. Real billing environments are inconsistent, noisy, and constantly changing. A model without a supporting system tends to work in the first setting, but not the second.

What Actually Matters

AI Isn't the Product

Most billing tools get evaluated on the AI itself. The demos look impressive, the accuracy metrics sound strong, and the pitch is polished. But none of that tells you how the system performs in the real world: when a payer rule changes unexpectedly, a modifier requirement varies by state, or a denied claim exposes an edge case the model has never seen before.

What actually determines whether the system holds up is everything built around the AI:

The rules that govern what it does in edge cases

Without explicit rules for handling unusual situations, the AI defaults to whatever pattern it saw most often in training data - which may be completely wrong for your payer mix.

The checkpoints that catch its mistakes before they become denied claims

Pre-submission validation that flags likely denials before they hit the payer - turning costly rework into a quick fix.

The feedback loops that make it smarter every time something goes wrong

When a denial occurs, that outcome should automatically update the system's logic - not just get logged and forgotten.

The human oversight that provides a final check on high-stakes decisions

In a regulated environment where every decision carries financial and compliance implications, human review of exceptions isn't a failure of automation - it's the design.

In other words, the AI is just one part of the workflow. Let's compare the two approaches side by side:

Think of it this way. Investigating a denied claim requires getting ten things right in sequence. A single AI doing all of it is like rolling a dice ten times in a row and hoping they all land in your favor. When the outcome is wrong, you have no way of knowing which roll caused the problem. A well-built harness rolls one dice at a time. Each step gets its own check, and when something goes wrong you know exactly which roll failed and why. You fix that one thing, not the whole chain.

In practice, each part of the denial workflow becomes its own controlled process. The system records what happens at every stage, so failures are traceable. You fix the specific step that's off, validate the improvement, and move on. Over time, each step gets more reliable because you're improving it individually, rather than trying to tune the whole sequence at once.

Practical Building Blocks

What a Harness Actually Looks Like in RCM

Harness engineering in RCM comes down to a few practical building blocks:

Feedback loops that turn denials into lasting fixes

When a claim is denied, the right question isn't just "how do we fix this claim." It's "why did this happen and how do we make sure it never happens again." A harness takes denial outcomes and converts them into structured rules or updates in the system. Over time, patterns that once lived in individual billers' experience get captured in the workflow itself.

Structured workflows rather than one big black box

A harness breaks the process into discrete steps, each with defined inputs, outputs, and validation. Eligibility, authorization, coding, and denial review can run as separate stages, but each one is observable and measurable. When something breaks, you can pinpoint the step.

A context layer built on real adjudication data

The value of context depends entirely on what data it comes from. Generic healthcare data leads to generic decisions. A system trained on specialty-specific claims adjudication history across hundreds of payers, dozens of states, and millions of claims knows how this payer actually behaves on this code with this modifier.

Human-in-the-loop design that scales trust over time

The right design starts with human oversight and expands automation only as confidence in the system's accuracy is established. The AI surfaces what to look at and what the data says; the human makes the final call. As accuracy improves, the human role gradually shifts from making every decision to focusing on exceptions and oversight.

Evaluating RCM Technology

What This Means If You Are Evaluating RCM Technology

A lot of RCM AI evaluations end up focusing on the wrong signals. Model accuracy in controlled tests. How fast implementation can happen. Projected reductions in cost-to-collect. These are useful, but they don't really answer the question that matters most: how the system behaves once it's live in a messy, changing environment over years.

The questions that matter are:

What happens when a claim is denied? Does that outcome feed back into the system in a structured way, or does it just become another case for a human to work? Vendors that can explain how denials actually change future behavior are describing a learning system. If they can't, it's closer to a workflow tool wrapped around AI.
What does performance look like in year two compared to year one? A system with a real feedback loop and structured learning should improve as it processes more claims, denials, and payer behavior. For a multi-site clinic or a PE-backed platform with a four to seven year horizon, this compounding effect is often the difference between incremental gains and sustained improvement.
Where does the human fit in the process? Not as a fallback for when the system fails, but as part of the design. In practice, humans review edge cases, validate higher-risk decisions, and help shape how the system handles exceptions. As reliability improves, their role can shift toward oversight and exception handling, but they remain an intentional part of the workflow.

The Bottom Line

The AI Arms Race Is Already Over

There's a lot of noise right now about which AI model is the best right now: most powerful, most accurate, best at reasoning. In RCM, that's not really the deciding factor. The goal was never to use the most or latest AI, but to get claims paid correctly and consistently.

The vendors that will matter over time are the ones that use AI where it actually helps and don't force it where it doesn't. Some parts of the billing workflow are well-suited for automation, while others still depend on human judgment.

That judgment doesn't come from models, but rather from being close to the work: working through real claims, real denials, and the edge cases that never show up in clean datasets. It's less about having the latest model and more about understanding how billing actually behaves when things get messy. And trying to replace that too early usually creates more problems than it solves.

Systems built from that perspective tend to look different because they don't assume every step should be automated the same way. Instead, they focus on where things tend to break, where risk concentrates, and where humans still consistently do better than models. And they improve those areas over time based on what actually happens in production.

The advantage doesn't come from swapping in a better model. It comes from accumulated operational knowledge, built up through actual claims over time, and in this space, that tends to matter more than model selection.