Scale AI and the Center for AI Safety tested frontier AI agents on 240 real professional projects pulled from Upwork — financial modeling, data analysis, architectural work. The average project cost $630 and took a human 29 hours to complete. The best AI agent completed 2.5% of projects at a quality a paying client would accept.

That is a 97.5% failure rate on real professional work. Not because the models are weak. Because the context required to do the job well existed outside the prompt.

In a CRE PE firm, that missing context has a name. It lives in the IC room.

The deals that go to the IC room are the ones AI handles worst

Most teams start deploying AI on the lower-stakes work: property summaries, market comps, first-draft IC narratives, portfolio reporting templates. That is the right place to start. The models handle these tasks reasonably well because the deliverable format is known, the inputs are structured, and the definition of good enough is clear.

The problem is what happens as AI fluency builds confidence and deployment scope expands. Teams start applying AI-assisted analysis to complex underwriting. The outputs look polished. The models are articulate. The numbers are internally consistent.

And then the deal goes to the IC room.

Research on AI agent performance shows a consistent pattern: agents perform best on the middle of the distribution and worst at the edges. Routine cases get handled well. Unusual cases at either extreme of the risk spectrum get handled poorly. The agent is most confident precisely where confidence is least warranted.

The IC room is not where you bring routine cases. It is where you bring the bridge loan approaching maturity on an asset whose comp set has deteriorated. The preferred equity position with a GP who has a strong track record but a capital structure that doesn't survive 50 basis points of further cap rate expansion. The value-add thesis that depends on a rent growth assumption requiring a specific view on Class B absorption in your market. These are the deals that need the room. They are also exactly the cases where AI is most likely to produce an output that looks authoritative and isn't.

Your aggregate accuracy metrics won't surface this problem. An AI tool that performs at 87% accuracy across your full deal pipeline looks excellent. The concentration of failures at the edges means the 13% error rate is not randomly distributed — it clusters in the cases where being wrong is most expensive.

The sponsor narrative anchors the output

The Mount Sinai Health System published a landmark study in early 2026 evaluating an AI medical triage tool. When a family member added a note minimizing a patient's symptoms — "she looks fine to me" — the system was twelve times more likely to recommend less urgent care, even when the objective clinical data pointed toward an emergency. The social framing hijacked the output. The agent didn't weigh the note as one data point among many. It anchored to it.

This is not a health-specific failure. It is a structural property of how language models process inputs that combine structured data with unstructured narrative. The financial model is the structured data. The offering memorandum is the narrative. The sponsor's track record summary, the broker's positioning language, the way the deal was introduced in the deal screening email — all of it is unstructured narrative that biases the output in ways the model never discloses.

A GP with confident, well-written deal materials anchors the AI's read of the financial model. The model does not tell you this happened. It produces a polished analysis that reflects the anchor without acknowledging it. The more institutional the sponsor's marketing, the more pronounced the effect.

The second-order consequence is subtle and worth sitting with. AI-assisted underwriting does not eliminate the influence of narrative framing on investment decisions. It hides it inside a document that looks like objective analysis. The IC room has always been where experienced partners push back on optimistic assumptions. That function becomes more critical, not less, when the optimism is embedded in AI output rather than in a sponsor's pitch deck.

The Verification Tax compounds at the IC level

There is a cost that most firms are not measuring. I call it the Verification Tax — the person-hours consumed verifying AI-assisted work that looks correct at the surface but requires expert review to confirm.

At the routine end of the pipeline, the Verification Tax is low. A property summary with a factual error is easy to catch. The cost of a miss is modest.

At the IC level, the Verification Tax is highest because the stakes are highest and the errors are hardest to detect. The AI-generated sensitivity analysis that uses an exit cap rate assumption mirroring current conditions rather than conditions at the projected hold period exit. The comp set that reflects CoStar headline figures rather than actual rent rolls. The waterfall model that is technically correct but misses a promote structure nuance that materially changes the GP-LP return split in the base case.

None of these errors are visible without domain expertise. The model produces them with the same formatting and apparent precision as a correct output. The only thing that catches them is a person who has built enough deals to know what to look for — and who knows to look.

The verification burden falls on your most expensive people. When AI-assisted underwriting becomes the norm and the quality of that work is assumed rather than verified, the probability of an uncaught error reaching IC increases. That is not a failure of AI capability. It is a failure of deployment architecture.

Judgment Filters: encoding what the IC room knows

The fix is not a better model or more detailed prompting. The fix is what I call Judgment Filters — structured tests that encode your firm's accumulated investment experience into a systematic verification that runs before AI-assisted analysis reaches the IC room.

A Judgment Filter is a specific, testable question drawn from your firm's deal history. Not a general best practice from an AI vendor's documentation. Not a surface-level accuracy check. The kind of question that only someone who has lived through a deal cycle in your specific market, with your specific risk tolerance, can formulate.

Assumption integrity. Does this underwriting assume absorption above the trailing twelve-month market average without an explicit thesis for why this asset outperforms? Is the exit cap rate assumption a function of the current environment or the projected exit environment? Does the sensitivity analysis show the deal breaking before we reach the downside scenario we would actually consider in an IC conversation — or does the model only stress to a level that still produces an acceptable return?

Source verification. Has the comp set been checked against actual rent rolls, or does it reflect CoStar headline figures? Does the construction cost assumption reflect current subcontractor pricing, or does it embed a pre-tariff baseline? Are the operating expense projections derived from this asset class in this market, or from a generic template the model populated with defaults?

Context the model doesn't have. Is there any element of this deal's thesis that depends on a GP relationship, a verbal commitment from a municipality, or an anchor tenant relationship that is not documented in the data room? Does the capital structure survive a scenario where the construction lender tightens covenant terms at the first extension? What does the GP's asset management track record look like on deals that didn't execute to plan — and is that history reflected anywhere in this analysis?

These are not questions a model generates on its own. They are questions a CIO generates by thinking through the ways deals have gone wrong and encoding those patterns into a systematic check. The senior person who builds a library of Judgment Filters is doing something more valuable than reviewing AI output. They are turning hard-won loss avoidance into reusable infrastructure.

The second-order effect of this work compounds. A Judgment Filter library built from five years of IC decisions becomes a form of institutional memory that survives personnel turnover. When the partner who caught the promote structure error three years ago leaves the firm, the Judgment Filter remains. The knowledge is no longer locked in one person's head.

What the Harvard data is telling the market

A study of 62 million American workers across 285,000 firms between 2015 and 2025 found that companies adopting generative AI saw junior employment drop roughly 8% relative to non-adopters within 18 months. Senior employment kept rising. The conventional interpretation — AI replaces junior workers — is the first-order read.

The second-order read is more precise: AI replaces task execution. Junior employees have historically been hired for tasks — first-pass underwriting, document review, comp pulls, draft IC narratives. AI handles these tasks adequately in isolation. Senior employees survive because they do something different. They hold the mental model of what can go wrong. They know which assumptions are load-bearing. They know the decision history that never made it into the deal files. They know when technically correct output is organizationally wrong.

The market is relearning, in real time, that context is the scarce resource — not analytical throughput. The firms discovering this too late are the ones that reduced senior headcount assuming AI had replicated their judgment. Gartner's February 2026 research predicted that by 2027, half the companies that cut staff for AI will rehire workers performing similar functions. Forrester's data showed 55% of employers already expressing regret over AI-driven layoffs.

The pattern is consistent enough to have a name. When you remove the people who hold institutional context and replace them with AI that can execute tasks but cannot carry that context, you have not automated judgment. You have eliminated it and left the appearance of it behind.

The question worth asking before your next IC

AI will make your deal pipeline faster. It will make first-pass underwriting cheaper and more consistent. It will reduce the time between deal identification and IC memo. All of that is real and worth building.

The IC room has always been where pattern recognition beats process. Where the partner who has seen a similar structure fail in a different rate environment raises a question that no model would surface. Where the asset manager who knows the submarket corrects the absorption assumption before it becomes the basis for a capital commitment. That function does not get replaced by AI. It gets more important, because AI produces outputs that carry the appearance of rigor regardless of whether the underlying judgment is sound.

The question worth sitting with before your next IC: which of your firm's hard-won lessons about how deals go wrong are encoded in a Judgment Filter that runs before AI-assisted analysis reaches the room? And which ones live only in the heads of the people who were there when the lesson was learned?

The second category is where your next loss is hiding.

AI Maturity Index

Is your firm's AI infrastructure producing signal or noise at the IC level?

The AIM Index evaluates your firm across ten operational dimensions — including how AI-assisted analysis is verified before it reaches investment decisions. It takes four minutes and surfaces where deployment architecture is creating risk you may not be measuring.

Take the Assessment

Back to all insights