robot whispering into a tin can
Feature

AI Plays Telephone With Your Workplace Documents

5 minute read
David Barry avatar
By
SAVED
The most dangerous AI failure does not crash a system or trigger an alert. It produces a document that looks perfect and reads completely wrong.

The most dangerous AI failure in the enterprise right now does not crash a system, trigger an alert or produce obviously garbled text. Instead, it produces a document that looks polished, reads coherently and travels through approvals while the meaning has shifted underneath.

A Microsoft Research preprint published in April tested 19 large language models on long, delegated document workflows across 52 professional domains. Even the best-performing frontier models corrupted an average of 25% of document content. Across all models tested, average degradation reached 50%. Giving models agentic tool access made no measurable difference.

It’s like a high-tech version of the Telephone game, only what comes out isn't merely garbled. It comes out polished, and wrong. 

The implications for organizations that have AI embedded in document-intensive work are clear. The question is if businesses are noticing.

Table of Contents

The Error That Looks Like Good Work

Here’s what 25% corruption looks like: It looks like a contract draft where a non-compete clause has been altered, a financial summary where a number is off by a digit or a research brief that cites a study that does not exist, said Gartner managing partner Jackie Swanson.

"The damage is silent and plausible, the kind of text that survives a casual review and only surfaces when someone tries to act on it,” Swanson said.

"The real risk is not that AI makes a visible mess of a document,” said Richard Harbridge, principal industry advisor at ShareGate by Workleap. “It is that the document still looks polished while the meaning has quietly changed. In an enterprise, that can be more dangerous than an obvious error because it moves through approvals, decisions and compliance processes with a sense of confidence it may not deserve."

This is even more serious in regulated environments, said Maxime Vermeir, VP of AI strategy at ABBYY. In banking a Know Your Customer pipeline processing hundreds of onboarding documents daily could see salary figures pulled from the wrong column or entity names corrupted, feeding bad data into credit decisions that no loan officer questions because the dashboard looks clean, he said.

In healthcare, a 25% degradation rate could mean a medication dosage extracted incorrectly or a patient history partially overwritten during summarization.

"None of these errors announce themselves," Vermeir said. "They travel quietly through the system until a clinician catches them, or doesn't."

A Model Problem, Not a Product One

When degradation occurs, the instinct is to blame the vendor. "But if the core issue lives in the underlying model, switching products solves nothing,” Vermeir said.

Microsoft Research tested models from six families including Google, Anthropic and OpenAI and found the same problem across all of them. It is a structural limitation.

Some enterprise buyers evaluate AI reliability as if it were purely a product feature, Harbridge said. But the failure could happen in multiple places: the model, the agent design, the workflow, the retrieval approach and the governance controls around it.

"Switching applications may not solve the problem if the same underlying limitations remain," Harbridge said. "At the same time, good product design can reduce the risk through validation, constrained workflows and human review."

The model does not lose fidelity because it lacks tools, explained Brian Sathianathan, co-founder and CTO at Iterate.ai. It loses fidelity because it loses grip on the source document as workflows lengthen.

"More capability isn't going to be a fix when the root issue is a comprehension problem," Sathianathan said.

More Tools, Same Failure

The finding that agentic tool access made no difference to degradation rates is significant given where enterprise AI adoption is heading. Organizations building autonomous document workflows assume the model at their core is a reliable delegate, when research says it is not.

In fact, four models tested agentically performed worse than without tools, incurring an average additional degradation of 6%.

Calling tools does not fix long-context coherence, Swanson said. "It just gives the model more surface area to make confident mistakes,” she said. “The failure is upstream of the workflow, in the model's ability to hold a document together over many steps."

The problem is that LLM-based extraction is inherently non-deterministic: Each step in a multi-tool agentic chain relies on the output of the step before it, Vermeir explained. If step one returns a subtly incorrect value, the rest of the chain propagates that error.

That has regulatory consequences. For organizations operating under the EU AI Act, DORA or GDPR, an agentic system where errors compound across workflow steps, and no step can explain why it chose a specific value is structurally incompatible with regulatory requirements, Vermeir warned.

Where the Governance Issue Is Biggest

The highest-risk professional domains are not necessarily the ones with the most AI usage, but the ones where small changes carry large consequences: legal, finance, healthcare, compliance, cybersecurity, procurement and regulated reporting.

Learning Opportunities

The problem is precision combined with accountability, Harbridge said. Where a document influences money, obligations, safety or regulatory posture, degradation becomes a material risk, he said.

"Most enterprises have no verification framework worth the name,” Swanson said. They have data residency policies, prompt sensitivity classifications and vendor security reviews. But almost nobody is running structured output verification at scale, which is what the research said is missing.

Mature deployments require field-level audit trails, deterministic extraction for high-stakes fields and explainability documentation that could survive regulatory scrutiny, Vermeir said. Most organizations have some human-in-the-loop review and periodic spot-checks, but few have anything close to that standard. "That gap is where liability is accumulating quietly, one unverified output at a time,” he Vermeir said.

Enterprise buyers need to know from vendors where autonomous editing is reliable, where it is not, what problems are known and what controls are required before AI is used in consequential processes. "Trust improves when vendors are specific about limits, not just confident about capabilities,” Harbridge said.

For organizations already running AI in document-intensive work, it becomes a liability issue when a corrupted document influences a consequential decision and nobody caught it, Sathianathan said.

"Clearly, the research is saying this is happening," Sathianathan said. "Now the question is whether anyone will be able to trace it back when something goes wrong."

The audit-committee conversation about who owns AI-output failures is coming regardless, Swanson warned. "Five years from now, that is the conversation that defines mature AI governance,” she said.

Given the findings of the research, the question is whether organizations have five years to prepare.

Editor's Note: How else is AI influencing the direction of document management?

About the Author
David Barry

David is a European-based journalist of 35 years who has spent the last 15 following the development of workplace technologies, from the early days of document management, enterprise content management and content services. Now, with the development of new remote and hybrid work models, he covers the evolution of technologies that enable collaboration, communications and work and has recently spent a great deal of time exploring the far reaches of AI, generative AI and General AI.

Main image: adobe stock
Featured Research