Coding Agents Proved That Writing Code Was Never the Hard Part

Here’s an uncomfortable number: experienced developers using AI coding agents are 19% slower on real-world tasks — while believing they’re 24% faster. That’s not a rounding error. It’s a 39-percentage-point gap between perception and reality, measured in a rigorous controlled trial by METR across 246 real tasks.

But this article isn’t another “coding agents are overhyped” take. The data tells a far more interesting story.

Coding agents haven’t failed. They’ve accidentally run the largest experiment in software engineering history — and the result is a revelation: writing code was never the hard part. The hard part was always maintaining systems over time, understanding why code exists, reviewing changes for architectural coherence, and preventing regressions across complex systems. Coding agents just proved it with data we can no longer ignore.

Three things happened in rapid succession that make this undeniable: Alibaba’s SWE-CI benchmark showed 75% of AI models break working code during maintenance. LinearB’s analysis of 8.1 million pull requests revealed end-to-end delivery is 19% slower despite AI adoption. And Cursor — the most successful AI coding company on the planet — paid over $290 million to acquire Graphite, a code review tool. Not a code generation tool. A code review tool.

The industry is telling you something. Let’s listen.

Abstract visualization of software code flowing through a bottleneck, representing the shift from code generation to code review and maintenance

The bottleneck in software engineering was never about writing code — it was everything that comes after.

The Seduction: Individual Productivity Gains Are Real

Let’s be fair to the optimists — the individual-level numbers are genuinely impressive. Developer adoption of AI coding tools has hit 85%. Claude Code alone generates approximately 135,000 public GitHub commits per day — roughly 4% of all public commits — projected to reach 20% by end of 2026.

At Anthropic, an internal survey of 132 engineers found 67% more merged PRs per day when using Claude Code. Boilerplate tasks see genuine 10x speedups. For isolated, well-defined problems, coding agents are extraordinary.

Sourcegraph CTO Beyang Liu reported on the a16z podcast that 90% of his code now originates from AI agents. That sounds like a revolution.

But here’s the twist: when Anthropic checked their organizational dashboard after that survey, delivery metrics hadn’t moved. More PRs. Same output. If you’ve been following the current state of coding agents, this paradox is becoming impossible to ignore.

The Perception Gap

In METR's controlled trial, developers believed AI tools made them 24% faster. The measured result: 19% slower. That's a 39-percentage-point gap between how productive developers feel and how productive they are. Google's 2024 DORA report found the same pattern: 75% of developers reported feeling more productive with AI, while every 25% increase in AI adoption correlated with 1.5% slower delivery throughput.

The Reveal: Organizational Data Tells a Different Story

LinearB’s 2026 Software Engineering Benchmarks Report analyzed 8.1 million pull requests from 4,800 engineering teams across 42 countries. It’s the largest dataset we’ve ever had on AI’s impact on real engineering organizations. The findings are striking:

PRs per author up 20% — developers are producing more code
End-to-end delivery 19% slower — but shipping takes longer
Incidents per PR jumped 23.5% — more things break
Review times increased 91% — reviewers are drowning
AI-generated PRs wait 4.6x longer for review — because reviewers don’t trust them

The most devastating number: only 32.7% of AI-generated code passes review without modification, compared to 84.4% for human-written code. That’s a 51.7 percentage point gap.

Remember Beyang Liu, the Sourcegraph CTO whose code is 90% AI-generated? He now spends 90% of his engineering time on code review. The bottleneck didn’t disappear — it moved. Senior engineers at large organizations now spend an average of 4.3 minutes per AI suggestion versus 1.2 minutes for human code during review.

The Smoking Gun: Maintenance Is Where Agents Break

If individual gains don’t translate to organizational outcomes, why not? Alibaba and Sun Yat-sen University researchers answered this with the SWE-CI benchmark — the first benchmark designed to test AI agents on what software engineers actually do most: maintaining codebases over time.

They tested 18 AI models across 100 real repositories with average histories of 233 days and 71 commits. The result: 75% of AI models broke previously working code during maintenance — even when their initial patches passed all tests.

This isn’t a bug in a particular model. It’s a structural limitation. Coding agents can fix the issue in front of them, but they can’t hold the context of an entire system in mind. They don’t understand why a function exists, what implicit contracts it maintains with other modules, or what will break three layers away when you change it.

As software engineer Anuradha Weeraman explains in The Compounding Problem: the model reads your codebase, infers patterns from it, and replicates those patterns — including the wrong ones. Errors don’t just persist; they compound superlinearly as each new feature propagates architectural mistakes.

The data from GitClear’s analysis of 211 million lines of code confirms this at scale: duplicated code blocks rose 8-fold in 2024, while refactoring dropped from 24-25% of changed lines (2020-2021) to under 10%. For the first time ever, copy-pasted lines exceeded refactored lines. Agents don’t refactor. They duplicate. And the codebase quietly rots. If you’re building with AI tools, understanding these hidden risks of AI-generated code is essential.

We are witnessing a fundamental shift in software engineering where value is no longer defined by the speed of writing code, but by the confidence in deploying it.

The $290 Million Confession

If you want to know what the smartest people in AI coding actually believe, don’t listen to their marketing. Watch where they spend their money.

In December 2025, Cursor — the company that built the most popular AI code editor on the planet — acquired Graphite for over $290 million. Graphite isn’t an AI code generation tool. It’s a code review and merge workflow platform.

This is a confession disguised as an acquisition. Cursor’s CEO Michael Truell knows that making code generation faster without solving the review bottleneck is like widening a highway that feeds into a single-lane bridge. The traffic doesn’t flow faster — it just backs up in a different place.

Amazon learned this the hard way. In March 2026, the company experienced multiple Sev-1 outages, including one linked to its AI coding assistant Q providing inaccurate advice. The result: approximately 120,000 lost orders and a 90-day safety reset across 335 Tier-1 systems. When AI-generated code moves fast and breaks things in production, the cost isn’t theoretical.

Organizational impact of AI coding tool adoption (Sources: LinearB 2026, GitClear 2025, CodeRabbit 2025-2026)

Metric	Before AI Adoption	After AI Adoption	Change
PRs per author	Baseline	+20%	📈 More code produced
End-to-end delivery time	Baseline	+19% slower	📉 Slower shipping
Incidents per PR	Baseline	+23.5%	📉 More things break
Review time per PR	Baseline	+91%	📉 Review bottleneck
Code duplication	Baseline	+8x (2024)	📉 Codebase degradation
Refactoring as % of changes	24-25%	Under 10%	📉 Maintenance collapse
Technical debt	Baseline	+30-41%	📉 Compounding costs

The Reframe: What Coding Agents Actually Taught Us

Here’s the insight that reframes everything: coding agents haven’t failed. They’ve succeeded at exactly what they were designed to do — generate code fast. And in doing so, they’ve revealed that code generation was never the bottleneck.

The actual hard problems in software engineering are:

Maintaining coherence across a codebase over months and years
Understanding intent — why code exists, not just what it does
Reviewing changes for architectural fit within a larger system
Preventing regressions in interconnected, complex systems
Making judgment calls about tradeoffs that require business context

These are precisely the tasks where agents fail most spectacularly. And they’re precisely what experienced software engineers spend most of their time doing.

As veteran engineer Denny Britz puts it: “Productivity with coding agents ranges from 0.1x (net negative) to 10x depending on the task. Most daily work is messy brownfield work. I estimate the average productivity gain to be closer to 1-2x, not 10x.”

The SonarSource 2026 survey captures the paradox perfectly: 96% of developers don’t fully trust AI-generated code’s functional accuracy, yet 84% use AI coding tools. Developers intuitively know the code needs human judgment. The research on AI productivity tools vs. reality shows this pattern extends well beyond coding.

What This Means for Your Career

The most valuable developer skill of the next decade isn't prompting AI to write code faster. It's the judgment, architectural thinking, and systems understanding that agents have proven they cannot replicate. If you've been worried that coding agents will replace you — the data suggests the opposite. The skills that matter most are becoming more valuable, not less. Invest in understanding systems deeply, not just generating code quickly.

Where the Industry Goes From Here

The market is already responding. Cursor’s Graphite acquisition is the most visible signal, but it’s part of a broader pattern: the $4.7 billion AI coding market is pivoting from “generate code faster” to “manage the consequences of generating code faster.”

Expect to see massive investment in:

AI-powered code review — not just linting, but architectural coherence checking
Automated regression detection across complex dependency chains
Intent documentation tools that capture why code exists, not just what it does
Codebase health monitoring that tracks the compounding debt AI introduces

The companies that win the next phase won’t be the ones that generate code fastest. They’ll be the ones that help teams maintain confidence in their systems as AI-generated code proliferates.

For individual developers and engineering managers, the takeaway is clear: don’t optimize for code output. Optimize for code understanding. The teams that invest in review processes, architectural documentation, and system-level thinking will dramatically outperform those chasing PR counts.

Coding agents didn’t fail us. They held up a mirror. And what we see reflected back is a profession that was always about much more than writing code — we just needed a machine to prove it.

Stay Ahead of the AI Coding Curve

Get data-driven insights on AI tools, developer productivity, and the evolving software engineering landscape — delivered without the hype.

Explore More AI & Dev Insights