What Comes After the Pull Request

The developer toolchain was built for humans writing code. Coding agents broke that assumption. Now the rest of the system is trying to catch up.

Introduction

Every team using coding agents now runs into the same contradiction.

Engineers feel more productive than ever. Prototypes come together in hours. Pull requests stack up faster than anyone expected. But the amount of useful, trusted software actually reaching end users has not increased at the same rate.

Cognition captured this well with a paraphrase of Churchill: Never in the field of software engineering has so much code been created by so many, yet shipped to so few.

Now that coding agents basically work, teams are generating code faster than ever before. But writing code was only one part of the software development lifecycle. We accelerated code generation without equally accelerating review, comprehension, verification, and trust.

And the broader data points in the same direction. Faros AI looked at activity across more than 10,000 developers on 1,200+ teams and found that teams with high AI adoption complete 21% more tasks, merge 98% more pull requests, and spend 91% more time reviewing code each day.

That is the new bottleneck.

The toolchain we built around software delivery assumed that code would be written by humans at human speed. Pull requests, review queues, CI, code ownership, approval flows, and collaboration norms all emerged from that assumption. In an agent-heavy world, that assumption no longer holds.

This is why the most important questions in AI-assisted software development are no longer just about code generation. They are about what happens after.

This post is about that transition: why coding agents are shifting the bottleneck from writing code to reviewing, verifying, and shipping it; the three main bets on what replaces the pull request; and why many of those new layers may themselves be temporary scaffolding rather than the final form.

The New Constraint

When agents can produce large volumes of code quickly, the limiting factor stops being authorship and becomes evaluation.

Humans can only absorb so much change at once. They can only review so many diffs, keep so much context in working memory, and spend so much time tracing downstream impact. As code generation accelerates, everything downstream starts to strain:

Review queues get longer
Signal-to-noise gets worse
More time is spent reading generated code than understanding it
The quality system sits downstream from the speed system

The software lifecycle still assumes that the pull request is where quality gets inspected and trust gets established. But if agents can generate diffs faster than humans can understand them, then the pull request stops being a control point and becomes a traffic jam.

That is the problem the major AI companies are now racing to solve.

Three Bets on How to Fix It

Bet 1: Make Review AI-Native

Code review was one of the earliest agentic product categories to emerge, with companies like CodeRabbit, Greptile, and others building systems to help teams review AI-generated code. But common developer sentiment has been that the signal-to-noise ratio is off when AI simply comments on a PR as if it were another contributor.

In response, Cognition shipped Devin Review. The thesis is straightforward: if humans cannot absorb agent-generated diffs, give them an AI-powered tool that helps close the gap.

What does that actually mean in practice?

Intelligent diff organization that groups changes by logical relationship
Copy and move detection so a file rename does not render as a full delete-and-rewrite
Inline chat with codebase context, so you can ask what a change impacts mid-review and get an answer back

The most interesting part is how this closes the loop. An agent writes code. A review agent evaluates it, runs checks, identifies issues. The output of the review agent gets fed back into the coding agent for another cycle until the result is clean.

You can already see examples of that loop-closing behavior emerging across leading systems:

The bet is simple: humans stay in the loop, but code review has to evolve. It does not disappear. It becomes AI-native.

Bet 2: Kill the Review

Ankit Jain published this argument in Latent Space: code review, as most teams actually practice it, was already breaking down before coding agents arrived.

The symptoms are familiar:

PRs sitting for days
Rubber-stamp approvals
Reviewers skimming 500-line diffs because they have their own sprint work

We call code review a quality gate, but teams shipped software without line-by-line review for decades before GitHub made PRs the norm around 2012 to 2014. Agents just make that contradiction impossible to ignore.

Jain's alternative is to move the human checkpoint upstream. Humans review specs, plans, and acceptance criteria. Replace the single downstream gate with layered trust: BDD specs as source of truth, deterministic guardrails, permission scoping, adversarial verification, observability, and fast rollback.

In that model, humans define what should happen. Systems verify whether it happened. Code becomes a generated artifact rather than the primary object of review.

The bet is that review, as currently practiced, is already mostly theater. Agent-heavy development simply makes that impossible to hide. Jain's proposal is that we pick up camp and move upstream in the river of abstraction.

Bet 3: Replace the Platform

OpenAI is building a GitHub competitor.

The immediate trigger may have been GitHub outages, but the broader strategic logic is more interesting. A repository platform built from the ground up for AI-native workflows would not treat agents as awkward participants inside a human-first system. It would treat them as first-class actors.

That same direction is visible in Symphony, which OpenAI open-sourced as an orchestration layer for agent-driven software work. Symphony watches project boards, launches agents for tasks, manages implementation, runs CI, handles review flow, and moves work through to PR completion. Engineers manage work and intent. Agents handle much of the execution between.

This is a different thesis than improve code review. It says the constraint may not just be the review layer. It may be the entire platform the review layer sits inside.

The bet is that the toolchain itself is the bottleneck. Patching GitHub's existing review flow may not be enough for an agent-first world. You will need a new platform where agents and humans collaborate natively, where review is a first-class feature, and where agents are the primary authors of code.

What These Bets Share

These three approaches are different, but they converge on the same broader shift.

The human role is moving away from did you write this correctly? and did I inspect every line? and toward are we solving the right problem?, are the constraints correct, and have we made intent legible enough for the agent?

The value is moving upstream, into intent and specification. Away from diffs.

And this fits the broader history of software as an industry of rising levels of abstraction. Programmers are having to adapt their workflows again, just like they had to when compilers arrived.

The Engineer's Job Is Changing

OpenAI ran an experiment they call harness engineering. They built a million-line production application where zero lines were written by humans. Engineers still mattered, but their role changed. They designed environments, specified intent, and built the feedback loops that made the system reliable.

The key lesson from their postmortem:

When something failed, the answer was almost never try harder. It was to identify what capability was missing and make it legible and enforceable for the agent.

What that looks like operationally:

Architecture Enforcement Moves Into Systems

Instead of relying on human reviewers to catch structural mistakes, teams increasingly encode architecture directly into tooling: dependency rules, naming conventions, file size limits, structural tests, linters with custom error messages written specifically to teach the agent how to fix the violation.

The harness does not just block mistakes. It teaches the agent how to recover.

Documentation Becomes Part of the Execution Environment

When documentation only exists in Slack threads or in someone's head, it is invisible to the system. In an agent-heavy world, that is a serious weakness.

Useful documentation has to live in the repo as structured, versioned artifacts. Background agents scan for stale docs and open cleanup PRs. Knowledge that is not legible to the system might as well not exist.

Engineers Increasingly Operate Through Intent

As agents take on more code generation, engineers spend less time writing implementation and more time working through prompts, specifications, plans, review criteria, feedback loops, and environment design.

The code is the agent's output. The engineer's output is the system that makes the code trustworthy.

Cognition's data from customer deployments backs this up: only 20% of engineering time is spent coding. Most of it goes into planning, reviewing, and scoping. As tools get better, that ratio shifts further.

The Scaffolding Gets Eaten

These three bets are different in scope and ambition.

Each one is building a layer to absorb the mismatch between agent-speed generation and the human-speed systems around it. Cognition redesigns the review surface because humans cannot absorb diffs at the rate agents produce them. Spec-driven development moves the human checkpoint upstream because downstream review no longer scales. OpenAI rebuilds the platform because the surrounding infrastructure was never designed for agents as first-class participants.

All three are producing real results today. But all three have the structural shape of something temporary.

We have seen this before.

Noam Brown gives a useful example: before reasoning models, teams built elaborate scaffolding to coax reasoning-like behavior out of weaker systems: multi-call orchestration, chain-of-thought structures, routers, specialized harnesses. Then suddenly, better models arrived and made much of that structure not only unnecessary, but detrimental.

The evidence keeps pointing in the same direction:

METR's findings suggest Claude Code and Codex do not significantly outperform a basic default scaffold on time-horizon measurements.
Scale AI's SWE-Atlas shows native scaffolds help frontier models, but the gap narrows as models get stronger.
Boris Cherny on Anthropic's approach: we literally could not build anything more minimal.

And yet the harness is still the highest-leverage variable in the short term. Can Boluk's harness result makes this concrete: changing only the edit tool improved performance across 15 LLMs by 5 to 14 points.

This is the paradox at the center of the current moment.

The most impactful thing you can build right now is almost certainly temporary.

Teams build these layers anyway because they cannot wait for the next model to fix today's bottleneck. Shipping today means solving today's review problem concretely. And the scaffolding itself is often how teams learn what the model eventually needs to absorb.

You cannot skip the intermediate step. Building the review harness teaches you what good review actually means in an agent context. Building the spec layer teaches you what intent has to look like to be enforceable. Building orchestration systems teaches you where the human handoff points really are.

The layer may be temporary. The learning is not. You have to build the muscle of solving the problem that's here now to earn the insight of what's coming next.

What's Still Missing

There is more missing than this, but these are the two gaps I see most clearly right now.

Observability Is Still Weak

Teams are being asked to build high-trust systems on top of workflows they still cannot inspect clearly.

Observability is still basically nonexistent. The labs talk constantly about context engineering, prompt quality, and system behavior, yet they still provide limited tooling for visualizing and understanding what is actually happening inside the context window.

So practitioners end up piecing together the truth from scattered sources:

All just to see what is going on under the hood.

Specs Cannot Replace Review Yet

The argument for moving human judgment fully upstream depends on specs being precise enough to generate against and verify against.

That is a much harder problem than it sounds. As Kevlin Henney puts it, the act of describing a program in unambiguous detail and the act of programming are one and the same.

That is the trap in spec-driven development: once a spec becomes precise enough for a machine to execute against, it starts to look like a programming language.

The abstraction layer between human intent and machine execution is still missing its compiler.

So What?

All of that is to say these new layers are incomplete. But those gaps are also the opportunity. Raindrop is one such example on the observability side: a small team bringing AI-native approaches to market and delighting real customers today.

These are the conditions and opportunities to look for in whatever market you enter next.

Conclusion

The pull request is not dying because someone invented a better review surface. It is dying because the assumptions underneath it no longer hold. Code authored by humans at human speed, reviewed by humans with full context, approved through social trust built on direct authorship. None of that describes how software gets made in an agent-heavy world.

What replaces it will be some combination of the bets outlined here: AI-native review, stronger upstream specs, and more agent-native platforms. All of it is necessary right now. All of it is probably scaffolding.

The teams that navigate this well will be the ones that build the thinnest possible layer between human intent and shipped software, treat every layer as a bet with an expiration date, and know when to rip it out because the model has caught up.

That is the new skill. Not building elaborate systems, but building disposable ones well.