From copilots to coworkers at AAAI: the gap between agentic research and production

Based on the AAAI 2026 panel "From Copilots to Co-Workers: What Changes When AI Writes, Reads, and Reasons About Code?" — Singapore, January 27, 2026

At the AAAI 2026 workshop on collaborative AI agents, a panel of researchers and practitioners from Microsoft, Mistral, National University of Singapore (NUS), LinkedIn, and Amazon Web Services (AWS) sat down to compare notes on what actually happens when you try to ship coding agents into production. Not whether it works (that debate is over) but what breaks, what surprises you, and what has to be rebuilt along the way.

Panelists:

Shengyu Fu — Partner Applied Science Manager, Microsoft CoreAI. Leads an AI for Code applied research team driving innovation across GitHub Copilot, from completions to coding agents.
Abhik Roychoudhury — Provost’s Chair Professor of Computer Science, National University of Singapore. Fellow of the ACM, Editor in Chief of ACM TOSEM. Leads the Trustworthy and Secure Software Engineering group, with contributions spanning semantic program repair, specification inference, fuzz testing, and AutoCodeRover.
Baptiste Rozière — Leads the code generation team at Mistral AI. Previously at Meta AI, where he contributed to Llama and led Code Llama.
Alborz Geramifard — Distinguished Scientist at LinkedIn
Omer Tripp — Principal Applied Scientist at AWS

The gap between research and production

Current research primarily optimizes for capability while production optimizes for reliability, cost, latency, trust, and organizational fit.

There is a familiar pattern in AI for software development. A paper demonstrates impressive results on a benchmark. A team tries to ship it. And then the real work begins, not on the model, but on everything around it: the orchestration layer that decides when to invoke which model, the cost architecture that keeps inference viable at scale, the latency budget that determines whether a developer waits or walks away, the evaluation framework that tells you if the agent is actually helping or just generating plausible noise, and the trust surface (the explanations, audit trails, and interrupt points) that determines whether anyone will delegate real work to the agent.

The panelists surfaced, repeatedly and from different angles, that the challenges of deploying coding agents are fundamentally different from the challenges of building them in a lab. Research optimizes for capability. Production optimizes for reliability, cost, latency, trust, and organizational fit — simultaneously. And that gap isn’t a single problem. It shows up at several distinct levels, each with its own set of hard-won lessons.

Where the gap shows up

Building from scratch: shipping fast while incorporating the latest research

The first challenge is architectural. When models were limited, chat was a natural interaction model. Shengyu Fu (Microsoft) described how, once VS Code shipped chat in Copilot, it was tempting to believe the system was complete — prompting felt like the whole story. That illusion doesn’t survive more capable models. As reasoning and tool-use improved, agents began performing multi-step actions, invoking tools autonomously, and self-verifying. The challenges shifted from prompt engineering to orchestration, system design, and evaluation. Teams that didn’t recognize this early paid in rework later.

In this modern agentic paradigm, architectural specialization can help. For example, GitHub Copilot uses dedicated sub-agents for tasks like code search, reserving expensive reasoning models for where they matter. But this specialization introduces its own problems. Each hand-off between agents adds latency, and as context passes through multiple stages, information degrades. At the scale of tens of millions of lines of code, driving down cost requires system-level design, not just swapping in a cheaper model.

How users perceive that latency has also evolved. In 2025, speed was the proxy for quality. By 2026, users were willing to wait longer in exchange for more autonomy on complex prompts and more comprehensive solutions. The bar moved from "respond quickly" to "handle more so I don’t have to." Abhik Roychoudhury (NUS) described how early agent teams, his included, optimized heavily for inference cost using program analysis techniques. What surprised many was how quickly organizational readiness overtook cost as the binding constraint. Three years earlier, at the International Conference of Software Engineering (ICSE), many companies were adamant that LLMs would never touch their code. That reversed once developers experienced the productivity gains. Cost mattered less than willingness, and with the right scaffolding, Abhik added, higher quality at lower cost is achievable.

Main takeaway: Research gives you capability; production demands an architecture balancing cost, latency, and quality — and that architecture is where most of the real engineering lives.

Learning and evaluation: making agents more autonomous

The second level of the gap is about knowing whether your agent is actually good, and improving it over time.

Reinforcement Learning is a natural fit, but it hits infrastructure walls before algorithmic ones. Reinforcement learning should be the obvious approach for training coding agents that take actions in environments. In practice, the hardest problems are engineering problems. Alborz described LinkedIn’s experience treating agent training as a fully RL-driven problem. GPU/CPU utilization became a systems challenge — builds and execution are CPU-heavy, creating imbalances with GPU-centric training clusters. Collecting trajectories while training models in parallel introduced coordination complexity. Naive reward signals invited reward hacking: agents learned to remove tests to appear successful.

Further, to achieve scaling, LinkedIn used virtual machines with each pod as a blank slate. Early designs that performed full git checkouts at each step saturated the system. Caching and strict artifact hygiene became essential. At scale, they ran close to 800 problems, each executed multiple times, generating trajectories online. Shengyu echoed similar experiences at Microsoft: agentic RL is far harder than RL for completion due to long trajectories and repeated model calls. Baptiste confirmed that Mistral saw the same pattern: RL environments demand significantly more CPU than pre-training clusters.

The consensus across all three organizations: scalable RL environments are the missing infrastructure layer for agentic AI. Currently, this is an engineering constraint, not an algorithmic frontier.

Evaluation Benchmarks are saturated and drifting from reality. SWE-Bench dominated evaluation in 2025, but by 2026 it became structurally misaligned with how developers actually use agents. Alborz pointed out that benchmarks typically operate over a single repository, while real work spans multiple repos. Abhik emphasized that most benchmarks measure code writing, while reading code, understanding intent, and analyzing impact are equally important but largely unmeasured. Baptiste added that correctness alone doesn’t differentiate agents because instruction following and generalization matter at least as much. Omer argued we should optimize for production conditions, not laboratory settings. Companies are building internal benchmarks on proprietary code, but shared standards for agentic evaluation remain missing, and for deployed systems like GitHub Copilot, agentic workloads still lack clean quantitative signals.

Main takeaway: evaluation and training infrastructure are the two missing layers. Benchmarks need to move beyond single-repo correctness to capture code reading, multi-repo workflows, and instruction following. Scalable RL environments are the missing training layer and every team that’s tried it has hit the same engineering wall, not an algorithmic one.

Agents don’t operate in a vacuum: the role of humans and agent-to-agent interaction

The third level is about what happens around the agent: how humans interact with it, how trust is built, and what the human role becomes.

Latency and auditability become product-defining. Omer drew a sharp distinction between two kinds of latency. Autocomplete latency (the delay as an AI suggests the next line while you type) is tolerable within reasonable limits because it fits within an existing flow. Delegation latency is fundamentally different: when you hand an agent a whole task and it works autonomously across multiple steps, you’re no longer typing alongside it, you're just waiting. That wait immediately couples with auditability: users want to know what the agent is doing, whether it’s stuck, and whether they can interrupt. These questions define whether a system feels trustworthy or opaque.

Quality is more than correctness. Abhik reframed quality along two under-appreciated dimensions. First, signal-to-noise: does the agent respect the developer’s attention, or does it generate noise that requires effort to filter? An agent producing ten suggestions where only one is useful imposes a hidden cost, even if each suggestion is technically correct. Second, explainability: can the agent justify unexpected decisions? Surprising behavior is acceptable if the reasoning is legible. In this framing, quality becomes inseparable from trust.

Verification lands in a practical middle ground. Formal verification remains out of reach for most systems. The panel converged on a pragmatic alternative: spec-driven development (extracting specifications from generated code so humans don’t need to revisit implementations), property-based testing (models are well-suited to generating these, and they serve as executable explanations), and AI-assisted code review (using AI to guardrail AI, so humans can focus on business logic).

The human role is changing, not disappearing. An audience question captured a common anxiety: how should junior developers grow when agents write much of the code? The emerging skill set centers on reading and reviewing code (more important than writing it), delegation (knowing what to automate and what requires human judgment), testing and validation (moving to the center of the developer's role), and resolving ambiguity (humans remain the glue where specifications are incomplete). NUS is already building courses around these principles. As Alborz put it: if he were a student, he’d double down on resolving ambiguity. That skill will remain valuable even as abstractions come and go.

Main takeaway: trust is the product, and it’s built through auditability, signal-to-noise discipline, and explainability. The human role shifts from writing code to judging, delegating, and resolving what agents can't.

Examples of how teams are addressing these challenges

The panel wasn’t purely theoretical. Several concrete examples emerged of how teams are closing the research-production gap:

Architectural specialization at GitHub Copilot. Rather than routing everything through one expensive model, Copilot uses dedicated sub-agents for tasks like code search, reserving powerful reasoning models for complex problems. This achieves higher quality at lower cost through system design rather than model substitution.

LinkedIn’s RL infrastructure. To train agents via RL at scale, LinkedIn built an environment using VMs as blank slates, with aggressive caching and artifact hygiene. Early designs that did full git checkouts per step couldn’t scale. The final system ran ~800 problems with online trajectory generation — a significant infrastructure investment that preceded any algorithmic gains.

Mistral’s compute rebalancing. Baptiste’s team discovered that RL environments for code agents demand far more CPU than their pre-training clusters were designed for. Recognizing this as an infrastructure problem (not an algorithmic one) was the key insight.

NUS curriculum redesign. Abhik’s team is building courses that reflect the new reality: code reading as a core skill, delegation to agents as a competency, testing as central, and responsible AI agent use as a requirement.

Pairwise comparison for judge calibration. Alborz proposed building datasets in the style of "this code is better than that code" — relatively low cost to create, and sufficient for judges to learn reliable relative rankings even when absolute calibration is hard.

AI reviewing AI. Shengyu’s team at Microsoft is among the first working on AI-assisted code review, where code review agents check consistency using specifications, summaries, intended behavior, and tests. The goal: humans focus on business logic while AI handles structural verification.

What we are looking at now: open research directions

The panel surfaced several threads that are active areas of work but far from resolved:

Scalable RL environments. This was the strongest consensus point. Every team that has tried RL for agentic coding has hit the same infrastructure wall. Building environments that reflect real-world diversity — multiple repos, real build systems, realistic execution — at the scale needed for RL training is the missing layer.

Meta-evaluation: judging the judges. If an AI judge scores a code change, how do we know that score is reliable? Explanation is the proposed next layer of trust where a system should produce a reliable score and high-quality comments, to make auto-shipping above a confidence threshold plausible. But establishing that reliability is the difficult part.

Spec extraction from generated code. The long-term ambition is recovering intent directly from generated code so humans don’t need to revisit implementations. This connects spec-driven development to impact analysis and understanding how changes propagate through a codebase.

Regeneration replacing repair. If code can be regenerated cheaply, why maintain it? This is a provocative idea with deep implications for how we think about software longevity, backward compatibility, and technical debt.

Shared evaluation standards. Companies are building internal benchmarks on proprietary code, but the community lacks shared standards for evaluating agentic systems. Public benchmarks are saturated; what replaces them needs to capture instruction following, generalization, code reading, and multi-repo workflows.

Customizable judges. Different organizations care about different things during code review — severity, style, ordering, cosmetic issues. One-size-fits-all judges won’t work at scale. The path forward requires judges that can be tuned to organizational context.