AI-Native Engineering Is Moving from Models to Execution Systems

As code generation gets cheaper, the real battleground is no longer just who writes code faster. It shifts toward who can design a system in which Agents execute continuously, reliably, and recoverably.

1. TL;DR: What This Article Is Actually Arguing

Over the last two years, the most visible changes in AI engineering came from the models: they write better code, call tools more reliably, and increasingly feel like coworkers who can get real work done. But zoom out a little, and the thing being rewritten is often not whether the model can do the task. It is how the system absorbs a long-running execution process that touches the outside world, fails, branches, and creates side effects. What OpenAI showed in Harness engineering was not simply that Codex produced one million lines of code. It was that once the engineer’s job shifts from “writing code by hand” to “designing environments, clarifying intent, and building feedback loops,” the engineering system itself has to be reorganized.4

This article really comes down to one sentence:

The AI-native engineering stack is shifting from model-centric to execution-system-centric.

More concretely, engineering systems used to optimize how a model produces an answer. More and more teams now have to optimize how a system handles a long-running, stateful, branching execution process that can fail and produce side effects. The durable execution work in DBOS and Pydantic AI, Microsoft Durable Agents' emphasis on session recovery semantics, and Luis Cardoso’s boundary / policy / lifecycle framing for sandboxes all point in the same direction: in the Agent era, the scarce thing is no longer just generation capability, but state, recovery, boundaries, and verification.1 6 8

The article follows a broader evolution:

In the SFT era, engineering was still mostly about how models generate answers.
RLHF, RLVR, and reasoning RL exposed rollout, environment interaction, and execution-system problems much more directly.2
Agents then carried those problems from training systems into the real world, which is why durable execution, effect logs, capability boundaries, and recovery semantics are becoming new infrastructure battlegrounds.1 5

If I compress the whole article into one line, it is this: models will keep improving, but the thing that will really separate engineering teams is the execution system, not code generation by itself.

2. Why We Keep Misreading the Infrastructure Problem in the Agent Era

It is easy to misread the Agent era because model progress is the most visible thing: longer context windows, stronger tool use, better coding, and models that look increasingly like humans who can “do work.” But model progress is not the same thing as system maturity. Harness engineering is a powerful counterexample. OpenAI started from an empty Git repository and, over five months, built a real product in active use. The repository grew to around one million lines of code and roughly 1,500 PRs, while the team maintained the principle that no application logic was written directly by hand. What got rewritten was not just the code, but the organization of the codebase, the docs, tests, CI, observability, and merge flow.4

This highlights an easily missed fact: once Agents begin to participate in delivery for real, the first thing that gets exposed is not whether the model is capable enough, but whether the environment is clear enough, whether the state is explicit enough, and whether validation is automated enough. OpenAI says explicitly that early progress was slower than expected not because Codex lacked capability, but because the environment was not clearly specified enough. The engineer’s job became helping the agent do useful work, not by “prompting harder,” but by asking what capability is missing and how to make that capability both legible and enforceable for the agent.4

There is a deeper issue here: agentic RL and ordinary online inference are not the same infrastructure object at all. MiniMax’s piece on Agent Runtime offers a very practical lens: once training enters on-policy rollout, and agents need to explore, interact, and make mistakes in real environments at the scale of millions or tens of millions of trajectories, sandboxing stops being an auxiliary tool and becomes infrastructure that directly determines throughput, stability, and cost.11

So the reason we misjudge Agent-era infrastructure is not just that we overestimate how well traditional application infra can adapt. It is also that we underestimate environment supply itself as a bottleneck. Many teams think an LLM wrapper plus a few tool calls plus some harness logic is enough. But once the workload becomes large-scale agentic RL, long-horizon rollout, rich interactive environments, and frequent checkpoint replay, the problems stop being edge-case engineering annoyances and become system problems in scheduling, isolation, observability, and image distribution.11

3. What “AI-Native Engineering” Actually Means

My definition of AI-native engineering is:

AI-native engineering is not traditional software engineering with a Copilot or Agent added on top. It is an engineering approach that treats Agents as first-class executors and first-class readers from the start, and then reorganizes code structure, knowledge management, execution semantics, state persistence, sandbox boundaries, observability, and validation flow accordingly.

The key is those two kinds of first-class citizenship.

First, the Agent is a first-class executor. It does not merely autocomplete code. It reads the repository, edits code, runs tests, uses tools, waits for external feedback, and continues making decisions. The most important part of OpenAI’s article is not that Codex generated one million lines of code, but that software engineers stopped primarily being coders and instead became designers of environment, intent, and feedback loops.4

Second, the Agent is a first-class reader. That means code, docs, logs, metrics, traces, and runtime state are no longer written only for humans. They must also be directly consumable by the Agent. OpenAI ultimately shrank AGENTS.md from a huge instruction manual into a directory, moved durable knowledge into a structured docs/ tree, and turned logs, metrics, and Chrome DevTools signals into direct inputs for Codex.4

From that angle, the core difference between AI-native engineering and “traditional engineering plus an AI assistant” is not whether the IDE now has a chat panel. The difference is whether the engineering system has been reorganized around the Agent’s reading and execution process. The former embeds a model into the old flow. The latter rewrites the flow itself. More precisely, traditional software engineering optimizes how humans write, read, edit, and collaborate, while AI-native engineering optimizes how humans and Agents continuously produce, execute, validate, and recover together inside one system.4

That is also why AI-native engineering naturally pushes toward durable execution, externalized state, sandbox lifecycle, effect logs, and recovery semantics. Once an Agent moves from “can generate an answer” to “can execute a task,” execution is no longer a single request. It becomes a long trajectory. And state no longer belongs only to the process; it becomes a system object. Systems like DBOS and Durable Agents matter not because they invented a new database, but because they made that transition operational.1 6

4. What OpenAI’s Harness Engineering Reveals About Modern Engineering Work

What matters most to me in Harness engineering is not that AI can write a lot of code. It is that the optimization target of the engineering team has changed. The article opens by saying OpenAI delivered a real product from an empty repo in five months, with application logic, tests, CI, docs, observability, and internal tools all generated by Codex, using only about one-tenth of the human coding time. The key signal is not the throughput number itself. The key signal is what it implies: once code generation becomes cheap enough, the scarce resource in engineering stops being code production and becomes environment design, intent expression, and reliable feedback loops.4

OpenAI repeatedly emphasizes the shift from “manual coding” to “systems, architecture, and leverage.” When progress stalls, the answer is no longer “prompt a bit harder.” It is to add missing abstractions, tools, internal structure, and enforceable rules so the Agent can clear the obstacle. At that point, this is no longer software development in the old sense. It becomes a new kind of harness engineering: not writing the implementation yourself, but building a system that is explicit enough, verifiable enough, and executable enough for the Agent to keep making progress.4

That is also why OpenAI treats the code repository as a record system. One of their early lessons was that Codex needs a map, not a 1,000-page instruction manual. A giant AGENTS.md wastes context, rots quickly, and is hard to verify. A better approach is to make knowledge in the repository hierarchical, structured, and navigable. The important shift is not simply that the docs got shorter. It is that the code repository is starting to become a record system for knowledge and execution, not just a collection of source files.4

OpenAI also did something even more important, and easier to miss: they made UI, logs, metrics, and traces direct inputs for Codex. As code throughput rose, human QA quickly became the bottleneck. So they let Codex drive Chrome DevTools directly, inspect DOM snapshots, read screenshots, query LogQL and PromQL, reproduce bugs, validate fixes, restart applications, and validate again. The importance of this move is that observability stopped being a dashboard for humans and became a debugging interface for Agents. Once you accept that, your definition of observability changes. It is no longer just an ops tool. It becomes part of the execution system itself.4

The recent popularity of Karpathy’s autoresearch points in the same direction. What makes it compelling is not merely that “the Agent can iteratively modify training code.” It is that the project compresses self-iteration into a visible, comparable, and judgeable minimal loop: the Agent edits only one train.py, each run gets a fixed five-minute budget, the system evaluates outcomes with a single val_bpb metric, keeps or discards the change, and then continues, leaving behind a full experiment trail. What this proves is not that Agents can already self-improve without limit. It proves that once the problem boundary is small enough, the feedback loop is fast enough, the metric is simple enough, and the keep/discard rule is explicit enough, Agents can already begin self-iteration inside a local system.13

So the real lesson of Harness engineering is that the most important capability of future engineering teams may no longer be “writing code.” It may be turning the system into an environment where Agents can execute, validate, and recover reliably.

5. Why Durable Execution Is Becoming a New Primitive

If Harness engineering showed what changes when Agents become first-class executors and readers, DBOS, Durable Agents, and Pydantic AI + DBOS show the lower-level consequence: once execution becomes a long-running process, workflow state must be persisted and externalized.

DBOS says this very clearly in its architecture docs: in traditional workflow systems, workflow state lives in runtime memory and is lost when the service crashes; DBOS stores workflow state and step checkpoints in Postgres, making the database the record system for durable execution.1 The key point is not merely “they used Postgres again.” The key point is that execution state is taken out of private process memory and turned into a system-level object.

The Pydantic AI + DBOS integration says the same thing more implicitly. It does not package Agents as a more complex prompt loop. It explicitly connects Agent workflows and steps into durable workflow semantics, so the Agent’s execution path naturally acquires persisted recovery points and retry points.7

Microsoft Durable Agents makes the same issue explicit: if a session is held by a single worker and that instance dies, the state may disappear; the goal of durable agents is any worker can resume a session.6 In production that is categorically different from “starting over in a new session,” because it means user interaction state, task context, and long-running execution traces are no longer pinned to one process.

So the real meaning of durable execution is not just “automatic retry on failure.” It means:

state no longer lives only in-process
a task is no longer equivalent to one request
recovery becomes a default capability rather than an exceptional branch

I think the importance of this is still underestimated in Agent infrastructure. Many Agent frameworks are basically extended request/response loops: model output, tool call, model output again, tool call again. As long as the chain is short, the weakness stays hidden. But once tasks span minutes, hours, human approval waits, and external system state changes, three system-level questions immediately appear:

Where does the state live?
Where does execution resume after failure?
How are side effects recorded and verified?

If a system cannot answer those three questions, it is not yet a mature execution system. That is why I increasingly think of durable execution as a new primitive in the Agent era, more comparable to message queues, database transactions, or job scheduling than to an optional enhancement.1 6 7

6. Why Agents and RL Force Execution-System Problems into the Open

When people hear ideas like durable execution, sandboxing, and externalized state, they often feel these sound less like AI and more like distributed systems. But it is precisely Agents and RL that dragged these questions back to center stage.

OpenRLHF and a range of reasoning / agentic post-training work increasingly make one point clear: in RLHF, RLVR, and reasoning RL, the truly expensive part is often not parameter updates themselves but rollout, environment interaction, and long-chain inference. The OpenRLHF systems paper says explicitly that as post-training shifts from supervised learning toward RLHF and RLVR, rollout and inference become the dominant runtime bottleneck.2

Let It Flow pushes in the same direction. It is not about isolated algorithm tricks. It is about designing rollout environments, context engineering, and post-training optimization together inside one agentic learning ecosystem.3 That means once the system target shifts from “produce an answer” to “repeatedly interact with an environment, generate trajectories, get feedback, and update policy,” the environment stops being background scenery and becomes part of the training system itself.

Once that happens, execution-system problems get amplified:

each execution is no longer a short request but a long trajectory rollout
state is no longer temporary process memory but something that must survive across steps, environments, and nodes
failure and recovery become high-frequency paths instead of rare exceptions
side effects are no longer just logs but actual changes to environments, file systems, networks, and downstream systems

In other words, RL first exposed the fundamental importance of execution systems inside training. Agents then carry those same constraints into real business environments. The former makes rollout, environments, and checkpoints core problems. The latter makes durable execution, sandbox boundaries, effect logs, and capability policy into production problems.

That is why I see Agent Infra and RL Infra as two stages of the same system evolution: RL first makes the execution problem visible in training, then Agents bring it into the real world.

7. Why Existing Agent Infra Usually Fails to Support Production Directly

I agree strongly with the core judgment in the article Why Existing Agent Infra Cannot Support Production-Grade Applications: many current Agent Infra systems are still mostly packaging prompt loops more conveniently, without actually providing the execution semantics required by production systems.5

The key criticism is not that models are still too weak. It is that the abstraction layer is wrong. The problems it points out are all foundational:

missing execution semantics: many frameworks package multi-step decision-making into a chain without clear recovery points, compensation points, or state semantics
missing side-effect semantics: once an Agent can call external APIs, write databases, edit files, or send messages, the system must know which effects have already happened, which can be replayed, and which must be idempotent
missing long-lifecycle task semantics: production tasks may span minutes, hours, and human approval waits, so they no longer fit a single request model

These are inseparable from durable execution. A more direct way to put it is this: production-grade Agent systems have to carry long-running, recoverable, verifiable execution processes, not just smarter model outputs. Once you frame the problem at that level, the limits of many flashy Agent frameworks become obvious. They are fine for demos, single-host tool invocation, and short-chain prototypes. But once you enter a world of real permissions, real state, and real side effects, you need a much deeper execution substrate.5

So I increasingly dislike thinking of Agent Infra as just an orchestration layer for models. A more accurate formulation is:

Production-grade Agent Infra is fundamentally an execution system organized around long-horizon execution, externalized state, side-effect management, recovery semantics, and environment boundaries.

Without that semantic layer, it becomes hard to answer:

where execution resumes after interruption
whether an effect has already been performed
which instance takes over after a crash
how state is preserved during human-in-the-loop waits
whether rollback refers to model state, environment state, or business state

Those are not framework ergonomics problems. They are system-semantics problems.5

8. Why Session Recovery and “Any Worker Can Resume a Session” Matter So Much

The line from Durable Agents that matters most to me is: any worker can resume a session.6 It sounds like an implementation detail. In practice, it is one of the clearest dividing lines between an Agent system that can enter production and one that cannot.

Why?

Because production Agents are almost never held by one process, one machine, or one Pod from start to finish. You will always have:

instance restarts
machine failures
rolling deploys
long waits for human input
long-latency feedback from external systems

If session state lives only in the current worker’s memory, then once any of those things happen, the system can only:

restart the whole task from scratch
drag the user back into an inconsistent state
rely on manual rescue

None of those is acceptable production semantics.

So the importance of Durable Agents is not merely that it “can restore sessions.” It is that it makes an important hidden fact explicit: an Agent session is increasingly closer to a long-lived task object in a distributed system than to a web session.6

Once you accept that framing, the rest of the system design follows naturally:

session state must be persisted
workers no longer own sessions but only execute them temporarily
crash recovery becomes the default path
human waiting becomes a system-level state, not a UI hack

From there, the next step toward effect logs and recovery semantics is obvious. If any instance can take over, then the system must answer: which side effects have already occurred? Which can be retried? Which require compensation? Which must wait for confirmation?

In other words, session recovery is not a convenient resume feature. It is pressure that forces the system to make execution semantics explicit.

9. Sandbox: The Real Question Is Not Just Isolation, but What Is Shared, What Is Reachable, and What Persists Across Runs

Luis Cardoso’s A field guide to sandboxes for AI matters because it reduces all sandbox discussion to one compact framing:

Sandbox = boundary + policy + lifecycle.8

The value of that definition is that it separates concepts people routinely blur together.

First comes boundary. Luis lays out four common ones: container boundaries still share the host kernel; gVisor boundaries intercept workload syscalls in a userspace kernel; microVM boundaries hand syscalls to a guest kernel; runtime boundaries do not expose the guest to a syscall ABI at all and instead expose only explicit host APIs. His summary line is powerful: the boundary is the line you are betting an attacker cannot cross. That is a strong starting point for almost any sandbox discussion.8

Then comes policy. Luis does not reduce policy to a seccomp profile. He expands it to file system paths, network destinations and protocols, process creation, device access, time/memory/CPU/disk quotas, syscalls, and import surface. More importantly, he gives a line that deserves to be repeated often: a tight policy in a weak boundary is still a weak sandbox; a strong boundary with a permissive policy is a missed opportunity. That neatly separates the concepts of boundary and capability scope.8

Finally there is lifecycle, arguably the dimension that matters most to Agents and RL and is easiest to underestimate. Luis splits lifecycle into three types: fresh run, workspace, and snapshot/restore, and explicitly says this matters a lot for agents and RL. Hostile code tends to want fresh runs. Agent workspaces want long-lived file systems and sessions. RL rollouts and pre-warmed agents naturally prefer snapshot/restore. Which means sandboxing is not a binary safe/unsafe question. It is a set of engineering choices around boundary, policy, and lifecycle.8

Once you use that frame, many sandbox debates become much clearer. The real question is not “containers or microVMs?” It is: are you serving hostile code, agent workspaces, or fast-reset RL? What lifecycle do you actually need? Once the question is correct, the technical decision tends to converge naturally.8

10. Choosing the Runtime Base: Containers, gVisor, MicroVMs, Wasm, and Unikernels

Once you have the boundary / policy / lifecycle frame, runtime choice becomes much easier to reason about.

Luis already disentangles the major boundary types clearly: containers share the host kernel; gVisor traps workload syscalls in a userspace kernel; microVMs provide a stronger boundary via a guest kernel; Wasm / isolates move the boundary into the runtime and expose only explicit host capabilities rather than ambient file system or network access.8

Containers: usable, but you need to be honest about the boundary

Luis is direct here: containers alone are not a sufficient security boundary for hostile-code scenarios. The issue is not that they provide no isolation at all. The issue is that they still share the host kernel. If there is a bug on an allowed syscall path, it is a host-kernel bug. Containers can certainly be hardened, but “a hardened shared-kernel environment” is still not the same thing as “a stronger boundary."8

Gao Ce’s piece Possible Agent Sandbox Choices and the Opportunity for Unikernels adds a useful practical nuance: many people assume containers are slower to start than Firecracker-style lightweight VMs, but once you exclude image pulling and similar external factors, container startup itself can be measured in tens of milliseconds or less. The bigger issue is security: namespaces and cgroups give some isolation, but the host kernel is still shared.9

MicroVMs: why they increasingly look like the practical answer

Luis also points out a key historical shift: in 2010, “use a VM for sandboxing” almost automatically implied second-level startup latency and major density tradeoffs; by 2026, microVMs can be “fast like containers,” and snapshot/restore can make reset almost free. That matters because the trade space has changed, while many people’s mental vocabulary is still stuck in the old “container vs VM” framing.8

Gao Ce’s article reflects the same practical direction. The realistic mainstream candidates for agent sandboxes are already Firecracker, gVisor, Kata Containers, and maybe Wasm. He also lists core agent-sandbox requirements explicitly: ideally sub-100ms cold start, strong security, compatibility with the Python ecosystem, and a workable image build flow.9

Wasm / runtime sandboxes: clearer capability boundaries, but compatibility is the constraint

Luis is right to isolate runtime sandboxes as a separate category. The biggest advantage of Wasm / isolates is not raw speed. It is that they do not come with ambient capabilities by default: no default file system, no default network, no default syscall ABI. The guest can only call host APIs that are explicitly exposed. That is beautiful from a capability-oriented design perspective.8

But Gao Ce also points out the practical limit: Wasm is not naturally friendly to complex dynamic languages like Python. WASI provides a smaller and more portable interface layer than native Linux. Once you need rich Python ecosystems and many C bindings, you get pulled back into compatibility trouble. So the real question is not whether Wasm is “good.” The question is whether it matches your workload shape.9

Unikernels: theoretically appealing, practically scarce

Gao Ce revisits unikernels because, in theory, they look almost ideal for agent sandboxes: application and kernel compiled into a single image, running on hardware virtualization, with a tiny attack surface and very fast startup. But the industry has not broadly converged on them. Instead it is still balancing across Firecracker, gVisor, Kata, and Wasm. That is a useful reminder that the real engineering question is not “which route is theoretically most elegant,” but “which route can, in today’s ecosystem, jointly satisfy boundary strength, compatibility, and lifecycle needs."9

So my conclusion for this section is simple: there is no permanently correct substrate, but Agents will keep pushing systems toward stronger boundaries, clearer capabilities, and more snapshot/restore-friendly runtimes.8

If I had to reduce the runtime choice into a short selection guide, it would be:

hostile code or high-risk permissions: prioritize stronger boundaries before chasing convenience
long-lived agent workspaces: focus on state retention, session semantics, and recovery cost
RL rollout / fast reset: focus on snapshot/restore speed and density

The core question in runtime debates is never “which is more advanced?” It is which one actually matches the shape of your workload.

11. Why Traditional Kubernetes Sandbox Approaches Hit Bottlenecks Early

What the MiniMax case study is really criticizing is not “all current Agent Infra abstractions are wrong.” It is that using a conventional Kubernetes stack as-is to serve agentic RL sandboxes tends to hit bottlenecks early.11

1) Kubernetes is optimized for microservices, not bursty sandbox supply

Kubernetes is excellent at resource orchestration, container management, and long-running services. But agentic RL wants large-scale, bursty, short-lived sandbox floods. The article explicitly says that in such scenarios, etcd, the scheduler, external storage, and networking all become bottlenecks, while controllers and state-coordination paths further increase latency.11

2) Security isolation and multi-tenancy magnify the problem

For multi-tenant sandboxes, Kubernetes' default network model, where things tend to be mutually reachable, introduces additional security issues on its own. In other words, Kubernetes is not incapable of running sandboxes. It is just that its default design target is not “strong isolation plus massive short-lived environment supply.” For agentic RL, that is usually not the best starting point.11

3) The question is not merely whether you have a sandbox, but whether the sandbox was designed for this workload

Tencent Cloud Agent Runtime is used in the article to prove exactly that point: only when lightweight scheduling, warm pools, pooled devices, on-demand image loading, block-level deduplication, snapshot restoration, VNC observability, and MCP / CLI integration are built into one substrate does the sandbox stop being an auxiliary tool and become scalable infrastructure for agentic RL.11

That brings us back to the main thesis of this article: runtime is not better simply because it is more powerful. It has to be designed against the workload. For agentic RL, the keywords are not request serving but rollout, environment supply, checkpoints, and experiment replay.11

12. A Stack Model for AI-Native Engineering That I Actually Find Useful

If I compress all the material above into one map, I would divide the AI-native engineering stack into four layers. The point is not to invent fresh terminology. The point is to put the systems discussed above into one coordinate system, so that later we can reduce them to practical evaluation questions.

Layer 1: model and tool layer

This layer handles inference, planning, tool invocation, code generation, and multimodal interaction. It determines whether the Agent can do something. Codex, browser automation, code-fix tools, and similar capabilities live here.4

Layer 2: durable execution / workflow layer

This layer organizes model outputs into an execution process that is persistable, recoverable, and schedulable. DBOS durable workflows, Pydantic AI + DBOS’s workflow/step wrapping, and Microsoft Durable Agents' session recovery all live here. This layer does not answer “how does the model think?” It answers “where is execution now, how does it resume after failure, and how does another instance take over?"1 7 6

Layer 3: sandbox / runtime layer

This layer answers “where does execution happen?” and “under what boundary does it happen?” Luis’s boundary/policy/lifecycle framing, isolation substrates like Firecracker, gVisor, and Wasm, ROCK’s environment control plane, Zeroboot’s CoW VM fork primitives, and the Agentic-RL-oriented sandbox platforms in the MiniMax case all belong here. This layer answers what is shared, what is reachable, what persists across runs, and whether environments can be supplied at scale with low latency.8 10 12 11

Layer 4: observability / validation / experiment-loop layer

This is the easiest layer to underestimate and one that will matter more and more. OpenAI made logs, metrics, and UI into direct Codex inputs. MiniMax elevates checkpoints, rollback, forked experiments, trajectory recording, VNC observability, and execution logs into part of training infrastructure. In other words, observability is no longer just an operations tool. It becomes part of both the execution system and the experiment system.4 11

From this perspective, the phrase “the AI-native engineering stack is shifting from model-centric to execution-system-centric” becomes more precise: it does not mean models no longer matter. It means model capability can only be realized reliably once execution, state, boundaries, and validation are all designed as one system. That is why I do not want to stop at a conceptual stack diagram. I want to reduce it to an evaluation frame.

13. Why the Most Important Engineering Problem in 2026 Is No Longer Code Generation

I want to write one judgment as plainly as possible:

The most important engineering problem in 2026 is no longer how to generate more code, but how to absorb more results generated and executed by Agents.

OpenAI’s example already shows that once code throughput rises, the bottleneck quickly becomes human QA rather than additional generation speed. In other words, once generation cost drops, the truly scarce resources reappear clearly: human attention, trustworthy validation systems, sound state semantics, and reliable recovery mechanisms.4

That is why concepts like durable execution, sandbox lifecycle, checkpoint / snapshot, and observability, which do not sound especially “AI-native” at first glance, will become more valuable over the next few years. They are not competitors to model capability. They are the next bottlenecks revealed after model throughput is amplified. OpenAI shows the first stage: once Agents can write a lot of code, validation, observability, and knowledge organization become bottlenecks. DBOS and Durable Agents show the second stage: execution state must be externalized. MiniMax makes the third stage concrete: in agentic RL, environment supply, image distribution, snapshot replay, and full-path observability become hard constraints on training throughput and stability.4 1 11

The popularity of autoresearch pushes the point further. It makes more people see directly that “Agents can start iterating on themselves” is not magic. It rests on whether two things have been systematized. First, observability: can the system preserve exactly what changed, what ran, and what happened in each round? Second, evaluability: can each round’s result be compared under a fixed budget, a fixed metric, and explicit keep/discard rules? Without those two pillars, Agent self-iteration easily collapses into drift. With them, it begins to look like a real experiment system.13

There is also an important boundary to state explicitly here: I am not arguing that models no longer matter, and I am not claiming every team will immediately hit the same systems problems. The narrower claim is this: once your workflows get longer, start touching external systems, and start requiring recovery, verification, and multi-worker handoff, execution-system problems surface faster than incremental gains in model capability. That claim comes from the shared pattern across OpenAI, DBOS, Durable Agents, and MiniMax. It is a systems-trend claim, not a cross-sectional census of every Agent product.4 1 6 11

So the gap between teams in the future may no longer be primarily:

who has the model that can write one more function

It will be:

who lets Agents understand the system faster
who persists execution state more correctly
who makes recovery the default behavior
who turns environment supply, checkpoints, validation, and observability into first-class system primitives

Put differently, the team that can generate more code does not necessarily win. The team that can validate more outcomes, recover from more failures, and absorb more execution is more likely to win.

That is also why a useful evaluation framework should no longer start with a checklist of model capabilities. It should start by asking whether the system has made absorption, recovery, boundaries, and verification into real infrastructure.

14. The 10 Questions I Now Use to Evaluate Any Agent Infra

If I compress the stack model above into something directly usable, these are basically the 10 questions I ask when evaluating Agent infrastructure:

Does state live in process memory, or is it persisted in external storage?
If state exists only in memory, then what you really have is still just a session, not recoverable execution.1
After failure, does execution restart from the beginning, or resume from the last completed step?
This is almost the dividing line between ordinary workflow systems and durable workflow systems.1
Are checkpointing, rollback, and branchable experiments available?
If environment state cannot be precisely saved, replayed, and forked, many agentic RL training and debugging workloads become too expensive to tolerate.11
What is the isolation boundary?
Containers, gVisor, microVMs, and Wasm have fundamentally different boundaries, and they should not be discussed as if they were interchangeable.8
What exactly does policy restrict?
File system, network, devices, process creation, syscalls, quotas, and so on. Which are constrained and which are not?8
What is the lifecycle: fresh run, workspace, or snapshot/restore?
Different workloads require very different answers.8
When instances switch, is session and task state lost?
“Any worker can resume a session” versus “state is lost on crash” is a devastating production difference.6
Are logs, metrics, and traces only for humans, or can Agents consume them directly too?
If they are only for humans, your debugging throughput will quickly become the bottleneck.4
Is the validation system an after-the-fact safety net, or a front-loaded design constraint?
This is not just about having tests. It also includes unified evaluation metrics and clear keep/discard rules. Without those, Agent self-iteration easily degrades into drift.4 13
Is the system built for demos, or can it withstand real permissions, long-horizon execution, and real side effects?
The distance between a working prototype and a production-ready system is often precisely the distance across the previous nine questions.5 12

Almost none of these questions directly ask how strong the model is. That is not because the model does not matter. It is because once the model is strong enough, what actually blocks you is usually the execution system itself. The four-layer stack above describes the problem space. These ten questions turn it into a practical reading framework for evaluating systems.

15. Practical Implications for Engineering Teams

If I turn those ten questions into actions a team can take today, I think there are at least six.

1) Write documentation for Agents, not just for humans

If your docs are still written as “complete manuals for human readers,” you are probably already behind. OpenAI’s experience suggests that what Agents really need is a map, not a giant handbook. More concretely, your documentation system should at least be structured, navigable, and freshness-verifiable. Otherwise model context quickly gets filled with stale instructions and vague rules.4

2) Move tests and acceptance standards forward

Once generation cost drops, validation becomes the bottleneck. The earlier a team moves testing, regression checks, acceptance criteria, and runtime guards into the system, the more likely Agent throughput turns into actual production output. OpenAI’s bottleneck moving from coding to QA is the clearest example. Validation should not be treated as a final safety net. It should be treated as an entry condition for execution.4

And if you want Agents to truly self-iterate, you have to move this even further forward: fixed budgets, unified metrics, and explicit keep/discard rules all need to be written into the system first. That is exactly why autoresearch is convincing.13

3) Observability should serve both humans and Agents

Logs, metrics, and traces should not exist only for human operators. They should also become direct Agent inputs. Making it possible for Agents to read LogQL, PromQL, and runtime UI signals will increasingly feel like a foundational capability, comparable to automated test execution today. For many teams, this means observability is no longer just a dashboard. It is the Agent’s debugging interface.4

4) State cannot live only in process memory

If you are building multi-step Agents, long-running workflows, tool-driven execution, or mixed human/Agent flows, you should assume processes will die, instances will drift, and tasks will wait. The core lesson from DBOS and Durable Agents is simple: state must be externalized if recovery is supposed to exist. If your system still depends on process memory to survive, it is not ready for production.1 6

5) Sandbox choice has to answer three questions

Using Luis’s framework, any sandbox decision should begin by answering:

what does it share with the host?
what can the code touch?
what persists across runs?

If you cannot answer those three clearly, then you do not actually understand your runtime boundary. Many infrastructure debates drag on only because workload shape, permission model, and lifecycle assumptions were never made explicit up front.8

6) The real organizational difference will show up in environment design

The biggest difference between teams in the future may not be whether they “use Agents.” It may be whether they know how to design environments for Agents. That includes codebase structure, documentation systems, test frameworks, durable execution, state storage, sandbox lifecycle, side-effect logs, and recovery semantics. Harness engineering matters not because it showcases a strong model, but because it shows how a team begins rewriting its engineering system around an Agent.4

16. Conclusion: From Writing Code to Designing Execution Systems

If you summarize the pre-2023 large-model engineering narrative as “bigger models, bigger training, stronger model capability,” then 2024 to 2026 looks like a different line of evolution: first RL and rollout made generation, state, and environment interaction central to training; then Agents brought those same constraints into real business systems; then durable execution, sandbox lifecycle, checkpoint / snapshot, environment supply, and observability began to emerge as new infrastructure prerequisites.2 3 5 11

So my prediction for the next two years is simple:

Model capability will keep improving, but the real differentiator will be the execution system.

More plainly, the gap between teams will not be only:

who can generate code faster

It will increasingly be:

who can organize the codebase into a record system readable by Agents
who can externalize execution state and make recovery the default
who can define sandbox boundary, policy, and lifecycle clearly
who can elevate checkpoints, validation, and observability into first-class system primitives rather than after-the-fact patches

The teams that solve those first will be closer to a real AI-native engineering stack.4 11

That is why I chose the title of this piece:

The AI-native engineering stack is shifting from model-centric to execution-system-centric.

This is not just a decorative trend statement. It is a systems-level evolution that has already started and is likely to continue for years. The timing will not be identical for every team. But once a system starts carrying longer, more real, more side-effectful Agent execution, this line tends to appear. The teams that rewrite their engineering systems along that line first are the ones most likely to capture the real compounding benefits of the Agent era.

17. Further Reading, Grouped by Theme

Durable Execution / Workflow

DBOS Architecture: durable execution via workflow / step checkpointing into Postgres.1
Durable Agents: session recovery, worker handoff, and human-in-the-loop waiting as persistent state.6
Pydantic AI + DBOS: explicitly wiring Agent workflows into durable execution.7

Harness / Eval / Self-Iteration

Harness engineering: how OpenAI rewrote repositories, validation, and observability into an Agent-consumable execution system.4
autoresearch: a minimal self-iteration loop with unified metrics, fixed budgets, and explicit keep/discard rules.13

Production Agent Infra / Runtime

Why Existing Agent Infra Cannot Support Production-Grade Applications: semantic mismatch, missing execution primitives, and the structural requirements of production Agent infrastructure.5
Million-Scale Throughput, Hundred-Thousand Concurrency: The Sandbox Substrate Behind MiniMax Agentic RL Training: the four major demands agentic RL places on sandbox infrastructure and Tencent Cloud Agent Runtime’s implementation path.11
Zeroboot: sub-millisecond VM sandbox primitives and snapshot / CoW fork direction.12

RL / Agentic Post-Training / Environment

OpenRLHF: rollout / inference as the dominant runtime bottleneck in RLHF and RLVR.2
Let It Flow: jointly designing rollout environments, context engineering, and post-training optimization in an agentic learning ecosystem.3
ROCK: an environment control plane for agentic reinforcement learning.10

Sandbox / Boundary / Runtime Base

A field guide to sandboxes for AI: the boundary / policy / lifecycle framework.8
Possible Agent Sandbox Choices and the Opportunity for Unikernels: practical tradeoffs across Firecracker, gVisor, Wasm, and unikernels.9