Summary
OpenAI describes a five-month experiment building and shipping an internal beta product with zero lines of manually-written code. Every line — application logic, tests, CI, docs, observability, tooling — was written by Codex agents. The result: ~1 million lines of code, ~1,500 PRs merged, built in roughly 1/10th the time of manual coding with a small team (initially 3, now 7 engineers). The article lays out the operational lessons learned and establishes a new engineering paradigm: humans steer, agents execute.
Key Takeaways
1. The Experiment’s Scale
- Started from an empty git repository in late August 2025
- ~1 million lines of code in 5 months
- ~1,500 PRs opened and merged
- 3.5 PRs per engineer per day average (throughput increased as team grew from 3 to 7)
- Product has hundreds of internal users, including daily power users
- Single Codex runs regularly work for 6+ hours on a single task
2. The Engineer’s New Role
The primary job shifted from writing code to enabling agents to do useful work:
- Designing environments and specifying intent
- Building feedback loops for reliable agent work
- When something fails, the question is always: “What capability is missing, and how do we make it legible and enforceable for the agent?”
- Humans interact almost entirely through prompts
- Review has been pushed toward agent-to-agent (not human review)
3. Application Legibility
The bottleneck shifted from code throughput to human QA capacity. Solutions:
- Made the app bootable per git worktree so Codex can launch one instance per change
- Wired Chrome DevTools Protocol into the agent runtime (DOM snapshots, screenshots, navigation)
- Built local observability stack (logs via LogQL, metrics via PromQL) that’s ephemeral per worktree
- Agents can now validate prompts like “ensure startup completes in under 800ms”
4. Repository Knowledge as System of Record
The “one big AGENTS.md” approach failed because:
- Context is scarce — giant files crowd out the actual task
- Too much guidance becomes non-guidance
- It rots instantly and is hard to verify
Instead: AGENTS.md as table of contents (~100 lines), pointing to a structured docs/ directory:
- Design docs with verification status and core beliefs
- Architecture documentation with domain maps and package layering
- Quality grades per domain and layer, tracked over time
- Execution plans as first-class versioned artifacts (active, completed, tech debt)
- Progressive disclosure: agents start with a small entry point, taught where to look next
Enforced mechanically: linters and CI validate the knowledge base is current, cross-linked, and structured. A “doc-gardening” agent scans for stale docs and opens fix-up PRs.
5. Agent Legibility as the Goal
Core principle: anything Codex can’t access in-context effectively doesn’t exist.
- Slack discussions, Google Docs, tribal knowledge are all illegible to the agent
- Push context into the repo: versioned markdown, schemas, executable plans
- Favored “boring” technologies — composable, stable APIs, well-represented in training data
- Sometimes cheaper to reimplement functionality than fight opaque upstream behavior
- Example: built custom map-with-concurrency helper instead of p-limit, tightly integrated with OpenTelemetry
6. Enforcing Architecture and Taste
A rigid architectural model with strict boundaries:
- Each business domain divided into fixed layers: Types → Config → Repo → Service → Runtime → UI
- Cross-cutting concerns enter through a single interface: Providers
- Enforced via custom linters and structural tests (all Codex-generated)
- “Taste invariants”: structured logging, naming conventions, file size limits, platform reliability requirements
- Custom lint error messages include remediation instructions injected into agent context
- Philosophy: enforce boundaries centrally, allow autonomy locally
7. Merge Philosophy at Scale
Conventional norms become counterproductive at agent throughput:
- Minimal blocking merge gates
- Short-lived PRs
- Test flakes addressed with follow-up runs rather than blocking
- “Corrections are cheap, waiting is expensive”
8. Increasing Levels of Autonomy
The system recently crossed a threshold where Codex can end-to-end drive a feature:
- Validate codebase state
- Reproduce a reported bug
- Record a video demonstrating failure
- Implement a fix
- Validate fix by driving the application
- Record a second video demonstrating resolution
- Open a PR
- Respond to agent and human feedback
- Detect and remediate build failures
- Escalate to human only when judgment is required
- Merge the change
9. Entropy and Garbage Collection
Agent autonomy introduces drift — Codex replicates patterns, even suboptimal ones.
- Initially humans spent every Friday (20% of the week) cleaning up “AI slop” — didn’t scale
- Solution: “Golden principles” encoded in the repo + recurring cleanup process
- Prefer shared utility packages over hand-rolled helpers
- Don’t probe data “YOLO-style” — validate boundaries or use typed SDKs
- Background Codex tasks scan for deviations, update quality grades, open refactoring PRs
- Functions like garbage collection: continuous small incremental paydowns vs. painful bursts
- “Human taste is captured once, then enforced continuously on every line of code”
Notable Quotes
“Humans steer. Agents execute.”
“The primary job of our engineering team became enabling the agents to do useful work.”
“When something failed, the fix was almost never ‘try harder.‘”
“Give Codex a map, not a 1,000-page instruction manual.”
“From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist.”
“Technologies often described as ‘boring’ tend to be easier for agents to model due to composability, API stability, and representation in the training set.”
“This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite.”
“Corrections are cheap, and waiting is expensive.”
“Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound.”
“Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code.”
Architectural Patterns
Repository Knowledge Structure
AGENTS.md # ~100 lines, table of contents
ARCHITECTURE.md # Top-level domain + package layering map
docs/
├── design-docs/ # Catalogued, indexed, verification status
│ ├── index.md
│ ├── core-beliefs.md
│ └── ...
├── exec-plans/ # First-class versioned artifacts
│ ├── active/
│ ├── completed/
│ └── tech-debt-tracker.md
├── generated/
│ └── db-schema.md
├── product-specs/
│ ├── index.md
│ └── ...
├── references/ # External docs made local
│ ├── design-system-reference-llms.txt
│ └── ...
├── DESIGN.md
├── FRONTEND.md
├── PLANS.md
├── PRODUCT_SENSE.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md
Domain Layer Model
Types → Config → Repo → Service → Runtime → UI
↑
Providers (auth, connectors, telemetry, feature flags)
Dependencies can only flow “forward.” Cross-cutting concerns enter through Providers only.
Implications for Practitioners
-
Invest in the harness, not just the model — The agent’s effectiveness is bounded by its environment, not its intelligence.
-
Make everything repo-local — If knowledge isn’t in the repository, it doesn’t exist for agents. Push Slack discussions, design decisions, and tribal knowledge into versioned markdown.
-
AGENTS.md should be a map, not a manual — Keep it short (~100 lines), point to deeper sources of truth.
-
Enforce architecture mechanically — Custom linters with remediation instructions in error messages are force multipliers for agents.
-
Build for agent legibility first — Boring tech > novel tech. Composable > opaque. In-repo > external.
-
Treat technical debt as garbage collection — Continuous small cleanups via background agent tasks, not Friday cleanup marathons.
-
Rethink merge gates — At agent throughput, corrections are cheaper than waiting. Minimize blocking gates.
-
Progressive disclosure over context dumping — Teach agents where to look, don’t overwhelm them upfront.
Open Questions (from the article)
- How does architectural coherence evolve over years in a fully agent-generated system?
- Where does human judgment add the most leverage, and how do you encode it so it compounds?
- How will this system evolve as models continue to become more capable?