Harness Engineering: Leveraging Codex in an Agent-First World

Summary

OpenAI describes a five-month experiment building and shipping an internal beta product with zero lines of manually-written code. Every line — application logic, tests, CI, docs, observability, tooling — was written by Codex agents. The result: ~1 million lines of code, ~1,500 PRs merged, built in roughly 1/10th the time of manual coding with a small team (initially 3, now 7 engineers). The article lays out the operational lessons learned and establishes a new engineering paradigm: humans steer, agents execute.

Key Takeaways

1. The Experiment’s Scale

Started from an empty git repository in late August 2025
~1 million lines of code in 5 months
~1,500 PRs opened and merged
3.5 PRs per engineer per day average (throughput increased as team grew from 3 to 7)
Product has hundreds of internal users, including daily power users
Single Codex runs regularly work for 6+ hours on a single task

2. The Engineer’s New Role

The primary job shifted from writing code to enabling agents to do useful work:

Designing environments and specifying intent
Building feedback loops for reliable agent work
When something fails, the question is always: “What capability is missing, and how do we make it legible and enforceable for the agent?”
Humans interact almost entirely through prompts
Review has been pushed toward agent-to-agent (not human review)

3. Application Legibility

The bottleneck shifted from code throughput to human QA capacity. Solutions:

Made the app bootable per git worktree so Codex can launch one instance per change
Wired Chrome DevTools Protocol into the agent runtime (DOM snapshots, screenshots, navigation)
Built local observability stack (logs via LogQL, metrics via PromQL) that’s ephemeral per worktree
Agents can now validate prompts like “ensure startup completes in under 800ms”

4. Repository Knowledge as System of Record

The “one big AGENTS.md” approach failed because:

Context is scarce — giant files crowd out the actual task
Too much guidance becomes non-guidance
It rots instantly and is hard to verify

Instead: AGENTS.md as table of contents (~100 lines), pointing to a structured docs/ directory:

Design docs with verification status and core beliefs
Architecture documentation with domain maps and package layering
Quality grades per domain and layer, tracked over time
Execution plans as first-class versioned artifacts (active, completed, tech debt)
Progressive disclosure: agents start with a small entry point, taught where to look next

Enforced mechanically: linters and CI validate the knowledge base is current, cross-linked, and structured. A “doc-gardening” agent scans for stale docs and opens fix-up PRs.

5. Agent Legibility as the Goal

Core principle: anything Codex can’t access in-context effectively doesn’t exist.

Slack discussions, Google Docs, tribal knowledge are all illegible to the agent
Push context into the repo: versioned markdown, schemas, executable plans
Favored “boring” technologies — composable, stable APIs, well-represented in training data
Sometimes cheaper to reimplement functionality than fight opaque upstream behavior
Example: built custom map-with-concurrency helper instead of p-limit, tightly integrated with OpenTelemetry

6. Enforcing Architecture and Taste

A rigid architectural model with strict boundaries:

Each business domain divided into fixed layers: Types → Config → Repo → Service → Runtime → UI
Cross-cutting concerns enter through a single interface: Providers
Enforced via custom linters and structural tests (all Codex-generated)
“Taste invariants”: structured logging, naming conventions, file size limits, platform reliability requirements
Custom lint error messages include remediation instructions injected into agent context
Philosophy: enforce boundaries centrally, allow autonomy locally

7. Merge Philosophy at Scale

Conventional norms become counterproductive at agent throughput:

Minimal blocking merge gates
Short-lived PRs
Test flakes addressed with follow-up runs rather than blocking
“Corrections are cheap, waiting is expensive”

8. Increasing Levels of Autonomy

The system recently crossed a threshold where Codex can end-to-end drive a feature:

Validate codebase state
Reproduce a reported bug
Record a video demonstrating failure
Implement a fix
Validate fix by driving the application
Record a second video demonstrating resolution
Open a PR
Respond to agent and human feedback
Detect and remediate build failures
Escalate to human only when judgment is required
Merge the change

9. Entropy and Garbage Collection

Agent autonomy introduces drift — Codex replicates patterns, even suboptimal ones.

Initially humans spent every Friday (20% of the week) cleaning up “AI slop” — didn’t scale
Solution: “Golden principles” encoded in the repo + recurring cleanup process
- Prefer shared utility packages over hand-rolled helpers
- Don’t probe data “YOLO-style” — validate boundaries or use typed SDKs
Background Codex tasks scan for deviations, update quality grades, open refactoring PRs
Functions like garbage collection: continuous small incremental paydowns vs. painful bursts
“Human taste is captured once, then enforced continuously on every line of code”

Notable Quotes

“Humans steer. Agents execute.”

“The primary job of our engineering team became enabling the agents to do useful work.”

“When something failed, the fix was almost never ‘try harder.‘”

“Give Codex a map, not a 1,000-page instruction manual.”

“From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist.”

“Technologies often described as ‘boring’ tend to be easier for agents to model due to composability, API stability, and representation in the training set.”

“This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite.”

“Corrections are cheap, and waiting is expensive.”

“Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound.”

“Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code.”

Architectural Patterns

Repository Knowledge Structure

AGENTS.md              # ~100 lines, table of contents
ARCHITECTURE.md        # Top-level domain + package layering map
docs/
├── design-docs/       # Catalogued, indexed, verification status
│   ├── index.md
│   ├── core-beliefs.md
│   └── ...
├── exec-plans/        # First-class versioned artifacts
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── product-specs/
│   ├── index.md
│   └── ...
├── references/        # External docs made local
│   ├── design-system-reference-llms.txt
│   └── ...
├── DESIGN.md
├── FRONTEND.md
├── PLANS.md
├── PRODUCT_SENSE.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md

Domain Layer Model

Types → Config → Repo → Service → Runtime → UI
                                    ↑
                              Providers (auth, connectors, telemetry, feature flags)

Dependencies can only flow “forward.” Cross-cutting concerns enter through Providers only.

Implications for Practitioners

Invest in the harness, not just the model — The agent’s effectiveness is bounded by its environment, not its intelligence.
Make everything repo-local — If knowledge isn’t in the repository, it doesn’t exist for agents. Push Slack discussions, design decisions, and tribal knowledge into versioned markdown.
AGENTS.md should be a map, not a manual — Keep it short (~100 lines), point to deeper sources of truth.
Enforce architecture mechanically — Custom linters with remediation instructions in error messages are force multipliers for agents.
Build for agent legibility first — Boring tech > novel tech. Composable > opaque. In-repo > external.
Treat technical debt as garbage collection — Continuous small cleanups via background agent tasks, not Friday cleanup marathons.
Rethink merge gates — At agent throughput, corrections are cheaper than waiting. Minimize blocking gates.
Progressive disclosure over context dumping — Teach agents where to look, don’t overwhelm them upfront.

Open Questions (from the article)

How does architectural coherence evolve over years in a fully agent-generated system?
Where does human judgment add the most leverage, and how do you encode it so it compounds?
How will this system evolve as models continue to become more capable?

Quartz 4

Explorer