Building Proper Tests for Coding Agents in Harness Engineering Frameworks

Executive Summary

Harness engineering — the discipline of building the infrastructure that wraps around AI coding agents to make them reliable, governable, and productive — has emerged as one of the most important new fields in software engineering. Named by Mitchell Hashimoto in February 2026, the central insight is that the harness matters more than the model: a mid-tier model in a great harness beats a frontier model in a bad one. Tests are the backbone of effective harnesses, functioning simultaneously as specifications for agents, feedback mechanisms, and verification gates. This report provides a comprehensive guide to building proper tests for applications used by coding agents within harness engineering frameworks.

2026-03-21 - Research - Harness Engineering in AI Coding

1. Background: What Is a Coding Agent Harness?

A coding agent harness is the complete infrastructure wrapping an LLM-based coding agent — human approvals, sub-agent coordination, filesystem access, prompt presets, lifecycle hooks, planning, and execution. The term draws from horse tack: reins, saddle, and bit that channel a powerful but unpredictable animal in the right direction (Hashimoto, 2026).

The harness engineering formula: Agent = Model + Harness. The model provides intelligence; the harness makes that intelligence useful (Parallel AI).

The Empirical Evidence: Harness > Model

The Hashline experiment demonstrated this empirically: merely changing the harness’s tool format improved Grok Code Fast 1 from 6.7% to 68.3% on coding benchmarks — no model weights were modified. LangChain’s ranking jumped from 30th to 5th place on Terminal-Bench 2.0 by changing only the harness — same model, 13.7-point improvement (Fowler, 2026; LangChain, 2026).

Key Components

Harness engineering involves two main practices (Hashimoto, 2026):

Better implicit prompting (AGENTS.md): For simple issues like wrong commands or wrong APIs, update the AGENTS.md file. Each line is based on a bad agent behavior and almost completely resolves them.
Programmed tools: Actual scripts — screenshots, filtered tests, etc. — paired with AGENTS.md instructions.

2. Tests as Specifications: The Core Insight

The foundational principle: test suites function as the most reliable specification language for coding agents. As Simon Willison noted, “the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against” (Willison, 2025).

Tests provide what agents need:

Concrete, verifiable success criteria that ground the agent’s work
Reduced hallucination risk through falsifiable outputs
Iterative self-correction: write code → run tests → fix errors → repeat

OpenAI’s harness engineering report validates this: “Your likelihood of successfully solving a problem with a coding agent is strongly correlated with the agent’s ability to verify its own work” (OpenAI, 2026).

The SWE-bench Model: Tests as Hidden Oracle

SWE-bench operationalizes tests-as-specifications rigorously. Each task has “FAIL_TO_PASS” tests (verifying the fix works) and “PASS_TO_PASS” tests (verifying nothing broke). Tests are hidden from the agent — it must solve the problem from natural language alone. The test suite acts as a hidden oracle (Jimenez et al., 2024).

The Specification Gaming Risk

Agents may write tests that verify their own broken behavior. Test-first development prevents this: “when the tests exist before the code, agents cannot cheat by writing tests that simply confirm whatever incorrect implementation they produced” (The Register, 2026). Always include “Do NOT modify the test files” in implementation prompts.

3. The Agent Testing Pyramid

The traditional testing pyramid breaks down for AI agents because agents violate the assumption of deterministic outputs. Multiple organizations — Block Engineering, Zapier, LangWatch, AWS — have independently converged on a restructured pyramid organized around uncertainty tolerance (Block Engineering, 2026).

Layer 1: Deterministic Foundations (Unit Tests)

Mock out the LLM entirely and test everything around it: retry behavior, turn limits, tool validation, delegation logic, prompt assembly, guardrail enforcement.

Run in milliseconds, cost nothing (no API calls)
Run on every commit
If tests fail here, the problem is in your code, not the AI

# Example: Test tool validation without LLM
def test_edit_tool_requires_exact_match():
    """Agent's edit tool must reject ambiguous replacements."""
    result = edit_tool.apply(
        old_string="foo",
        new_string="bar",
        file_content="foo bar foo"  # Two matches - should fail
    )
    assert result.error == "Multiple matches found"

Layer 2: Component-Level Evals (Integration Tests)

Test each component separately — retrieval, parsing, prompt construction, tool orchestration. Block Engineering introduced record-and-replay testing: record a good agent session, commit the fixture, create a regression test capturing real model behavior (Block Engineering, 2026).

Zapier’s “trajectory evals” score entire workflow executions, combining deterministic assertions with LLM-as-judge rubrics. Critical lesson: “unit test evals penalize different approaches, even when they’re smarter or more efficient” (Zapier/rwilinski, 2025).

Layer 3: Probabilistic Performance (Benchmark Evals)

Validate behaviors requiring multiple runs. Track four key metrics (AWS, 2026):

pass@k: Probability at least one of k trials succeeds
pass^k: Probability all k trials succeed
Latency: Time to completion
Token usage: Cost per task

Layer 4: Judgment and Simulation (End-to-End Evals)

Agent simulations and human-judgment assessments. Use LLM-as-judge with clear rubrics, running evaluations three times with majority voting (Block Engineering, 2026).

CI/CD Integration Strategy

Layer	When to Run	Cost	Signal
Deterministic (1-2)	Every commit	Free	Is the scaffolding broken?
Benchmark (3)	Nightly / pre-release	Moderate	Has agent behavior regressed?
Judgment (4)	On-demand / pre-release	High	Does the system work end-to-end?

4. Designing Tests for Agent Consumption

What Makes Tests “Agent-Friendly”

One assertion per test: Agents parse failure output to decide what to fix. Multiple assertions in a single test create ambiguity about what’s wrong.

Descriptive error messages: Include context explaining why the test failed and what was expected:

assert result.status == "success", (
    f"Expected successful login but got {result.status}. "
    f"Error: {result.error_message}. "
    f"This usually means the auth token is expired or malformed."
)

Fast execution: Unit tests should complete in seconds. Agents iterate by running tests after every change — slow suites break the feedback loop.
Deterministic setup/teardown: Use fixtures, not shared state. Each test must be independently runnable.
Machine-parseable output: Use structured formats (TAP, JUnit XML, pytest JSON) that agents can programmatically interpret (TAP Protocol).
The AAA Pattern: Structure every test as Arrange → Act → Assert for maximum agent readability.

Test Output Formats for Agents

Format	Agent Parseability	Language Support	Best For
TAP (Test Anything Protocol)	Excellent — `ok`/`not ok` is trivially parseable	15+ languages	Cross-language agent workflows
JUnit XML	Good — structured XML	Java, Python, JS	CI/CD integration
pytest verbose	Good — human and machine readable	Python	Python-specific agents
JSON reporters	Excellent — native data structure	Most frameworks	Programmatic consumption

Anti-Patterns: “Vibe Testing”

When agents generate tests that technically pass but verify nothing — inflating coverage metrics while providing false confidence. Combat this by pairing coverage thresholds with assertion quality metrics using AST analysis (DEV Community, 2026).

5. Test-Driven Agent Development (TDAD)

Why TDD Is a Natural Fit for Agents

“Everything that makes TDD a slog for humans makes it the perfect workflow for an AI agent” — AI thrives on clear, measurable goals, and a binary test is the clearest goal possible. AI eliminates TDD’s biggest weakness (manual labor of writing tests) while preserving its biggest strength (fast, unambiguous feedback) (Elliott, 2025).

The TDD Prompting Paradox

The TDAD paper (March 2026) revealed a critical finding: adding procedural TDD instructions without contextual test information increased regressions to 9.94% — worse than no intervention. But providing targeted context about which tests are at risk via graph-based impact analysis reduced regression rates by 70% (TDAD, 2026).

Key takeaway: Context over instruction. Agents benefit more from knowing which tests matter for a given change than from verbose “how to do TDD” instructions.

TDAD Results

Metric	Baseline	With TDAD
Test-level regression rate	6.08%	1.82% (-70%)
PASS_TO_PASS failures	562	155 (-72%)
Resolution rate (15 iterations)	12%	60%

Practical Workflow

Write tests first (or have the agent help write them)
Audit tests to ensure they capture intended behavior
Lock test files: Include “Do NOT modify the test files” in implementation prompts
Start small: 3-5 tests covering core behavior, then iterate
Let the agent iterate: Agent runs tests → reads failures → fixes code → repeats

6. Sandboxing and Isolation

Safe test execution requires isolating agent-generated code from the host system.

Sandbox Comparison

Sandbox	Startup	Isolation Level	Best For
Docker	~50ms	Process (shared kernel)	Development, general eval
E2B (Firecracker)	~150ms	Hardware (dedicated kernel)	Production agent execution
Modal	~90ms	Container (managed)	Parallel eval pipelines
gVisor	50-100ms	User-space kernel	K8s-native workloads
nsjail	~10ms	Process + seccomp	Lightweight sandboxing

Recommendation: Docker for development; E2B or Modal for production; Firecracker/Proxmox for highest-security evaluations.

The Inspect Sandboxing Toolkit (UK AISI) provides a reference architecture where “Inspect itself sits outside of the sandbox and sends commands into it” — commands originate externally, everything inside is explicitly authorized (AISI, 2026).

7. Evaluation Frameworks Comparison

Framework	Type	Best For	Key Feature
Inspect AI	General framework	Production eval pipelines	100+ built-in evals, Docker/K8s/Proxmox sandboxing
SWE-bench	Benchmark	Coding agent ranking	Real GitHub issues, FAIL_TO_PASS/PASS_TO_PASS pattern
BigCodeBench	Benchmark	Realistic coding tasks	1,140 tasks across 139 libraries
Terminal-Bench	Benchmark	CLI agent testing	Real terminal environments
Aider Bench	Benchmark	Code editing tools	Tests full edit-apply-debug loop
METR Task Standard	Specification	Portable task definitions	1,000+ tasks, adopted by UK AISI
FeatureBench	Benchmark	Feature development	Exposes the feature development gap

8. Industry Best Practices

How Leading Companies Build Tests

Anthropic (Claude Code): Minimal scaffold philosophy — bash + edit tools, single-threaded master loop. Grade outcomes, not paths. Use three grader types: deterministic, LLM-based, and human. Start with 20-30 real failures as eval tasks (Anthropic, 2026).

Cursor: Private CursorBench sourced from real developer sessions via “Cursor Blame” tool. Uses “agentic graders” that can understand multiple valid solutions. Supplements offline evals with online A/B experiments (Cursor, 2026).

Cognition (Devin): Evaluator agents with browser and shell access autonomously judge outcomes. Simulated users test interactive capabilities. Production Devin achieves 74.2% without prior exposure to evaluation tasks (Cognition, 2026).

OpenAI: Codex operates in sandboxed containers with internet disabled. Self-bootstrapping: GPT-5.3-Codex was used to debug its own training. Eval-driven development with “measure, improve, ship” loop (OpenAI, 2026).

Consolidated Best Practices

Start with failures: Collect 20-30 real failures and turn them into eval tasks
Grade outcomes, not paths: Don’t test tool call sequences; test end states
Use multi-trial statistics: pass@k and pass^k capture stochastic variance
Layer grading approaches: Deterministic first → LLM judges for nuance → humans for calibration
Test the system, not the model: Evaluate the full agent+harness pipeline
Keep evals private: Public benchmarks invite gaming and contamination
Combine offline and online evals: Offline catches regressions; online detects UX gaps
Read the transcripts: No substitute for reviewing actual multi-step agent behavior

9. Cutting-Edge Developments (2025-2026)

Benchmark Evolution

SWE-bench Verified has been effectively retired due to data contamination — frontier models can reproduce gold patches verbatim. The field has shifted to:

SWE-bench Pro: Multi-language, private codebases, 70-81% → 23% performance drop from Verified
FeatureBench: Feature development (not bug-fixing) — best agents solve only 11% vs 74% on SWE-bench
SWE-CI: Long-term maintenance — zero-regression rates below 25% for most models
LiveCodeBench: Continuously sourced fresh problems to prevent contamination

Multi-Agent Testing

Empirical evidence shows that separating code generation from test generation improves quality. AgentCoder (multi-agent) achieved 79.9% vs 71.3% (single agent) on HumanEval (AgentCoder, 2024). The key: tests written by the same agent that wrote the code suffer from confirmation bias.

Property-Based Testing with Agents

Anthropic’s agentic PBT agent discovered that numpy.random.wald sometimes returns negative numbers — a real bug in NumPy that was patched upstream. Running against 100+ Python packages, the approach demonstrates that agents can find novel bugs through property-based testing at scale (Anthropic Red Team, 2026).

Continuous Eval Pipelines

CI/CD for agent capabilities is becoming standard: quality gates that block releases on regressions, LLM-as-judge scoring on every PR, production monitoring for quality drift. Recommended stack: DeepEval for CI/CD gates, RAGAS for metric exploration, Langfuse/LangSmith for production monitoring.

The Self-Testing Paradox

A surprising finding: in high-autonomy settings, agent-written tests provide marginal utility. GPT-5.2 achieves nearly identical results (71.8% vs 74.4%) while writing almost no tests. Agent-written tests function primarily as observational tools (prints) rather than verification mechanisms (arXiv, 2026). This doesn’t negate pre-existing tests as specifications, but challenges the assumption that agents writing their own tests during resolution is always beneficial.

10. Practical Getting-Started Guide

Step 1: Set Up Your Testing Infrastructure

project/
├── AGENTS.md          # Agent instructions including test commands
├── tests/
│   ├── unit/          # Fast, deterministic, run on every commit
│   ├── integration/   # Component-level, may use real APIs
│   └── evals/         # Agent behavior evals, run nightly
├── .claude/           # Agent harness configuration
└── scripts/
    └── run-tests.sh   # Single command to run all tests

Step 2: Write Agent-Friendly Tests

# GOOD: One assertion, descriptive message, fast
def test_user_creation_returns_valid_id():
    user = create_user(name="Alice", email="[email protected]")
    assert user.id is not None, (
        f"create_user returned None id. "
        f"Check database connection and user validation."
    )
    assert isinstance(user.id, int), (
        f"Expected int id but got {type(user.id).__name__}. "
        f"Database may be returning string UUIDs."
    )
 
# BAD: Multiple concerns, no diagnostic info
def test_user():
    user = create_user(name="Alice", email="[email protected]")
    assert user.id
    assert user.name == "Alice"
    assert user.email == "[email protected]"
    assert user.created_at
    assert validate_user(user)

Step 3: Configure Your AGENTS.md

## Testing
- Run tests with: `pytest tests/ -v --tb=short`
- Run only unit tests: `pytest tests/unit/ -v`
- NEVER modify test files — implement code to pass existing tests
- If a test fails, read the error message carefully before changing code
- Run tests after every code change

Step 4: Implement Feedback Loops

Use hooks that run tests automatically on every agent code change. On failure, surface only the error output (back-pressure pattern). On success, hooks are silent — nothing added to context.

Step 5: Add Evaluation Layers Incrementally

Start with deterministic unit tests (Layer 1)
Add record-and-replay integration tests (Layer 2)
Implement pass@k benchmark evals when ready (Layer 3)
Reserve LLM-as-judge for subjective quality assessment (Layer 4)

11. Open Challenges

The feature development gap: Best agents solve only 11% of FeatureBench vs 74% on SWE-bench
Long-term maintenance regression: Zero-regression rates below 25% for most models during sustained development
The oracle problem: Who verifies the specification (test) is correct when AI writes both code and tests?
Statistical evaluation tooling: No mainstream tools offer confidence intervals or formal statistical aggregation for agent eval results
Cross-model regression detection: How to systematically detect behavioral regressions when model providers update weights
Visual and multimodal testing: 73% performance drop when images are involved

12. Key Principles (Summary)

Tests are the primary interface between human intent and agent behavior. They function as specifications, feedback signals, and verification mechanisms simultaneously.
The testing pyramid must be restructured around uncertainty tolerance, not test granularity.
Harness quality determines agent quality. Model capabilities are necessary but not sufficient.
Context over instruction. Agents benefit more from targeted contextual information than from verbose procedural instructions.
Specification must be executable. Static documentation drifts; executable specifications enforce compliance mechanically.
Grade outcomes, not paths. Testing specific sequences is brittle; test end states.
Start with failures. Turn real production failures into eval tasks.
Keep infrastructure lightweight. Every model release changes the optimal structure — design for replaceability.

Quartz 4

Explorer

Building Proper Tests for Coding Agents in Harness Engineering Frameworks

Building Proper Tests for Coding Agents in Harness Engineering Frameworks

Executive Summary

Related Notes

1. Background: What Is a Coding Agent Harness?

The Empirical Evidence: Harness > Model

Key Components

2. Tests as Specifications: The Core Insight

The SWE-bench Model: Tests as Hidden Oracle

The Specification Gaming Risk

3. The Agent Testing Pyramid

Layer 1: Deterministic Foundations (Unit Tests)

Layer 2: Component-Level Evals (Integration Tests)

Layer 3: Probabilistic Performance (Benchmark Evals)

Layer 4: Judgment and Simulation (End-to-End Evals)

CI/CD Integration Strategy

4. Designing Tests for Agent Consumption

What Makes Tests “Agent-Friendly”

Test Output Formats for Agents

Anti-Patterns: “Vibe Testing”

5. Test-Driven Agent Development (TDAD)

Why TDD Is a Natural Fit for Agents

The TDD Prompting Paradox

TDAD Results

Practical Workflow

6. Sandboxing and Isolation

Sandbox Comparison

7. Evaluation Frameworks Comparison

8. Industry Best Practices

How Leading Companies Build Tests

Consolidated Best Practices

9. Cutting-Edge Developments (2025-2026)

Benchmark Evolution

Multi-Agent Testing

Property-Based Testing with Agents

Continuous Eval Pipelines

The Self-Testing Paradox

10. Practical Getting-Started Guide

Step 1: Set Up Your Testing Infrastructure

Step 2: Write Agent-Friendly Tests

Step 3: Configure Your AGENTS.md

Step 4: Implement Feedback Loops

Step 5: Add Evaluation Layers Incrementally

11. Open Challenges

12. Key Principles (Summary)

Graph View

Table of Contents

Backlinks