Building Proper Tests for Coding Agents in Harness Engineering Frameworks

Executive Summary

Harness engineering — the discipline of building the infrastructure that wraps around AI coding agents to make them reliable, governable, and productive — has emerged as one of the most important new fields in software engineering. Named by Mitchell Hashimoto in February 2026, the central insight is that the harness matters more than the model: a mid-tier model in a great harness beats a frontier model in a bad one. Tests are the backbone of effective harnesses, functioning simultaneously as specifications for agents, feedback mechanisms, and verification gates. This report provides a comprehensive guide to building proper tests for applications used by coding agents within harness engineering frameworks.


1. Background: What Is a Coding Agent Harness?

A coding agent harness is the complete infrastructure wrapping an LLM-based coding agent — human approvals, sub-agent coordination, filesystem access, prompt presets, lifecycle hooks, planning, and execution. The term draws from horse tack: reins, saddle, and bit that channel a powerful but unpredictable animal in the right direction (Hashimoto, 2026).

The harness engineering formula: Agent = Model + Harness. The model provides intelligence; the harness makes that intelligence useful (Parallel AI).

The Empirical Evidence: Harness > Model

The Hashline experiment demonstrated this empirically: merely changing the harness’s tool format improved Grok Code Fast 1 from 6.7% to 68.3% on coding benchmarks — no model weights were modified. LangChain’s ranking jumped from 30th to 5th place on Terminal-Bench 2.0 by changing only the harness — same model, 13.7-point improvement (Fowler, 2026; LangChain, 2026).

Key Components

Harness engineering involves two main practices (Hashimoto, 2026):

  1. Better implicit prompting (AGENTS.md): For simple issues like wrong commands or wrong APIs, update the AGENTS.md file. Each line is based on a bad agent behavior and almost completely resolves them.
  2. Programmed tools: Actual scripts — screenshots, filtered tests, etc. — paired with AGENTS.md instructions.

2. Tests as Specifications: The Core Insight

The foundational principle: test suites function as the most reliable specification language for coding agents. As Simon Willison noted, “the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against” (Willison, 2025).

Tests provide what agents need:

  • Concrete, verifiable success criteria that ground the agent’s work
  • Reduced hallucination risk through falsifiable outputs
  • Iterative self-correction: write code → run tests → fix errors → repeat

OpenAI’s harness engineering report validates this: “Your likelihood of successfully solving a problem with a coding agent is strongly correlated with the agent’s ability to verify its own work” (OpenAI, 2026).

The SWE-bench Model: Tests as Hidden Oracle

SWE-bench operationalizes tests-as-specifications rigorously. Each task has “FAIL_TO_PASS” tests (verifying the fix works) and “PASS_TO_PASS” tests (verifying nothing broke). Tests are hidden from the agent — it must solve the problem from natural language alone. The test suite acts as a hidden oracle (Jimenez et al., 2024).

The Specification Gaming Risk

Agents may write tests that verify their own broken behavior. Test-first development prevents this: “when the tests exist before the code, agents cannot cheat by writing tests that simply confirm whatever incorrect implementation they produced” (The Register, 2026). Always include “Do NOT modify the test files” in implementation prompts.


3. The Agent Testing Pyramid

The traditional testing pyramid breaks down for AI agents because agents violate the assumption of deterministic outputs. Multiple organizations — Block Engineering, Zapier, LangWatch, AWS — have independently converged on a restructured pyramid organized around uncertainty tolerance (Block Engineering, 2026).

Layer 1: Deterministic Foundations (Unit Tests)

Mock out the LLM entirely and test everything around it: retry behavior, turn limits, tool validation, delegation logic, prompt assembly, guardrail enforcement.

  • Run in milliseconds, cost nothing (no API calls)
  • Run on every commit
  • If tests fail here, the problem is in your code, not the AI
# Example: Test tool validation without LLM
def test_edit_tool_requires_exact_match():
    """Agent's edit tool must reject ambiguous replacements."""
    result = edit_tool.apply(
        old_string="foo",
        new_string="bar",
        file_content="foo bar foo"  # Two matches - should fail
    )
    assert result.error == "Multiple matches found"

Layer 2: Component-Level Evals (Integration Tests)

Test each component separately — retrieval, parsing, prompt construction, tool orchestration. Block Engineering introduced record-and-replay testing: record a good agent session, commit the fixture, create a regression test capturing real model behavior (Block Engineering, 2026).

Zapier’s “trajectory evals” score entire workflow executions, combining deterministic assertions with LLM-as-judge rubrics. Critical lesson: “unit test evals penalize different approaches, even when they’re smarter or more efficient” (Zapier/rwilinski, 2025).

Layer 3: Probabilistic Performance (Benchmark Evals)

Validate behaviors requiring multiple runs. Track four key metrics (AWS, 2026):

  • pass@k: Probability at least one of k trials succeeds
  • pass^k: Probability all k trials succeed
  • Latency: Time to completion
  • Token usage: Cost per task

Layer 4: Judgment and Simulation (End-to-End Evals)

Agent simulations and human-judgment assessments. Use LLM-as-judge with clear rubrics, running evaluations three times with majority voting (Block Engineering, 2026).

CI/CD Integration Strategy

LayerWhen to RunCostSignal
Deterministic (1-2)Every commitFreeIs the scaffolding broken?
Benchmark (3)Nightly / pre-releaseModerateHas agent behavior regressed?
Judgment (4)On-demand / pre-releaseHighDoes the system work end-to-end?

4. Designing Tests for Agent Consumption

What Makes Tests “Agent-Friendly”

  1. One assertion per test: Agents parse failure output to decide what to fix. Multiple assertions in a single test create ambiguity about what’s wrong.

  2. Descriptive error messages: Include context explaining why the test failed and what was expected:

    assert result.status == "success", (
        f"Expected successful login but got {result.status}. "
        f"Error: {result.error_message}. "
        f"This usually means the auth token is expired or malformed."
    )
  3. Fast execution: Unit tests should complete in seconds. Agents iterate by running tests after every change — slow suites break the feedback loop.

  4. Deterministic setup/teardown: Use fixtures, not shared state. Each test must be independently runnable.

  5. Machine-parseable output: Use structured formats (TAP, JUnit XML, pytest JSON) that agents can programmatically interpret (TAP Protocol).

  6. The AAA Pattern: Structure every test as Arrange → Act → Assert for maximum agent readability.

Test Output Formats for Agents

FormatAgent ParseabilityLanguage SupportBest For
TAP (Test Anything Protocol)Excellent — ok/not ok is trivially parseable15+ languagesCross-language agent workflows
JUnit XMLGood — structured XMLJava, Python, JSCI/CD integration
pytest verboseGood — human and machine readablePythonPython-specific agents
JSON reportersExcellent — native data structureMost frameworksProgrammatic consumption

Anti-Patterns: “Vibe Testing”

When agents generate tests that technically pass but verify nothing — inflating coverage metrics while providing false confidence. Combat this by pairing coverage thresholds with assertion quality metrics using AST analysis (DEV Community, 2026).


5. Test-Driven Agent Development (TDAD)

Why TDD Is a Natural Fit for Agents

“Everything that makes TDD a slog for humans makes it the perfect workflow for an AI agent” — AI thrives on clear, measurable goals, and a binary test is the clearest goal possible. AI eliminates TDD’s biggest weakness (manual labor of writing tests) while preserving its biggest strength (fast, unambiguous feedback) (Elliott, 2025).

The TDD Prompting Paradox

The TDAD paper (March 2026) revealed a critical finding: adding procedural TDD instructions without contextual test information increased regressions to 9.94% — worse than no intervention. But providing targeted context about which tests are at risk via graph-based impact analysis reduced regression rates by 70% (TDAD, 2026).

Key takeaway: Context over instruction. Agents benefit more from knowing which tests matter for a given change than from verbose “how to do TDD” instructions.

TDAD Results

MetricBaselineWith TDAD
Test-level regression rate6.08%1.82% (-70%)
PASS_TO_PASS failures562155 (-72%)
Resolution rate (15 iterations)12%60%

Practical Workflow

  1. Write tests first (or have the agent help write them)
  2. Audit tests to ensure they capture intended behavior
  3. Lock test files: Include “Do NOT modify the test files” in implementation prompts
  4. Start small: 3-5 tests covering core behavior, then iterate
  5. Let the agent iterate: Agent runs tests → reads failures → fixes code → repeats

6. Sandboxing and Isolation

Safe test execution requires isolating agent-generated code from the host system.

Sandbox Comparison

SandboxStartupIsolation LevelBest For
Docker~50msProcess (shared kernel)Development, general eval
E2B (Firecracker)~150msHardware (dedicated kernel)Production agent execution
Modal~90msContainer (managed)Parallel eval pipelines
gVisor50-100msUser-space kernelK8s-native workloads
nsjail~10msProcess + seccompLightweight sandboxing

Recommendation: Docker for development; E2B or Modal for production; Firecracker/Proxmox for highest-security evaluations.

The Inspect Sandboxing Toolkit (UK AISI) provides a reference architecture where “Inspect itself sits outside of the sandbox and sends commands into it” — commands originate externally, everything inside is explicitly authorized (AISI, 2026).


7. Evaluation Frameworks Comparison

FrameworkTypeBest ForKey Feature
Inspect AIGeneral frameworkProduction eval pipelines100+ built-in evals, Docker/K8s/Proxmox sandboxing
SWE-benchBenchmarkCoding agent rankingReal GitHub issues, FAIL_TO_PASS/PASS_TO_PASS pattern
BigCodeBenchBenchmarkRealistic coding tasks1,140 tasks across 139 libraries
Terminal-BenchBenchmarkCLI agent testingReal terminal environments
Aider BenchBenchmarkCode editing toolsTests full edit-apply-debug loop
METR Task StandardSpecificationPortable task definitions1,000+ tasks, adopted by UK AISI
FeatureBenchBenchmarkFeature developmentExposes the feature development gap

8. Industry Best Practices

How Leading Companies Build Tests

Anthropic (Claude Code): Minimal scaffold philosophy — bash + edit tools, single-threaded master loop. Grade outcomes, not paths. Use three grader types: deterministic, LLM-based, and human. Start with 20-30 real failures as eval tasks (Anthropic, 2026).

Cursor: Private CursorBench sourced from real developer sessions via “Cursor Blame” tool. Uses “agentic graders” that can understand multiple valid solutions. Supplements offline evals with online A/B experiments (Cursor, 2026).

Cognition (Devin): Evaluator agents with browser and shell access autonomously judge outcomes. Simulated users test interactive capabilities. Production Devin achieves 74.2% without prior exposure to evaluation tasks (Cognition, 2026).

OpenAI: Codex operates in sandboxed containers with internet disabled. Self-bootstrapping: GPT-5.3-Codex was used to debug its own training. Eval-driven development with “measure, improve, ship” loop (OpenAI, 2026).

Consolidated Best Practices

  1. Start with failures: Collect 20-30 real failures and turn them into eval tasks
  2. Grade outcomes, not paths: Don’t test tool call sequences; test end states
  3. Use multi-trial statistics: pass@k and pass^k capture stochastic variance
  4. Layer grading approaches: Deterministic first → LLM judges for nuance → humans for calibration
  5. Test the system, not the model: Evaluate the full agent+harness pipeline
  6. Keep evals private: Public benchmarks invite gaming and contamination
  7. Combine offline and online evals: Offline catches regressions; online detects UX gaps
  8. Read the transcripts: No substitute for reviewing actual multi-step agent behavior

9. Cutting-Edge Developments (2025-2026)

Benchmark Evolution

SWE-bench Verified has been effectively retired due to data contamination — frontier models can reproduce gold patches verbatim. The field has shifted to:

  • SWE-bench Pro: Multi-language, private codebases, 70-81% → 23% performance drop from Verified
  • FeatureBench: Feature development (not bug-fixing) — best agents solve only 11% vs 74% on SWE-bench
  • SWE-CI: Long-term maintenance — zero-regression rates below 25% for most models
  • LiveCodeBench: Continuously sourced fresh problems to prevent contamination

Multi-Agent Testing

Empirical evidence shows that separating code generation from test generation improves quality. AgentCoder (multi-agent) achieved 79.9% vs 71.3% (single agent) on HumanEval (AgentCoder, 2024). The key: tests written by the same agent that wrote the code suffer from confirmation bias.

Property-Based Testing with Agents

Anthropic’s agentic PBT agent discovered that numpy.random.wald sometimes returns negative numbers — a real bug in NumPy that was patched upstream. Running against 100+ Python packages, the approach demonstrates that agents can find novel bugs through property-based testing at scale (Anthropic Red Team, 2026).

Continuous Eval Pipelines

CI/CD for agent capabilities is becoming standard: quality gates that block releases on regressions, LLM-as-judge scoring on every PR, production monitoring for quality drift. Recommended stack: DeepEval for CI/CD gates, RAGAS for metric exploration, Langfuse/LangSmith for production monitoring.

The Self-Testing Paradox

A surprising finding: in high-autonomy settings, agent-written tests provide marginal utility. GPT-5.2 achieves nearly identical results (71.8% vs 74.4%) while writing almost no tests. Agent-written tests function primarily as observational tools (prints) rather than verification mechanisms (arXiv, 2026). This doesn’t negate pre-existing tests as specifications, but challenges the assumption that agents writing their own tests during resolution is always beneficial.


10. Practical Getting-Started Guide

Step 1: Set Up Your Testing Infrastructure

project/
├── AGENTS.md          # Agent instructions including test commands
├── tests/
│   ├── unit/          # Fast, deterministic, run on every commit
│   ├── integration/   # Component-level, may use real APIs
│   └── evals/         # Agent behavior evals, run nightly
├── .claude/           # Agent harness configuration
└── scripts/
    └── run-tests.sh   # Single command to run all tests

Step 2: Write Agent-Friendly Tests

# GOOD: One assertion, descriptive message, fast
def test_user_creation_returns_valid_id():
    user = create_user(name="Alice", email="[email protected]")
    assert user.id is not None, (
        f"create_user returned None id. "
        f"Check database connection and user validation."
    )
    assert isinstance(user.id, int), (
        f"Expected int id but got {type(user.id).__name__}. "
        f"Database may be returning string UUIDs."
    )
 
# BAD: Multiple concerns, no diagnostic info
def test_user():
    user = create_user(name="Alice", email="[email protected]")
    assert user.id
    assert user.name == "Alice"
    assert user.email == "[email protected]"
    assert user.created_at
    assert validate_user(user)

Step 3: Configure Your AGENTS.md

## Testing
- Run tests with: `pytest tests/ -v --tb=short`
- Run only unit tests: `pytest tests/unit/ -v`
- NEVER modify test files — implement code to pass existing tests
- If a test fails, read the error message carefully before changing code
- Run tests after every code change

Step 4: Implement Feedback Loops

Use hooks that run tests automatically on every agent code change. On failure, surface only the error output (back-pressure pattern). On success, hooks are silent — nothing added to context.

Step 5: Add Evaluation Layers Incrementally

  1. Start with deterministic unit tests (Layer 1)
  2. Add record-and-replay integration tests (Layer 2)
  3. Implement pass@k benchmark evals when ready (Layer 3)
  4. Reserve LLM-as-judge for subjective quality assessment (Layer 4)

11. Open Challenges

  1. The feature development gap: Best agents solve only 11% of FeatureBench vs 74% on SWE-bench
  2. Long-term maintenance regression: Zero-regression rates below 25% for most models during sustained development
  3. The oracle problem: Who verifies the specification (test) is correct when AI writes both code and tests?
  4. Statistical evaluation tooling: No mainstream tools offer confidence intervals or formal statistical aggregation for agent eval results
  5. Cross-model regression detection: How to systematically detect behavioral regressions when model providers update weights
  6. Visual and multimodal testing: 73% performance drop when images are involved

12. Key Principles (Summary)

  1. Tests are the primary interface between human intent and agent behavior. They function as specifications, feedback signals, and verification mechanisms simultaneously.
  2. The testing pyramid must be restructured around uncertainty tolerance, not test granularity.
  3. Harness quality determines agent quality. Model capabilities are necessary but not sufficient.
  4. Context over instruction. Agents benefit more from targeted contextual information than from verbose procedural instructions.
  5. Specification must be executable. Static documentation drifts; executable specifications enforce compliance mechanically.
  6. Grade outcomes, not paths. Testing specific sequences is brittle; test end states.
  7. Start with failures. Turn real production failures into eval tasks.
  8. Keep infrastructure lightweight. Every model release changes the optimal structure — design for replaceability.