Building Proper Tests for Coding Agents in Harness Engineering Frameworks
Executive Summary
Harness engineering — the discipline of building the infrastructure that wraps around AI coding agents to make them reliable, governable, and productive — has emerged as one of the most important new fields in software engineering. Named by Mitchell Hashimoto in February 2026, the central insight is that the harness matters more than the model: a mid-tier model in a great harness beats a frontier model in a bad one. Tests are the backbone of effective harnesses, functioning simultaneously as specifications for agents, feedback mechanisms, and verification gates. This report provides a comprehensive guide to building proper tests for applications used by coding agents within harness engineering frameworks.
Related Notes
1. Background: What Is a Coding Agent Harness?
A coding agent harness is the complete infrastructure wrapping an LLM-based coding agent — human approvals, sub-agent coordination, filesystem access, prompt presets, lifecycle hooks, planning, and execution. The term draws from horse tack: reins, saddle, and bit that channel a powerful but unpredictable animal in the right direction (Hashimoto, 2026).
The harness engineering formula: Agent = Model + Harness. The model provides intelligence; the harness makes that intelligence useful (Parallel AI).
The Empirical Evidence: Harness > Model
The Hashline experiment demonstrated this empirically: merely changing the harness’s tool format improved Grok Code Fast 1 from 6.7% to 68.3% on coding benchmarks — no model weights were modified. LangChain’s ranking jumped from 30th to 5th place on Terminal-Bench 2.0 by changing only the harness — same model, 13.7-point improvement (Fowler, 2026; LangChain, 2026).
Key Components
Harness engineering involves two main practices (Hashimoto, 2026):
- Better implicit prompting (AGENTS.md): For simple issues like wrong commands or wrong APIs, update the AGENTS.md file. Each line is based on a bad agent behavior and almost completely resolves them.
- Programmed tools: Actual scripts — screenshots, filtered tests, etc. — paired with AGENTS.md instructions.
2. Tests as Specifications: The Core Insight
The foundational principle: test suites function as the most reliable specification language for coding agents. As Simon Willison noted, “the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against” (Willison, 2025).
Tests provide what agents need:
- Concrete, verifiable success criteria that ground the agent’s work
- Reduced hallucination risk through falsifiable outputs
- Iterative self-correction: write code → run tests → fix errors → repeat
OpenAI’s harness engineering report validates this: “Your likelihood of successfully solving a problem with a coding agent is strongly correlated with the agent’s ability to verify its own work” (OpenAI, 2026).
The SWE-bench Model: Tests as Hidden Oracle
SWE-bench operationalizes tests-as-specifications rigorously. Each task has “FAIL_TO_PASS” tests (verifying the fix works) and “PASS_TO_PASS” tests (verifying nothing broke). Tests are hidden from the agent — it must solve the problem from natural language alone. The test suite acts as a hidden oracle (Jimenez et al., 2024).
The Specification Gaming Risk
Agents may write tests that verify their own broken behavior. Test-first development prevents this: “when the tests exist before the code, agents cannot cheat by writing tests that simply confirm whatever incorrect implementation they produced” (The Register, 2026). Always include “Do NOT modify the test files” in implementation prompts.
3. The Agent Testing Pyramid
The traditional testing pyramid breaks down for AI agents because agents violate the assumption of deterministic outputs. Multiple organizations — Block Engineering, Zapier, LangWatch, AWS — have independently converged on a restructured pyramid organized around uncertainty tolerance (Block Engineering, 2026).
Layer 1: Deterministic Foundations (Unit Tests)
Mock out the LLM entirely and test everything around it: retry behavior, turn limits, tool validation, delegation logic, prompt assembly, guardrail enforcement.
- Run in milliseconds, cost nothing (no API calls)
- Run on every commit
- If tests fail here, the problem is in your code, not the AI
# Example: Test tool validation without LLM
def test_edit_tool_requires_exact_match():
"""Agent's edit tool must reject ambiguous replacements."""
result = edit_tool.apply(
old_string="foo",
new_string="bar",
file_content="foo bar foo" # Two matches - should fail
)
assert result.error == "Multiple matches found"Layer 2: Component-Level Evals (Integration Tests)
Test each component separately — retrieval, parsing, prompt construction, tool orchestration. Block Engineering introduced record-and-replay testing: record a good agent session, commit the fixture, create a regression test capturing real model behavior (Block Engineering, 2026).
Zapier’s “trajectory evals” score entire workflow executions, combining deterministic assertions with LLM-as-judge rubrics. Critical lesson: “unit test evals penalize different approaches, even when they’re smarter or more efficient” (Zapier/rwilinski, 2025).
Layer 3: Probabilistic Performance (Benchmark Evals)
Validate behaviors requiring multiple runs. Track four key metrics (AWS, 2026):
- pass@k: Probability at least one of k trials succeeds
- pass^k: Probability all k trials succeed
- Latency: Time to completion
- Token usage: Cost per task
Layer 4: Judgment and Simulation (End-to-End Evals)
Agent simulations and human-judgment assessments. Use LLM-as-judge with clear rubrics, running evaluations three times with majority voting (Block Engineering, 2026).
CI/CD Integration Strategy
| Layer | When to Run | Cost | Signal |
|---|---|---|---|
| Deterministic (1-2) | Every commit | Free | Is the scaffolding broken? |
| Benchmark (3) | Nightly / pre-release | Moderate | Has agent behavior regressed? |
| Judgment (4) | On-demand / pre-release | High | Does the system work end-to-end? |
4. Designing Tests for Agent Consumption
What Makes Tests “Agent-Friendly”
-
One assertion per test: Agents parse failure output to decide what to fix. Multiple assertions in a single test create ambiguity about what’s wrong.
-
Descriptive error messages: Include context explaining why the test failed and what was expected:
assert result.status == "success", ( f"Expected successful login but got {result.status}. " f"Error: {result.error_message}. " f"This usually means the auth token is expired or malformed." ) -
Fast execution: Unit tests should complete in seconds. Agents iterate by running tests after every change — slow suites break the feedback loop.
-
Deterministic setup/teardown: Use fixtures, not shared state. Each test must be independently runnable.
-
Machine-parseable output: Use structured formats (TAP, JUnit XML, pytest JSON) that agents can programmatically interpret (TAP Protocol).
-
The AAA Pattern: Structure every test as Arrange → Act → Assert for maximum agent readability.
Test Output Formats for Agents
| Format | Agent Parseability | Language Support | Best For |
|---|---|---|---|
| TAP (Test Anything Protocol) | Excellent — ok/not ok is trivially parseable | 15+ languages | Cross-language agent workflows |
| JUnit XML | Good — structured XML | Java, Python, JS | CI/CD integration |
| pytest verbose | Good — human and machine readable | Python | Python-specific agents |
| JSON reporters | Excellent — native data structure | Most frameworks | Programmatic consumption |
Anti-Patterns: “Vibe Testing”
When agents generate tests that technically pass but verify nothing — inflating coverage metrics while providing false confidence. Combat this by pairing coverage thresholds with assertion quality metrics using AST analysis (DEV Community, 2026).
5. Test-Driven Agent Development (TDAD)
Why TDD Is a Natural Fit for Agents
“Everything that makes TDD a slog for humans makes it the perfect workflow for an AI agent” — AI thrives on clear, measurable goals, and a binary test is the clearest goal possible. AI eliminates TDD’s biggest weakness (manual labor of writing tests) while preserving its biggest strength (fast, unambiguous feedback) (Elliott, 2025).
The TDD Prompting Paradox
The TDAD paper (March 2026) revealed a critical finding: adding procedural TDD instructions without contextual test information increased regressions to 9.94% — worse than no intervention. But providing targeted context about which tests are at risk via graph-based impact analysis reduced regression rates by 70% (TDAD, 2026).
Key takeaway: Context over instruction. Agents benefit more from knowing which tests matter for a given change than from verbose “how to do TDD” instructions.
TDAD Results
| Metric | Baseline | With TDAD |
|---|---|---|
| Test-level regression rate | 6.08% | 1.82% (-70%) |
| PASS_TO_PASS failures | 562 | 155 (-72%) |
| Resolution rate (15 iterations) | 12% | 60% |
Practical Workflow
- Write tests first (or have the agent help write them)
- Audit tests to ensure they capture intended behavior
- Lock test files: Include “Do NOT modify the test files” in implementation prompts
- Start small: 3-5 tests covering core behavior, then iterate
- Let the agent iterate: Agent runs tests → reads failures → fixes code → repeats
6. Sandboxing and Isolation
Safe test execution requires isolating agent-generated code from the host system.
Sandbox Comparison
| Sandbox | Startup | Isolation Level | Best For |
|---|---|---|---|
| Docker | ~50ms | Process (shared kernel) | Development, general eval |
| E2B (Firecracker) | ~150ms | Hardware (dedicated kernel) | Production agent execution |
| Modal | ~90ms | Container (managed) | Parallel eval pipelines |
| gVisor | 50-100ms | User-space kernel | K8s-native workloads |
| nsjail | ~10ms | Process + seccomp | Lightweight sandboxing |
Recommendation: Docker for development; E2B or Modal for production; Firecracker/Proxmox for highest-security evaluations.
The Inspect Sandboxing Toolkit (UK AISI) provides a reference architecture where “Inspect itself sits outside of the sandbox and sends commands into it” — commands originate externally, everything inside is explicitly authorized (AISI, 2026).
7. Evaluation Frameworks Comparison
| Framework | Type | Best For | Key Feature |
|---|---|---|---|
| Inspect AI | General framework | Production eval pipelines | 100+ built-in evals, Docker/K8s/Proxmox sandboxing |
| SWE-bench | Benchmark | Coding agent ranking | Real GitHub issues, FAIL_TO_PASS/PASS_TO_PASS pattern |
| BigCodeBench | Benchmark | Realistic coding tasks | 1,140 tasks across 139 libraries |
| Terminal-Bench | Benchmark | CLI agent testing | Real terminal environments |
| Aider Bench | Benchmark | Code editing tools | Tests full edit-apply-debug loop |
| METR Task Standard | Specification | Portable task definitions | 1,000+ tasks, adopted by UK AISI |
| FeatureBench | Benchmark | Feature development | Exposes the feature development gap |
8. Industry Best Practices
How Leading Companies Build Tests
Anthropic (Claude Code): Minimal scaffold philosophy — bash + edit tools, single-threaded master loop. Grade outcomes, not paths. Use three grader types: deterministic, LLM-based, and human. Start with 20-30 real failures as eval tasks (Anthropic, 2026).
Cursor: Private CursorBench sourced from real developer sessions via “Cursor Blame” tool. Uses “agentic graders” that can understand multiple valid solutions. Supplements offline evals with online A/B experiments (Cursor, 2026).
Cognition (Devin): Evaluator agents with browser and shell access autonomously judge outcomes. Simulated users test interactive capabilities. Production Devin achieves 74.2% without prior exposure to evaluation tasks (Cognition, 2026).
OpenAI: Codex operates in sandboxed containers with internet disabled. Self-bootstrapping: GPT-5.3-Codex was used to debug its own training. Eval-driven development with “measure, improve, ship” loop (OpenAI, 2026).
Consolidated Best Practices
- Start with failures: Collect 20-30 real failures and turn them into eval tasks
- Grade outcomes, not paths: Don’t test tool call sequences; test end states
- Use multi-trial statistics: pass@k and pass^k capture stochastic variance
- Layer grading approaches: Deterministic first → LLM judges for nuance → humans for calibration
- Test the system, not the model: Evaluate the full agent+harness pipeline
- Keep evals private: Public benchmarks invite gaming and contamination
- Combine offline and online evals: Offline catches regressions; online detects UX gaps
- Read the transcripts: No substitute for reviewing actual multi-step agent behavior
9. Cutting-Edge Developments (2025-2026)
Benchmark Evolution
SWE-bench Verified has been effectively retired due to data contamination — frontier models can reproduce gold patches verbatim. The field has shifted to:
- SWE-bench Pro: Multi-language, private codebases, 70-81% → 23% performance drop from Verified
- FeatureBench: Feature development (not bug-fixing) — best agents solve only 11% vs 74% on SWE-bench
- SWE-CI: Long-term maintenance — zero-regression rates below 25% for most models
- LiveCodeBench: Continuously sourced fresh problems to prevent contamination
Multi-Agent Testing
Empirical evidence shows that separating code generation from test generation improves quality. AgentCoder (multi-agent) achieved 79.9% vs 71.3% (single agent) on HumanEval (AgentCoder, 2024). The key: tests written by the same agent that wrote the code suffer from confirmation bias.
Property-Based Testing with Agents
Anthropic’s agentic PBT agent discovered that numpy.random.wald sometimes returns negative numbers — a real bug in NumPy that was patched upstream. Running against 100+ Python packages, the approach demonstrates that agents can find novel bugs through property-based testing at scale (Anthropic Red Team, 2026).
Continuous Eval Pipelines
CI/CD for agent capabilities is becoming standard: quality gates that block releases on regressions, LLM-as-judge scoring on every PR, production monitoring for quality drift. Recommended stack: DeepEval for CI/CD gates, RAGAS for metric exploration, Langfuse/LangSmith for production monitoring.
The Self-Testing Paradox
A surprising finding: in high-autonomy settings, agent-written tests provide marginal utility. GPT-5.2 achieves nearly identical results (71.8% vs 74.4%) while writing almost no tests. Agent-written tests function primarily as observational tools (prints) rather than verification mechanisms (arXiv, 2026). This doesn’t negate pre-existing tests as specifications, but challenges the assumption that agents writing their own tests during resolution is always beneficial.
10. Practical Getting-Started Guide
Step 1: Set Up Your Testing Infrastructure
project/
├── AGENTS.md # Agent instructions including test commands
├── tests/
│ ├── unit/ # Fast, deterministic, run on every commit
│ ├── integration/ # Component-level, may use real APIs
│ └── evals/ # Agent behavior evals, run nightly
├── .claude/ # Agent harness configuration
└── scripts/
└── run-tests.sh # Single command to run all tests
Step 2: Write Agent-Friendly Tests
# GOOD: One assertion, descriptive message, fast
def test_user_creation_returns_valid_id():
user = create_user(name="Alice", email="[email protected]")
assert user.id is not None, (
f"create_user returned None id. "
f"Check database connection and user validation."
)
assert isinstance(user.id, int), (
f"Expected int id but got {type(user.id).__name__}. "
f"Database may be returning string UUIDs."
)
# BAD: Multiple concerns, no diagnostic info
def test_user():
user = create_user(name="Alice", email="[email protected]")
assert user.id
assert user.name == "Alice"
assert user.email == "[email protected]"
assert user.created_at
assert validate_user(user)Step 3: Configure Your AGENTS.md
## Testing
- Run tests with: `pytest tests/ -v --tb=short`
- Run only unit tests: `pytest tests/unit/ -v`
- NEVER modify test files — implement code to pass existing tests
- If a test fails, read the error message carefully before changing code
- Run tests after every code changeStep 4: Implement Feedback Loops
Use hooks that run tests automatically on every agent code change. On failure, surface only the error output (back-pressure pattern). On success, hooks are silent — nothing added to context.
Step 5: Add Evaluation Layers Incrementally
- Start with deterministic unit tests (Layer 1)
- Add record-and-replay integration tests (Layer 2)
- Implement pass@k benchmark evals when ready (Layer 3)
- Reserve LLM-as-judge for subjective quality assessment (Layer 4)
11. Open Challenges
- The feature development gap: Best agents solve only 11% of FeatureBench vs 74% on SWE-bench
- Long-term maintenance regression: Zero-regression rates below 25% for most models during sustained development
- The oracle problem: Who verifies the specification (test) is correct when AI writes both code and tests?
- Statistical evaluation tooling: No mainstream tools offer confidence intervals or formal statistical aggregation for agent eval results
- Cross-model regression detection: How to systematically detect behavioral regressions when model providers update weights
- Visual and multimodal testing: 73% performance drop when images are involved
12. Key Principles (Summary)
- Tests are the primary interface between human intent and agent behavior. They function as specifications, feedback signals, and verification mechanisms simultaneously.
- The testing pyramid must be restructured around uncertainty tolerance, not test granularity.
- Harness quality determines agent quality. Model capabilities are necessary but not sufficient.
- Context over instruction. Agents benefit more from targeted contextual information than from verbose procedural instructions.
- Specification must be executable. Static documentation drifts; executable specifications enforce compliance mechanically.
- Grade outcomes, not paths. Testing specific sequences is brittle; test end states.
- Start with failures. Turn real production failures into eval tasks.
- Keep infrastructure lightweight. Every model release changes the optimal structure — design for replaceability.