Adaptive Harness Foundry — Architecture
Executive Summary
The Adaptive Harness Foundry (AHF) is a proof-of-concept implementation of the HarnessX architecture using Google ADK Python v2.2+. It demonstrates that agent behavior can be composed, versioned, evaluated, and evolved through configuration-level mutations only — without modifying model weights, source code, or training data.
The system wraps an ADK LlmAgent in a lifecycle processor pipeline, captures structured traces of every invocation, evaluates outputs deterministically, and uses a constrained meta-agent pipeline to propose bounded harness modifications. A deterministic promotion gate ensures candidates only replace the active harness when they provably improve target metrics without regressing safety or overall performance.
Why This Architecture
Most agent improvement systems blur three separate concerns together: the runtime that executes the agent, the scoring logic that decides whether it worked, and the adaptation logic that tries to improve it. That makes failures hard to diagnose. If an agent changes, you cannot always tell whether the cause was a prompt tweak, an evaluator drift, or a hidden runtime behavior.
AHF splits those concerns into planes so each one can change on its own. The runtime plane only compiles and runs harnesses. The trace plane only records what happened. The evaluation plane only scores those recorded facts. The evolution plane can suggest bounded config changes, but it cannot directly rewrite code or bypass the gate. That separation is what makes deterministic promotion possible.
To ground the rest of this page, use one concrete example: a policy compliance task family. A support agent answers questions about internal policy. The runtime injects the right processors and tools, the trace plane records which tools were called and what the model returned, the evaluation plane checks correctness and grounding, and the evolution plane can respond to repeated failures by tightening the policy-specific harness variant instead of rewriting the whole system.
Architecture Overview
graph TD
OP[Operator Plane]
EV[Evolution Plane]
EVAL[Evaluation Plane]
TR[Trace Plane]
RT[Runtime Plane]
CAT[Catalog Plane]
OP --> EV
EV --> EVAL
EVAL --> TR
TR --> RT
RT --> CAT
OP1[FastAPI Server]
OP2[Minimal React UI]
EV1[Digester]
EV2[Planner]
EV3[Evolver]
EV4[Critic]
EV5[Validator]
EV6[Gate]
EVAL1[Benchmark Runner]
EVAL2[Deterministic Evaluators]
TR1[TraceRecorder]
RT1[Harness Compiler]
RT2[ADK Agent Assembly]
RT3[Task Runner]
CAT1[SQLite Catalog]
OP --> OP1
OP --> OP2
EV --> EV1
EV --> EV2
EV --> EV3
EV --> EV4
EV --> EV5
EV --> EV6
EVAL --> EVAL1
EVAL --> EVAL2
TR --> TR1
RT --> RT1
RT --> RT2
RT --> RT3
CAT --> CAT1 The diagram above shows the planes as a stack. In practice, they behave more like a contract system. Each plane accepts a small set of inputs, emits a narrow set of outputs, and hands evidence to the next plane without sharing authority.
Trust Boundaries
| # | Boundary | Protects |
|---|---|---|
| TB1 | Evolution plane → Catalog | Catalog immutability — evolved harnesses are new versions, never mutations of existing ones |
| TB2 | Evolution plane → Runtime | No source code emission — only structured YAML/JSON patches |
| TB3 | Evaluation plane → Promotion | Deterministic gate — LLM cannot approve its own output |
| TB4 | Task family → Variant | Variant isolation — task-family-specific configs can’t leak into other families |
| TB5 | Meta-agent → Held-out tasks | Held-out split never exposed to evolution — prevents benchmark overfitting |
Trust Boundaries at a Glance
flowchart TD
subgraph CAT[Catalog Plane]
CAT1[Immutable harness versions]
end
subgraph RT[Runtime Plane]
RT1[Compiled ADK agent]
end
subgraph TR[Trace Plane]
TR1[Structured trace events]
end
subgraph EVAL[Evaluation Plane]
EVAL1[Deterministic scores]
end
subgraph EVO[Evolution Plane]
EVO1[Patch proposal pipeline]
EVO2[Held-out tasks hidden]
end
EVO -->|TB1: new versions only| CAT
EVO -->|TB2: patches, not code| RT
EVAL -->|TB3: gate controls promotion| CAT
RT -->|TB4: family-scoped variants| CAT
EVO -.->|TB5: no access| EVO2 Component Architecture
Catalog Plane
- What it does: Stores the harness definitions that the rest of the system treats as source of truth. A harness is not an informal prompt bundle; it is a typed, hashed record that can be recompiled later.
- Inputs and outputs: Inputs are baseline harness registrations, variant definitions, and promotion records. Outputs are immutable versions that the runtime can compile and the operator can inspect.
- Failure mode: If canonicalization, hashing, lineage, or version linkage is wrong, the rest of the system loses provenance. A candidate may still run, but you can no longer prove exactly what was promoted.
- HarnessDefinition: Immutable YAML/JSON configuration with canonical hash (SHA-256 of normalized serialization). Contains model config, agent config, tool policy, processor pipeline assignment.
- HarnessVersion: Each edit produces a new version. Version chain: parent → child. Authorship tracked (human vs meta_agent).
- ProcessorSpec: Typed processor with declared hook, capability set (read/write permissions), version.
- VariantDefinition: Inherits from a base harness, overrides specific fields for a task family.
- PromotionRecord: Links candidate to baseline, records gate-by-gate results, stores evidence references.
In the policy compliance example, the catalog might hold a base enterprise-support harness plus a policy_question variant that adds stricter citation requirements and a narrower tool allowlist.
Runtime Plane
- What it does: Compiles a harness into a working Google ADK agent, attaches lifecycle processors to callbacks, and executes tasks one by one.
- Inputs and outputs: Inputs are a harness version, task definition, task family, and available tools. Outputs are model responses, tool invocations, latencies, token counts, and callback events.
- Failure mode: Bad config resolution, callback miswiring, or processor capability mismatches cause runtime errors or incorrect behavior before evaluation even starts.
- HarnessCompiler: Validates harness config, resolves processor references, compiles into an ADK
LlmAgentwith attached callbacks. - ADKApp: Factory function creating the ADK application (agent + tools + runner).
- TaskRunner: Executes a benchmark task against a harness, collecting trace events.
- CallbackBridge: Maps
LifecycleHookenum to ADK callback types (before_agent,after_agent,before_model,after_model,before_tool,after_tool).
For a policy question, the runtime can attach processors that inject task-family context before the model runs, block forbidden tools before invocation, and require citations after tool results come back.
Trace Plane
- What it does: Turns runtime behavior into durable evidence. Every relevant callback, tool call, latency sample, and error becomes a normalized record.
- Inputs and outputs: Inputs are raw runtime events, provisional agent state, and redaction rules. Outputs are
TraceEventrows in SQLite plus optional JSONL exports. - Failure mode: Missing or malformed traces undermine evaluation and evolution. If secrets are not redacted before persistence, the trace plane becomes a liability instead of a debugging asset.
- TraceRecorder: Writes normalized
TraceEventrecords to SQLite and optionally exports JSONL. - TraceRepository: Query interface for traces by run_id, task_id, harness_id, failure classification.
- RedactionService: Strips API keys, PII, and credentials before persistence.
When the policy agent answers incorrectly, the trace plane shows whether the root cause was missing retrieval, a disallowed tool call, unsupported grounding, or a slow multi-step loop.
Evaluation Plane
- What it does: Converts traces into deterministic judgments. It does not ask another model whether an answer “seems good”; it applies code to the recorded facts.
- Inputs and outputs: Inputs are benchmark fixtures, traces, and evaluator rules for correctness, safety, grounding, and efficiency. Outputs are task-level scores, family summaries, and comparison reports.
- Failure mode: Weak or incomplete evaluators produce false confidence. The plane stays deterministic, but the benchmark may still measure the wrong thing if the rubric is poorly designed.
- BenchmarkRunner: Loads benchmark tasks, routes to correct task family variant, executes serially.
- ScoringEngine: Computes TaskScore from trace events using deterministic rules.
- ComparisonEngine: Diffs two runs at task, family, and global levels.
In the policy family, evaluation can reward correct answers with grounded citations while penalizing unapproved tools, missing evidence, or excessive tool chatter.
Evolution Plane
- What it does: Reads repeated failures and proposes the smallest allowed harness change that might fix them. It is deliberately constrained: no source file writes, no evaluator edits, no secret access to held-out tasks.
- Inputs and outputs: Inputs are failed traces from the evolution split, benchmark summaries, and allowed patch operations. Outputs are candidate harness patches plus structured review artifacts.
- Failure mode: Unsafe, leaky, or ineffective patches are common, which is why this plane is surrounded by critics, linters, and a deterministic promotion gate. Rejection is expected behavior, not system failure.
- Digester: Analyzes failed traces, clusters failures, identifies recurring patterns.
- Planner: Selects one bounded adaptation objective from observed failures.
- Evolver: Generates a
HarnessPatchusing allowed operations (add/remove/replace processor, update config, create variant). - Critic: Reviews patch for safety, benchmark leakage, reward hacking.
- PatchLinter: Static analysis rejecting patches with benchmark IDs, expected answers, evaluator tampering.
- PromotionGate: Deterministic acceptance policy (all-gates-must-pass).
In the policy example, the evolver might tighten a variant-specific citation processor or adjust a model instruction block. If that patch improves policy tasks but harms incident handling, the gate can still reject it.
Data Flow
sequenceDiagram
actor Operator
participant Catalog
participant CLI
participant Runtime
participant Trace
participant Evaluation
participant Evolution
participant Gate
Operator->>Catalog: Register baseline harness
CLI->>Runtime: Compile harness and run benchmark
Runtime->>Trace: Record callbacks, tool calls, outputs, errors
Evaluation->>Trace: Read traces for scoring
Evaluation-->>CLI: BenchmarkReport
CLI->>Evolution: Start evolve cycle with failed traces
Evolution->>Trace: Read failure clusters from evolution split
Evolution->>Catalog: Write candidate patch as new version
CLI->>Runtime: Run candidate benchmark
Evaluation-->>Gate: Compare baseline vs candidate
Gate->>Catalog: Promote or reject candidate This flow is intentionally one-way at each stage. The evaluation plane can read traces, but it cannot rewrite them. The evolution plane can propose a new catalog entry, but it cannot directly mutate the active one.
Evolution Pipeline
flowchart TD
F[Failed traces] --> D[Digester]
D --> P[Planner]
P --> E[Evolver]
E --> C{Critic: safe?}
C -->|No| R1[Reject candidate]
C -->|Yes| L{Patch linter: clean?}
L -->|No| R2[Reject candidate]
L -->|Yes| B{Benchmark: improved?}
B -->|No| R3[Reject candidate]
B -->|Yes| G{Promotion gate passes?}
G -->|No| R4[Reject candidate]
G -->|Yes| A[Promote candidate] The important architectural point is that generation happens early and judgment happens late. AHF lets the meta-agent explore bounded ideas, but only after multiple non-LLM checks decide whether those ideas are acceptable.
Key Design Decisions
- Google ADK v2 as runtime engine — Not wrapped in unnecessary abstractions. The foundry owns configuration, tracing, evaluation, and versioning; ADK handles LLM orchestration. ADK v2 callbacks map cleanly to HarnessX lifecycle hooks.
- Configuration-only evolution — The meta-agent proposes YAML/JSON patches, never Python code. This bounds the adaptation surface to what can be validated and linted.
- Deterministic promotion gate — Implemented as code, not delegated to an LLM. Every criterion is independently testable.
- Variant isolation — Task families get separate harness variants. This limits blast radius of changes and enables family-specific optimization.
- Held-out evaluation — 6 tasks (20%) are never exposed to the meta-agent. This tests generalization.
Trade-offs
| Decision | Trade-off |
|---|---|
| YAML/JSON patches only | Safe and auditable, but limited adaptation surface vs. code generation |
| Deterministic gate | Provably correct, but cannot express nuanced “better but different” judgments |
| Fake model for testing | CI runs without API keys, but may not catch model-specific behavior |
| SQLite for catalog + traces | Simple deployment, but not multi-writer safe |
| Single-agent architecture | Clearer trace causality, but cannot demonstrate multi-agent evolution |
These trade-offs bite in specific places. Configuration-only evolution means you cannot repair a missing tool implementation from inside the system; a human still has to change code. Deterministic gates make promotion explainable, but they also reject candidates that may look better to a human reviewer while failing one explicit threshold. SQLite keeps the proof of concept easy to run, but it becomes the first pressure point if multiple operators or runners need concurrent writes.
Deployment Topology
Local development:
Python venv → ADK → Gemini (live) or fake model (test)
SQLite database (single file)
FastAPI dev server (uvicorn)
CLI via Typer
Docker:
Dockerfile + docker-compose.yml
Environment variables for API keys
Volume mount for SQLite persistence