🏗️ Building High-Quality AI Agents 🤖 — A Comprehensive, Actionable Field Guide - P2 📘
A synthesis of hard-won lessons from Claude Code, OpenHands, SWE-agent, GoClaw, Nanobot, PicoClaw, and the emerging discipline of harness engineering. This is the guide we wish existed when we started building agents.
The goal: an AI agent that is fast, scalable, capable, reliable, efficient, and secure — not by accident, but by design.
📖 How to read this guide
- Read top-to-bottom for the full mental model. Each section builds on the previous one.
- Skim the boxes if you only want the takeaways — every section ends with an Actionable rules box.
- Jump to Part 14 — The Build-Your-Own Roadmap if you already know the theory and want a sequenced plan.
- Bookmark Part 15 — Anti-Patterns for design reviews.
📋 Table of Contents
- ⚡ Part 0 — The Core Equation
- 🧠 Part 1 — Mental Model: What an AI Agent Actually Is
- 🔄 Part 2 — The Agent Loop (the Kernel)
- 🛠️ Part 3 — Tools: The Agent's Hands
- 💭 Part 4 — Context Engineering
- 💾 Part 5 — Memory (Long-Term Knowledge)
- ⚡ Part 6 — Concurrency & Multi-Agent Patterns
- 🔧 Part 7 — Reliability: Error Recovery, Stuck Detection, Autosubmit
- 🔒 Part 8 — Security: Defense-in-Depth
- 🏢 Part 9 — Multi-Tenancy from Day One
- 🚀 Part 10 — Performance & Efficiency
- 🔌 Part 11 — Provider Abstraction & Resilience
- 📡 Part 12 — Channels & Integration Surface
- 📊 Part 13 — Observability & Evaluation
- 🗺️ Part 14 — The Build-Your-Own Roadmap
- ⚠️ Part 15 — Anti-Patterns to Avoid
- 🎯 Part 16 — Closing: The Harness Mindset
Part 0 -> 12 Read here https://viblo.asia/p/building-high-quality-ai-agents-a-comprehensive-actionable-field-guide-part-1-ymJXDQ9rJkq
📊 Part 13 — Observability & Evaluation
13.1 🔎 Trace everything
Three span types: agent, llm_call, tool_call. Wrap every LLM call in a span. Wrap every tool call in a span. Trace tree mirrors the run shape.
| Detail | Value |
|---|---|
| Batch size | 100 spans |
| On batch failure | retry individually |
| Verbose mode | full input/output truncated at 50 KB |
| Span exporters | OpenTelemetry compatible |
13.2 💰 Cost tracking from step 1
Every API response runs through a cost accumulator:
- Per-model usage in bootstrap state.
- Reports to OpenTelemetry.
- Recursively processes nested model calls (sub-agents, recall queries).
- Persists to project config on process exit.
- Restores on next session if persisted session ID matches.
Histograms use reservoir sampling (Algorithm R) with 1,024 entries to compute p50/p95/p99. Averages hide tail latency, and tail latency is what users feel.
Even in v0, instrument cost and latency. You cannot decide what to optimize from feel.
13.3 ⏮️ Replayable trajectories
Every step() writes a .traj JSON file containing history, model output, observations, costs. SWE-agent's run-replay re-executes any old run. The append-only event log is the source of truth.
Worth it just for debugging. When the agent does something weird at minute 47, you can rewind to any event and try a different model or prompt.
13.4 🧪 Eval taxonomies
Three layers of evaluation:
| Eval | What it measures |
|---|---|
| Single-step | Does one tool call work correctly? |
| Full-run | Does the complete task get solved? |
| Multi-turn | Does the agent handle evolving goals? |
13.5 📝 Trace grading
Grade agent traces directly — especially helpful for multi-step tasks where final output alone doesn't reveal process quality. Use a separate LLM as a judge with a clear rubric.
13.6 🎯 Skill-level evals
Measure whether a specific skill actually helps using:
- Bounded tasks — reproducible inputs.
- Deterministic verifiers — automated pass/fail.
- No-skill baseline — does the skill move the needle?
- Trace review — human-spotcheck the failures.
13.7 📡 Infrastructure noise
Runtime configuration can move coding benchmark scores by more than many leaderboard gaps.
Infrastructure choices may matter more than model intelligence. The same model with a better harness, better tools, better verification, lands a higher score.
13.8 📒 Activity log for every admin action
Every admin write to global tables (settings, permissions, tool config) appends to an audit log: { tenant_id, actor_id, action, target, timestamp, ip }. Cheap to write, invaluable when "who changed X?" comes up.
✅ Actionable rules
- Spans on every LLM and tool call. Trace tree mirrors the run.
- Cost + reservoir-sampled latency from day one.
- Append-only event log = replayable trajectories.
- Eval at three layers: single-step, full-run, multi-turn. Trace-grade.
- Skill-level evals with no-skill baselines. If it doesn't move the needle, drop it.
- Audit log for every admin action.
🗺️ Part 14 — The Build-Your-Own Roadmap
A pragmatic order to implement everything above. Each step compiles and runs on its own.
🌱 Milestone 0 — Foundation (1–2 days)
- Pick the language. Go for small/portable; Python for ML/research/speed.
- Pick the DB: PostgreSQL + pgvector if you ever want vector search.
- Skeleton:
cmd/,internal/,pkg/,migrations/,docs/,Makefile,docker-compose.yml. - Define the
Providerinterface (4 methods). - Implement one provider — start with OpenAI-compatible (covers Groq, DeepSeek, Together for free).
cmd/serveloads config, makes one HTTP request, prints the response.
🔁 Milestone 1 — Minimum Viable Agent Loop (1 week)
- Define
Toolinterface:name,description,schema,execute(ctx, args). - Implement 3 tools:
read_file,write_file,list_files— workspace-scoped, withresolvePath()traversal guard. - Build the loop:
for i := 0; i < 20; i++ { think; if no tools break; act; observe }. - Persist sessions:
SessionStoreinterface + in-memory implementation. - Emit events via callback. Three only:
run.started,tool.call,run.completed. - HTTP endpoint
/v1/chat/completions(OpenAI-compatible). One agent. No streaming yet.
You now have an LLM that can read/write files in a workspace.
🧩 Milestone 2 — System Prompt Architecture (3–4 days)
- Bootstrap files:
agent_context_files(agent-level) +user_context_files(per-user). 6 known files: SOUL, IDENTITY, AGENTS, TOOLS, BOOTSTRAP, USER. ContextFileInterceptor— when a tool reads/writes a known name, route to DB instead of disk.- System prompt builder — assemble from sections. Persona early, persona reminder late.
- Two modes:
PromptFullandPromptMinimal. - Per-user file seeding on first chat.
🏢 Milestone 3 — Multi-Tenancy from the Start (3–4 days)
tenantsandapi_keystables. UUID v7 PKs.tenant_id NOT NULLon every table that holds tenant data.WithTenantID(ctx)/TenantIDFromContext(ctx)helpers.- Resolve API key → SHA-256 lookup → set tenant on ctx at the gateway.
- Update every store query to add
WHERE tenant_id = $N. Audit the diff. - Master tenant for legacy/single-user data; master scope guard for global writes.
🔧 Milestone 4 — Pipeline Refactor (1 week)
Once your loop has > 3 conditional branches, split it:
- Define
Stageinterface,StageResultenum,RunStatestruct. - Implement
ContextStage,ThinkStage,ToolStage,ObserveStage,CheckpointStage,FinalizeStage. AddPruneStagelater. Pipeline.Runorchestrates: setup → iteration loop → finalize.- Feature flag (
pipeline_enabled) so V2 (monolithic) and V3 (pipeline) coexist during migration.
💾 Milestone 5 — Memory & Search (1–2 weeks)
memory_documents+memory_chunkstables.tsvector(FTS) +vector(1536)columns.MemoryInterceptor— auto-chunks + embeds on.mdwrites insidememory/*.- Hybrid search:
0.7 * vector + 0.3 * fts. Per-user 1.2× boost. Dedup. memory_searchandmemory_gettools.- Later:
episodic_summaries+EpisodicWorkersubscribed torun.completed. - Later:
kg_entities+kg_relationswith temporal validity for L2.
🛡️ Milestone 6 — Tool Registry Hardening (1 week)
- Funnel every tool call through
Registry.ExecuteWithContext. - Token-bucket rate limiting per session key (defaults: 60/min, burst 5).
- Credential scrubber — start with 5–10 high-value patterns.
- Policy engine: profiles (
full/coding/messaging/minimal), groups, allow/deny lists. - Shell deny groups (start with:
destructive_ops,reverse_shell,dangerous_paths,package_install). - Capability metadata on every tool.
📡 Milestone 7 — Channels (per channel, ~2 days each)
- Define
Channelinterface:Listen(ctx, onMessage),Send(ctx, OutboundMessage) error. - Telegram first (simplest, long-polling).
channel_instancestable withtenant_idbaked in.- Outbound dispatcher routes by
channel_instance_id. - Pairing flow: 8-char code, 60-min TTL.
- Then: Discord, Slack, WhatsApp, Feishu, Zalo.
📊 Milestone 8 — Observability (3–4 days)
tracesandspanstables. Three span types.- Wrap every LLM call and tool call in a span.
BatchCreateSpansin batches of 100; on failure, retry individually.- Verbose mode (
TRACE_VERBOSE=1) for full input/output, truncated at 50 KB. - Optional: OpenTelemetry exporter.
🔄 Milestone 9 — Resilience (3–4 days)
- Wrap providers with retry middleware.
- Per-model cooldown.
- Failover chain.
- Mid-loop compaction at 75%; post-run at 50 messages or 75%.
- Per-session
TryLockfor compaction goroutine. - Stuck detector (5 patterns, semantic comparison).
- Autosubmit on every fatal error path.
🤖 Milestone 10 — Multi-Agent (1–2 weeks)
subagenttable. Limits: depth 1, max 5 children, max 8 concurrent.spawntool (async return),delegatetool (sync with timeout).agent_linkstable for delegation eligibility.- When ready:
teams,agent_team_members,team_tasks,team_messages. - Atomic task claim:
UPDATE … WHERE status = 'pending' AND owner_agent_id IS NULL.
🔒 Milestone 11 — Production Hardening (ongoing)
- Add the remaining 4 security layers (input guard, output sanitizer, isolation).
- AES-256-GCM encryption for all at-rest secrets.
aes-gcm:prefix convention. - API keys: 16 random bytes, SHA-256 hash, constant-time compare.
- Activity log for every admin action.
- Hourly snapshot aggregations.
- Per-tenant config UI.
🧩 Milestone 12 — Optional Surface Area
- Knowledge Vault with wikilinks (
[[target]]). - MCP bridge (stdio + SSE + streamable-http transports, per-agent + per-user grants).
- Custom shell tools (DB-stored, hot-reloaded).
- Cron jobs.
- Browser automation (headless Chrome).
🕰️ Save for last (don't build until milestone 12)
- Fork agents (cache-driven sub-agents)
- Swarm teams
- Remote tasks across machines
- KAIROS continuous-mode logs
- Auto-mode permission classifier
- Renderer optimization (cell-diffing, BSU/ESU)
- Bitmap search index for huge filesystems
⚠️ Part 15 — Anti-Patterns to Avoid
Each row is a trap that's burned multiple production teams.
🔄 Loop / control flow
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Callbacks or event emitters for the agent loop | You'll re-invent backpressure poorly | async function* (or channels) |
A single error terminal state |
Lose information about why | Encode 10+ specific reasons in a discriminated union |
| Stop-hooks on error responses | Creates error → hook blocks → retry → error infinite loops |
Skip them on errors |
Forgetting to pair tool_use with tool_result on abort |
API rejects the next message | Drain queued tools with synthetic results on every cancel path |
| Trusting the model's tool-call format | Models hallucinate <tool_call> XML, [Tool Call: ...] text |
7-step output sanitizer strips them all |
One giant runLoop() function |
2k-line functions become untestable | 8-stage pipeline; each stage isolated |
🛠️ Tools
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Constructor literal instead of factory | Defaults will be unsafe | Always go through buildTool() |
| Per-tool-type concurrency safety | Bash is sometimes safe, sometimes not |
Pass parsed input |
| Concatenating built-ins and MCP tools then sorting flat | Cache breakpoint dies | Sort within partition, then concat |
| Returning huge raw output | Context blows up | Cap with maxResultSizeChars; persist to disk + return preview |
Using SDK's BetaMessageStream |
O(n²) JSON re-parsing | Read raw stream events |
| Bypassing the tool registry "just for this one call" | Loses scrubbing, rate-limit, RBAC | Every tool call through the registry, no exceptions |
Reusing the human shell (cat, grep -rn) |
Bad agent tools — too much output, no error story | Build agent-shaped commands with bounded output |
Free-form sed -i edits |
Frequent syntactic collapses | Line-range edit with lint + auto-rollback |
🔐 Permissions
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Scattering if mode === ... checks throughout tool code |
Untestable, drift | Centralize in modes + resolution chain |
| Trusting a partial bash parse | Bypassable | If parseForSecurity() fails, treat as unsafe |
Sub-agent default = default mode |
Needs a UI to prompt; bg agents have none | Default to bubble (sync) or dontAsk (async) |
⚡ Caching / API
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Runtime conditionals in the static prompt prefix | Each one doubles cache key space | Move below the dynamic boundary |
| Mid-session feature toggles that change request headers | Bust cache | Use sticky latches |
| Reserving 64K output tokens by default | Over-reserve 8–16× | Cap at 8K, escalate on demand |
| Regenerating the system prompt for fork children | Feature flags or session date may have moved | Pass parent's bytes |
| Filtering tools per child agent in fork mode | Different array → different cache key | useExactTools: true and runtime guards |
💾 Memory
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Storing what git log can answer |
Useless duplication that goes stale | Derivability test: if git/code can answer it, don't memorize |
| Embedding-only retrieval | Misses negation ("do NOT mock") | LLM recall over a manifest, hybrid with FTS |
| Hard expiration | Stale memories are still data | Annotate with age; let model decide |
Letting MEMORY.md grow past 200 lines |
Truncated silently | Treat the index as a budget |
🤖 Multi-agent
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Coordinators with the full tool set | They'll do the work themselves | Restrict to Agent, SendMessage, TaskStop |
| Workers asked to "based on the research, implement X" | Re-derive context, miss specifics, hallucinate paths | Synthesis is the coordinator's job; give exact paths/lines |
| Mid-tool-execution message delivery | Race conditions | Queue at tool-round boundaries |
| Unbounded teammate state | 36.8 GB / 292 agents was a real incident | Cap message history |
General-purpose agents that can spawn Agent |
Exponential fan-out | Block recursive spawning at the schema level |
🏢 Multi-tenancy
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Single-tenant first, "we'll add it later" | Migration is brutal — every query, test, cache key | tenant_id NOT NULL on day one |
Trusting client-supplied tenant_id header |
Spoofable; cross-tenant leakage | Tenant resolved from API key at gateway |
🪝 Bootstrap / hooks
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Loading the world for --version |
Slow startup | Fast-path dispatch first |
| Hook config that updates live mid-session | Lets a malicious repo redefine permissions after trust dialog | Snapshot at startup; update only via explicit user channel |
| Treating MCP skills like local skills | They are content-only | Never execute their inline shell commands |
🔌 Provider / API
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Hard-coding one LLM provider | You'll need 5 within a year | Provider interface + adapters |
| Storing secrets unencrypted because "it's the same DB" | Database dumps leak; insider widens blast radius | AES-256-GCM with aes-gcm: prefix |
time.Sleep between LLM retries |
Wastes time + cost; thundering herd | Exponential backoff with jitter, honor Retry-After |
| Distributed lock for "claim this task" | Adds Redis/Zookeeper; race conditions still possible | Atomic SQL UPDATE with WHERE status = 'pending' |
| Loading the full agent config on every request | Slow; chatty | Router cache with TTL + pub/sub invalidation |
| Synchronous summarization on the request path | User waits 10+ seconds | Synchronous flush, asynchronous summarize |
| Letting the agent self-modify its prompts unguarded | One bad cycle, quality craters | Suggestion engine + admin approval + rollback_on_drop_pct |
🎯 Part 16 — Closing: The Harness Mindset
Three closing observations distilled from every source.
1. 🎯 Push complexity to the boundaries
Permission resolution, protocol translation, state reconciliation, tool I/O — these are the messy edges. Concentrate the mess there. Keep the loop, the tool composition, the memory recall, and the streaming logic clean and exhaustively typed.
2. 🔁 The agent is a function from event history to next event, run in a loop
Everything else is a hook into that one loop:
- "Function" → stateless Agent.
- "Event history" → append-only EventLog.
- "Next event" → Action, executed by Workspace, producing Observation.
- "Run in a loop" → Conversation, until
Finishor stuck.
There is no big design. There is one tight kernel and a lot of small components hanging off it.
3. 🔧 Iterate on failures
The single most important cultural practice from the harness-engineering discipline:
Anytime an agent makes a mistake, engineer a solution so it never makes that mistake again.
Ship first. Add configuration reactively. Throw away what doesn't help. Distribute battle-tested configurations. Treat technical debt as a high-interest loan.
After many production incidents the pattern is the same:
- "GPT-6 will fix it" → almost always wrong.
- "It's a configuration problem" → almost always right.
The fix is in your harness — context management, tool selection, verification loops, handoff artifacts, prompt reinforcement zones, hook ordering, error ladders.
🍳 The shortest possible recipe
If you only build six things well, you have a great agent:
- An async-generator loop with typed terminal states and a continue-state ladder for recovery.
- A self-describing tool registry with per-invocation safety, the 14-step pipeline, and bounded output.
- A 4-layer context compression pipeline preserving the prompt cache architecture.
- File-based memory with always-loaded index + LLM recall side-query.
- Defense-in-depth security with five independent layers.
- Multi-tenancy on day one —
tenant_id NOT NULLeverywhere.
Build those, and you've shipped a real agent. The rest of this guide is layering and polish.
📚 Appendix — Source Map
| Source | The lessons learned |
|---|---|
| Claude Code (from-source guide) | Async-generator loop, prompt cache as architecture, fork agents, file-based memory, hooks, 4-layer compression |
| OpenHands | CodeAct (code as universal action), append-only event log, Workspace abstraction, Skills/microagents, stuck detection, risk-aware confirmation |
| SWE-agent | The Agent-Computer Interface thesis, line-bounded edit + lint + rollback, autosubmit on error, cost-budget |
| GoClaw | Multi-tenancy from day one, 8-stage pipeline, 3-tier memory, 5-layer security, channel adapters, provider resilience stack |
| Nanobot | Bus-based decoupling, per-session lock + pending queue, files + git for memory, progressive skill loading |
| PicoClaw | Lean Go runtime, capability-based polymorphism, JSONL persistence with sidecar metadata, 64-shard mutex, cheap-first routing, JSON-RPC stdio hooks |
| Harness Engineering | Agent = Model + Harness; feedforward + feedback control; sensors/guides; sub-agents as context firewalls; iterate on failures |
"It's not a model problem. It's a configuration problem." — every team, after enough incidents.
If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃
All rights reserved