Đã đăng vào thg 4 28, 10:00 SA 18 phút đọc

🦊GoClaw Deep Dive 🤖 — A Builder's Guide to a Multi-Tenant AI Agent Platform 📘

Source: https://github.com/nextlevelbuilder/goclaw — a Go-based, multi-tenant AI agent gateway with 20+ LLM providers, 7 messaging channels, an 8-stage pipeline, 3-tier memory, and 5-layer security.

This document distills GoClaw's architecture into the principles, patterns, and concrete building blocks you need to build a similar platform from scratch. Read top-to-bottom for theory, jump to Part 4 — Build-It-Yourself Blueprint for a sequenced implementation plan.

🧠 What GoClaw Actually Is (mental model)
⚙️ The 11 Core Principles
🔁 Cross-Cutting Patterns
🗺️ Build-It-Yourself Blueprint
⚠️ Anti-Patterns to Avoid
📚 Reference Map

Part 1 — 🧠 What GoClaw Actually Is

GoClaw is not a chatbot or "wrapper around OpenAI." It is an AI agent gateway — a backend service that sits between your application and LLM providers + tools + storage, and exposes a stable RPC/HTTP surface to the outside world.

[Browser / Telegram / Discord / Your SaaS Backend / CLI]
            │ (WebSocket RPC, HTTP REST, OpenAI-compat /v1/chat/completions)
            ▼
   ┌─────────────────────────────────┐
   │        GoClaw Gateway           │
   │  Auth · RBAC · Rate-limit       │
   │  Tenant Isolation Layer         │
   └────────────────┬────────────────┘
                    ▼
   ┌─────────────────────────────────┐
   │        Agent Engine             │
   │  Loop · Pipeline · Router       │
   │  Tools · Memory · Skills · MCP  │
   └────────────────┬────────────────┘
                    ▼
   ┌─────────────────────────────────┐
   │  PostgreSQL  ·  Redis  ·  Files │
   │  (sessions · agents · memory ·  │
   │   traces · KG · vault · keys)   │
   └─────────────────────────────────┘
                    │
                    ▼
        20+ LLM Providers (Anthropic, OpenAI, Gemini, …)

Three sentences that capture the design

Agents are configurations, not code — defined by rows in a DB plus a few markdown bootstrap files (SOUL.md, IDENTITY.md, AGENTS.md, TOOLS.md).
Everything is multi-tenant from day one — every table carries tenant_id, every query enforces it, and tenant scope flows through context.Context.
Every concern is an interface with at least one implementation — providers, stores, channels, tools, all behind small interfaces so they can be swapped or mocked.

Part 2 — ⚙️ The 11 Core Principles

2.1 🔄 The Agent Loop: Think → Act → Observe

The fundamental shape of any agent is a loop. GoClaw caps it at 20 iterations by default and structures each iteration as three actions:

loop (≤ 20 times):
    THINK   → Build prompt → call LLM → get response (text + tool calls?)
    if no tool calls: BREAK
    ACT     → Execute tool calls (parallel if multiple)
    OBSERVE → Append tool results back into the message history
finalize → sanitize output, persist messages, emit completion event

Key implementation details:

Detail	Value	Why
Max iterations	20	Prevents runaway loops; configurable per-agent and per-request
Parallel tools	goroutines + result sort by index	Latency win when LLM calls 3+ tools at once
Single tool	sequential	Goroutine overhead isn't worth it
Mid-loop compaction	trigger at 75% of context window	Summarize first ~70% of history in-place to avoid overflow
Cancel handling	`context.Background()` fallback for trace finalize	Ensures the trace record always saves even on `/stop`

Build it yourself: start here. A loop with one provider, one tool (echo), one in-memory session store, and a for i := 0; i < 20; i++ is a 200-line program that already works.

2.2 🔧 The 8-Stage Pluggable Pipeline

The V3 architecture turns the monolithic loop into 8 independent stages. Each stage is a Stage interface implementation that mutates a shared RunState.

Setup (once)
└─ ContextStage      Inject ctx (agentID, userID, locale), resolve workspace,
                     ensure per-user files exist, persist IDs on session.

Iteration loop (≤ 20)
├─ ThinkStage        Build system prompt (15+ sections), filter tools via policy,
│                    call LLM, record span, emit `chunk` events.
├─ PruneStage        If context > 25%: soft-trim oversized tool results.
│                    If > 50%: hard-clear. Run sanitizeHistory after.
├─ ToolStage         Execute tool calls (parallel for multi-call).
│                    Emit `tool.call` / `tool.result`.
├─ ObserveStage      Append tool results to message buffer.
│                    Handle `NO_REPLY` convention (silent completion).
└─ CheckpointStage   Increment iteration. Break on max-iters or ctx cancel.

Finalize (once)
└─ FinalizeStage     7-step output sanitization, atomic message flush,
                     update session metadata, emit `run.completed`.

Why this matters:

Each stage is testable in isolation (stages_test.go per stage).
New behavior (e.g. a RagStage) is one file — no surgery on a 2k-line runLoop().
Both V2 (monolithic) and V3 (pipeline) can coexist behind a feature flag.

Stage interface (sketch):

type Stage interface {
    Name() string
    Run(ctx context.Context, state *RunState) (StageResult, error)
}

type StageResult int
const (
    Continue   StageResult = iota // proceed to next stage
    BreakLoop                      // exit iteration loop
    AbortRun                       // abort the entire run
)

Lesson: Pluggable pipelines beat monolithic loops once the loop has more than ~3 conditional branches. Pay the abstraction cost early.

2.3 🤖 Provider Abstraction & Resilience

A Provider is a tiny interface. Everything that's hard about LLMs lives inside this seam.

type Provider interface {
    Name() string
    DefaultModel() string
    Chat(ctx context.Context, req ChatRequest) (ChatResponse, error)
    ChatStream(ctx context.Context, req ChatRequest, onChunk func(Chunk)) (ChatResponse, error)
}

Every backend — Anthropic native HTTP+SSE, OpenAI-compatible (Groq, DeepSeek, Gemini, Mistral via the same wire format), Claude CLI subprocess, ACP JSON-RPC, DashScope wrapper — implements this interface. The agent loop never knows which one it's talking to.

Resilience layers wrapped around providers:

Layer	Purpose
Retry	Exponential backoff with jitter; honors `Retry-After`; retries 5xx + network errors only (not 4xx)
Cooldown	Per-model cooldown timer after repeated failures — skip the model for N seconds
Failover	2-tier: rotate API profiles, then degrade to a fallback model
Cache	Composable middleware — caches identical prompts within a TTL
Service tier	Middleware that picks `priority`/`flex`/`auto` tier per request
Error classify	Map raw provider errors to 9 canonical reasons (rate-limit, context-overflow, auth, etc.)

Wire-format quirks live in the adapter, not the loop. Examples:

Anthropic uses x-api-key; OpenAI-compat uses Bearer; Codex uses OAuth + token refresh.
Claude CLI is a subprocess speaking stdio; ACP is JSON-RPC 2.0 over stdio.
DashScope wraps Qwen with a custom thinking-budget mapping.

Lesson: When you support N providers, the spread of behaviors is enormous. Force every quirk through one interface and you keep the agent loop boringly simple.

2.4 🛠️ The Tool Registry Pattern

Tools are the agent's hands. Every tool call goes through one place: Registry.ExecuteWithContext. The registry mediates every invocation.

Agent Loop
    │ ExecuteWithContext(name, args, channel, chatID, ...)
    ▼
[Registry]
    1. Inject per-call context (channel, chatID, peerKind, sandbox key, workspace)
    2. Rate-limit check (token bucket per session key)
    3. Policy check (RBAC: is this tool allowed for this agent?)
    4. Execute the Tool.Execute(ctx, args)
    5. Scrub credentials from output (regex + dynamic registered values)
    6. Return Result{ ForLLM, ForUser, IsError, MediaRefs, ... }

Tool capabilities (metadata that drives policy):

Capability	Examples
`read-only`	`read_file`, `web_search`, `memory_search` — safe to retry
`mutating`	`write_file`, `exec`, `cron`, `team_tasks`
`async`	`spawn` — returns immediately, result delivered later
`mcp-bridged`	Anything proxied to an external MCP server

The Policy Engine filters tools through 7 layers before sending the list to the LLM:

Global profile (full / coding / messaging / minimal)
Provider profile override
Global allow list
Provider allow override
Agent allow
Agent + provider allow
Group allow → then deny lists → then AlsoAllow (additive) → then subagent deny → final list

The 4-tier config overlay (most specific wins):

Per-agent override (agents.builtin_tool_settings)
Per-tenant override (builtin_tool_tenant_configs)
Global default (builtin_tools.settings)
Hardcoded fallback (in tool code)

Built-in tool inventory (the floor you should aim for):

Group	Tools
`fs`	`read_file`, `write_file`, `list_files`, `edit`, `send_file`
`runtime`	`exec` (with credentialed CLI mode for secret injection)
`web`	`web_search`, `web_fetch` (with allow/block domains)
`memory`	`memory_search`, `memory_get`, `memory_expand`
`sessions`	`sessions_list`, `sessions_history`, `sessions_send`, `spawn`
`automation`	`cron`, `datetime`, `heartbeat`
`messaging`	`message`, `create_forum_topic`, `list_group_members`
`team`	`team_tasks` (create/list/claim/complete/comment/attach/...)
`media-gen`	`create_image`, `create_audio`, `create_video`, `tts`
`media-read`	`read_image`, `read_audio`, `read_document`, `read_video`
`knowledge`	`vault_search`, `vault_read`, `knowledge_graph_search`, `skill_search`

Custom tools are shell commands with Go-template placeholders, stored in custom_tools table. Hot-reloaded via pub/sub on change. Supports encrypted env vars for credentials.

Virtual filesystem interceptors route specific paths to the database, not disk:

ContextFileInterceptor → routes SOUL.md, IDENTITY.md, etc. to agent_context_files / user_context_files.
MemoryInterceptor → routes MEMORY.md, memory/* to memory_documents. Writing a .md triggers chunking + embedding automatically.

Path security: every filesystem op runs through resolvePath() which filepath.Clean()s and verifies the result starts with the workspace prefix. Blocks path traversal.

Lesson: the tool registry is where security lives. If every tool call doesn't go through one chokepoint, you have no place to enforce rate-limit / RBAC / scrubbing.

2.5 🧠 3-Tier Memory (L0/L1/L2)

GoClaw treats memory as a progressive loading problem: cheap context first, expensive context only when asked.

L0 — Working Memory                L1 — Episodic                L2 — Semantic
┌────────────────────┐             ┌──────────────────┐         ┌──────────────────┐
│ Current session    │             │ Session summaries│         │ Knowledge Graph  │
│ messages           │   ───────►  │ + L0 abstracts   │ ──────► │ entities +       │
│ (auto-injected     │             │ (~50 tokens)     │         │ relations        │
│  if relevant)      │             │ + embeddings     │         │ + temporal       │
│ Threshold-based    │             │ 90-day retention │         │   validity       │
│ compaction         │             │ Hybrid search    │         │ (valid_from/to)  │
└────────────────────┘             └──────────────────┘         └──────────────────┘
        ▲                                  ▲                            ▲
   auto-inject                        memory_search                memory_expand
   (ContextStage)                     (tool, top-K)                (tool, full doc)

The progressive flow:

L0 auto-injection — On every turn, ContextStage runs AutoInjector which scores the user message against episodic summaries + KG entities. If relevance ≥ 0.3, inject up to 5 entries / 200 tokens at the top of the system prompt. Free for the agent — no tool call.
L1 unified search — When the agent calls memory_search(query), it runs hybrid search (BM25 + vector) across both episodic L0 abstracts and KG entities. Returns top K within score threshold.
L2 deep retrieval — When the agent calls memory_expand(episodic_id), it loads the full summary plus linked KG edges.

Hybrid search formula:

combined_score = vector_score * 0.7 + fts_score * 0.3
                                              │
              FTS: PostgreSQL tsvector +     │
                   plainto_tsquery('simple')  │
              Vector: pgvector with <=>       │
                      cosine distance         │
              Per-user boost: 1.2x            │
              Dedup: per-user wins over global│

Event-driven consolidation (the magic that fills L1/L2 over time):

run.completed event
       │
       ▼
EpisodicWorker → extract summary + L0 abstract via LLM
       │
       │ episodic.created event
       ▼
SemanticWorker → extract entities/relations from summary, write to KG
       │
       │ entity.upserted event
       ▼
DedupWorker → embedding-similarity merge, redirect relations

(separately, debounced 10m)
DreamingWorker → batch unpromoted summaries scored by:
                 0.30 * frequency + 0.35 * relevance +
                 0.20 * recency  + 0.15 * freshness  (14-day half-life)
              → LLM synthesis → write to long-term memory / vault

Two compaction strategies for L0:

When	Trigger	Strategy
Mid-loop	`prompt_tokens >= 75%` of context window during iteration	Summarize first ~70% of in-memory messages, keep last ~30%
Post-run	`> 50 messages` OR `> 75%` context window after run	Per-session try-lock → memory flush → background summarize → save summary + truncate to last 4 messages

Lesson: Memory is not a single tier. Treat it as a hierarchy with cost gradients (free auto-inject → tool call for L1 → tool call for L2). Use embeddings + FTS together, not either-or.

2.6 🏢 Multi-Tenant Isolation by Default

This is the single most consequential design decision — and the one most projects skip until it's painful.

Three rules, never broken:

Every isolatable table has tenant_id NOT NULL. 40+ tables in GoClaw enforce this.
Every query includes WHERE tenant_id = $N. No exceptions. Fail-closed.
Tenant flows through context.Context. Resolved at the gateway, propagated everywhere, never taken from client headers (which can be spoofed).

Tenant resolution at the gateway:

Credential	How tenant is resolved
Tenant-bound API key	Auto from `api_keys.tenant_id` (the recommended path)
System-level API key + `X-GoClaw-Tenant-Id` header	From header (UUID or slug); only system keys can do this
Gateway token + owner user ID	All tenants (cross-tenant admin)
Channel webhook (Telegram, Discord, …)	Baked into `channel_instances.tenant_id` at registration
No credentials	Master tenant only (dev mode)

Per-tenant overrides — each tenant gets its own:

LLM provider configs and API keys
Tool settings (web_search providers, TTS voice, etc.)
Skills enabled/disabled
MCP servers + per-user credentials
Channel instances

API key flow:

[Your SaaS Backend] ── Bearer goclaw_sk_abc... ── [GoClaw]
                                                       │
                                                       ▼
                                         api_keys table:
                                         hash = SHA-256(key)
                                         tenant_id = UUID
                                         scopes = [...]
                                                       │
                                                       ▼
                                         ctx = WithTenantID(parent, tenantID)
                                                       │
                                                       ▼
                                    All downstream queries:
                                    WHERE tenant_id = $N

Storage hardening:

API keys: SHA-256 at rest, constant-time compare for validation (crypto/subtle.ConstantTimeCompare).
Provider/MCP/custom-tool secrets: AES-256-GCM with aes-gcm: prefix + 12-byte nonce + ciphertext + tag, base64'd.
Master scope guard: writes to global tables (builtin_tools, config.*) require IsMasterScope(ctx) — otherwise tenant admin only.

Identity propagation pattern: GoClaw doesn't authenticate end-users. The upstream service (your SaaS backend, your auth proxy) provides user_id, opaque, max 255 chars. The recommended convention for multi-tenant deployments is tenant.{tenantId}.user.{userId}.

Lesson: Retrofitting multi-tenancy is one of the most painful migrations in software. Make tenant_id a column on day one, even if you only have one tenant.

2.7 🛡️ 5-Layer Defense-in-Depth Security

Each layer is independent — even if one is bypassed, the others still protect.

Layer 1 — 🌐 Transport

CORS allow-list validation
WebSocket message size limit: 512 KB
HTTP body limit: MaxBytesReader 1 MB
Timing-safe token comparison (crypto/subtle)
Rate limiting (token bucket, per user / per IP)
Ping/pong every 30s; read deadline 60s; write deadline 10s

Layer 2 — 🔍 Input Validation (`InputGuard`)

6 regex patterns scan every user message:

Pattern	Catches
`ignore_instructions`	"Ignore all previous instructions"
`role_override`	"You are now a different assistant"
`system_tags`	`<\\|im_start\\|>system`, `[SYSTEM]`
`instruction_injection`	"New instructions:", "override:"
`null_bytes`	`\x00`
`delimiter_escape`	`</instructions>`, "end of system"

4 action modes: off / log / warn (default) / block.

Layer 3 — ⚙️ Tool Execution

Shell deny groups — 15 classes, all denied by default: destructive_ops, data_exfiltration, reverse_shell, code_injection, privilege_escalation, dangerous_paths, env_injection, container_escape, crypto_mining, filter_bypass, network_recon, package_install, persistence, process_control, env_dump. Live-reloadable via pub/sub.
Path traversal prevention — resolvePath() cleans + prefix-checks every filesystem op.
SSRF guards — validateProviderURL() blocks 127.0.0.1/localhost for provider base URLs.
Credentialed CLI gate — when calling registered binaries (gh, gcloud, aws, kubectl, terraform), the exec tool injects encrypted env vars directly into the child process (no shell), unwraps sh -c wrappers up to depth 3 to prevent bypass, and fails-closed on DB error.
Domain allow/block — web_fetch honors per-tenant allow_domains / block_domains.

Layer 4 — 🧹 Output Sanitization

Credential scrubber — static regex patterns for OpenAI, Anthropic, GitHub, AWS keys + dynamic registry of runtime values. Replaces with [REDACTED]. Always-on.
Output sanitizer (7 steps applied to LLM output before delivery):
1. Strip garbled tool XML (<tool_call>, <minimax:tool_call>, etc. from broken models)
2. Strip downgraded text-format tool calls ([Tool Call: ...])
3. Strip thinking tags (<think>, <thinking>, <antThinking>)
4. Strip final wrapper tags (preserve inner content)
5. Strip echoed [System Message] blocks
6. Collapse consecutive duplicate paragraphs (model stuttering)
7. Strip leading blank lines

Layer 5 — 🔒 Isolation

Per-user workspace — base + "/" + sanitize(userID), injected via WithToolWorkspace(ctx)
Docker sandbox — read-only root, dropped capabilities, scoped per-session
Subagent depth limit — max depth 1, max children 5/parent, max concurrent 8 system-wide

Lesson: Don't pick one security strategy. Layer them. Assume each one will fail and ask "what's the next line of defense?"

2.8 💾 Persistence: Interface-First, Dual Backend

Every store is a Go interface. Each interface has both a PostgreSQL implementation (server) and a SQLite implementation (Lite desktop). Selected at compile time via //go:build tags.

type SessionStore interface {
    GetOrCreate(ctx context.Context, key string) (*Session, error)
    AddMessage(ctx context.Context, key string, msg Message) error
    SetSummary(ctx context.Context, key, summary string) error
    Save(ctx context.Context, key string) error
    Delete(ctx context.Context, key string) error
    List(ctx context.Context, opts ListOpts) ([]*Session, error)
}

// PG: writes through to PostgreSQL, in-memory write-behind cache
// SQLite: same interface, plain SQLite, no FTS5/vector

Why this matters:

Write the agent loop once, ship a server edition (PG) and a desktop edition (SQLite + Wails app).
Tests use mocks against the interface.
Replace any backend without touching call sites.

The 22+ stores in the system:

Store	What it owns
SessionStore	Conversation history (with in-memory write-behind cache)
AgentStore	Agent definitions, soft-delete, RBAC sharing
ProviderStore	LLM provider configs, encrypted keys
MemoryStore	Memory docs + chunks (FTS + pgvector hybrid)
EpisodicStore	Session summaries with embeddings + recall scoring
KnowledgeGraphStore	Entities + relations with temporal validity
VaultStore	Knowledge vault docs + bidirectional wikilinks
TeamStore	Teams, tasks (atomic claim), members, messages
CronStore	Scheduled jobs + run logs
TracingStore	Traces + spans (LLM, tool, agent)
MCPServerStore	MCP server configs + grants
CustomToolStore	Dynamic shell-based tools
ChannelInstanceStore	Channel configs (Telegram bot tokens, Discord guild IDs, …)
ConfigSecretsStore	Encrypted config values
BuiltinToolStore	System tool metadata + per-tenant settings
PendingMessageStore	Offline group-chat queue with auto-compaction
ContactStore	Cross-channel contact dedup + merge
ActivityStore	Audit log
SnapshotStore	Hourly usage aggregations for dashboards
SecureCLIStore	Credentialed binary configs (encrypted env)
APIKeyStore	Gateway API keys (SHA-256 hashed)
HookStore	Lifecycle hook definitions + execution audit

Two power patterns from the PG layer:

xmax trick for "is this row new?"
```
INSERT INTO user_agent_profiles (...) VALUES (...) 
ON CONFLICT (...) DO UPDATE SET last_seen_at = NOW()
RETURNING xmax = 0 AS is_new
```
is_new = true means a real INSERT happened → trigger first-time setup (seed context files). false means it was an UPDATE → returning user.

Atomic task claim (race-safe without distributed locks):

UPDATE team_tasks
SET status = 'in_progress', owner_agent_id = $1
WHERE id = $2 AND status = 'pending' AND owner_agent_id IS NULL
-- 1 row updated = claimed; 0 rows = someone else got it

Other PG conventions:

No ORM. database/sql with pgx/v5/stdlib. Raw SQL, $1/$2/$3 positional params.
Nullable columns via Go pointers (*string, *time.Time); helpers like nilStr() convert zero-values to nil.
execMapUpdate(map[string]any) builds dynamic UPDATE statements without one-function-per-field-combo.
UUID v7 (time-ordered) for all primary keys via GenNewID().
Required extensions: pgvector + pgcrypto.

Session caching pattern (write-behind):

Read:    GetOrCreate(key) → cache miss? load from DB into cache → return
Write:   AddMessage / SetSummary → in-memory only (no DB write)
Save:    Save(key) → snapshot under read lock → flush to DB via UPDATE
Delete:  Delete(key) → remove from cache + DB

Reads of List() go straight to DB to avoid stale results.

Lesson: Define stores as interfaces from line one. You'll thank yourself when you need a desktop edition, an in-memory test, or to swap PG for CockroachDB.

2.9 📡 Channels as Pluggable Adapters

Each external messaging platform is an adapter that converts platform-specific events to a unified InboundMessage and platform-specific replies from a unified OutboundMessage.

7 supported channels:

Channel	Transport	DM	Group	STT	Streaming
Telegram	Long polling (telego)	✓	✓	✓	✓
Feishu/Lark	WebSocket / webhook	✓	✓	✓	✓
Discord	Gateway WebSocket	✓	✓	✓	—
Slack	Socket Mode	✓	✓	—	✓
WhatsApp	Multi-device protocol	✓	✓	✓	—
Zalo OA	Webhook	✓	—	—	—
Zalo Personal	Reverse-engineered	✓	✓	—	—

4 internal channels (cli, system, subagent, browser) are silently skipped by the outbound dispatcher — they never reach an external platform.

Three DM access policies: pairing (8-character code, 60-min validity) / allowlist / open.

Session key format encodes everything you need:

agent:{agentId}:{channel}:direct:{peerId}     ← DM
agent:{agentId}:{channel}:group:{groupId}     ← Group
agent:{agentId}:subagent:{label}              ← Subagent
agent:{agentId}:cron:{jobId}:run:{runId}      ← Cron run
agent:{agentId}:main                          ← Default/main session

This single key fully scopes session state and enables cross-channel deduplication.

Lesson: Channels look diverse but reduce to two functions: Listen() -> InboundMessage and Send(OutboundMessage) -> error. Keep the agent loop ignorant of platform specifics.

2.10 🤝 Teams, Delegation, and Subagents

Three orchestration modes determine which inter-agent tools are available:

Mode	Tools available	When
`Spawn` (default)	`spawn`	No team, no delegate links
`Delegate`	`spawn`, `delegate`	`agent_links` table has rows for this agent
`Team`	`spawn`, `delegate`, `team_tasks`	`teams` table has a row for this agent

Resolution priority: Team > Delegate > Spawn.

Subagents (parallel child agents):

Limit	Default
Max concurrent (system-wide)	8
Max spawn depth	1
Max children per parent	5
Auto-archive after	60 min
Max iterations per subagent	20

Subagent actions: spawn (async), run (sync), list, cancel (id/all/last), steer (cancel + respawn with new message). Subagents share the parent's SecureCLIStore — credentialed binary gate cannot be bypassed by delegation.

Teams (collaborative multi-agent with a shared task board):

User → Team Lead (sees TEAM.md with member list + roles)
         │
         ▼ creates task on board
       team_tasks table
         │  status: pending
         ▼ atomic claim (SQL row lock)
       Member Agent → works in their own session
         │
         ▼ on completion: result via message bus with "teammate:" prefix
       Team Lead → synthesizes results → replies to user

Only the lead receives TEAM.md in its system prompt. Members discover context through tools (team_tasks list, list_group_members). This saves tokens on idle agents.

Task states: pending / in_progress / in_review / completed / failed / cancelled / blocked / stale.

Task dependencies via blocked_by UUID[]: completing a task auto-unblocks dependents whose blockers are all complete.

Lesson: Don't overload a single agent with everything. Start with spawn for simple parallelism. Add delegate when agents have distinct skills. Add team_tasks when you need a board (work tracking, dependencies, peer messages).

2.11 🌱 Self-Evolution with Guardrails

Agents adapt their behavior based on metrics — within strict bounds.

Three rules for the suggestion engine:

Rule	Detects	Suggests
`LowRetrievalUsageRule`	`memory_search` / `knowledge_graph_search` underused	Enable vault, adjust retrieval weights
`ToolFailureRule`	Frequently failing tools	Limit tool set or reword tool descriptions
`RepeatedToolRule`	Same tool called many times in a row (loop)	Adjust prompt to break the loop

Adaptation guardrails (in agents.other_config.evolution_guardrails):

Field	Default	Purpose
`max_delta_per_cycle`	0.1	Max parameter change per cycle (no wild swings)
`min_data_points`	100	Need ≥ N metrics before applying
`rollback_on_drop_pct`	20.0	Auto-revert if quality drops > 20% after change
`locked_params`	`[]`	Names that cannot auto-change (e.g. `temperature`)

The workflow:

SuggestionEngine.Analyze() runs over a 7-day metrics window.
Generates EvolutionSuggestion records with status="pending".
Admin reviews in dashboard, approves/rejects.
On approval, the auto-adapt worker applies and records baseline metrics.
Next cycle detects regression and rolls back if rollback_on_drop_pct exceeded.

Lesson: "Self-evolving agents" without guardrails is a recipe for production incidents. Bound the change rate, require admin approval, and always keep a rollback path.

Part 3 — 🔁 Cross-Cutting Patterns

A handful of patterns repeat across every module. They're worth internalizing as habits.

Pattern A — 🔗 Context Propagation, Not Mutable State

Everything per-request flows through context.Context:

ctx = store.WithTenantID(ctx, tenantID)
ctx = store.WithUserID(ctx, userID)
ctx = store.WithAgentID(ctx, agentID)
ctx = store.WithAgentType(ctx, "predefined")
ctx = store.WithLocale(ctx, "en")
ctx = tools.WithToolChannel(ctx, "telegram")
ctx = tools.WithToolChatID(ctx, chatID)
ctx = tools.WithToolWorkspace(ctx, "/data/workspaces/u_123")

Tools and store calls read from ctx, never from globals. This is what makes per-tenant + per-user concurrent execution thread-safe without mutexes.

Pattern B — 📢 Event Bus for Decoupling

Agent run completion fires run.completed on a domain event bus. Workers subscribe asynchronously:

EpisodicWorker → extract summary
SemanticWorker → extract entities
DedupWorker → merge duplicates
DreamingWorker → debounced batch synthesis

The agent loop never imports any of them. New workers just subscribe.

Pattern C — 📝 System Prompt as 19+ Composable Sections

The system prompt is assembled at request time from these sections (build order matters):

Identity (channel-aware)
First-run bootstrap notice (if BOOTSTRAP.md exists)
Persona (SOUL.md, IDENTITY.md) — early "primacy zone"
Tooling (filtered + sandbox-aware)
Credentialed CLI context (optional)
Safety preamble + identity anchoring
Self-Evolution rules (predefined agents only)
Skills inline (≤ 15 skills) OR via skill_search tool
MCP tools inline OR via mcp_tool_search
Workspace info
Team workspace (team agents)
Sandbox container info
User identity / owner IDs
Time (UTC)
Channel formatting hints
Extra context (<extra_context> tags)
Project/bootstrap context files (defensive preamble)
Sub-agent spawning rules
Runtime info (agent ID, model, pricing)
Persona reminder — late "recency zone" — fights "lost in the middle"
Memory reminders (run memory_search first)

Two modes: PromptFull (main runs) and PromptMinimal (subagents, cron, memory flush — only AGENTS.md + TOOLS.md).

Two reinforcement zones (primacy + recency) are the cheapest reliability win in agent prompting.

Pattern D — 🧹 Always Sanitize, Always Trace, Always Scrub

Three callbacks that wrap every run:

Sanitize output (7 steps) before delivery.
Record a span for every LLM call and every tool call. Trace tree mirrors the run shape.
Scrub credentials from every tool result via static + dynamic patterns.

Pattern E — ⚛️ Atomic, Race-Safe Mutations via SQL, Not Locks

Don't reach for distributed locks. Instead:

Atomic claim: UPDATE … WHERE status = 'pending' (row-level lock, 1 winner)
Upsert: INSERT … ON CONFLICT … DO UPDATE (idempotent)
Dynamic update: execMapUpdate(map[string]any) — no one-function-per-field-combo

Pattern F — 🔒 Per-Session Try-Lock for Long-Running Side Effects

When a run finishes and decides to compact:

if !sessionLock.TryLock(sessionKey) { return }   // someone else is already compacting
defer sessionLock.Unlock(sessionKey)
runMemoryFlush()
go runSummarize(ctx, ...)

Try-lock instead of blocking lock — skip if another concurrent run is already doing it.

Pattern G — ⚡ Write-Behind Cache for Hot Data

Session messages are written to memory only during a run. One Save(key) flushes to DB at the end. This collapses 10–20 individual INSERTs into 1 UPDATE.

Pattern H — 🔀 Two-Phase Tool Registry (Global + Per-Agent)

Global tools loaded at startup into a shared registry. Per-agent custom tools merged on first agent access into a clone of the global registry — never mutating the shared one.

Part 4 — 🗺️ Build-It-Yourself Blueprint

A concrete, sequenced plan to build a similar system. Each milestone is a runnable, testable deliverable.

Milestone 0 — 🏗️ Foundation (1–2 days)

[ ] Pick the language (Go is a great fit; Python is too).
[ ] Pick the DB (PostgreSQL + pgvector if you want vector search).
[ ] Set up project skeleton: cmd/, internal/, pkg/, migrations/, docs/, Makefile, docker-compose.yml.
[ ] Define the Provider interface (4 methods).
[ ] Implement one provider — start with OpenAI-compatible (covers Groq, DeepSeek, Together, etc. for free).
[ ] Wire a cmd/serve that loads config, makes one HTTP request to the provider, and prints the response.

Milestone 1 — 🔄 Minimum Viable Agent Loop (1 week)

[ ] Define Tool interface: Name() string, Description() string, Schema() JSONSchema, Execute(ctx, args) (Result, error).
[ ] Implement 3 tools: read_file, write_file, list_files (workspace-scoped, with resolvePath() traversal guard).
[ ] Build the loop: Loop.Run(req) → for i := 0; i < 20; i++ { think; if no tools break; act; observe }.
[ ] Persist sessions: SessionStore interface + in-memory implementation. Add PG implementation behind it.
[ ] Emit events via callback (onEvent func(EventType, payload)). Just three: run.started, tool.call, run.completed.
[ ] Build cmd/serve HTTP /v1/chat/completions (OpenAI-compatible). One agent. No streaming yet.

You should now have an LLM that can read/write files in a workspace.

Milestone 2 — 📝 System Prompt Architecture (3–4 days)

[ ] Bootstrap files in DB: agent_context_files (agent-level) + user_context_files (per-user). 6 known files: SOUL, IDENTITY, AGENTS, TOOLS, BOOTSTRAP, USER.
[ ] ContextFileInterceptor — when a tool reads/writes one of these names, route to DB instead of disk.
[ ] System prompt builder — assemble from sections (start with 5–6, grow as needed). Persona early, persona reminder late.
[ ] Two modes: PromptFull and PromptMinimal.
[ ] Per-user file seeding on first chat (use the xmax trick with PG; on SQLite use last_insert_rowid() after INSERT ... ON CONFLICT DO NOTHING).

Milestone 3 — 🏢 Multi-Tenancy from the Start (3–4 days)

[ ] tenants and api_keys tables. UUID v7 PKs.
[ ] tenant_id NOT NULL on every table that holds tenant data (agents, sessions, memory_documents, traces, agent_context_files, …).
[ ] Add WithTenantID(ctx) / TenantIDFromContext(ctx) helpers.
[ ] At the gateway: resolve API key → SHA-256 lookup → set tenant on ctx.
[ ] Update every store query to add WHERE tenant_id = $N. Audit the diff.
[ ] Master tenant for legacy/single-user data. Master scope guard for global writes.

Milestone 4 — 🔧 Pipeline Refactor (1 week)

Once your monolithic loop has > 3 conditional branches, split it:

[ ] Define Stage interface, StageResult enum, RunState struct.
[ ] Implement: ContextStage, ThinkStage, ToolStage, ObserveStage, CheckpointStage, FinalizeStage. Add PruneStage later.
[ ] Pipeline.Run orchestrates: setup → iteration loop → finalize.
[ ] Add a feature flag (pipeline_enabled) so V2 (monolithic) and V3 (pipeline) coexist during the migration.

Milestone 5 — 🧠 Memory & Search (1–2 weeks)

[ ] memory_documents + memory_chunks tables. tsvector (FTS) + vector(1536) (pgvector) columns.
[ ] MemoryInterceptor — auto-chunks + embeds on .md writes inside memory/*.
[ ] Hybrid search: 0.7 * vector + 0.3 * fts, with per-user 1.2x boost and dedup (per-user wins).
[ ] memory_search and memory_get tools.
[ ] (Later) episodic_summaries table + EpisodicWorker subscribed to run.completed.
[ ] (Later) kg_entities + kg_relations with valid_from / valid_until for L2.

Milestone 6 — 🛠️ Tool Registry Hardening (1 week)

[ ] Funnel every tool call through Registry.ExecuteWithContext.
[ ] Add rate limiting (token bucket per session key, defaults: 60/min, burst 5).
[ ] Add credential scrubber — start with 5–10 high-value patterns (OpenAI sk-, Anthropic sk-ant-, GitHub ghp_, AWS AKIA, generic 64-char hex).
[ ] Add policy engine: profiles (full / coding / messaging / minimal), groups (fs, runtime, web, …), allow/deny lists.
[ ] Add shell deny groups (start with: destructive_ops, reverse_shell, dangerous_paths, package_install).
[ ] Capability metadata on every tool (read-only / mutating / async).

Milestone 7 — 📡 Channels (per channel, ~2 days each)

[ ] Define Channel interface: Name() string, Listen(ctx, onMessage), Send(ctx, OutboundMessage) error.
[ ] Telegram first (simplest, long-polling library exists).
[ ] Add channel_instances table with tenant_id baked in.
[ ] Outbound dispatcher routes by channel_instance_id. Internal channels (cli, system, subagent) silently skipped.
[ ] Pairing flow: 8-char code, 60-min TTL, paired-device tracking.
[ ] Then add: Discord (websocket), Slack (Socket Mode), WhatsApp, Feishu, Zalo.

Milestone 8 — 🔭 Observability (3–4 days)

[ ] traces and spans tables. Three span types: agent, llm_call, tool_call.
[ ] Wrap every LLM call in a span. Wrap every tool call in a span.
[ ] BatchCreateSpans in batches of 100; on batch failure, retry individually.
[ ] Verbose mode (TRACE_VERBOSE=1) records full input/output truncated at 50 KB.
[ ] Optional: OpenTelemetry exporter for spans.

Milestone 9 — 💪 Resilience (3–4 days)

[ ] Wrap providers with retry middleware (exponential backoff, jitter, honor Retry-After, only retry 5xx + network).
[ ] Per-model cooldown — track failures per model, skip cooldown'd models for N seconds.
[ ] Failover — try API profile A, then profile B, then degraded model.
[ ] Mid-loop compaction at 75% context. Post-run compaction at 50 messages or 75% context.
[ ] Per-session TryLock for compaction goroutine.

Milestone 10 — 🤝 Multi-Agent (1–2 weeks)

[ ] subagent table for spawn tracking. Limits: depth 1, max 5 children, max 8 concurrent.
[ ] spawn tool (async return), delegate tool (sync with timeout).
[ ] agent_links table for delegation eligibility.
[ ] When ready: teams, agent_team_members, team_tasks, team_messages.
[ ] Atomic task claim: UPDATE … WHERE status = 'pending' AND owner_agent_id IS NULL.
[ ] team_tasks tool with actions: create / list / claim / complete / comment / attach / approve / reject.

Milestone 11 — 🔐 Production Hardening (ongoing)

[ ] Add the remaining 4 security layers (input guard, output sanitizer, isolation).
[ ] AES-256-GCM encryption for all at-rest secrets. aes-gcm: prefix convention.
[ ] API keys: 16 random bytes, SHA-256 hash, constant-time compare.
[ ] Activity log for every admin action.
[ ] Hourly SnapshotStore aggregations.
[ ] Per-tenant config UI.
[ ] Self-evolution suggestion engine (only after you have ≥ 100 metrics per agent).

Milestone 12 — 🌟 Optional Surface Area

[ ] Knowledge Vault with wikilinks ([[target]] syntax).
[ ] MCP bridge (stdio + SSE + streamable-http transports, per-agent + per-user grants).
[ ] Custom shell tools (DB-stored, hot-reloaded).
[ ] Cron jobs (cron expressions + cron_run_logs).
[ ] Browser automation (headless Chrome, browser.act / browser.snapshot / browser.screenshot).

Part 5 — ⚠️ Anti-Patterns to Avoid

GoClaw earns its design by not doing these things:

Anti-pattern	Why it's a trap	What GoClaw does instead
Hard-coding one LLM provider	You'll need 5 within a year	`Provider` interface; adapters per provider
Single-tenant first, "we'll add it later"	Migration is brutal — every query, every test, every cache key	`tenant_id NOT NULL` on day one
Mutable global agent state	Race conditions across concurrent runs	Per-call data lives in `context.Context`
Bypassing the tool registry "just for this one call"	Loses scrubbing, rate-limit, RBAC	Every tool call through `Registry.ExecuteWithContext`, no exceptions
Trusting the model's tool-call format	Models hallucinate `<tool_call>` XML, `[Tool Call: ...]` text, etc.	7-step output sanitizer strips them all
Storing secrets unencrypted because "it's the same DB"	Database dumps leak; insider access widens blast radius	AES-256-GCM with `aes-gcm:` prefix on every secret
One giant `runLoop()` function	2k-line functions become untestable	8-stage pipeline, each stage isolated
Using `time.Sleep` between LLM retries	Wastes time + cost; no jitter → thundering herd	Exponential backoff with jitter, honors `Retry-After`
One memory tier ("just embeddings")	Slow, expensive, irrelevant matches	L0 auto-inject + L1 hybrid search + L2 deep retrieval
Distributed lock for "claim this task"	Adds Redis/Zookeeper dependency; race conditions still possible	Atomic SQL UPDATE with `WHERE status = 'pending'`
Trusting client-supplied `tenant_id` header	Spoofable; cross-tenant leakage	Tenant resolved from API key at gateway, never from clients
Loading the full agent config on every request	Slow; chatty	Router cache with TTL + pub/sub invalidation
Synchronous summarization on the request path	User waits 10+ seconds	Synchronous memory flush, asynchronous summarization in background goroutine
Letting the agent self-modify its prompts	One bad cycle and quality cratters	Suggestion engine + admin approval + `rollback_on_drop_pct` guardrail

Part 6 — 📚 Reference Map

📁 Repo structure (the parts that matter)

goclaw/
├── cmd/                                  130+ files: serve, onboard, migrate
│   ├── gateway*.go                       Gateway lifecycle + setup + wiring
│   └── tui_*.go                          TUI for onboarding/setup
├── internal/
│   ├── agent/                            V2 monolithic loop, router, system prompt,
│   │                                     resolver, sanitize, compaction, evolution
│   ├── pipeline/                         V3 8-stage pipeline (context_stage.go,
│   │                                     think_stage.go, tool_stage.go, …)
│   ├── providers/                        Provider interface + adapters per backend
│   │                                     + retry, cooldown, failover, middleware
│   ├── tools/                            Registry, capabilities, policy engine,
│   │                                     scrubber, rate limiter, custom tools
│   ├── memory/                           3-tier memory + auto-injector + embeddings
│   ├── consolidation/                    Episodic/semantic/dreaming workers
│   ├── vault/                            Knowledge vault + wikilinks + FS sync
│   ├── knowledgegraph/                   KG entities + relations + traversal
│   ├── store/                            Store interfaces (the contract)
│   │   ├── pg/                           PostgreSQL implementations
│   │   └── sqlitestore/                  SQLite implementations
│   ├── gateway/                          WS server, HTTP mux, method router,
│   │                                     rate limiter, client lifecycle
│   ├── http/                             HTTP API handlers (/v1/*)
│   ├── channels/                         Telegram, Discord, Slack, WhatsApp,
│   │                                     Feishu, Zalo OA, Zalo Personal
│   ├── mcp/                              MCP bridge (stdio/sse/http transports)
│   ├── crypto/                           AES-256-GCM with `aes-gcm:` prefix
│   ├── permissions/                      RBAC: viewer/operator/admin
│   ├── eventbus/                         Domain event bus for consolidation
│   ├── tracing/                          Trace + span hierarchy
│   ├── tokencount/                       tiktoken-based counter
│   ├── workspace/                        Per-user workspace resolver
│   ├── bootstrap/                        SOUL/IDENTITY system prompt loading
│   ├── config/                           JSON5 config + env overlay
│   ├── i18n/                             EN/VI/ZH backend message catalog
│   ├── audio/                            TTS provider layer (5 providers)
│   ├── media/                            Image / audio / video generation
│   └── sandbox/                          Docker sandbox for shell exec
├── pkg/
│   ├── browser/                          Browser automation
│   └── protocol/                         Frame types, RPC method names, errors
├── migrations/                           PostgreSQL migrations (45+)
├── docker/                               Docker compose variants
├── docs/                                 31 architecture docs (00-architecture-overview, 
│                                         01-agent-loop, 03-tools-system, …)
└── ui/
    ├── web/                              React SPA (Vite, Tailwind, Radix, Zustand)
    └── desktop/                          Wails v2 desktop app (SQLite, embedded gateway)

🗝️ Key files to read first (in order)

docs/00-architecture-overview.md — system map
docs/01-agent-loop.md — the loop in detail (V2 + V3)
docs/03-tools-system.md — tool registry, policy, security
docs/06-store-data-model.md — every table and store interface
docs/09-security.md — the 5 layers
docs/23-multi-tenant-architecture.md — tenant resolution + isolation
docs/24-knowledge-vault.md — vault, wikilinks, hybrid search
docs/04-gateway-protocol.md — RPC + HTTP API surface
docs/02-providers.md — provider abstraction + resilience
docs/codebase-summary.md — module map

💡 The shortest possible "what is GoClaw"

A multi-tenant AI agent gateway in Go that exposes WebSocket RPC + HTTP REST + OpenAI-compatible APIs. Behind a single Provider interface it talks to 20+ LLM backends. Behind a single Tool registry it offers 50+ built-in tools plus MCP and custom shell tools, all gated by RBAC + rate limits + credential scrubbing + path/SSRF/shell-deny guards. Agent runs flow through an 8-stage pluggable pipeline (think→prune→tool→observe→checkpoint→finalize). Memory is 3-tier (working / episodic / semantic) with hybrid BM25+vector search. Every isolatable table carries tenant_id; every query enforces it; tenant scope flows through context.Context. Channels (Telegram, Discord, Slack, …) are pluggable adapters. Teams of agents collaborate on a SQL-claimed task board.

💭 Closing Thoughts

GoClaw is a study in disciplined boundaries. The agent loop never knows which provider it's talking to. The provider never knows which channel a message came from. The tool never knows which tenant owns the data. Each layer reduces to a small interface and a context-propagated set of values.

If you take only one thing from this document: make every concern an interface from line one, and make multi-tenancy and security non-optional from line one. Everything else can be added incrementally — those two cannot.

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Table of Contents