Last Updated: 2026-04-03 Status: Draft
1. High-Level Design (HLD)
1.1 Core Architectural Components
| Component | Responsibility | Technology |
|---|---|---|
| Brain Service | API layer — receives requests, streams responses, exposes insight endpoints | Hono on Cloudflare Workers |
| Router Agent | Intent classification, specialist delegation, multi-step orchestration | Haiku-class LLM (fast, cheap) |
| Specialist Agents | Domain-specific reasoning (blog, content, analytics, chat) | Sonnet/Opus-class LLM (capable) |
| Tool Registry | Maps service capabilities as callable tools for agents | TypeScript module per service |
| Context Assembler | Two-pass lazy context assembly — intent-first, then targeted data pull | TypeScript + Neon queries |
| Memory Layer | Tri-layer memory: working (KV), episodic (PG), semantic (pgvector) | CF KV + Neon + pgvector |
| Consolidation Engine | Three-tier memory consolidation: immediate, deferred, periodic | CF Queues + Cron Triggers |
| Insight Engine | Proactive anomaly detection, pattern surfacing, notification generation | Cron Triggers + haiku-class LLM |
| Cost Governor | Per-tenant usage tracking, plan-based limits, model routing by tier | Core-billing integration |
| Eval Scorer | Decision tracking, outcome measurement, quality metrics | PostgreSQL + Cron Triggers |
1.2 Overall System Diagram (HLD)
2. Low-Level Design (LLD)
2.1 Request Lifecycle — From User Message to Streamed Response
2.2 Two-Pass Lazy Context Assembly
The Challenge:
- Pulling context from all services on every request wastes tokens and adds latency
- Irrelevant context confuses the LLM and degrades response quality
- Token budgets are finite — every wasted token is a missed insight
The Solution: Intent-First, Then Targeted Pull
Pass 1 — Intent Classification (fast, ~100ms):
The router agent receives only the raw user message + minimal session info. No service data, no memories. It classifies:
{
"intent": "content_creation",
"confidence": 0.92,
"services": ["blog"],
"entities": ["posts", "schedule", "topics"],
"time_scope": "future",
"specialist": "blog",
"action_required": true
}Pass 2 — Targeted Pull (parallel, ~30ms):
Based on the intent, the context assembler fetches ONLY relevant data:
{
"token_budget": {
"system_prompt": 500,
"tools": 1500,
"tenant_profile": 200,
"memories": 1500,
"service_data": 2000,
"conversation_history": 2000,
"safety_buffer": 300,
"total": 8000
},
"fetch_plan": {
"services": ["blog"],
"memory_entities": ["posts", "schedule", "topics"],
"memory_types": ["preference", "pattern", "fact"],
"memory_min_importance": 0.3,
"memory_max_age": null,
"history_window": 10
}
}If the assembled context exceeds any section budget, the compressor:
- Drops lowest-scored memories first
- Filters service data to only intent-relevant fields
- Summarizes older conversation turns, keeps recent turns verbatim
2.3 Multi-LLM Routing & Fallback
Model Selection Strategy:
| Task | Model Tier | Latency Target | Cost/1K tokens |
|---|---|---|---|
| Intent classification | Haiku-class (Claude Haiku, GPT-4o-mini) | < 150ms | ~$0.001 |
| Simple queries (facts, status) | Haiku-class | < 200ms | ~$0.001 |
| Content generation, analysis | Sonnet-class (Claude Sonnet, GPT-4o) | < 500ms first token | ~$0.01 |
| Complex reasoning, multi-step planning | Opus-class (Claude Opus) | < 1s first token | ~$0.05 |
Plan-Based Model Access:
| Plan | Models Available | Max Interactions/Month |
|---|---|---|
| Free | Haiku-class only | 50 |
| Starter | Haiku + Sonnet | 500 |
| Pro | Haiku + Sonnet + Opus (for complex) | 5,000 |
| Business | All models, full access | Unlimited |
Fallback Chain:
Primary Provider (e.g., Claude Sonnet)
│ timeout/error
▼
Secondary Provider (e.g., GPT-4o)
│ timeout/error
▼
Tertiary Provider (e.g., Gemini 1.5 Flash)
│ timeout/error
▼
Graceful Degradation Message
"I'm having trouble right now — try again in a few minutes."Each fallback is logged. If a provider fails >3 times in 5 minutes, the router skips it for 10 minutes (circuit breaker).
2.4 Streaming Architecture (SSE)
The brain service streams responses via Server-Sent Events:
Client Brain Service
│ │
│── POST /brain/chat ─────────────>│
│ │── classify intent
│ │── assemble context
│ │── call LLM (streaming)
│<── event: token ─────────────────│ "Based"
│<── event: token ─────────────────│ " on"
│<── event: token ─────────────────│ " your"
│<── event: token ─────────────────│ " data"
│<── event: tool_call ─────────────│ { name: "blog.analytics", params: {...} }
│<── event: tool_result ───────────│ { views: 340, trend: "up" }
│<── event: token ─────────────────│ " your traffic"
│<── event: token ─────────────────│ " increased..."
│<── event: done ──────────────────│ { usage: { input: 2500, output: 180 } }
│ │Event types:
| Event | Payload | Purpose |
|---|---|---|
token |
{ text: string } |
Streaming text token |
tool_call |
{ name, params } |
Agent is calling a tool (shown as loading state in UI) |
tool_result |
{ name, result } |
Tool completed (UI can show inline data) |
thinking |
{ text: string } |
Agent's reasoning (optional, collapsible in UI) |
done |
{ usage, decision_id } |
Stream complete, usage stats |
error |
{ code, message } |
Error during processing |
2.5 Insight Engine — Proactive Intelligence
The Problem: A reactive-only AI waits for users to ask questions. Most business owners don't know what to ask — they need the AI to surface opportunities and problems proactively.
The Solution: A cron-triggered insight engine that scans each tenant's data every 6 hours and generates actionable insight cards.
Insight Types:
| Type | Trigger | Example |
|---|---|---|
opportunity |
Content gap detected from chatbot questions | "40% of chatbot queries are about eggless recipes — you have no post on this" |
anomaly |
Traffic spike or drop beyond 2 standard deviations | "Your traffic dropped 35% this week — likely due to 0 posts published" |
win |
A post/campaign significantly outperformed average | "Your Diwali post got 3x average views — seasonal recipes are your strength" |
suggestion |
Pattern-based optimization | "Your Monday posts get 40% more views than Friday posts — consider shifting" |
reminder |
Scheduled content gap detected | "You have no posts scheduled for next week — want me to plan some?" |
2.6 Durable Objects — Session State Management
Each active copilot session is backed by a Cloudflare Durable Object:
Responsibilities:
- Hold WebSocket connection for real-time streaming
- Maintain in-flight conversation state (no DB round-trip for active turns)
- Buffer tool call results during multi-step agent execution
- Enforce single-session-per-user (prevent duplicate streams)
Lifecycle:
- User opens copilot → Durable Object created (or resumed)
- Conversation turns stored in-memory during active session
- On session idle (5 min timeout) → flush to CF KV + PostgreSQL
- On explicit close → full persistence + memory extraction
2.7 Tool Execution & Safety
When an agent calls a tool (e.g., blog.create_draft), the execution goes through safety checks:
Agent decides to call tool
│
▼
┌──────────────────────┐
│ Permission Check │ Does this user have blog.write?
│ (core-access) │ NO → reject, inform user
└──────────┬───────────┘
│ YES
▼
┌──────────────────────┐
│ Destructive Check │ Is this a delete/update/irreversible action?
│ │ YES → require user confirmation
└──────────┬───────────┘
│ SAFE or CONFIRMED
▼
┌──────────────────────┐
│ Rate Limit Check │ Has this tenant hit tool call limits?
│ (cost governor) │ YES → graceful limit message
└──────────┬───────────┘
│ OK
▼
┌──────────────────────┐
│ Execute via Gateway │ Internal service call with x-gateway-key
│ │ Returns result to agent
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Log Decision │ Record tool call + params + result
│ │ for outcome tracking
└──────────────────────┘3. Data Flow Guarantees
-
Tenant isolation is absolute. Every query — memory retrieval, context assembly, insight generation — is scoped to
tenant_id. Row-Level Security on all AI tables as a safety net. A memory from Tenant A is never visible to Tenant B, even during vector similarity search. -
No data loss on stream disconnect. If the SSE connection drops mid-response, the Durable Object retains the full response. The client can reconnect and receive the completed response. Conversation state is flushed to KV on disconnect.
-
At-least-once for deferred consolidation. CF Queues guarantee at-least-once delivery. Memory consolidation jobs are idempotent — processing the same conversation twice produces the same memories. Deduplication via
source_idonai_memory. -
Cost tracking is synchronous. Usage is checked BEFORE the LLM call, not after. If a tenant has 2 interactions remaining, the cost governor blocks the 3rd call before any tokens are consumed. Usage is incremented atomically.
-
LLM calls never touch tenant data directly. The agent calls tools, tools call services, services access the database. The LLM provider never receives raw database credentials or direct data access. All data is pre-assembled by the context assembler and passed as prompt context.
4. Open Design Decisions (TBD)
-
Durable Object per-tenant vs per-session? Per-session is simpler but means losing DO state when the user refreshes. Per-tenant allows resuming but requires more complex state management. Leaning toward per-tenant with session windowing.
-
Should the insight engine use an LLM for every tenant? At scale (10k+ tenants), running haiku-class inference every 6 hours per tenant could be expensive. Alternative: rule-based anomaly detection first (cheap), then LLM only for tenants with detected anomalies (targeted). This hybrid approach is likely optimal.
-
Vector embedding model. Currently planning
text-embedding-3-small(1536 dims, cheap). Should we usetext-embedding-3-large(3072 dims, better quality) for semantic memory? Trade-off: storage cost vs retrieval accuracy. Start with small, benchmark, upgrade if needed. -
Multi-step agent execution timeout. If an agent needs to call 5+ tools sequentially (e.g., creating a month-long content calendar), the total execution time could exceed user patience. Should we cap at N tool calls per turn? Or show progressive results via SSE?