System Architecture: HLD & LLD — AI Brain · AI Brain

Last Updated: 2026-04-03 Status: Draft

1. High-Level Design (HLD)

1.1 Core Architectural Components

Component	Responsibility	Technology
Brain Service	API layer — receives requests, streams responses, exposes insight endpoints	Hono on Cloudflare Workers
Router Agent	Intent classification, specialist delegation, multi-step orchestration	Haiku-class LLM (fast, cheap)
Specialist Agents	Domain-specific reasoning (blog, content, analytics, chat)	Sonnet/Opus-class LLM (capable)
Tool Registry	Maps service capabilities as callable tools for agents	TypeScript module per service
Context Assembler	Two-pass lazy context assembly — intent-first, then targeted data pull	TypeScript + Neon queries
Memory Layer	Tri-layer memory: working (KV), episodic (PG), semantic (pgvector)	CF KV + Neon + pgvector
Consolidation Engine	Three-tier memory consolidation: immediate, deferred, periodic	CF Queues + Cron Triggers
Insight Engine	Proactive anomaly detection, pattern surfacing, notification generation	Cron Triggers + haiku-class LLM
Cost Governor	Per-tenant usage tracking, plan-based limits, model routing by tier	Core-billing integration
Eval Scorer	Decision tracking, outcome measurement, quality metrics	PostgreSQL + Cron Triggers

1.2 Overall System Diagram (HLD)

2. Low-Level Design (LLD)

2.1 Request Lifecycle — From User Message to Streamed Response

2.2 Two-Pass Lazy Context Assembly

The Challenge:

Pulling context from all services on every request wastes tokens and adds latency
Irrelevant context confuses the LLM and degrades response quality
Token budgets are finite — every wasted token is a missed insight

The Solution: Intent-First, Then Targeted Pull

Pass 1 — Intent Classification (fast, ~100ms):

The router agent receives only the raw user message + minimal session info. No service data, no memories. It classifies:

{
  "intent": "content_creation",
  "confidence": 0.92,
  "services": ["blog"],
  "entities": ["posts", "schedule", "topics"],
  "time_scope": "future",
  "specialist": "blog",
  "action_required": true
}

Pass 2 — Targeted Pull (parallel, ~30ms):

Based on the intent, the context assembler fetches ONLY relevant data:

{
  "token_budget": {
    "system_prompt": 500,
    "tools": 1500,
    "tenant_profile": 200,
    "memories": 1500,
    "service_data": 2000,
    "conversation_history": 2000,
    "safety_buffer": 300,
    "total": 8000
  },
  "fetch_plan": {
    "services": ["blog"],
    "memory_entities": ["posts", "schedule", "topics"],
    "memory_types": ["preference", "pattern", "fact"],
    "memory_min_importance": 0.3,
    "memory_max_age": null,
    "history_window": 10
  }
}

If the assembled context exceeds any section budget, the compressor:

Drops lowest-scored memories first
Filters service data to only intent-relevant fields
Summarizes older conversation turns, keeps recent turns verbatim

2.3 Multi-LLM Routing & Fallback

Model Selection Strategy:

Task	Model Tier	Latency Target	Cost/1K tokens
Intent classification	Haiku-class (Claude Haiku, GPT-4o-mini)	< 150ms	~$0.001
Simple queries (facts, status)	Haiku-class	< 200ms	~$0.001
Content generation, analysis	Sonnet-class (Claude Sonnet, GPT-4o)	< 500ms first token	~$0.01
Complex reasoning, multi-step planning	Opus-class (Claude Opus)	< 1s first token	~$0.05

Plan-Based Model Access:

Plan	Models Available	Max Interactions/Month
Free	Haiku-class only	50
Starter	Haiku + Sonnet	500
Pro	Haiku + Sonnet + Opus (for complex)	5,000
Business	All models, full access	Unlimited

Fallback Chain:

Primary Provider (e.g., Claude Sonnet)
    │ timeout/error
    ▼
Secondary Provider (e.g., GPT-4o)
    │ timeout/error
    ▼
Tertiary Provider (e.g., Gemini 1.5 Flash)
    │ timeout/error
    ▼
Graceful Degradation Message
"I'm having trouble right now — try again in a few minutes."

Each fallback is logged. If a provider fails >3 times in 5 minutes, the router skips it for 10 minutes (circuit breaker).

2.4 Streaming Architecture (SSE)

The brain service streams responses via Server-Sent Events:

Client                          Brain Service
  │                                  │
  │── POST /brain/chat ─────────────>│
  │                                  │── classify intent
  │                                  │── assemble context
  │                                  │── call LLM (streaming)
  │<── event: token ─────────────────│ "Based"
  │<── event: token ─────────────────│ " on"
  │<── event: token ─────────────────│ " your"
  │<── event: token ─────────────────│ " data"
  │<── event: tool_call ─────────────│ { name: "blog.analytics", params: {...} }
  │<── event: tool_result ───────────│ { views: 340, trend: "up" }
  │<── event: token ─────────────────│ " your traffic"
  │<── event: token ─────────────────│ " increased..."
  │<── event: done ──────────────────│ { usage: { input: 2500, output: 180 } }
  │                                  │

Event types:

Event	Payload	Purpose
`token`	`{ text: string }`	Streaming text token
`tool_call`	`{ name, params }`	Agent is calling a tool (shown as loading state in UI)
`tool_result`	`{ name, result }`	Tool completed (UI can show inline data)
`thinking`	`{ text: string }`	Agent's reasoning (optional, collapsible in UI)
`done`	`{ usage, decision_id }`	Stream complete, usage stats
`error`	`{ code, message }`	Error during processing

2.5 Insight Engine — Proactive Intelligence

The Problem: A reactive-only AI waits for users to ask questions. Most business owners don't know what to ask — they need the AI to surface opportunities and problems proactively.

The Solution: A cron-triggered insight engine that scans each tenant's data every 6 hours and generates actionable insight cards.

Insight Types:

Type	Trigger	Example
`opportunity`	Content gap detected from chatbot questions	"40% of chatbot queries are about eggless recipes — you have no post on this"
`anomaly`	Traffic spike or drop beyond 2 standard deviations	"Your traffic dropped 35% this week — likely due to 0 posts published"
`win`	A post/campaign significantly outperformed average	"Your Diwali post got 3x average views — seasonal recipes are your strength"
`suggestion`	Pattern-based optimization	"Your Monday posts get 40% more views than Friday posts — consider shifting"
`reminder`	Scheduled content gap detected	"You have no posts scheduled for next week — want me to plan some?"

2.6 Durable Objects — Session State Management

Each active copilot session is backed by a Cloudflare Durable Object:

Responsibilities:

Hold WebSocket connection for real-time streaming
Maintain in-flight conversation state (no DB round-trip for active turns)
Buffer tool call results during multi-step agent execution
Enforce single-session-per-user (prevent duplicate streams)

Lifecycle:

User opens copilot → Durable Object created (or resumed)
Conversation turns stored in-memory during active session
On session idle (5 min timeout) → flush to CF KV + PostgreSQL
On explicit close → full persistence + memory extraction

2.7 Tool Execution & Safety

When an agent calls a tool (e.g., blog.create_draft), the execution goes through safety checks:

Agent decides to call tool
    │
    ▼
┌──────────────────────┐
│ Permission Check     │ Does this user have blog.write?
│ (core-access)        │ NO → reject, inform user
└──────────┬───────────┘
           │ YES
           ▼
┌──────────────────────┐
│ Destructive Check    │ Is this a delete/update/irreversible action?
│                      │ YES → require user confirmation
└──────────┬───────────┘
           │ SAFE or CONFIRMED
           ▼
┌──────────────────────┐
│ Rate Limit Check     │ Has this tenant hit tool call limits?
│ (cost governor)      │ YES → graceful limit message
└──────────┬───────────┘
           │ OK
           ▼
┌──────────────────────┐
│ Execute via Gateway  │ Internal service call with x-gateway-key
│                      │ Returns result to agent
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ Log Decision         │ Record tool call + params + result
│                      │ for outcome tracking
└──────────────────────┘

3. Data Flow Guarantees

Tenant isolation is absolute. Every query — memory retrieval, context assembly, insight generation — is scoped to tenant_id. Row-Level Security on all AI tables as a safety net. A memory from Tenant A is never visible to Tenant B, even during vector similarity search.
No data loss on stream disconnect. If the SSE connection drops mid-response, the Durable Object retains the full response. The client can reconnect and receive the completed response. Conversation state is flushed to KV on disconnect.
At-least-once for deferred consolidation. CF Queues guarantee at-least-once delivery. Memory consolidation jobs are idempotent — processing the same conversation twice produces the same memories. Deduplication via source_id on ai_memory.
Cost tracking is synchronous. Usage is checked BEFORE the LLM call, not after. If a tenant has 2 interactions remaining, the cost governor blocks the 3rd call before any tokens are consumed. Usage is incremented atomically.
LLM calls never touch tenant data directly. The agent calls tools, tools call services, services access the database. The LLM provider never receives raw database credentials or direct data access. All data is pre-assembled by the context assembler and passed as prompt context.

4. Open Design Decisions (TBD)

Durable Object per-tenant vs per-session? Per-session is simpler but means losing DO state when the user refreshes. Per-tenant allows resuming but requires more complex state management. Leaning toward per-tenant with session windowing.
Should the insight engine use an LLM for every tenant? At scale (10k+ tenants), running haiku-class inference every 6 hours per tenant could be expensive. Alternative: rule-based anomaly detection first (cheap), then LLM only for tenants with detected anomalies (targeted). This hybrid approach is likely optimal.
Vector embedding model. Currently planning text-embedding-3-small (1536 dims, cheap). Should we use text-embedding-3-large (3072 dims, better quality) for semantic memory? Trade-off: storage cost vs retrieval accuracy. Start with small, benchmark, upgrade if needed.
Multi-step agent execution timeout. If an agent needs to call 5+ tools sequentially (e.g., creating a month-long content calendar), the total execution time could exceed user patience. Should we cap at N tool calls per turn? Or show progressive results via SSE?