logicspike/docs

AI Brain

System Architecture: HLD & LLD — AI Brain

Last Updated: 2026-04-03 Status: Draft


1. High-Level Design (HLD)

1.1 Core Architectural Components

Component Responsibility Technology
Brain Service API layer — receives requests, streams responses, exposes insight endpoints Hono on Cloudflare Workers
Router Agent Intent classification, specialist delegation, multi-step orchestration Haiku-class LLM (fast, cheap)
Specialist Agents Domain-specific reasoning (blog, content, analytics, chat) Sonnet/Opus-class LLM (capable)
Tool Registry Maps service capabilities as callable tools for agents TypeScript module per service
Context Assembler Two-pass lazy context assembly — intent-first, then targeted data pull TypeScript + Neon queries
Memory Layer Tri-layer memory: working (KV), episodic (PG), semantic (pgvector) CF KV + Neon + pgvector
Consolidation Engine Three-tier memory consolidation: immediate, deferred, periodic CF Queues + Cron Triggers
Insight Engine Proactive anomaly detection, pattern surfacing, notification generation Cron Triggers + haiku-class LLM
Cost Governor Per-tenant usage tracking, plan-based limits, model routing by tier Core-billing integration
Eval Scorer Decision tracking, outcome measurement, quality metrics PostgreSQL + Cron Triggers

1.2 Overall System Diagram (HLD)


2. Low-Level Design (LLD)

2.1 Request Lifecycle — From User Message to Streamed Response

2.2 Two-Pass Lazy Context Assembly

The Challenge:

  1. Pulling context from all services on every request wastes tokens and adds latency
  2. Irrelevant context confuses the LLM and degrades response quality
  3. Token budgets are finite — every wasted token is a missed insight

The Solution: Intent-First, Then Targeted Pull

Pass 1 — Intent Classification (fast, ~100ms):

The router agent receives only the raw user message + minimal session info. No service data, no memories. It classifies:

{
  "intent": "content_creation",
  "confidence": 0.92,
  "services": ["blog"],
  "entities": ["posts", "schedule", "topics"],
  "time_scope": "future",
  "specialist": "blog",
  "action_required": true
}

Pass 2 — Targeted Pull (parallel, ~30ms):

Based on the intent, the context assembler fetches ONLY relevant data:

{
  "token_budget": {
    "system_prompt": 500,
    "tools": 1500,
    "tenant_profile": 200,
    "memories": 1500,
    "service_data": 2000,
    "conversation_history": 2000,
    "safety_buffer": 300,
    "total": 8000
  },
  "fetch_plan": {
    "services": ["blog"],
    "memory_entities": ["posts", "schedule", "topics"],
    "memory_types": ["preference", "pattern", "fact"],
    "memory_min_importance": 0.3,
    "memory_max_age": null,
    "history_window": 10
  }
}

If the assembled context exceeds any section budget, the compressor:

  1. Drops lowest-scored memories first
  2. Filters service data to only intent-relevant fields
  3. Summarizes older conversation turns, keeps recent turns verbatim

2.3 Multi-LLM Routing & Fallback

Model Selection Strategy:

Task Model Tier Latency Target Cost/1K tokens
Intent classification Haiku-class (Claude Haiku, GPT-4o-mini) < 150ms ~$0.001
Simple queries (facts, status) Haiku-class < 200ms ~$0.001
Content generation, analysis Sonnet-class (Claude Sonnet, GPT-4o) < 500ms first token ~$0.01
Complex reasoning, multi-step planning Opus-class (Claude Opus) < 1s first token ~$0.05

Plan-Based Model Access:

Plan Models Available Max Interactions/Month
Free Haiku-class only 50
Starter Haiku + Sonnet 500
Pro Haiku + Sonnet + Opus (for complex) 5,000
Business All models, full access Unlimited

Fallback Chain:

Primary Provider (e.g., Claude Sonnet)
    │ timeout/error

Secondary Provider (e.g., GPT-4o)
    │ timeout/error

Tertiary Provider (e.g., Gemini 1.5 Flash)
    │ timeout/error

Graceful Degradation Message
"I'm having trouble right now — try again in a few minutes."

Each fallback is logged. If a provider fails >3 times in 5 minutes, the router skips it for 10 minutes (circuit breaker).

2.4 Streaming Architecture (SSE)

The brain service streams responses via Server-Sent Events:

Client                          Brain Service
  │                                  │
  │── POST /brain/chat ─────────────>│
  │                                  │── classify intent
  │                                  │── assemble context
  │                                  │── call LLM (streaming)
  │<── event: token ─────────────────│ "Based"
  │<── event: token ─────────────────│ " on"
  │<── event: token ─────────────────│ " your"
  │<── event: token ─────────────────│ " data"
  │<── event: tool_call ─────────────│ { name: "blog.analytics", params: {...} }
  │<── event: tool_result ───────────│ { views: 340, trend: "up" }
  │<── event: token ─────────────────│ " your traffic"
  │<── event: token ─────────────────│ " increased..."
  │<── event: done ──────────────────│ { usage: { input: 2500, output: 180 } }
  │                                  │

Event types:

Event Payload Purpose
token { text: string } Streaming text token
tool_call { name, params } Agent is calling a tool (shown as loading state in UI)
tool_result { name, result } Tool completed (UI can show inline data)
thinking { text: string } Agent's reasoning (optional, collapsible in UI)
done { usage, decision_id } Stream complete, usage stats
error { code, message } Error during processing

2.5 Insight Engine — Proactive Intelligence

The Problem: A reactive-only AI waits for users to ask questions. Most business owners don't know what to ask — they need the AI to surface opportunities and problems proactively.

The Solution: A cron-triggered insight engine that scans each tenant's data every 6 hours and generates actionable insight cards.

Insight Types:

Type Trigger Example
opportunity Content gap detected from chatbot questions "40% of chatbot queries are about eggless recipes — you have no post on this"
anomaly Traffic spike or drop beyond 2 standard deviations "Your traffic dropped 35% this week — likely due to 0 posts published"
win A post/campaign significantly outperformed average "Your Diwali post got 3x average views — seasonal recipes are your strength"
suggestion Pattern-based optimization "Your Monday posts get 40% more views than Friday posts — consider shifting"
reminder Scheduled content gap detected "You have no posts scheduled for next week — want me to plan some?"

2.6 Durable Objects — Session State Management

Each active copilot session is backed by a Cloudflare Durable Object:

Responsibilities:

  • Hold WebSocket connection for real-time streaming
  • Maintain in-flight conversation state (no DB round-trip for active turns)
  • Buffer tool call results during multi-step agent execution
  • Enforce single-session-per-user (prevent duplicate streams)

Lifecycle:

  1. User opens copilot → Durable Object created (or resumed)
  2. Conversation turns stored in-memory during active session
  3. On session idle (5 min timeout) → flush to CF KV + PostgreSQL
  4. On explicit close → full persistence + memory extraction

2.7 Tool Execution & Safety

When an agent calls a tool (e.g., blog.create_draft), the execution goes through safety checks:

Agent decides to call tool


┌──────────────────────┐
│ Permission Check     │ Does this user have blog.write?
│ (core-access)        │ NO → reject, inform user
└──────────┬───────────┘
           │ YES

┌──────────────────────┐
│ Destructive Check    │ Is this a delete/update/irreversible action?
│                      │ YES → require user confirmation
└──────────┬───────────┘
           │ SAFE or CONFIRMED

┌──────────────────────┐
│ Rate Limit Check     │ Has this tenant hit tool call limits?
│ (cost governor)      │ YES → graceful limit message
└──────────┬───────────┘
           │ OK

┌──────────────────────┐
│ Execute via Gateway  │ Internal service call with x-gateway-key
│                      │ Returns result to agent
└──────────┬───────────┘


┌──────────────────────┐
│ Log Decision         │ Record tool call + params + result
│                      │ for outcome tracking
└──────────────────────┘

3. Data Flow Guarantees

  1. Tenant isolation is absolute. Every query — memory retrieval, context assembly, insight generation — is scoped to tenant_id. Row-Level Security on all AI tables as a safety net. A memory from Tenant A is never visible to Tenant B, even during vector similarity search.

  2. No data loss on stream disconnect. If the SSE connection drops mid-response, the Durable Object retains the full response. The client can reconnect and receive the completed response. Conversation state is flushed to KV on disconnect.

  3. At-least-once for deferred consolidation. CF Queues guarantee at-least-once delivery. Memory consolidation jobs are idempotent — processing the same conversation twice produces the same memories. Deduplication via source_id on ai_memory.

  4. Cost tracking is synchronous. Usage is checked BEFORE the LLM call, not after. If a tenant has 2 interactions remaining, the cost governor blocks the 3rd call before any tokens are consumed. Usage is incremented atomically.

  5. LLM calls never touch tenant data directly. The agent calls tools, tools call services, services access the database. The LLM provider never receives raw database credentials or direct data access. All data is pre-assembled by the context assembler and passed as prompt context.


4. Open Design Decisions (TBD)

  1. Durable Object per-tenant vs per-session? Per-session is simpler but means losing DO state when the user refreshes. Per-tenant allows resuming but requires more complex state management. Leaning toward per-tenant with session windowing.

  2. Should the insight engine use an LLM for every tenant? At scale (10k+ tenants), running haiku-class inference every 6 hours per tenant could be expensive. Alternative: rule-based anomaly detection first (cheap), then LLM only for tenants with detected anomalies (targeted). This hybrid approach is likely optimal.

  3. Vector embedding model. Currently planning text-embedding-3-small (1536 dims, cheap). Should we use text-embedding-3-large (3072 dims, better quality) for semantic memory? Trade-off: storage cost vs retrieval accuracy. Start with small, benchmark, upgrade if needed.

  4. Multi-step agent execution timeout. If an agent needs to call 5+ tools sequentially (e.g., creating a month-long content calendar), the total execution time could exceed user patience. Should we cap at N tool calls per turn? Or show progressive results via SSE?

AI Brain