Memory & Context Orchestration Spec — AI Brain · AI Brain

Last Updated: 2026-04-03 Status: Draft

1. The Challenge

LLMs are stateless. Every request starts from zero. Without a memory system, the AI asks "What's your business?" on every conversation — even after 6 months of use.
Flat vector search is naive. Cosine similarity alone has no sense of time, importance, or relationships. It will retrieve a 3-month-old stale fact over yesterday's correction.
Pulling full context is wasteful. Loading blog stats, chat metrics, content calendar, billing info, and 50 memories for a simple "change my blog title" query burns tokens and confuses the LLM.
Nightly batch consolidation is too slow. If a user teaches the AI something at 9am, waiting until midnight to "learn" it means 15 hours of suboptimal responses.
Memory without evaluation is just storage. Without tracking whether memories actually improve responses, the system accumulates noise instead of knowledge.

2. The Tri-Layer Memory Architecture

Layer	Technology	Purpose	Speed	TTL
Working Memory	Cloudflare KV	Active conversation turns, session preferences, in-flight tool results	< 5ms	Session (flush on idle/close)
Episodic Memory	PostgreSQL	Full conversation history, decisions, outcomes — the audit trail	< 20ms	Permanent
Semantic Memory	pgvector on Neon	Learned knowledge as embeddings — facts, preferences, patterns, entities	< 30ms	Importance-gated (pruned when < 0.1)

3. Multi-Signal Memory Retrieval

3.1 The Problem with Pure Vector Search

Standard approach: embed query → cosine similarity → top K results.

This fails because:

A 3-month-old memory about "traffic was 500" scores high on similarity but is stale (traffic is now 800)
A rarely-accessed memory ranks the same as a frequently-used one
No awareness of which entity the user is asking about

3.2 The Scoring Formula

Every memory candidate receives a composite retrieval score:

RetrievalScore = (similarity × 0.40)
               + (recency × 0.25)
               + (importance × 0.20)
               + (access_frequency × 0.10)
               + (entity_match × 0.05)

Signal	Weight	Computation
`similarity`	0.40	Cosine similarity between query embedding and memory embedding
`recency`	0.25	`1 - (days_since_access / 365)`, clamped to [0, 1]
`importance`	0.20	Memory's current importance score (0.0–1.0)
`access_frequency`	0.10	`min(access_count / 20, 1.0)` — normalized to [0, 1]
`entity_match`	0.05	`1.0` if memory's `entity_ids` overlap with query entities, else `0.0`

3.3 Retrieval Process

Step 1: Embed query using text-embedding-3-small
Step 2: pgvector query → top 30 by cosine similarity (tenant-scoped)
Step 3: Compute RetrievalScore for each candidate
Step 4: Re-rank by RetrievalScore
Step 5: Return top 10 (or fewer, based on token budget)

SQL (Step 2):

SELECT id, content, memory_type, importance, access_count,
       entity_ids, accessed_at, created_at,
       1 - (embedding <=> $1) AS similarity
FROM ai_memory
WHERE tenant_id = $2
  AND (expires_at IS NULL OR expires_at > NOW())
  AND importance >= 0.1
ORDER BY embedding <=> $1
LIMIT 30;

The remaining scoring (recency, frequency, entity match) is computed in application code after the initial vector retrieval.

3.4 Entity-Scoped Retrieval

When the intent classifier identifies specific entities (e.g., ["blog", "traffic"]), the retrieval adds an entity filter:

-- Boost memories linked to relevant entities
AND entity_ids && ARRAY['blog', 'traffic']

This is combined with vector search using a UNION — entity-matched memories get the 0.05 entity_match bonus, others don't. This ensures entity-linked memories surface even if their embedding similarity is slightly lower.

4. Memory Lifecycle

4.1 Creation — Where Memories Come From

Source	Memory Types Created	Timing
User says something	`fact`, `preference`	Immediate (inline)
Agent makes a decision	`episode`	Immediate (inline)
Outcome is tracked	`pattern` (if repeated)	Deferred (via queue)
Consolidation compresses episodes	`pattern`	Periodic (6hr cron)
Entity extraction from conversations	`entity`	Deferred (via queue)
Insight engine detects trend	`fact`, `pattern`	Periodic (6hr cron)

4.2 Immediate Extraction (Inline, < 10ms)

Runs synchronously after every AI response. Must be fast — no LLM calls.

Rules-based extraction:

IF message contains "I prefer" / "I like" / "I want" / "don't"
  → Extract preference memory
 
IF message contains "we are" / "our business" / "we sell" / "located in"
  → Extract fact memory
 
IF agent made a tool call
  → Extract episode memory (what was done + why)

Example:

User: "I don't want posts longer than 800 words"

Extracted memory:

{
  "type": "preference",
  "content": "Prefers blog posts under 800 words",
  "entity_ids": ["blog", "posts"],
  "importance": 0.8,
  "source_type": "conversation"
}

4.3 Deferred Consolidation (CF Queue, ~5min delay)

After a conversation turn, a consolidation job is queued:

{
  "type": "consolidate",
  "tenant_id": "tnt_123",
  "conversation_id": "conv_abc",
  "tasks": ["merge_duplicates", "extract_entities", "compute_importance"]
}

Task: Merge Duplicates

If the new memory is semantically similar (cosine > 0.92) to an existing memory of the same type and tenant, merge instead of creating a duplicate:

Existing: "User prefers short blog posts"
New:      "Prefers blog posts under 800 words"
Similarity: 0.94
 
→ Merge: "Prefers blog posts under 800 words" (more specific wins)
→ Importance: max(existing, new) + 0.05 boost (reinforced preference)

Task: Extract Entities

Scan conversation for entity references. Create MemoryEntity records if they don't exist. Link new memories to recognized entities.

"My Diwali Sweet Recipes post got 340 views"
 
Entities extracted:
  - "diwali-sweet-recipes" (type: content)
  - "blog" (type: service)
  - "views" (type: metric)

Task: Compute Importance

If this memory relates to a previous decision with a positive outcome, boost importance:

Memory: "Recipe posts get 2.8x more views"
Related decision: Suggested recipe post → got 340 views (positive)
 
→ importance += 0.1 (outcome-boosted)

4.4 Periodic Consolidation (Cron, Every 6 Hours)

Task 1: Decay Unaccessed Memories

UPDATE ai_memory
SET importance = GREATEST(importance - decay_rate, 0.0)
WHERE tenant_id = $1
  AND accessed_at < NOW() - INTERVAL '7 days'
  AND importance > 0.1;

Task 2: Compress Episodes → Patterns

When 5+ episodes share a common theme, compress into a pattern:

Episodes:
  - "Suggested recipe post → 340 views"
  - "Suggested recipe post → 290 views"
  - "Suggested recipe post → 280 views"
  - "Suggested news post → 95 views"
  - "Suggested news post → 110 views"
 
Pattern extracted:
  "Recipe posts consistently outperform news posts by 2.8x for this tenant"
  (type: pattern, importance: 0.85)

The original episodes remain in PostgreSQL (episodic memory) but are not loaded into agent context — the pattern replaces them.

Task 3: Prune Low-Importance

DELETE FROM ai_memory
WHERE tenant_id = $1
  AND importance < 0.1
  AND accessed_at < NOW() - INTERVAL '30 days';

Memories pruned here are truly forgotten — they were never important enough and nobody accessed them.

5. Working Memory — The Session Layer

5.1 CF KV Structure

Key: brain:session:{tenant_id}:{user_id}
TTL: 24 hours (auto-expire)
 
Value: {
  "conversation_id": "conv_abc123",
  "turns": [
    { "role": "user", "content": "...", "ts": 1712150000 },
    { "role": "assistant", "content": "...", "ts": 1712150003 }
  ],
  "active_preferences": [
    "blog-only focus",
    "prefers short posts"
  ],
  "pending_confirmations": [],
  "tool_results_buffer": []
}

5.2 Sliding Window Protocol

Working memory holds the last 20 conversation turns. When a new turn arrives:

Append new turn to turns array
If turns.length > 20, remove oldest turn
The first turn is ALWAYS preserved (contains initial context/greeting)
Write back to KV

Why 20 turns? At ~200 tokens per turn, 20 turns = ~4000 tokens of history. This fits comfortably within the 2000-token conversation history budget (after summarization of older turns).

5.3 Flush to PostgreSQL

On session idle (5 minutes of no activity) or close:

All turns in KV → bulk insert to messages table
Extract immediate memories from conversation
Queue deferred consolidation
Clear KV entry (or let TTL expire)

6. Constructing the LLM Payload

When the agent is ready to call the LLM, the context assembler builds the final payload:

Step 1: System Prompt (Fixed, ~500 tokens)

You are an AI copilot for LogicSpike, a business platform. You help
business owners manage their blog, content, and customer engagement.
 
RULES:
- Always be specific. Use the tenant's actual data, not generic advice.
- When suggesting actions, explain WHY based on their data.
- If you're not sure, say so. Never fabricate metrics.
- For destructive actions (delete, bulk update), always ask for confirmation.
- Respond in the same language the user writes in.

Step 2: Tenant Profile (~200 tokens)

TENANT CONTEXT:
- Business: Mumbai Bakery (food & bakery)
- Plan: Pro (4658 AI interactions remaining this month)
- Blog: 18 posts, 3 categories (recipes, news, tutorials)
- Top content: recipe posts (2.8x avg engagement)
- Content Engine: not connected
- Chat Engine: active, 45 conversations this week

Step 3: Retrieved Memories (~1500 tokens max)

MEMORIES (what you've learned about this business):
- [preference] Focus on blog only, not social media (importance: 0.82)
- [preference] Prefers posts under 800 words (importance: 0.78)
- [pattern] Recipe posts get 2.8x more views than news posts (importance: 0.85)
- [pattern] Monday posts get 40% more engagement than Friday (importance: 0.71)
- [fact] Business is a bakery in Mumbai, owner is Priya (importance: 0.90)
- [episode] Last week: suggested Diwali post → 340 views, 3x average (importance: 0.75)

Step 4: Service Data (~2000 tokens max, only relevant services)

CURRENT DATA (blog analytics, last 7 days):
- Total views: 315 (previous week: 840, change: -62.5%)
- Posts published this week: 0 (previous: 3)
- Top post: "Chocolate Truffle Recipe" — 85 views, declining 8%/day
- Category breakdown: recipes 70%, tutorials 20%, news 10%

Step 5: Conversation History (~2000 tokens max)

CONVERSATION:
User: My blog traffic dropped a lot this week. What happened?

Final Token Budget Check

System prompt:        480 / 500  ✅
Tenant profile:       185 / 200  ✅
Memories (6 items):  1120 / 1500 ✅
Service data:        1650 / 2000 ✅
Conversation:          45 / 2000 ✅
Tools:               1200 / 1500 ✅
Safety buffer:        320 / 300  ⚠️ (slightly over, drop lowest memory)
─────────────────────────────────
Total:               4680 / 8000 ✅

If any section exceeds its budget:

Memories: drop lowest-scored items
Service data: filter to only intent-relevant fields
History: summarize older turns, keep last 4 verbatim

7. Memory Maturity Timeline

Time	Memory State	AI Behavior
Day 1	0 memories	Generic responses, asks clarifying questions
Week 1	5–10 facts + preferences	Knows the business, respects stated preferences
Month 1	20–30 memories, first patterns emerging	Starts making data-backed suggestions
Month 3	50+ memories, strong patterns	Personalized advice, knows what works for this business
Month 6	80+ memories, consolidated patterns	Proactive recommendations, anticipates needs
Year 1	100+ high-quality memories (noise pruned)	Deep business understanding, acts as a trusted advisor

The key insight: memory count doesn't grow linearly. Consolidation compresses episodes into patterns and prunes noise. A 1-year-old tenant has ~100 high-quality memories, not 10,000 raw episodes.