Last Updated: 2026-04-03 Status: Draft
1. The Challenge
-
LLMs are stateless. Every request starts from zero. Without a memory system, the AI asks "What's your business?" on every conversation — even after 6 months of use.
-
Flat vector search is naive. Cosine similarity alone has no sense of time, importance, or relationships. It will retrieve a 3-month-old stale fact over yesterday's correction.
-
Pulling full context is wasteful. Loading blog stats, chat metrics, content calendar, billing info, and 50 memories for a simple "change my blog title" query burns tokens and confuses the LLM.
-
Nightly batch consolidation is too slow. If a user teaches the AI something at 9am, waiting until midnight to "learn" it means 15 hours of suboptimal responses.
-
Memory without evaluation is just storage. Without tracking whether memories actually improve responses, the system accumulates noise instead of knowledge.
2. The Tri-Layer Memory Architecture
| Layer | Technology | Purpose | Speed | TTL |
|---|---|---|---|---|
| Working Memory | Cloudflare KV | Active conversation turns, session preferences, in-flight tool results | < 5ms | Session (flush on idle/close) |
| Episodic Memory | PostgreSQL | Full conversation history, decisions, outcomes — the audit trail | < 20ms | Permanent |
| Semantic Memory | pgvector on Neon | Learned knowledge as embeddings — facts, preferences, patterns, entities | < 30ms | Importance-gated (pruned when < 0.1) |
3. Multi-Signal Memory Retrieval
3.1 The Problem with Pure Vector Search
Standard approach: embed query → cosine similarity → top K results.
This fails because:
- A 3-month-old memory about "traffic was 500" scores high on similarity but is stale (traffic is now 800)
- A rarely-accessed memory ranks the same as a frequently-used one
- No awareness of which entity the user is asking about
3.2 The Scoring Formula
Every memory candidate receives a composite retrieval score:
RetrievalScore = (similarity × 0.40)
+ (recency × 0.25)
+ (importance × 0.20)
+ (access_frequency × 0.10)
+ (entity_match × 0.05)| Signal | Weight | Computation |
|---|---|---|
similarity |
0.40 | Cosine similarity between query embedding and memory embedding |
recency |
0.25 | 1 - (days_since_access / 365), clamped to [0, 1] |
importance |
0.20 | Memory's current importance score (0.0–1.0) |
access_frequency |
0.10 | min(access_count / 20, 1.0) — normalized to [0, 1] |
entity_match |
0.05 | 1.0 if memory's entity_ids overlap with query entities, else 0.0 |
3.3 Retrieval Process
Step 1: Embed query using text-embedding-3-small
Step 2: pgvector query → top 30 by cosine similarity (tenant-scoped)
Step 3: Compute RetrievalScore for each candidate
Step 4: Re-rank by RetrievalScore
Step 5: Return top 10 (or fewer, based on token budget)SQL (Step 2):
SELECT id, content, memory_type, importance, access_count,
entity_ids, accessed_at, created_at,
1 - (embedding <=> $1) AS similarity
FROM ai_memory
WHERE tenant_id = $2
AND (expires_at IS NULL OR expires_at > NOW())
AND importance >= 0.1
ORDER BY embedding <=> $1
LIMIT 30;The remaining scoring (recency, frequency, entity match) is computed in application code after the initial vector retrieval.
3.4 Entity-Scoped Retrieval
When the intent classifier identifies specific entities (e.g., ["blog", "traffic"]), the retrieval adds an entity filter:
-- Boost memories linked to relevant entities
AND entity_ids && ARRAY['blog', 'traffic']This is combined with vector search using a UNION — entity-matched memories get the 0.05 entity_match bonus, others don't. This ensures entity-linked memories surface even if their embedding similarity is slightly lower.
4. Memory Lifecycle
4.1 Creation — Where Memories Come From
| Source | Memory Types Created | Timing |
|---|---|---|
| User says something | fact, preference |
Immediate (inline) |
| Agent makes a decision | episode |
Immediate (inline) |
| Outcome is tracked | pattern (if repeated) |
Deferred (via queue) |
| Consolidation compresses episodes | pattern |
Periodic (6hr cron) |
| Entity extraction from conversations | entity |
Deferred (via queue) |
| Insight engine detects trend | fact, pattern |
Periodic (6hr cron) |
4.2 Immediate Extraction (Inline, < 10ms)
Runs synchronously after every AI response. Must be fast — no LLM calls.
Rules-based extraction:
IF message contains "I prefer" / "I like" / "I want" / "don't"
→ Extract preference memory
IF message contains "we are" / "our business" / "we sell" / "located in"
→ Extract fact memory
IF agent made a tool call
→ Extract episode memory (what was done + why)Example:
User: "I don't want posts longer than 800 words"
Extracted memory:
{
"type": "preference",
"content": "Prefers blog posts under 800 words",
"entity_ids": ["blog", "posts"],
"importance": 0.8,
"source_type": "conversation"
}4.3 Deferred Consolidation (CF Queue, ~5min delay)
After a conversation turn, a consolidation job is queued:
{
"type": "consolidate",
"tenant_id": "tnt_123",
"conversation_id": "conv_abc",
"tasks": ["merge_duplicates", "extract_entities", "compute_importance"]
}Task: Merge Duplicates
If the new memory is semantically similar (cosine > 0.92) to an existing memory of the same type and tenant, merge instead of creating a duplicate:
Existing: "User prefers short blog posts"
New: "Prefers blog posts under 800 words"
Similarity: 0.94
→ Merge: "Prefers blog posts under 800 words" (more specific wins)
→ Importance: max(existing, new) + 0.05 boost (reinforced preference)Task: Extract Entities
Scan conversation for entity references. Create MemoryEntity records if they don't exist. Link new memories to recognized entities.
"My Diwali Sweet Recipes post got 340 views"
Entities extracted:
- "diwali-sweet-recipes" (type: content)
- "blog" (type: service)
- "views" (type: metric)Task: Compute Importance
If this memory relates to a previous decision with a positive outcome, boost importance:
Memory: "Recipe posts get 2.8x more views"
Related decision: Suggested recipe post → got 340 views (positive)
→ importance += 0.1 (outcome-boosted)4.4 Periodic Consolidation (Cron, Every 6 Hours)
Task 1: Decay Unaccessed Memories
UPDATE ai_memory
SET importance = GREATEST(importance - decay_rate, 0.0)
WHERE tenant_id = $1
AND accessed_at < NOW() - INTERVAL '7 days'
AND importance > 0.1;Task 2: Compress Episodes → Patterns
When 5+ episodes share a common theme, compress into a pattern:
Episodes:
- "Suggested recipe post → 340 views"
- "Suggested recipe post → 290 views"
- "Suggested recipe post → 280 views"
- "Suggested news post → 95 views"
- "Suggested news post → 110 views"
Pattern extracted:
"Recipe posts consistently outperform news posts by 2.8x for this tenant"
(type: pattern, importance: 0.85)The original episodes remain in PostgreSQL (episodic memory) but are not loaded into agent context — the pattern replaces them.
Task 3: Prune Low-Importance
DELETE FROM ai_memory
WHERE tenant_id = $1
AND importance < 0.1
AND accessed_at < NOW() - INTERVAL '30 days';Memories pruned here are truly forgotten — they were never important enough and nobody accessed them.
5. Working Memory — The Session Layer
5.1 CF KV Structure
Key: brain:session:{tenant_id}:{user_id}
TTL: 24 hours (auto-expire)
Value: {
"conversation_id": "conv_abc123",
"turns": [
{ "role": "user", "content": "...", "ts": 1712150000 },
{ "role": "assistant", "content": "...", "ts": 1712150003 }
],
"active_preferences": [
"blog-only focus",
"prefers short posts"
],
"pending_confirmations": [],
"tool_results_buffer": []
}5.2 Sliding Window Protocol
Working memory holds the last 20 conversation turns. When a new turn arrives:
- Append new turn to
turnsarray - If
turns.length > 20, remove oldest turn - The first turn is ALWAYS preserved (contains initial context/greeting)
- Write back to KV
Why 20 turns? At ~200 tokens per turn, 20 turns = ~4000 tokens of history. This fits comfortably within the 2000-token conversation history budget (after summarization of older turns).
5.3 Flush to PostgreSQL
On session idle (5 minutes of no activity) or close:
- All turns in KV → bulk insert to
messagestable - Extract immediate memories from conversation
- Queue deferred consolidation
- Clear KV entry (or let TTL expire)
6. Constructing the LLM Payload
When the agent is ready to call the LLM, the context assembler builds the final payload:
Step 1: System Prompt (Fixed, ~500 tokens)
You are an AI copilot for LogicSpike, a business platform. You help
business owners manage their blog, content, and customer engagement.
RULES:
- Always be specific. Use the tenant's actual data, not generic advice.
- When suggesting actions, explain WHY based on their data.
- If you're not sure, say so. Never fabricate metrics.
- For destructive actions (delete, bulk update), always ask for confirmation.
- Respond in the same language the user writes in.Step 2: Tenant Profile (~200 tokens)
TENANT CONTEXT:
- Business: Mumbai Bakery (food & bakery)
- Plan: Pro (4658 AI interactions remaining this month)
- Blog: 18 posts, 3 categories (recipes, news, tutorials)
- Top content: recipe posts (2.8x avg engagement)
- Content Engine: not connected
- Chat Engine: active, 45 conversations this weekStep 3: Retrieved Memories (~1500 tokens max)
MEMORIES (what you've learned about this business):
- [preference] Focus on blog only, not social media (importance: 0.82)
- [preference] Prefers posts under 800 words (importance: 0.78)
- [pattern] Recipe posts get 2.8x more views than news posts (importance: 0.85)
- [pattern] Monday posts get 40% more engagement than Friday (importance: 0.71)
- [fact] Business is a bakery in Mumbai, owner is Priya (importance: 0.90)
- [episode] Last week: suggested Diwali post → 340 views, 3x average (importance: 0.75)Step 4: Service Data (~2000 tokens max, only relevant services)
CURRENT DATA (blog analytics, last 7 days):
- Total views: 315 (previous week: 840, change: -62.5%)
- Posts published this week: 0 (previous: 3)
- Top post: "Chocolate Truffle Recipe" — 85 views, declining 8%/day
- Category breakdown: recipes 70%, tutorials 20%, news 10%Step 5: Conversation History (~2000 tokens max)
CONVERSATION:
User: My blog traffic dropped a lot this week. What happened?Final Token Budget Check
System prompt: 480 / 500 ✅
Tenant profile: 185 / 200 ✅
Memories (6 items): 1120 / 1500 ✅
Service data: 1650 / 2000 ✅
Conversation: 45 / 2000 ✅
Tools: 1200 / 1500 ✅
Safety buffer: 320 / 300 ⚠️ (slightly over, drop lowest memory)
─────────────────────────────────
Total: 4680 / 8000 ✅If any section exceeds its budget:
- Memories: drop lowest-scored items
- Service data: filter to only intent-relevant fields
- History: summarize older turns, keep last 4 verbatim
7. Memory Maturity Timeline
| Time | Memory State | AI Behavior |
|---|---|---|
| Day 1 | 0 memories | Generic responses, asks clarifying questions |
| Week 1 | 5–10 facts + preferences | Knows the business, respects stated preferences |
| Month 1 | 20–30 memories, first patterns emerging | Starts making data-backed suggestions |
| Month 3 | 50+ memories, strong patterns | Personalized advice, knows what works for this business |
| Month 6 | 80+ memories, consolidated patterns | Proactive recommendations, anticipates needs |
| Year 1 | 100+ high-quality memories (noise pruned) | Deep business understanding, acts as a trusted advisor |
The key insight: memory count doesn't grow linearly. Consolidation compresses episodes into patterns and prunes noise. A 1-year-old tenant has ~100 high-quality memories, not 10,000 raw episodes.