Normal Memory: Because Forgetting Is a Feature for Humans.

Well, let’s discuss the origin of normal-memory going into the bushes. I was in on a call with a bunch of builder friends a couple of months back, all of us discussing AI tools and agents and memory. This was right around the time SuperMemory founder Dhravya Shah announced raising capital and the simple truth we finally agreed on:
“AI is a stateless piece of shit.”
No, seriously. Every single LLM we love is fundamentally brain-damaged:
It forgets everything the moment you close the tab.
It only “remembers” whatever you managed to cram into the last 8k–128k tokens.
Cross the context limit, and poof, your AI now has dementia.
Want it to remember your name, your dog’s birthday, or that you’re allergic to coriander? Good luck manually stuffing that into every single prompt forever.
At one point, we were all laughing because we’ve all built the same ugly hacks:
shoving JSON blobs into system prompts
praying the model reads the 69th message in the history
watching it confidently hallucinate that you live in Antarctica because that one fact fell off the context cliff.
But the real question here is, how do you actually fix it? How do you make an AI remember everything, forever, without ever hitting the context window wall? How do companies like mem0 and supermemory even pull this off?
That question has been stuck in my head ever since. That’s exactly what this post is about, no AI slop, I promise, tried building this 3 times,this was my 4th attempt and managed to build something I’d say.
I’m going to rip open the memory layer I built, no fluff, no marketing, just the raw architecture, code, queues, vector DBs, background workers, and all the dirty little LLM calls that make it work.
By the end of this, you’ll know exactly how to stop your AI from being a stateless piece of shit, and turn it into something that actually remembers you, updates when you change your mind, and can answer “What do you know about me?” six months later like it was yesterday.
Section 2: First, Let’s Agree on What Proper Memory Actually Means
Before we dive into queues, Pinecone upserts, and the 69th LLM call that finally stops your AI from gaslighting you, let’s get brutally clear on the bar we’re trying to beat.
Most people say “I want memory” and think that means “dump the last 20 messages into the prompt.” That’s not memory. That’s a slightly longer context window. Still stateless slop.
Real long-term memory has to satisfy three non-negotiable rules. If it breaks even one, it’s fake:
- Infinite retention
- Once you tell it something, it remembers forever (or until you explicitly contradict it).
- Zero manual stuffing
- You never again write You are a helpful assistant who knows the user is vegan and lives in Berlin… in the system prompt. Ever.
- Automatic conflict resolution
- You say “I became a vegetarian” in January → “I quit vegetarianism” in March → the AI must notice and update/delete the old fact.
Part 1: The Split-Second Lie We Tell Users
When you type a message and hit enter, you want two things at the exact same time:
An instant reply (like ChatGPT – no lag allowed)
The AI to perfectly understand and remember everything you just said forever
If we wait to extract facts, resolve contradictions, update summaries, and upsert vectors before sending the reply, the user stares at a spinner for 4–12 seconds. Feels slow and sloppy.
If we reply instantly and do the memory work later, we’re lying to the user for a split second. We chose to lie for 300 ms and tell the truth forever after.
Synchronous (the part the user actually sees – sub-500 ms)
User → POST /chat →
1. Save user message instantly
2. Grab summary + last 20 messages
3. One fast LLM call → generate reply
4. Save assistant reply
5. Return reply to user
→ User feels zero lag
Asynchronous (the real brain – happens in the background)
→ Fire and forget a job into Redis/BullMQ
→ Memory worker wakes up
→ Extracts facts, resolves contradictions, updates Pinecone, maybe triggers summary refresh
→ All the slow, expensive, correct stuff
This is the first and most important design pattern. Everything else in this blog is just details on how to make the asynchronous half not screw things up.
Part 2: What Actually Gets Stored, and Where the Hell It Lives
Our system has exactly four places where truth lives.
| Place | What lives there | Forever? | Used for retrieval? | Updated how often? |
| messages | Everything you and the AI ever said | Yes | Never directly | Instantly (synchronous) |
| memories | Clean, deduped, contradiction-free facts | Yes | Yes –,via Pinecone | Background, per message |
| summaries | One short paragraph (≤400 tokens) of the entire conversation so far | Yes | Only for extraction prevention | Every few minutes or when big changes happen |
| Pinecone index | 1536-dim embeddings of every memory | Yes | Yes – the actual search | Instantly when memories change |
1. messages table
Every user and assistant message, exactly as typed.
Never edited, never deleted (unless the user explicitly deletes the conversation).
This is your audit log. If something goes wrong, you can replay the entire conversation from here.
2. memories table – the actual long-term brain
id UUID -- same ID lives in Pinecone
conversation_id UUID
content TEXT -- e.g. "User is no longer vegan"
created_at TIMESTAMP
updated_at TIMESTAMP -- changes on UPDATE actions
Key rules that stop the system from turning into garbage:
One fact = one row
Same fact never appears twice
If you contradict yourself → we UPDATE or DELETE the old row (same UUID!)
Every row has an identical twin vector in Pinecone with the exact same UUID
3. summaries table
conversation_id UUID (PK)
text TEXT -- "Alex used to be vegan but quit last month. Lives in Berlin..."
updated_at TIMESTAMP
Why this exists:
When extracting new facts, we feed the old summary to the LLM so it doesn’t re-extract shit we already know.
Keeps the eventual prompt tiny, during normal chat we only send this summary + last 20 messages and never the whole history.
4. Pinecone vector database
Metadata stored: { conversationId, content }
Every memory row has exactly one vector with the same ID
Filtered queries → we only ever search inside one conversation using the conversationId
If you ever feel confused while reading the rest of this post, come back to this table. Everything we do is just moving data between these four boxes without breaking them.
Part 3: Memory Extraction Background Process – The Real Brain
This is the single most important section of the entire blog. Everything else is just plumbing. This worker is where raw conversation becomes clean, contradiction-free and forever memory.
This runs completely asynchronously in a separate process listening to the memory-process queue. By the time this finishes, the user has already received their reply seconds ago.
Worker Configuration
Queue name:
memory-processConcurrency: 5 jobs at once because we don’t want to hammer OpenAI
Job retention: completed → 1 hour, failed → 24 hours
Backend: Redis + BullMQ
When you send a message, the chat endpoint does one thing at the end:
memoryProcessQueue.add({ conversationId, messageId })
That’s it. Fire and forget.
Step 1: Job Arrives – worker wakes up
{
conversationId: "cvt_123",
messageId: "msg_789"
}
Step 2: Grab Every Piece of Context We Need (4 DB queries – yes, it’s a lot,I know, but its necessary)
We do four queries. Here’s exactly why:
Query 1 – Get the exact message we’re processing
SELECT * FROM messages WHERE id = $1 LIMIT 1
Early exit if missing (should never happen).
Query 2 – Find the previous message
Current dumb (but working) way – fetch ALL messages:
SELECT * FROM messages WHERE conversation_id = $1 ORDER BY created_at ASC
Then in JS: find index of current → grab index - 1
Query 3 – Get current conversation summary —→ so we don’t re-extract old facts
SELECT * FROM summaries WHERE conversation_id = $1 LIMIT 1
Query 4 – Last 10 recent messages (extra context for extraction LLM)
SELECT * FROM messages WHERE conversation_id = $1
ORDER BY created_at DESC LIMIT 10
Final context object passed to every LLM call:
{
newMessage,
previousMessage, // can be null
summary, // string or ""
recentMessages // last 10, chronological
}
Step 3: Extract ONLY New Facts
Conversation summary (for context only – DO NOT extract from this):
Alex is a vegan. Lives in Berlin. Has a dog named Max.
Recent messages (for context only – DO NOT extract from these):
user: Yeah the weather sucks today
assistant: Tell me about it...
Previous message: user: By the way, I stopped being vegan last week
NEW message → user: I'm eating chicken now, feels good man
Extract ONLY new, permanent, factual statements from the NEW message above.
Ignore chit-chat, emotions, temporary states.
Return JSON: { "facts": ["string", "string"] }
Example output:
{
"facts": [
"User stopped being vegan last week",
"User now eats chicken"
]
}
If facts array is empty → early exit. Job done. No work needed.
Step 4: For Each Fact → The Full Decision Pipeline
Now we enter the per-fact loop. This is where contradictions are caught.
4a. Create embedding for the candidate fact
openai.embeddings.create({
model: "text-embedding-3-small",
input: fact
})
→ 1536-dim vector
4b. Get all memory IDs for this conversation needed for filtering
SELECT id FROM memories WHERE conversation_id = $1
4c. Search Pinecone for semantically similar memories
index.query({
vector: embedding,
topK: Math.min(10, totalMemoriesInConversation),
filter: { conversationId: { $eq: conversationId } },
includeMetadata: true
})
Then:
Keep only matches with
score >= 0.5Double-check they actually belong to this conversation
Fetch full memory rows from PostgreSQL
Result → similarMemories array, max 10 items
4d. LLM Decides: ADD / UPDATE / DELETE (LLM Call as the judge)
Tool-calling, forced function call:
tools: [{
function: {
name: "decide_memory_action",
parameters: {
type: "object",
properties: {
action: { enum: ["ADD", "UPDATE", "DELETE"] },
memoryId: { type: "string" } // null for ADD
},
required: ["action"]
}
}
}]
Prompt fed to the model:
Candidate fact: "User now eats chicken"
Existing similar memories:
1. ID: mem_abc123 → "User is vegan" (score: 0.91)
2. ID: mem_def456 → "User became vegan 3 months ago" (score: 0.87)
Decide: ADD (new), UPDATE (refines), or DELETE (contradicts)?
Return only the JSON via tool call.
The LLM is terrifyingly good at this.
4e. Execute the Verdict
Action: ADD → Brand new fact
Insert into PostgreSQL → get new UUID
Upsert into Pinecone with same UUID + new embedding + metadata
Action: UPDATE → Refine existing memory
UPDATE memories SET content = $new, updated_at = NOW() WHERE id = $oldIdupsertsame ID in Pinecone with new embedding + new content
→ Same memory lives forever, just evolves
Action: DELETE → Direct contradiction
DELETE FROM memories WHERE id = $oldIdindex.deleteMany([oldId])in Pinecone
→ Memory is erased from existence. Gone. Dead.
Step 5: Did We Change Enough to Refresh the Summary?
After all facts processed:
basically if the count of newly added facts is greater than or equal to 3 and the count of updated facts is >=2,we refresh the summary.
if (added >= 3 || updated >= 2) {
summaryQueue.add({ conversationId }, { delay: 5000, jobId: unique })
}
Why delay 5 seconds? Lets multiple messages in quick succession batch together.
Real-World Examples
Example 1: ADD – "I live in Berlin"
No similar memories → score < 0.5
LLM decides ADD
New UUID created
Row + vector inserted
Memory now forever retrievable
Example 2: UPDATE – "I’m a vegetarian now" (was vegan earlier)
Embedding similarity to old "User is vegan" → 0.84
LLM sees contradiction → chooses UPDATE
Same memory ID kept
Content becomes "User is a vegetarian"
Vector updated
No duplicate created
Example 3: DELETE – "I don’t have a girlfriend anymore"
Old memory: "User's girlfriend is Kitkat"
Similarity → 0.89
LLM: direct contradiction → DELETE
Memory + vector completely removed
Asking “Who is my girlfriend?” later → honest “You told me you broke up”
Example 4: Chain reaction in one message
User says: “I quit veganism and broke up with Sarah”
Two facts extracted:
"User is no longer vegan" → UPDATEs old vegan memory
"User broke up with Sarah" → DELETEs girlfriend memory
One message, two surgical operations, zero duplicates.
Part 4: Summary Generation Background Process
This worker is the single most underrated piece of the entire memory system.It runs completely in the background, triggered two ways:
Memory worker says it has ≥3 adds or ≥2 updates
Every 3 minutes automatically (periodic refresh) , because users don’t always trigger thresholds.
Worker Configuration
Queue name:
summary-updateConcurrency: 3 jobs at once
Job retention: completed → 1 hour, failed → 24 hours
Backend: Redis + BullMQ
Two job types:
update → one-off, fired by memory worker
periodic → repeat job, every 3 minutes per conversation
How Jobs Get Scheduled
On conversation creation:
summaryUpdateQueue.add('periodic', { conversationId }, {
repeat: { every: 3 * 60 * 1000 },
jobId: `summary-periodic-${conversationId}`
})
On server startup: we clean up old jobs and re-schedule periodic ones for every existing conversatio
Full Step-by-Step Flow
Step 1: Job Arrives
JavaScript
{ conversationId: "cvt_123" }
Step 2: Verify the Conversation Still Exists
SELECT 1 FROM conversations WHERE id = $1 LIMIT 1
If gone, kill periodic job + exit. No zombie summaries.
Step 3: Fetch the Current Summary
SELECT text FROM summaries WHERE conversation_id = $1 LIMIT 1
If exists then use it
If not, then return an empty string
This is fed to the LLM so it doesn’t repeat itself.
Step 4: Grab the Last 50 Messages
SELECT role, content, created_at
FROM messages
WHERE conversation_id = $1
ORDER BY created_at DESC
LIMIT 50
We reverse them to chronological order before sending to LLM.
Edge case: for 0 messages, force empty summary + exit early
Step 5: Build the llm Prompt
This prompt is the reason summaries never go rogue:
You are maintaining a concise, factual summary of a long-running conversation.
Current summary (may be outdated or empty):
Alex is a vegan who became vegan last month. He lives in Berlin with his dog Max.
Recent messages (last 50 – chronological order):
user: Yeah I actually stopped being vegan two weeks ago
assistant: Oh really? What made you quit?
user: Health reasons + I missed cheese too much
user: Also I'm moving to Barcelona next month
assistant: Nice! When exactly?
user: End of June
Task:
Rewrite the summary incorporating ONLY new permanent facts from the recent messages.
Rules:
• ≤400 tokens
• Only factual, permanent info (names, location, diet, relationships, plans, preferences)
• DO NOT include temporary chit-chat (“user asked about weather”)
• DO NOT repeat facts already in current summary UNLESS they changed
• Write in coherent paragraphs, not bullet points
• Focus on what the user has revealed about themselves
• Return ONLY the new summary text – no explanations
Return only the summary.
Step 6: One Single LLM Call – The Summarizer
openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0.3,
max_tokens: 400,
messages: [
{ role: "system", content: "You are a factual summarizer. Never hallucinate." },
{ role: "user", content: fullPromptAbove }
]
})
Alex used to be vegan but quit two weeks ago due to health reasons and missing cheese. He lives in Berlin with his dog Max and is moving to Barcelona at the end of June.
Step 7: Write It Back to Database (UPSERT pattern)
SELECT 1 FROM summaries WHERE conversation_id = $1
If exists ,only then UPDATE:
UPDATE summaries
SET text = $1, updated_at = NOW()
WHERE conversation_id = $2
If not exist,then INSERT:
INSERT INTO summaries (conversation_id, text, updated_at)
VALUES ($1, $2, NOW())
Step 8: Done
{ summaryLength: 312 }
Example
Before (old summary):
Alex is a full-time software engineer. He is vegan and has been for 18 months.
User then says over 40 messages:
“I actually quit veganism last month”
“I’m dating someone new named kitkat”
“Moving to Lisbon in September for a new job”
“I adopted a cat named Luna”
“No longer doing keto either”
New summary generated:
Alex was vegan for 18 months but quit last month. He is now dating Clara and recently adopted a cat named Luna. He is moving to Lisbon in September for a new job as a software engineer.
Periodic Refresh
Even if user sends 100 tiny messages that never trigger the threshold, every 3 minutes the summary worker wakes up and checks if there is anything new to process
Without this worker, your memory layer is just a fancy log file and with it,your ai actually evolves with the user.
Part 5: How Developers and Users Actually Talk to the Brain
Now let’s zoom out and see how clean it looks from the outside , because nobody should ever have to touch the queues directly, i’ve built a simple nodejs sdk for the same.
memory.chat("your message")
User message → instant assistant reply (sub-500 ms)
Behind the scenes: message stored → reply generated → memory extraction job fired → summary refreshed if needed
Feels exactly like ChatGPT, but with a real brain attached.
memory.ask("What do you know about my diet?") —> Pure memory retrieval.
Question → embedding → Pinecone search (top 25 most relevant memories) → final LLM call that answers using only real stored facts.
Returns a perfect, hallucination-free answer in ~400 ms even after years of conversation.
memory.say("anything") The smart router. You don’t have to decide if it’s chat or ask.
Looks at the message: if it contains question words, “what”, “summarize”, “remind”, “do you know”, etc. → routes to ask()
Otherwise → routes to chat()
What Really Happens Inside memory.ask()
User calls: memory.ask("What do you know about my diet right now?")
Question arrives at the API endpoint
We immediately create an embedding for the entire question using text-embedding-3-small
That vector is thrown at Pinecone with a filter for this exact conversationId
Top-K = 25
Only memories with similarity ≥ 0.78 make the cut.
We pull the full text content of those matching memories from PostgreSQL
One final LLM call receives:
The always-fresh conversation summary
The ranked list of relevant memories (with relative timestamps: “3 days ago”, “2 months ago”)
The original question
Strict instructions: only use the provided memories, never hallucinate, be warm but concise
Answer sent back to the user in real time
The moment the answer is sent, a tiny background job logs:
Question text
Which memory IDs were used
Retrieval latency
Model used
The user never waits for logging. The response is already in their hands.
Conclusion
What I just showed you is an incomplete, minimalistic, yet fully functional implementation of the Mem0 research paper (2024). I shipped the vector + LLM-guided self-correction part first .
The graph layer from the paper , the one with entities, relationships, and temporal edges , is next. I’m starting it next week
now below is a small thought experiment that i have been thinking about on how to make the memory layer more cognitive and human like. if you get some insights or you’ve got some thoughts on the same, feel free to share them
A Thought Experiment
When we were kids, someone told us “Your birthday is 12th April 2003” exactly once, and it stuck forever. In 10th grade I could recite every mark, every Merchant of Venice line, every history books date end-to-end. Five years later? Gone. Not even a trace.
Why?
Because human memory isn’t a log. It’s a prioritised, decaying and works lik an attention-driven cache.
My birthday gets recalled every single year (and dozens of times in between) → reinforcement → score near 1.0 → instant, crystal-clear access. 10th-grade scores got hammered into my head for one year, then never accessed again → logarithmic decay → eventually evicted from high-priority recall.
Memory Importance Scoring + Reinforcement + Decay
Give every memory a dynamic importance score (0–1).
Extremely personal facts (birthday, name, trauma, core values) → seed score 0.95+
Transient facts (exam scores, random preferences) → seed score 0.3–0.6
Then run a background worker that:
Logs every single recall (every time a memory is retrieved in ask() or injected into chat)
Increases score slightly on recall
Applies slow logarithmic decay on untouched memories
Result:
Your birthday score climbs to 1.0 or somewhere close and stays there because it’s retrieved every April + random conversations
Your 10th-grade marks spike to 0.9 during 11th grade, then decay to 0.05 over five years , still stored, but no longer surfaces unless explicitly asked with high detail mode
Final Words
There is still so much to fix. The SDK needs to be more minimal with fewer configs. We need to support every major provider and local models properly (right now it’s just OpenAI + Gemini).
But even in its current raw, unpolished state, this memory layer already turns a stateless LLM into something that feels alive.
I truly believe the next real leap in AI won’t come from another 175B → 1.8T parameter jump.
It will come from agents that remember you.
If you’ve made it to the end,thank you so much,see you in the next one.




