Skip to main content

Command Palette

Search for a command to run...

Normal Memory: Because Forgetting Is a Feature for Humans.

Published
17 min read
Normal Memory: Because Forgetting Is a Feature for Humans.

Well, let’s discuss the origin of normal-memory going into the bushes. I was in on a call with a bunch of builder friends a couple of months back, all of us discussing AI tools and agents and memory. This was right around the time SuperMemory founder Dhravya Shah announced raising capital and the simple truth we finally agreed on:

“AI is a stateless piece of shit.”

No, seriously. Every single LLM we love is fundamentally brain-damaged:

  • It forgets everything the moment you close the tab.

  • It only “remembers” whatever you managed to cram into the last 8k–128k tokens.

  • Cross the context limit, and poof, your AI now has dementia.

  • Want it to remember your name, your dog’s birthday, or that you’re allergic to coriander? Good luck manually stuffing that into every single prompt forever.

At one point, we were all laughing because we’ve all built the same ugly hacks:

  • shoving JSON blobs into system prompts

  • praying the model reads the 69th message in the history

  • watching it confidently hallucinate that you live in Antarctica because that one fact fell off the context cliff.

But the real question here is, how do you actually fix it? How do you make an AI remember everything, forever, without ever hitting the context window wall? How do companies like mem0 and supermemory even pull this off?

That question has been stuck in my head ever since. That’s exactly what this post is about, no AI slop, I promise, tried building this 3 times,this was my 4th attempt and managed to build something I’d say.

I’m going to rip open the memory layer I built, no fluff, no marketing, just the raw architecture, code, queues, vector DBs, background workers, and all the dirty little LLM calls that make it work.

By the end of this, you’ll know exactly how to stop your AI from being a stateless piece of shit, and turn it into something that actually remembers you, updates when you change your mind, and can answer “What do you know about me?” six months later like it was yesterday.

Section 2: First, Let’s Agree on What Proper Memory Actually Means

Before we dive into queues, Pinecone upserts, and the 69th LLM call that finally stops your AI from gaslighting you, let’s get brutally clear on the bar we’re trying to beat.

Most people say “I want memory” and think that means “dump the last 20 messages into the prompt.” That’s not memory. That’s a slightly longer context window. Still stateless slop.

Real long-term memory has to satisfy three non-negotiable rules. If it breaks even one, it’s fake:

  1. Infinite retention
  • Once you tell it something, it remembers forever (or until you explicitly contradict it).
  1. Zero manual stuffing
  • You never again write You are a helpful assistant who knows the user is vegan and lives in Berlin… in the system prompt. Ever.
  1. Automatic conflict resolution
  • You say “I became a vegetarian” in January → “I quit vegetarianism” in March → the AI must notice and update/delete the old fact.

Part 1: The Split-Second Lie We Tell Users

When you type a message and hit enter, you want two things at the exact same time:

  1. An instant reply (like ChatGPT – no lag allowed)

  2. The AI to perfectly understand and remember everything you just said forever

If we wait to extract facts, resolve contradictions, update summaries, and upsert vectors before sending the reply, the user stares at a spinner for 4–12 seconds. Feels slow and sloppy.

If we reply instantly and do the memory work later, we’re lying to the user for a split second. We chose to lie for 300 ms and tell the truth forever after.

Synchronous (the part the user actually sees – sub-500 ms)

User → POST /chat → 
   1. Save user message instantly  
   2. Grab summary + last 20 messages  
   3. One fast LLM call → generate reply  
   4. Save assistant reply  
   5. Return reply to user  
→ User feels zero lag

Asynchronous (the real brain – happens in the background)

→ Fire and forget a job into Redis/BullMQ  
   → Memory worker wakes up  
   → Extracts facts, resolves contradictions, updates Pinecone, maybe triggers summary refresh  
   → All the slow, expensive, correct stuff

This is the first and most important design pattern. Everything else in this blog is just details on how to make the asynchronous half not screw things up.

Part 2: What Actually Gets Stored, and Where the Hell It Lives

Our system has exactly four places where truth lives.

PlaceWhat lives thereForever?Used for retrieval?Updated how often?
messagesEverything you and the AI ever saidYesNever directlyInstantly (synchronous)
memoriesClean, deduped, contradiction-free factsYesYes –,via PineconeBackground, per message
summariesOne short paragraph (≤400 tokens) of the entire conversation so farYesOnly for extraction preventionEvery few minutes or when big changes happen
Pinecone index1536-dim embeddings of every memoryYesYes – the actual searchInstantly when memories change

1. messages table

  • Every user and assistant message, exactly as typed.

  • Never edited, never deleted (unless the user explicitly deletes the conversation).

  • This is your audit log. If something goes wrong, you can replay the entire conversation from here.

2. memories table – the actual long-term brain

id               UUID          -- same ID lives in Pinecone
conversation_id  UUID
content          TEXT          -- e.g. "User is no longer vegan"
created_at       TIMESTAMP
updated_at       TIMESTAMP     -- changes on UPDATE actions

Key rules that stop the system from turning into garbage:

  • One fact = one row

  • Same fact never appears twice

  • If you contradict yourself → we UPDATE or DELETE the old row (same UUID!)

  • Every row has an identical twin vector in Pinecone with the exact same UUID

3. summaries table

conversation_id  UUID (PK)
text             TEXT     -- "Alex used to be vegan but quit last month. Lives in Berlin..."
updated_at       TIMESTAMP

Why this exists:

  • When extracting new facts, we feed the old summary to the LLM so it doesn’t re-extract shit we already know.

  • Keeps the eventual prompt tiny, during normal chat we only send this summary + last 20 messages and never the whole history.

4. Pinecone vector database

  • Metadata stored: { conversationId, content }

  • Every memory row has exactly one vector with the same ID

  • Filtered queries → we only ever search inside one conversation using the conversationId

If you ever feel confused while reading the rest of this post, come back to this table. Everything we do is just moving data between these four boxes without breaking them.

Part 3: Memory Extraction Background Process – The Real Brain

This is the single most important section of the entire blog. Everything else is just plumbing. This worker is where raw conversation becomes clean, contradiction-free and forever memory.

This runs completely asynchronously in a separate process listening to the memory-process queue. By the time this finishes, the user has already received their reply seconds ago.

Worker Configuration

  • Queue name: memory-process

  • Concurrency: 5 jobs at once because we don’t want to hammer OpenAI

  • Job retention: completed → 1 hour, failed → 24 hours

  • Backend: Redis + BullMQ

When you send a message, the chat endpoint does one thing at the end:

memoryProcessQueue.add({ conversationId, messageId })

That’s it. Fire and forget.

Step 1: Job Arrives – worker wakes up

{
  conversationId: "cvt_123",
  messageId:     "msg_789"
}

Step 2: Grab Every Piece of Context We Need (4 DB queries – yes, it’s a lot,I know, but its necessary)

We do four queries. Here’s exactly why:

Query 1 – Get the exact message we’re processing

SELECT * FROM messages WHERE id = $1 LIMIT 1

Early exit if missing (should never happen).

Query 2 – Find the previous message

Current dumb (but working) way – fetch ALL messages:

SELECT * FROM messages WHERE conversation_id = $1 ORDER BY created_at ASC

Then in JS: find index of current → grab index - 1

Query 3 – Get current conversation summary —→ so we don’t re-extract old facts

SELECT * FROM summaries WHERE conversation_id = $1 LIMIT 1

Query 4 – Last 10 recent messages (extra context for extraction LLM)

SELECT * FROM messages WHERE conversation_id = $1 
ORDER BY created_at DESC LIMIT 10

Final context object passed to every LLM call:

{
  newMessage,
  previousMessage,   // can be null
  summary,           // string or ""
  recentMessages     // last 10, chronological
}

Step 3: Extract ONLY New Facts

Conversation summary (for context onlyDO NOT extract from this):
Alex is a vegan. Lives in Berlin. Has a dog named Max.

Recent messages (for context onlyDO NOT extract from these):
user: Yeah the weather sucks today
assistant: Tell me about it...

Previous message: user: By the way, I stopped being vegan last week
NEW message → user: I'm eating chicken now, feels good man

Extract ONLY new, permanent, factual statements from the NEW message above.
Ignore chit-chat, emotions, temporary states.
Return JSON: { "facts": ["string", "string"] }

Example output:

{
  "facts": [
    "User stopped being vegan last week",
    "User now eats chicken"
  ]
}

If facts array is empty → early exit. Job done. No work needed.

Step 4: For Each Fact → The Full Decision Pipeline

Now we enter the per-fact loop. This is where contradictions are caught.

4a. Create embedding for the candidate fact

openai.embeddings.create({
  model: "text-embedding-3-small",
  input: fact
})

→ 1536-dim vector

4b. Get all memory IDs for this conversation needed for filtering

SELECT id FROM memories WHERE conversation_id = $1

4c. Search Pinecone for semantically similar memories

index.query({
  vector: embedding,
  topK: Math.min(10, totalMemoriesInConversation),
  filter: { conversationId: { $eq: conversationId } },
  includeMetadata: true
})

Then:

  • Keep only matches with score >= 0.5

  • Double-check they actually belong to this conversation

  • Fetch full memory rows from PostgreSQL

Result → similarMemories array, max 10 items

4d. LLM Decides: ADD / UPDATE / DELETE (LLM Call as the judge)

Tool-calling, forced function call:

tools: [{
  function: {
    name: "decide_memory_action",
    parameters: {
      type: "object",
      properties: {
        action: { enum: ["ADD", "UPDATE", "DELETE"] },
        memoryId: { type: "string" }   // null for ADD
      },
      required: ["action"]
    }
  }
}]

Prompt fed to the model:

Candidate fact: "User now eats chicken"

Existing similar memories:
1. ID: mem_abc123 → "User is vegan" (score: 0.91)
2. ID: mem_def456 → "User became vegan 3 months ago" (score: 0.87)

Decide: ADD (new), UPDATE (refines), or DELETE (contradicts)?
Return only the JSON via tool call.

The LLM is terrifyingly good at this.

4e. Execute the Verdict

Action: ADD → Brand new fact
  1. Insert into PostgreSQL → get new UUID

  2. Upsert into Pinecone with same UUID + new embedding + metadata

Action: UPDATE → Refine existing memory
  1. UPDATE memories SET content = $new, updated_at = NOW() WHERE id = $oldId

  2. upsert same ID in Pinecone with new embedding + new content
    → Same memory lives forever, just evolves

Action: DELETE → Direct contradiction
  1. DELETE FROM memories WHERE id = $oldId

  2. index.deleteMany([oldId]) in Pinecone
    → Memory is erased from existence. Gone. Dead.

Step 5: Did We Change Enough to Refresh the Summary?

After all facts processed:

basically if the count of newly added facts is greater than or equal to 3 and the count of updated facts is >=2,we refresh the summary.

if (added >= 3 || updated >= 2) {
  summaryQueue.add({ conversationId }, { delay: 5000, jobId: unique })
}

Why delay 5 seconds? Lets multiple messages in quick succession batch together.

Real-World Examples

Example 1: ADD – "I live in Berlin"

  • No similar memories → score < 0.5

  • LLM decides ADD

  • New UUID created

  • Row + vector inserted

  • Memory now forever retrievable

Example 2: UPDATE – "I’m a vegetarian now" (was vegan earlier)

  • Embedding similarity to old "User is vegan" → 0.84

  • LLM sees contradiction → chooses UPDATE

  • Same memory ID kept

  • Content becomes "User is a vegetarian"

  • Vector updated

  • No duplicate created

Example 3: DELETE – "I don’t have a girlfriend anymore"

  • Old memory: "User's girlfriend is Kitkat"

  • Similarity → 0.89

  • LLM: direct contradiction → DELETE

  • Memory + vector completely removed

  • Asking “Who is my girlfriend?” later → honest “You told me you broke up”

Example 4: Chain reaction in one message

User says: “I quit veganism and broke up with Sarah”

Two facts extracted:

  1. "User is no longer vegan" → UPDATEs old vegan memory

  2. "User broke up with Sarah" → DELETEs girlfriend memory

One message, two surgical operations, zero duplicates.

Part 4: Summary Generation Background Process

This worker is the single most underrated piece of the entire memory system.It runs completely in the background, triggered two ways:

  1. Memory worker says it has ≥3 adds or ≥2 updates

  2. Every 3 minutes automatically (periodic refresh) , because users don’t always trigger thresholds.

Worker Configuration

  • Queue name: summary-update

  • Concurrency: 3 jobs at once

  • Job retention: completed → 1 hour, failed → 24 hours

  • Backend: Redis + BullMQ

  • Two job types:

    • update → one-off, fired by memory worker

    • periodic → repeat job, every 3 minutes per conversation

How Jobs Get Scheduled

On conversation creation:

summaryUpdateQueue.add('periodic', { conversationId }, {
  repeat: { every: 3 * 60 * 1000 },
  jobId: `summary-periodic-${conversationId}`
})

On server startup: we clean up old jobs and re-schedule periodic ones for every existing conversatio

Full Step-by-Step Flow

Step 1: Job Arrives

JavaScript

{ conversationId: "cvt_123" }

Step 2: Verify the Conversation Still Exists

SELECT 1 FROM conversations WHERE id = $1 LIMIT 1

If gone, kill periodic job + exit. No zombie summaries.

Step 3: Fetch the Current Summary

SELECT text FROM summaries WHERE conversation_id = $1 LIMIT 1
  • If exists then use it

  • If not, then return an empty string

This is fed to the LLM so it doesn’t repeat itself.

Step 4: Grab the Last 50 Messages

SELECT role, content, created_at 
FROM messages 
WHERE conversation_id = $1 
ORDER BY created_at DESC 
LIMIT 50

We reverse them to chronological order before sending to LLM.

Edge case: for 0 messages, force empty summary + exit early

Step 5: Build the llm Prompt

This prompt is the reason summaries never go rogue:

You are maintaining a concise, factual summary of a long-running conversation.

Current summary (may be outdated or empty):
Alex is a vegan who became vegan last month. He lives in Berlin with his dog Max.

Recent messages (last 50 – chronological order):
user: Yeah I actually stopped being vegan two weeks ago
assistant: Oh really? What made you quit?
user: Health reasons + I missed cheese too much
user: Also I'm moving to Barcelona next month
assistant: Nice! When exactly?
user: End of June

Task:
Rewrite the summary incorporating ONLY new permanent facts from the recent messages.

Rules:
• ≤400 tokens
• Only factual, permanent info (names, location, diet, relationships, plans, preferences)
• DO NOT include temporary chit-chat (“user asked about weather”)
• DO NOT repeat facts already in current summary UNLESS they changed
• Write in coherent paragraphs, not bullet points
• Focus on what the user has revealed about themselves
• Return ONLY the new summary text – no explanations

Return only the summary.

Step 6: One Single LLM Call – The Summarizer

openai.chat.completions.create({
  model: "gpt-4o-mini",
  temperature: 0.3,
  max_tokens: 400,
  messages: [
    { role: "system", content: "You are a factual summarizer. Never hallucinate." },
    { role: "user",   content: fullPromptAbove }
  ]
})
Alex used to be vegan but quit two weeks ago due to health reasons and missing cheese. He lives in Berlin with his dog Max and is moving to Barcelona at the end of June.

Step 7: Write It Back to Database (UPSERT pattern)

SELECT 1 FROM summaries WHERE conversation_id = $1

If exists ,only then UPDATE:

UPDATE summaries 
SET text = $1, updated_at = NOW() 
WHERE conversation_id = $2

If not exist,then INSERT:

INSERT INTO summaries (conversation_id, text, updated_at) 
VALUES ($1, $2, NOW())

Step 8: Done

{ summaryLength: 312 }

Example

Before (old summary):

Alex is a full-time software engineer. He is vegan and has been for 18 months.

User then says over 40 messages:

  • “I actually quit veganism last month”

  • “I’m dating someone new named kitkat”

  • “Moving to Lisbon in September for a new job”

  • “I adopted a cat named Luna”

  • “No longer doing keto either”

New summary generated:

Alex was vegan for 18 months but quit last month. He is now dating Clara and recently adopted a cat named Luna. He is moving to Lisbon in September for a new job as a software engineer.

Periodic Refresh

Even if user sends 100 tiny messages that never trigger the threshold, every 3 minutes the summary worker wakes up and checks if there is anything new to process

Without this worker, your memory layer is just a fancy log file and with it,your ai actually evolves with the user.

Part 5: How Developers and Users Actually Talk to the Brain

Now let’s zoom out and see how clean it looks from the outside , because nobody should ever have to touch the queues directly, i’ve built a simple nodejs sdk for the same.

  • memory.chat("your message")

    • User message → instant assistant reply (sub-500 ms)

    • Behind the scenes: message stored → reply generated → memory extraction job fired → summary refreshed if needed

    • Feels exactly like ChatGPT, but with a real brain attached.

  • memory.ask("What do you know about my diet?") —> Pure memory retrieval.

    • Question → embedding → Pinecone search (top 25 most relevant memories) → final LLM call that answers using only real stored facts.

    • Returns a perfect, hallucination-free answer in ~400 ms even after years of conversation.

  • memory.say("anything") The smart router. You don’t have to decide if it’s chat or ask.

    • Looks at the message: if it contains question words, “what”, “summarize”, “remind”, “do you know”, etc. → routes to ask()

    • Otherwise → routes to chat()

What Really Happens Inside memory.ask()

User calls: memory.ask("What do you know about my diet right now?")

  1. Question arrives at the API endpoint

  2. We immediately create an embedding for the entire question using text-embedding-3-small

  3. That vector is thrown at Pinecone with a filter for this exact conversationId

    • Top-K = 25

    • Only memories with similarity ≥ 0.78 make the cut.

  4. We pull the full text content of those matching memories from PostgreSQL

  5. One final LLM call receives:

    • The always-fresh conversation summary

    • The ranked list of relevant memories (with relative timestamps: “3 days ago”, “2 months ago”)

    • The original question

    • Strict instructions: only use the provided memories, never hallucinate, be warm but concise

  6. Answer sent back to the user in real time

  7. The moment the answer is sent, a tiny background job logs:

    • Question text

    • Which memory IDs were used

    • Retrieval latency

    • Model used

The user never waits for logging. The response is already in their hands.

Conclusion

What I just showed you is an incomplete, minimalistic, yet fully functional implementation of the Mem0 research paper (2024). I shipped the vector + LLM-guided self-correction part first .

The graph layer from the paper , the one with entities, relationships, and temporal edges , is next. I’m starting it next week

now below is a small thought experiment that i have been thinking about on how to make the memory layer more cognitive and human like. if you get some insights or you’ve got some thoughts on the same, feel free to share them

A Thought Experiment

When we were kids, someone told us “Your birthday is 12th April 2003” exactly once, and it stuck forever. In 10th grade I could recite every mark, every Merchant of Venice line, every history books date end-to-end. Five years later? Gone. Not even a trace.

Why?

Because human memory isn’t a log. It’s a prioritised, decaying and works lik an attention-driven cache.

My birthday gets recalled every single year (and dozens of times in between) → reinforcement → score near 1.0 → instant, crystal-clear access. 10th-grade scores got hammered into my head for one year, then never accessed again → logarithmic decay → eventually evicted from high-priority recall.

Memory Importance Scoring + Reinforcement + Decay

Give every memory a dynamic importance score (0–1).

  • Extremely personal facts (birthday, name, trauma, core values) → seed score 0.95+

  • Transient facts (exam scores, random preferences) → seed score 0.3–0.6

Then run a background worker that:

  • Logs every single recall (every time a memory is retrieved in ask() or injected into chat)

  • Increases score slightly on recall

  • Applies slow logarithmic decay on untouched memories

Result:

  • Your birthday score climbs to 1.0 or somewhere close and stays there because it’s retrieved every April + random conversations

  • Your 10th-grade marks spike to 0.9 during 11th grade, then decay to 0.05 over five years , still stored, but no longer surfaces unless explicitly asked with high detail mode

Final Words

There is still so much to fix. The SDK needs to be more minimal with fewer configs. We need to support every major provider and local models properly (right now it’s just OpenAI + Gemini).

But even in its current raw, unpolished state, this memory layer already turns a stateless LLM into something that feels alive.

I truly believe the next real leap in AI won’t come from another 175B → 1.8T parameter jump.

It will come from agents that remember you.

If you’ve made it to the end,thank you so much,see you in the next one.