What is short-term memory in an AI agent?

It is the conversation so far — the running list of user and assistant messages that you resend on every call. Because a single LLM call is stateless, this list is the only thing that lets the agent remember earlier turns within a session.

Why does my AI agent forget what I just said?

Because each LLM call only sees the messages you pass in that request. If you send just the latest question, the model has no idea what came before. The fix is to keep every turn in a messages list and resend it each time.

What is the context window and why does it matter for memory?

The context window is the maximum number of tokens a model can read in one call. Your memory lives inside it, so a long conversation eventually hits the limit — at which point you must trim or summarize older turns to keep going.

What is the difference between short-term and long-term agent memory?

Short-term memory lasts one session and lives in the messages list. Long-term memory survives across sessions and usually lives in a database or vector store you search and load back in. This post covers short-term; long-term is a later step.

AI Agent Memory in Python: Short-Term Memory From Scratch

HomeArtificial Intelligence

By Sukhveer Kaur

Artificial Intelligence

June 19, 2026

6 min read

Beginner

Part 4 banner reading Give Your Agent a Memory with a stack of chat-bubble messages showing ai agent memory in Python on a dark background

Table of Contents

Why Your Agent Forgets

Short-Term AI Agent Memory: Keep the Messages List

Memory Has a Limit: the Context Window

Trim and Summarize

Short-Term vs Long-Term Memory

Quick Recap

Conclusion

Series: AI Agents from Scratch in Python This is Part 4. So far: Part 1 made your first LLM call, Part 2 added tools, and Part 3 wrapped them in the agent loop. Now we add AI agent memory so it can remember — the one thing that loop can’t do. If your Part 3 agent runs, you’re ready.

The agent you built in Part 3 has amnesia. Tell it your name, ask “what’s my name?” on the next turn, and it has no idea — it forgets everything the moment a turn ends. This post fixes that by giving your agent AI agent memory: short-term memory that carries the conversation from one turn to the next.

The good news is that short-term memory is almost embarrassingly simple — it is a Python list. The interesting part is what happens when that list grows too big for the model to read, which is where most beginners get stuck. We’ll build the memory first, then handle the limit.

🎯 Key takeaways

Short-term memory is just a list of messages you resend on every call — there’s no special API feature.
The context window is the real constraint: a long chat overflows the token limit and quietly raises your bill.
Two fixes: a sliding window (keep the last N turns) and a rolling summary (condense the old ones).

Why Your Agent Forgets

A single LLM call is stateless. The model has no hidden memory of your last request — it only sees the messages you send in this call. So if you send just the newest question, the model genuinely cannot know what came before.

Without memory the call sees only the current message and forgets earlier turns; with memory you resend the whole conversation so the model remembers

The fix follows directly from the cause: if the model only knows what you send, then send the whole conversation every time. That is all short-term memory is — you keep a list of every message, user and assistant, and pass the full list on each call. There is no special “memory” feature in the API; you build it by what you choose to resend.

It helps to recall the three message roles from earlier in the series. The system message sets the agent’s behaviour, user messages are what the person says, and assistant messages are the model’s own replies. Memory is just keeping all three, in order — the transcript of the whole conversation, replayed on every call so the model can read its own past.

Short-Term AI Agent Memory: Keep the Messages List

Here is a complete chat agent that remembers. The only new idea versus Part 1 is that we append both sides of the conversation to one list and reuse it every turn.

python

from openai import OpenAI

client = OpenAI()
messages = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
    user = input("You: ")
    if user.strip() == "quit":
        break
    messages.append({"role": "user", "content": user})         # remember the question

    reply = client.chat.completions.create(
        model="gpt-5.4-mini", messages=messages,
    )
    answer = reply.choices[0].message.content
    messages.append({"role": "assistant", "content": answer})   # remember the answer
    print("Agent:", answer)

Run it, say “My name is Sukhveer,” then ask “What’s my name?” — this time it answers correctly. The two append lines are the entire memory system. Because the assistant’s replies go back into messages, the model sees its own past answers too, so it stays consistent across the whole chat.

Notice the system message is created once, before the loop, and never removed — it should survive every turn so the agent keeps its instructions. This is exactly how a chat assistant remembers within a single session: there is no hidden store behind the scenes, only a transcript that grows with each exchange and gets sent back in full.

That is the same messages list from Parts 1 through 3, just never thrown away. If you plugged your Part 3 tool loop in here instead of a plain call, you would have an agent that both acts and remembers — which is most of what a real assistant is.

💡 Test it in 30 secondsRun three quick checks: tell the agent a fact, ask for that fact a few turns later, then ask it to summarize the chat. If all three land, your short-term memory is wired correctly.

Memory Has a Limit: the Context Window

There is a catch, and it is the part the simple version hides. Every model has a context window — the maximum number of tokens it can read in one call. Your growing messages list lives inside that window, so a long enough conversation will eventually overflow it.

As the conversation grows it passes the token limit; a sliding window keeps the system prompt plus recent turns, while a rolling summary condenses older turns

When you overflow, you get one of two bad outcomes: an error from the API, or a quietly rising bill, because you pay per token and you are resending the entire history on every single turn. A fifty-message chat means you re-send forty-nine old messages just to add one. Unbounded memory is not free memory. So past a certain length, you have to decide what the agent keeps.

A quick example shows the stakes. If each turn is roughly 100 tokens and you chat for fifty turns, the final call resends about 5,000 tokens of history just to add one new line — and you pay for all of it, on every turn. A bigger context window does not rescue you here: even when everything fits, more history means more tokens read per call, so cost and latency both climb with the length of the chat.

🔑 Key pointMemory has a running cost, not just a size limit. Every message in the list is re-sent — and re-billed — on every single turn, so trimming saves money as well as space.

To know when you are close, you can count tokens before you send — OpenAI’s tiktoken library does exactly that. But you do not need exact counts to start; a simple cap on the number of messages is enough for most small agents, and it is what we’ll do next.

Trim and Summarize

There are two standard ways to keep memory under the limit, and real agents often use both. The first is a sliding window: keep the system prompt and the most recent turns, drop the oldest.

python

MAX_TURNS = 20                          # keep the last 20 messages

def trim(messages, keep=MAX_TURNS):
    system = messages[:1]               # always keep the system prompt
    recent = messages[1:][-keep:]       # keep only the most recent turns
    return system + recent

messages = trim(messages)               # call this each loop, before the API call

Trimming is deterministic, instant, and free — no extra model call. Its weakness is honest: a fact from early in the chat (“my name is Sukhveer”) vanishes once it scrolls past the window. For many agents that is fine; for some it is not.

A practical default is to keep the last ten exchanges. And if the agent depends on a fixed instruction or an identifier the user gave early — an order number, a project name — pin that detail into the system message, which trimming never touches, so it survives no matter how long the conversation runs.

The second method is a rolling summary: condense the old turns into a short note instead of dropping them.

python

def summarize(old_messages):
    text = "\n".join(f"{m['role']}: {m['content']}" for m in old_messages)
    resp = client.chat.completions.create(
        model="gpt-5.4-mini",
        messages=[{"role": "user",
                   "content": f"Summarize this conversation in 3 sentences:\n{text}"}],
    )
    return {"role": "system", "content": "Summary so far: " + resp.choices[0].message.content}

⚠️ Common mistakeSummarizing on every turn. That adds a model call — latency and cost — to each message. Only summarize when the history actually grows large, then replace the old turns with the one summary message.

My own rule, after shipping a few of these: start with a sliding window, and only add summarization when losing early facts actually causes a bug. Most small agents never need more. The combination most production agents land on is a summary of old turns plus the last few messages kept verbatim — the recent detail stays sharp, the old gist survives.

Short-Term vs Long-Term Memory

Everything above is short-term memory: it lives in a Python list and disappears the moment your program exits. Restart the script and the agent has forgotten you again. That is the right scope for a single conversation, and it is all many agents ever need.

Long-term memory is the next step up — memory that survives across sessions. Instead of a list in RAM, you store messages or facts in a database, and on a new session you search for the relevant ones and load them back into the context window.

That retrieval step is usually powered by a vector store, and it is a bigger topic than this post. If you want the production version with a managed store, I cover persistent memory in the build series’ memory part.

The honest boundary: do not reach for a vector database to remember a phone number from three messages ago. Short-term memory handles the conversation; long-term memory handles everything that should outlive it. Knowing which problem you actually have saves you a lot of wasted infrastructure.

Quick Recap

The whole of short-term memory, in five lines:

Memory = a messages list you append to and resend on every call.
Keep all three roles (system, user, assistant), in order.
Watch the context window — long chats overflow the token limit and cost more.
Trim with a sliding window (keep system + last N turns) for a simple, free fix.
Summarize older turns only when trimming starts dropping facts you need.

Conclusion

You gave your agent a memory, and it turned out to be a list you resend, plus a plan for when that list gets too big. That is the whole of short-term memory: keep the conversation, watch the context window, and trim or summarize before you overflow. No framework, no database — just the messages you choose to carry forward.

Your agent can now talk (Part 1), act (Parts 2 and 3), and remember (Part 4) — the four pieces of a real assistant. The next frontier is memory that survives a restart, which is where long-term storage comes in.

What would you want your agent to never forget — your name, your preferences, the last file it edited? Tell me in the comments; the answer usually reveals whether you need short-term or long-term memory.

Read next: AI Agent Memory with a Managed Store (build series) for the persistent, production version, or revisit the agent loop in Part 3 to plug this memory into a tool-using agent.