Series: AI Agents from Scratch in Python This is Part 4. So far: Part 1 made your first LLM call, Part 2 added tools, and Part 3 wrapped them in the agent loop. Now we add AI agent memory so it can remember — the one thing that loop can’t do. If your Part 3 agent runs, you’re ready.
The agent you built in Part 3 has amnesia. Tell it your name, ask “what’s my name?” on the next turn, and it has no idea — it forgets everything the moment a turn ends. This post fixes that by giving your agent AI agent memory: short-term memory that carries the conversation from one turn to the next.
The good news is that short-term memory is almost embarrassingly simple — it is a Python list. The interesting part is what happens when that list grows too big for the model to read, which is where most beginners get stuck. We’ll build the memory first, then handle the limit.
- Short-term memory is just a list of messages you resend on every call — there’s no special API feature.
- The context window is the real constraint: a long chat overflows the token limit and quietly raises your bill.
- Two fixes: a sliding window (keep the last N turns) and a rolling summary (condense the old ones).
Why Your Agent Forgets
A single LLM call is stateless. The model has no hidden memory of your last request — it only sees the messages you send in this call. So if you send just the newest question, the model genuinely cannot know what came before.
The fix follows directly from the cause: if the model only knows what you send, then send the whole conversation every time. That is all short-term memory is — you keep a list of every message, user and assistant, and pass the full list on each call. There is no special “memory” feature in the API; you build it by what you choose to resend.
It helps to recall the three message roles from earlier in the series. The system message sets the agent’s behaviour, user messages are what the person says, and assistant messages are the model’s own replies. Memory is just keeping all three, in order — the transcript of the whole conversation, replayed on every call so the model can read its own past.
Short-Term AI Agent Memory: Keep the Messages List
Here is a complete chat agent that remembers. The only new idea versus Part 1 is that we append both sides of the conversation to one list and reuse it every turn.
from openai import OpenAIclient = OpenAI()messages = [{"role": "system", "content": "You are a helpful assistant."}]while True:user = input("You: ")if user.strip() == "quit":breakmessages.append({"role": "user", "content": user}) # remember the questionreply = client.chat.completions.create(model="gpt-5.4-mini", messages=messages,)answer = reply.choices[0].message.contentmessages.append({"role": "assistant", "content": answer}) # remember the answerprint("Agent:", answer)
Run it, say “My name is Sukhveer,” then ask “What’s my name?” — this time it answers correctly. The two append lines are the entire memory system. Because the assistant’s replies go back into messages, the model sees its own past answers too, so it stays consistent across the whole chat.
Notice the system message is created once, before the loop, and never removed — it should survive every turn so the agent keeps its instructions. This is exactly how a chat assistant remembers within a single session: there is no hidden store behind the scenes, only a transcript that grows with each exchange and gets sent back in full.
That is the same messages list from Parts 1 through 3, just never thrown away. If you plugged your Part 3 tool loop in here instead of a plain call, you would have an agent that both acts and remembers — which is most of what a real assistant is.
Memory Has a Limit: the Context Window
There is a catch, and it is the part the simple version hides. Every model has a context window — the maximum number of tokens it can read in one call. Your growing messages list lives inside that window, so a long enough conversation will eventually overflow it.
When you overflow, you get one of two bad outcomes: an error from the API, or a quietly rising bill, because you pay per token and you are resending the entire history on every single turn. A fifty-message chat means you re-send forty-nine old messages just to add one. Unbounded memory is not free memory. So past a certain length, you have to decide what the agent keeps.
A quick example shows the stakes. If each turn is roughly 100 tokens and you chat for fifty turns, the final call resends about 5,000 tokens of history just to add one new line — and you pay for all of it, on every turn. A bigger context window does not rescue you here: even when everything fits, more history means more tokens read per call, so cost and latency both climb with the length of the chat.
To know when you are close, you can count tokens before you send — OpenAI’s tiktoken library does exactly that. But you do not need exact counts to start; a simple cap on the number of messages is enough for most small agents, and it is what we’ll do next.
Trim and Summarize
There are two standard ways to keep memory under the limit, and real agents often use both. The first is a sliding window: keep the system prompt and the most recent turns, drop the oldest.
MAX_TURNS = 20 # keep the last 20 messagesdef trim(messages, keep=MAX_TURNS):system = messages[:1] # always keep the system promptrecent = messages[1:][-keep:] # keep only the most recent turnsreturn system + recentmessages = trim(messages) # call this each loop, before the API call
Trimming is deterministic, instant, and free — no extra model call. Its weakness is honest: a fact from early in the chat (“my name is Sukhveer”) vanishes once it scrolls past the window. For many agents that is fine; for some it is not.
A practical default is to keep the last ten exchanges. And if the agent depends on a fixed instruction or an identifier the user gave early — an order number, a project name — pin that detail into the system message, which trimming never touches, so it survives no matter how long the conversation runs.
The second method is a rolling summary: condense the old turns into a short note instead of dropping them.
def summarize(old_messages):text = "\n".join(f"{m['role']}: {m['content']}" for m in old_messages)resp = client.chat.completions.create(model="gpt-5.4-mini",messages=[{"role": "user","content": f"Summarize this conversation in 3 sentences:\n{text}"}],)return {"role": "system", "content": "Summary so far: " + resp.choices[0].message.content}
My own rule, after shipping a few of these: start with a sliding window, and only add summarization when losing early facts actually causes a bug. Most small agents never need more. The combination most production agents land on is a summary of old turns plus the last few messages kept verbatim — the recent detail stays sharp, the old gist survives.
Short-Term vs Long-Term Memory
Everything above is short-term memory: it lives in a Python list and disappears the moment your program exits. Restart the script and the agent has forgotten you again. That is the right scope for a single conversation, and it is all many agents ever need.
Long-term memory is the next step up — memory that survives across sessions. Instead of a list in RAM, you store messages or facts in a database, and on a new session you search for the relevant ones and load them back into the context window.
That retrieval step is usually powered by a vector store, and it is a bigger topic than this post. If you want the production version with a managed store, I cover persistent memory in the build series’ memory part.
The honest boundary: do not reach for a vector database to remember a phone number from three messages ago. Short-term memory handles the conversation; long-term memory handles everything that should outlive it. Knowing which problem you actually have saves you a lot of wasted infrastructure.
Quick Recap
The whole of short-term memory, in five lines:
- Memory = a
messageslist you append to and resend on every call. - Keep all three roles (system, user, assistant), in order.
- Watch the context window — long chats overflow the token limit and cost more.
- Trim with a sliding window (keep system + last N turns) for a simple, free fix.
- Summarize older turns only when trimming starts dropping facts you need.
Conclusion
You gave your agent a memory, and it turned out to be a list you resend, plus a plan for when that list gets too big. That is the whole of short-term memory: keep the conversation, watch the context window, and trim or summarize before you overflow. No framework, no database — just the messages you choose to carry forward.
Your agent can now talk (Part 1), act (Parts 2 and 3), and remember (Part 4) — the four pieces of a real assistant. The next frontier is memory that survives a restart, which is where long-term storage comes in.
What would you want your agent to never forget — your name, your preferences, the last file it edited? Tell me in the comments; the answer usually reveals whether you need short-term or long-term memory.
Read next: AI Agent Memory with a Managed Store (build series) for the persistent, production version, or revisit the agent loop in Part 3 to plug this memory into a tool-using agent.








