I’ve been watching the agent space for a while, and one thing that keeps bugging me is how these systems handle the same problems over and over again. You’d think after the hundredth time an agent falls into an infinite scroll trap, it would learn. But most of them don’t. They just keep banging their head against the same wall.
Google’s new ReasoningBank framework, published at ICLR this year, is trying to fix that. And honestly, it’s one of the more sensible approaches I’ve seen in a while.
The core problem: agents don’t learn from experience
Agents are getting pretty good at navigating websites, editing code, and handling multi-step tasks. But once they’re deployed and running long-term, they hit a wall. They don’t have a way to look back at what worked and what didn’t, and then adjust their strategy for next time.
Existing memory systems have two big flaws:
First, they tend to log every single action in excruciating detail. You get a massive dump of “clicked button X”, “scrolled down”, “typed text Y”. That’s useful for replaying a specific task, but it doesn’t teach the agent anything about the why behind the actions. It’s like giving someone a transcript of a chess game without explaining the strategy behind each move.
Second, most systems only bother to remember successful runs. They completely ignore failures. That’s insane. Some of the best lessons come from screwing up. If you only learn from wins, you never build guardrails against the dumb mistakes you’ll keep making.
How ReasoningBank works
ReasoningBank is a memory framework that stores high-level, structured reasoning patterns instead of raw action logs. Each memory has three parts:
- Title: A short label like “Handle pagination on dynamic sites”
- Description: A quick summary of when to use this strategy
- Content: The actual distilled reasoning steps, decision rationales, and operational insights
The system runs in a continuous loop. Before the agent takes action, it pulls relevant memories from the bank into its context. After it finishes a task, it uses an LLM-as-a-judge to self-assess the trajectory. It extracts success insights from wins and, crucially, failure reflections from losses.
Here’s the kicker: the self-judgment doesn’t need to be perfect. The researchers found ReasoningBank is surprisingly robust against noisy or imperfect evaluations. That’s a big deal, because perfect self-assessment is basically impossible for complex tasks.
Learning from failures is the secret sauce
This is where ReasoningBank really stands out. Instead of just recording “click the ‘Load More’ button” from a successful run, the agent might learn from a past failure to “always verify the current page identifier first to avoid infinite scroll traps before attempting to load more results.”
That’s a much more transferable piece of knowledge. It’s not tied to a specific website or task. It’s a general principle that applies across many web navigation scenarios.
The paper shows this works. On web browsing and software engineering benchmarks, agents using ReasoningBank achieved higher success rates and completed tasks in fewer steps compared to baseline approaches. That efficiency gain is exactly what you’d expect when an agent stops repeating the same mistakes.
What I like and what I’m skeptical about
I like that this approach treats memory as something dynamic and evolving, not just a static log. The idea of distilling high-level strategies from specific experiences is solid. It’s how humans learn too — you don’t remember every time you touched a hot stove, but you remember the rule “don’t touch hot stoves.”
I’m a bit skeptical about the long-term scalability. The paper mentions they simply append new memories to the bank and leave “more sophisticated consolidation strategies for future work.” That’s fine for a research paper, but in practice, a memory bank that just grows forever is going to have serious retrieval and relevance problems. You’ll end up with thousands of memories, most of which are useless for any given task.
The other question is how well this works across very different domains. The benchmarks are focused on web and software engineering tasks. I’d love to see how it handles something like robotic control or multi-agent coordination, where the action space is much larger and more continuous.
Still, this is a genuinely useful step forward. The code is on GitHub, so you can actually try it out yourself. If you’re building any kind of long-running agent system, I’d take a close look at this. Your agents might finally stop making the same stupid mistakes.
Comments (0)
Login Log in to comment.
Be the first to comment!