New NLP papers map agent memory and tool use
A June 24 arXiv roundup highlights agent memory, tool-use signals, and conversational search papers that push practical NLP forward.

This roundup highlights new NLP papers on agent memory, tool use, and conversational search.
A June 24 arXiv roundup on Zhihu pulls together several papers that matter for people building agents, retrieval systems, and product search assistants. The three titles that jump out are Metis, When Retrieval Metrics Mislead, and Dialogue to Discovery.
| Paper | Focus | Why it matters |
|---|---|---|
| Metis | Text and code memory for self-evolving agents | Targets how agents remember across modalities |
| When Retrieval Metrics Mislead | Policy signal in long-horizon tool use | Questions whether retrieval scores reflect agent behavior |
| Dialogue to Discovery | Attribute-aware preference elicitation | Improves conversational product search |
Agent memory is becoming a systems problem
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The biggest theme in this roundup is memory, but not in the casual chatbot sense. Metis points at a harder problem: agents that write code, read text, revise plans, and keep learning need memory that works across formats.

That matters because a lot of current agent stacks still treat text notes, code artifacts, and tool traces as separate buckets. Once an agent starts changing its own behavior, those buckets stop being enough. The memory layer has to preserve context, but it also has to support updates, versioning, and retrieval when the agent returns days later.
This is where the paper title is doing real work. “Bridging text and code memory” suggests the authors are looking at the handoff between natural language reasoning and executable artifacts. For developers, that is the difference between a demo agent and one that can keep improving after repeated use.
- Text memory captures plans, instructions, and explanations.
- Code memory captures functions, patches, and implementation details.
- Self-evolving agents need both to stay consistent over time.
- Memory failures can turn a useful agent into a brittle one after a few iterations.
Retrieval scores can hide weak policies
The second paper in the roundup takes aim at a common assumption in long-horizon agent work: if retrieval looks good, the policy must be good too. When Retrieval Metrics Mislead argues that retrieval metrics can overstate progress when the agent’s actual decision-making is still shaky.
That is a useful warning for anyone benchmarking tool-using systems. A model can fetch the right document, call the right API, or surface the right snippet and still fail at the larger task because the policy that decides what to do next is weak. In other words, the retrieval layer may look healthy while the control layer is doing the heavy lifting poorly.
“Evaluating retrieval in isolation is insufficient for understanding the behavior of tool-using agents.”
That line, from the paper’s framing, gets to the heart of the issue. If your evaluation suite rewards a good hit rate but ignores planning quality, you may end up optimizing the wrong thing. For teams shipping agents, this means looking beyond top-k retrieval and checking whether the system actually completes multi-step tasks with fewer dead ends.
There is also a product angle here. Tool-use agents are increasingly being used in workflows where one mistake compounds into several more. Search, database calls, code execution, and browser actions all depend on policy quality. A retrieval metric alone cannot tell you whether the model knows when to ask for help, retry, or stop.
- High retrieval accuracy does not guarantee strong multi-step reasoning.
- Long-horizon tasks expose policy errors that short benchmarks miss.
- Tool-use systems need evaluations that measure both selection and control.
- Benchmarks should track task completion, retries, and failure recovery.
Conversational search is getting more structured
The third paper, Dialogue to Discovery, moves the focus from agent internals to user intent. Instead of treating a product search assistant as a general chat interface, the paper frames search as a process of preference elicitation, where the assistant asks better questions to narrow down what the user actually wants.

That is a practical shift. Product search often fails because users describe needs in fuzzy language: “lightweight,” “good battery,” “for travel,” “under budget.” An attribute-aware system can turn that vague dialogue into structured signals. The assistant can ask about size, battery life, material, or price range, then rank results using those attributes rather than raw keyword overlap.
For e-commerce teams, this is one of the more actionable ideas in the roundup. Search assistants do better when they model the conversation as a sequence of constraints instead of a stream of chat responses. That means fewer generic answers and more targeted product discovery.
The broader pattern across these papers is easy to spot. Agent research is moving from “can the model answer?” toward “can the system remember, decide, and adapt over time?” That includes memory design, evaluation design, and interaction design.
What this roundup says about the next wave of NLP work
These papers do not point in one direction. They split across memory, evaluation, and conversational search, but the common thread is practical reliability. The field is asking harder questions about what happens after the first good demo.
That matters because the next generation of NLP systems will be judged less by single-turn fluency and more by whether they can keep state, handle tools, and recover from mistakes. If OpenAI’s Cookbook and similar agent tooling have taught builders anything, it is that the gap between a working prototype and a dependable system is mostly in the details: memory format, evaluation metrics, and user interaction design.
For readers tracking the space, this roundup is a reminder to watch for papers that measure the messy parts of agent behavior. The most useful work this year may not be the flashiest model release. It may be the research that tells teams where their current metrics are lying to them, and how to design systems that keep working after the first five steps.
If you are building agents or search assistants, the next question is simple: are you optimizing for visible output, or for the hidden machinery that makes that output trustworthy?
Related reading: agent evaluation benchmarks and conversational search design.
// Related Articles
- [RSCH]
3 AI papers on code, music, and diagnosis
- [RSCH]
Self-Distillation Can Shrink Model Diversity
- [RSCH]
RevengeBench tests reverse-engineering game policies
- [RSCH]
Learning Action Priors for Cross-Embodiment Manipulation
- [RSCH]
OPSD lets you turn user clicks into training
- [RSCH]
UltraQuant: 4-bit KV caching for long agents