ArXiv AI papers push agents, memory, and data
This arXiv AI batch centers on agentic reasoning, long-context data, and benchmark design across navigation, workflows, and health.

New arXiv papers show AI agents getting better at planning, memory, and domain-specific reasoning.
ArXiv’s Artificial Intelligence feed on papers.cool lists 214 papers for June 17, 2026, and the strongest theme is easy to spot: agents are moving from static response generators to systems that remember, plan, and act. Several papers also lean hard into data infrastructure, which matters just as much as model design when training data gets scarce.
| Paper | Key numbers | What it changes |
|---|---|---|
| EvolveNav | 10.1% success-rate gain | Test-time learning for zero-shot navigation |
| SEFD | 152B tokens, 18.5M filings, 550B-token archive estimate | Open long-context data for financial modeling |
| DRFLOW | 100 tasks, 1,246 workflow steps, 3,900+ sources | Benchmark for personalized workflows |
Agents are starting to plan before they act
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The most interesting paper in the batch is EvolveNav, which attacks zero-shot object-goal navigation. The setup is simple to state and hard to solve: an embodied agent has to find an object it has never been trained for, using only what it can infer at test time.

Most prior systems lean on foundation models with fixed priors, then spend a lot of time correcting avoidable mistakes. EvolveNav tries a different route. It builds an agentic rule memory from past trajectories, uses upper confidence bound retrieval to pick rules, and adds a preflection module that predicts likely outcomes before the agent moves.
The result is practical rather than flashy. The paper reports a 10.1% improvement in success rate, plus fewer unnecessary steps. That matters because in embodied AI, wasted motion is often the real cost, whether the agent is a robot in a house or a simulated explorer in a maze.
- Rule memory turns past trajectories into reusable action knowledge.
- UCB retrieval balances semantic match with historical success.
- Preflection reduces blind exploration before the next action.
Benchmarks are getting more specific, and that is a good thing
Two papers in this set are really about measurement. DRFLOW asks a different question from the usual deep-research benchmark: can an agent recover the actual workflow a user needs, step by step, from scattered sources?
That shift matters because many enterprise tasks are procedural, not summarization tasks. DRFLOW includes 100 tasks across five domains, 1,246 reference workflow steps, and more than 3,900 sources. The authors also define seven diagnostic metrics that test grounding, step recovery, ordering, condition handling, and personalization.
“The challenge is not to generate a report, but to identify the correct action-step sequence for the user’s task.” — Md Tawkat Islam Khondaker and coauthors, DRFLOW
The benchmark result is telling: the reference agent, DRFLOW-Agent, improves over strong baselines by up to 10.02% average F1, but the paper still says there is a lot of room left. That is usually a sign the benchmark is measuring something real rather than something already solved.
Another useful comparison is how these papers define progress. EvolveNav optimizes behavior in the world. DRFLOW optimizes planning over documents and sources. Both are agent papers, but they test different failure modes: one is about physical exploration, the other about workflow recovery.
- DRFLOW: 100 tasks, 5 domains, 7 diagnostic metrics.
- DRFLOW-Agent: up to 10.02% average F1 improvement over strong baselines.
- EvolveNav: 10.1% success-rate gain with fewer unnecessary steps.
Data is becoming the bottleneck, so new corpora matter
The most strategic paper in the batch may be The Stanford EDGAR Filings Dataset. It treats SEC filings as a training resource, reconstructing them into layout-faithful MultiMarkdown for long-context pretraining and evaluation.

That is a smart response to a very real problem: good public web text is getting harder to find in bulk, and a lot of the remaining long-context corpora are proprietary, synthetic, or too narrow. EDGAR filings are dense, audited, and full of structure that language models usually struggle to preserve.
The scale is the headline. The authors release SEFD-v1 as a 152B-token snapshot, describe a larger 18.5M-filing archive, and estimate that archive at 550B tokens. They also report less than 0.1% overlap with Common Crawl-derived corpora, which makes the dataset useful for pretraining without simply recycling the same internet text again.
They also introduce two benchmarks: EDGAR-Forecast for filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR for transcription of complex financial tables. That pairing is smart because it tests both reasoning and document fidelity, which is where many models still wobble.
For teams building finance-focused models, the message is clear: better data can matter as much as another round of tuning. If the corpus is cleaner and more structured, the model gets a better shot at learning long-context behavior that holds up in practice.
Agentic AI is spreading into medicine, robotics, and simulation
The rest of the batch reinforces the same pattern. WEQA combines language models with wearable-health tools and reports 24% better accuracy than LLM and agentic baselines, plus a blinded study with 12 medical experts and 8 users that found stronger usefulness and clinical soundness.
LEADS applies an LLM agent to cardiac electrophysiology digital twins, using structured action spaces to discover hybrid models that stay physically grounded and numerically stable. That is a nice example of where agents make sense: not as free-form writers, but as guided search systems inside a scientific workflow.
Then there is Fixed-Point Reasoners, which uses fixed-point convergence as a halting mechanism in looped Transformers. The paper targets Sudoku, Maze, state-tracking, and ARC-AGI, which is a good reminder that algorithmic reasoning is still one of the cleanest places to test whether a model can actually think in steps.
One more paper, Memory as a Wasting Asset, pulls the conversation in a different direction. It prices flash endurance for embodied agents, showing that memory writes have a real lifetime cost on hardware with limited program/erase cycles. The paper says the endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices, but binding on commodity QLC/eMMC around 1,000 P/E, which is exactly the kind of hardware detail AI teams ignore until deployment starts eating budget.
That mix of papers points to a broader shift: agent research is no longer just about getting a chat model to sound helpful. It is about memory that changes over time, benchmarks that measure actual workflows, and data sources that can support longer context windows without collapsing into noise.
If there is a prediction worth making from this batch, it is this: the next wave of agent papers will be judged less by clever prompts and more by whether they improve task completion, data efficiency, and test-time adaptation. The models that matter will be the ones that can remember, plan, and justify their actions without wasting steps or tokens.
// Related Articles
- [RSCH]
ReproRepo scales reproducibility audits with GitHub issues
- [RSCH]
Variable-Width Transformers cut wasted capacity
- [RSCH]
VERITAS lets robots verify and improve at runtime
- [RSCH]
Phase noise makes massive MIMO information age
- [RSCH]
18 AI benchmarks now rank GPT-5.5, Claude, Gemini
- [RSCH]
Exact posterior scores for inverse problems