Tag
reinforcement learning
Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.
24 articles

Google OpenRL brings RL fine-tuning to Kubernetes
Google’s OpenRL lets teams run LLM post-training and fine-tuning on their own Kubernetes clusters.

RiVER trains LLMs without ground-truth answers
RiVER shows LLMs can improve from score-based tasks without ground-truth answers by calibrating rewards from execution feedback.

Self-Distillation Can Shrink Model Diversity
Self-distillation can boost pass@1 while quietly reducing rollout diversity and hurting out-of-distribution robustness.

CoorDex lets humanoids move while manipulating
CoorDex turns humanoid body and hand control into latent priors so dexterous manipulation can happen while the robot is moving.

Turing-RL trains user simulators by fooling judges
Turing-RL trains user simulators to sound indistinguishable from real users instead of matching one fixed reply.

OmniAgent brings active perception to video understanding
OmniAgent turns long-video understanding into an active reasoning loop that scales with turns, not video length.

ContextRL teaches LLMs to pick the right evidence
ContextRL uses contrastive context selection to improve grounding in long and multimodal reasoning.

ART fine-tunes multimodal LLMs via pixels
ART fine-tunes frozen multimodal LLMs by optimizing a single input image instead of model weights.

Mana turns articulated tools into animation tasks
Mana reframes dexterous tool use as animation, enabling zero-shot sim-to-real manipulation of articulated tools.

RL Training That Hands Off Control Gradually
This paper shows how to start RL from a working baseline policy and gradually hand control to a learned policy.

Reinforcement-aware distillation for LLM reasoning
This paper proposes reinforcement-aware knowledge distillation to improve LLM reasoning, but the abstract provides no benchmark numbers.

MobileGym makes mobile GUI agents testable at scale
MobileGym adds deterministic judging and parallel rollouts for mobile GUI agent research.

Vector Policy Optimization boosts search diversity
VPO trains language models to produce diverse solutions that work better in test-time search.

MARLIN tackles greener LLM inference in datacenters
MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.

ATLAS Makes Visual Reasoning Use One Token
ATLAS uses one discrete token for both agentic and latent visual reasoning, aiming to cut overhead without changing standard training.

AlphaGRPO teaches multimodal models to self-correct
AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Synthetic computers for long-horizon agent training
A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.

Safe Continual RL for Changing Real-World Systems
This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.

Why Bounded Ratio RL Replaces PPO's Clipped Objective
BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.

Why LLMs Generalize on Maps but Fail on Scale
A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.

PreRL: Training LLMs in pre-train space
PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.

Physics Simulators as RL Data for LLM Reasoning
Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.

Act Wisely: Teaching Agents When Not to Call Tools
A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.

Five AI Infra Frontiers Bessemer Expects for 2026
Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.