Tag

reinforcement learning

Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.

24 articles

Model Releases/Jun 27

Google OpenRL brings RL fine-tuning to Kubernetes

Google’s OpenRL lets teams run LLM post-training and fine-tuning on their own Kubernetes clusters.

Research/Jun 26

RiVER trains LLMs without ground-truth answers

RiVER shows LLMs can improve from score-based tasks without ground-truth answers by calibrating rewards from execution feedback.

Research/Jun 25

Self-Distillation Can Shrink Model Diversity

Self-distillation can boost pass@1 while quietly reducing rollout diversity and hurting out-of-distribution robustness.

Research/Jun 23

CoorDex lets humanoids move while manipulating

CoorDex turns humanoid body and hand control into latent priors so dexterous manipulation can happen while the robot is moving.

Research/Jun 18

Turing-RL trains user simulators by fooling judges

Turing-RL trains user simulators to sound indistinguishable from real users instead of matching one fixed reply.

Research/Jun 18

OmniAgent brings active perception to video understanding

OmniAgent turns long-video understanding into an active reasoning loop that scales with turns, not video length.

Research/Jun 16

ContextRL teaches LLMs to pick the right evidence

ContextRL uses contrastive context selection to improve grounding in long and multimodal reasoning.

Research/Jun 12

ART fine-tunes multimodal LLMs via pixels

ART fine-tunes frozen multimodal LLMs by optimizing a single input image instead of model weights.

Research/Jun 12

Mana turns articulated tools into animation tasks

Mana reframes dexterous tool use as animation, enabling zero-shot sim-to-real manipulation of articulated tools.

Research/Jun 9

RL Training That Hands Off Control Gradually

This paper shows how to start RL from a working baseline policy and gradually hand control to a learned policy.

Research/Jun 5

Reinforcement-aware distillation for LLM reasoning

This paper proposes reinforcement-aware knowledge distillation to improve LLM reasoning, but the abstract provides no benchmark numbers.

Research/May 26

MobileGym makes mobile GUI agents testable at scale

MobileGym adds deterministic judging and parallel rollouts for mobile GUI agent research.

Research/May 22

Vector Policy Optimization boosts search diversity

VPO trains language models to produce diverse solutions that work better in test-time search.

Research/May 18

MARLIN tackles greener LLM inference in datacenters

MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.

Research/May 16

ATLAS Makes Visual Reasoning Use One Token

ATLAS uses one discrete token for both agentic and latent visual reasoning, aiming to cut overhead without changing standard training.

Research/May 13

AlphaGRPO teaches multimodal models to self-correct

AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Research/May 1

Synthetic computers for long-horizon agent training

A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.

Research/Apr 22

Safe Continual RL for Changing Real-World Systems

This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.

Research/Apr 21

Why Bounded Ratio RL Replaces PPO's Clipped Objective

BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.

Research/Apr 17

Why LLMs Generalize on Maps but Fail on Scale

A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.

Research/Apr 16

PreRL: Training LLMs in pre-train space

PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.

Research/Apr 14

Physics Simulators as RL Data for LLM Reasoning

Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.

Research/Apr 10

Act Wisely: Teaching Agents When Not to Call Tools

A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.

Industry News/Apr 3

Five AI Infra Frontiers Bessemer Expects for 2026

Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.