Back to home

Tag

reinforcement learning

Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.

24 articles

Google OpenRL brings RL fine-tuning to Kubernetes
Model Releases/Jun 27

Google OpenRL brings RL fine-tuning to Kubernetes

Google’s OpenRL lets teams run LLM post-training and fine-tuning on their own Kubernetes clusters.

RiVER trains LLMs without ground-truth answers
Research/Jun 26

RiVER trains LLMs without ground-truth answers

RiVER shows LLMs can improve from score-based tasks without ground-truth answers by calibrating rewards from execution feedback.

Self-Distillation Can Shrink Model Diversity
Research/Jun 25

Self-Distillation Can Shrink Model Diversity

Self-distillation can boost pass@1 while quietly reducing rollout diversity and hurting out-of-distribution robustness.

CoorDex lets humanoids move while manipulating
Research/Jun 23

CoorDex lets humanoids move while manipulating

CoorDex turns humanoid body and hand control into latent priors so dexterous manipulation can happen while the robot is moving.

Turing-RL trains user simulators by fooling judges
Research/Jun 18

Turing-RL trains user simulators by fooling judges

Turing-RL trains user simulators to sound indistinguishable from real users instead of matching one fixed reply.

OmniAgent brings active perception to video understanding
Research/Jun 18

OmniAgent brings active perception to video understanding

OmniAgent turns long-video understanding into an active reasoning loop that scales with turns, not video length.

ContextRL teaches LLMs to pick the right evidence
Research/Jun 16

ContextRL teaches LLMs to pick the right evidence

ContextRL uses contrastive context selection to improve grounding in long and multimodal reasoning.

ART fine-tunes multimodal LLMs via pixels
Research/Jun 12

ART fine-tunes multimodal LLMs via pixels

ART fine-tunes frozen multimodal LLMs by optimizing a single input image instead of model weights.

Mana turns articulated tools into animation tasks
Research/Jun 12

Mana turns articulated tools into animation tasks

Mana reframes dexterous tool use as animation, enabling zero-shot sim-to-real manipulation of articulated tools.

RL Training That Hands Off Control Gradually
Research/Jun 9

RL Training That Hands Off Control Gradually

This paper shows how to start RL from a working baseline policy and gradually hand control to a learned policy.

Reinforcement-aware distillation for LLM reasoning
Research/Jun 5

Reinforcement-aware distillation for LLM reasoning

This paper proposes reinforcement-aware knowledge distillation to improve LLM reasoning, but the abstract provides no benchmark numbers.

MobileGym makes mobile GUI agents testable at scale
Research/May 26

MobileGym makes mobile GUI agents testable at scale

MobileGym adds deterministic judging and parallel rollouts for mobile GUI agent research.

Vector Policy Optimization boosts search diversity
Research/May 22

Vector Policy Optimization boosts search diversity

VPO trains language models to produce diverse solutions that work better in test-time search.

MARLIN tackles greener LLM inference in datacenters
Research/May 18

MARLIN tackles greener LLM inference in datacenters

MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.

ATLAS Makes Visual Reasoning Use One Token
Research/May 16

ATLAS Makes Visual Reasoning Use One Token

ATLAS uses one discrete token for both agentic and latent visual reasoning, aiming to cut overhead without changing standard training.

AlphaGRPO teaches multimodal models to self-correct
Research/May 13

AlphaGRPO teaches multimodal models to self-correct

AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Synthetic computers for long-horizon agent training
Research/May 1

Synthetic computers for long-horizon agent training

A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.

Safe Continual RL for Changing Real-World Systems
Research/Apr 22

Safe Continual RL for Changing Real-World Systems

This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.

Why Bounded Ratio RL Replaces PPO's Clipped Objective
Research/Apr 21

Why Bounded Ratio RL Replaces PPO's Clipped Objective

BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.

Why LLMs Generalize on Maps but Fail on Scale
Research/Apr 17

Why LLMs Generalize on Maps but Fail on Scale

A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.

PreRL: Training LLMs in pre-train space
Research/Apr 16

PreRL: Training LLMs in pre-train space

PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.

Physics Simulators as RL Data for LLM Reasoning
Research/Apr 14

Physics Simulators as RL Data for LLM Reasoning

Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.

Act Wisely: Teaching Agents When Not to Call Tools
Research/Apr 10

Act Wisely: Teaching Agents When Not to Call Tools

A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.

Five AI Infra Frontiers Bessemer Expects for 2026
Industry News/Apr 3

Five AI Infra Frontiers Bessemer Expects for 2026

Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.