[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-reinforcement-learning":3},{"tag":4,"articles":11,"peer_article_count":184},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"d52d08ae-f7f9-4625-ada6-d32a7bcd1036","reinforcement learning","reinforcement-learning",15,"強化學習研究如何讓模型在回饋訊號下逐步學會決策，常見於機器人控制、長期代理訓練與 LLM 微調。這個主題也涵蓋 PPO、BRRL、持續學習與安全約束等方法，重點在穩定更新、長期規劃與部署風險。","Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.",[12,21,29,36,43,50,57,64,71,78,85,92,99,106,113,120,127,134,141,148,155,162,169,176],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"35368bfc-0dbe-45dc-b422-87b1bd350ac0","google-openrl-llm-fine-tuning-kubernetes-en","Google OpenRL brings RL fine-tuning to Kubernetes","Google’s OpenRL lets teams run LLM post-training and fine-tuning on their own Kubernetes clusters.","model-release","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782572578249-jlty.png","en","2026-06-27T15:02:27.543012+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":26,"image_url":27,"cover_image":27,"language":19,"created_at":28},"c05899fc-dd62-4fad-a249-9748376c1ef2","river-llm-reinforcement-learning-without-answers-en","RiVER trains LLMs without ground-truth answers","RiVER shows LLMs can improve from score-based tasks without ground-truth answers by calibrating rewards from execution feedback.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782454678234-6mk1.png","2026-06-26T06:17:27.491779+00:00",{"id":30,"slug":31,"title":32,"summary":33,"category":26,"image_url":34,"cover_image":34,"language":19,"created_at":35},"17884e8b-86d6-431c-8e83-d628bb4d060a","self-distillation-shrinks-output-diversity-en","Self-Distillation Can Shrink Model Diversity","Self-distillation can boost pass@1 while quietly reducing rollout diversity and hurting out-of-distribution robustness.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782369170326-a6te.png","2026-06-25T06:32:27.005106+00:00",{"id":37,"slug":38,"title":39,"summary":40,"category":26,"image_url":41,"cover_image":41,"language":19,"created_at":42},"1ebf2fd0-d54e-46ce-8be1-3c0afe10cf29","coordex-humanoid-loco-manipulation-priors-en","CoorDex lets humanoids move while manipulating","CoorDex turns humanoid body and hand control into latent priors so dexterous manipulation can happen while the robot is moving.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782196377805-l76f.png","2026-06-23T06:32:32.755081+00:00",{"id":44,"slug":45,"title":46,"summary":47,"category":26,"image_url":48,"cover_image":48,"language":19,"created_at":49},"03e7168c-77a8-40ea-924b-96f86204d88e","turing-rl-user-simulator-rewards-en","Turing-RL trains user simulators by fooling judges","Turing-RL trains user simulators to sound indistinguishable from real users instead of matching one fixed reply.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781763480946-dpwl.png","2026-06-18T06:17:31.584257+00:00",{"id":51,"slug":52,"title":53,"summary":54,"category":26,"image_url":55,"cover_image":55,"language":19,"created_at":56},"0e33a353-6482-43dc-a0d7-646b9b1a2a2a","omniagent-active-perception-video-understanding-en","OmniAgent brings active perception to video understanding","OmniAgent turns long-video understanding into an active reasoning loop that scales with turns, not video length.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781762581923-hx7i.png","2026-06-18T06:02:32.210704+00:00",{"id":58,"slug":59,"title":60,"summary":61,"category":26,"image_url":62,"cover_image":62,"language":19,"created_at":63},"79767774-adbe-4e97-93d9-9c5bf674b35e","contextrl-teaches-llms-to-pick-right-evidence-en","ContextRL teaches LLMs to pick the right evidence","ContextRL uses contrastive context selection to improve grounding in long and multimodal reasoning.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781590673379-8nq0.png","2026-06-16T06:17:30.366185+00:00",{"id":65,"slug":66,"title":67,"summary":68,"category":26,"image_url":69,"cover_image":69,"language":19,"created_at":70},"b1779b30-e9e3-4406-aa29-d44e94f7ca67","art-fine-tunes-multimodal-llms-via-pixels-en","ART fine-tunes multimodal LLMs via pixels","ART fine-tunes frozen multimodal LLMs by optimizing a single input image instead of model weights.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781266683694-z93k.png","2026-06-12T12:17:32.187899+00:00",{"id":72,"slug":73,"title":74,"summary":75,"category":26,"image_url":76,"cover_image":76,"language":19,"created_at":77},"a09335be-d07a-4675-9601-8b57d1870398","mana-articulated-tool-manipulation-animation-en","Mana turns articulated tools into animation tasks","Mana reframes dexterous tool use as animation, enabling zero-shot sim-to-real manipulation of articulated tools.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781246883418-afa8.png","2026-06-12T06:47:30.169865+00:00",{"id":79,"slug":80,"title":81,"summary":82,"category":26,"image_url":83,"cover_image":83,"language":19,"created_at":84},"55e7197e-f114-4b6c-b3e2-af1a3cd9dfa4","rl-training-hands-off-control-gradually-en","RL Training That Hands Off Control Gradually","This paper shows how to start RL from a working baseline policy and gradually hand control to a learned policy.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780986801034-gf8m.png","2026-06-09T06:32:33.516452+00:00",{"id":86,"slug":87,"title":88,"summary":89,"category":26,"image_url":90,"cover_image":90,"language":19,"created_at":91},"37bb5c43-947c-48da-a02c-091da7b99319","reinforcement-aware-distillation-llm-reasoning-en","Reinforcement-aware distillation for LLM reasoning","This paper proposes reinforcement-aware knowledge distillation to improve LLM reasoning, but the abstract provides no benchmark numbers.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780646587562-pbu3.png","2026-06-05T08:02:34.575637+00:00",{"id":93,"slug":94,"title":95,"summary":96,"category":26,"image_url":97,"cover_image":97,"language":19,"created_at":98},"cf14ef80-3ca8-4323-9468-1bb7fa19ad3e","mobilegym-verifiable-parallel-mobile-gui-sim-en","MobileGym makes mobile GUI agents testable at scale","MobileGym adds deterministic judging and parallel rollouts for mobile GUI agent research.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779775565646-l5ai.png","2026-05-26T06:05:36.223532+00:00",{"id":100,"slug":101,"title":102,"summary":103,"category":26,"image_url":104,"cover_image":104,"language":19,"created_at":105},"08e121ad-d16a-4f61-a124-0530101f4665","vector-policy-optimization-search-diversity-en","Vector Policy Optimization boosts search diversity","VPO trains language models to produce diverse solutions that work better in test-time search.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779432356045-kl6u.png","2026-05-22T06:45:30.65682+00:00",{"id":107,"slug":108,"title":109,"summary":110,"category":26,"image_url":111,"cover_image":111,"language":19,"created_at":112},"8b3832ee-9b1b-4684-9d11-919559a92b28","marlin-greener-llm-inference-datacenters-en","MARLIN tackles greener LLM inference in datacenters","MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779084239926-7642.png","2026-05-18T06:03:36.916559+00:00",{"id":114,"slug":115,"title":116,"summary":117,"category":26,"image_url":118,"cover_image":118,"language":19,"created_at":119},"2a05602e-4f77-4e7a-a073-0f3878a9d9de","atlas-one-token-visual-reasoning-en","ATLAS Makes Visual Reasoning Use One Token","ATLAS uses one discrete token for both agentic and latent visual reasoning, aiming to cut overhead without changing standard training.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778912030332-58uq.png","2026-05-16T06:13:36.193661+00:00",{"id":121,"slug":122,"title":123,"summary":124,"category":26,"image_url":125,"cover_image":125,"language":19,"created_at":126},"4a7fe7e7-0731-47ec-96a5-2758c5bfd8f9","alphagrpo-self-reflective-multimodal-generation-en","AlphaGRPO teaches multimodal models to self-correct","AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778652656972-4yog.png","2026-05-13T06:10:34.985001+00:00",{"id":128,"slug":129,"title":130,"summary":131,"category":26,"image_url":132,"cover_image":132,"language":19,"created_at":133},"14c7a767-8a49-4a9f-9531-3ea654444daf","synthetic-computers-long-horizon-agent-training-en","Synthetic computers for long-horizon agent training","A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777620239939-k8q2.png","2026-05-01T06:30:47.562935+00:00",{"id":135,"slug":136,"title":137,"summary":138,"category":26,"image_url":139,"cover_image":139,"language":19,"created_at":140},"89d74343-03a7-4325-88e0-14029dab320d","safe-continual-rl-changing-environments-en","Safe Continual RL for Changing Real-World Systems","This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776838195882-6v8v.png","2026-04-22T06:09:33.432376+00:00",{"id":142,"slug":143,"title":144,"summary":145,"category":26,"image_url":146,"cover_image":146,"language":19,"created_at":147},"19f116fd-02dd-4a7d-9638-75a3bb70cae2","bounded-ratio-reinforcement-learning-ppo-en","Why Bounded Ratio RL Replaces PPO's Clipped Objective","BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776751796218-p4in.png","2026-04-21T06:09:40.318224+00:00",{"id":149,"slug":150,"title":151,"summary":152,"category":26,"image_url":153,"cover_image":153,"language":19,"created_at":154},"443c85ce-62b3-4336-ad93-7a8a1538d271","llm-generalization-shortest-path-scale-en","Why LLMs Generalize on Maps but Fail on Scale","A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776406022431-jsmd.png","2026-04-17T06:06:34.142981+00:00",{"id":156,"slug":157,"title":158,"summary":159,"category":26,"image_url":160,"cover_image":160,"language":19,"created_at":161},"d1bbd868-15d4-459c-9e2b-2626c779b4ef","prerl-training-llms-in-pre-train-space-en","PreRL: Training LLMs in pre-train space","PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319621187-aig1.png","2026-04-16T06:06:38.24406+00:00",{"id":163,"slug":164,"title":165,"summary":166,"category":26,"image_url":167,"cover_image":167,"language":19,"created_at":168},"8a95a2d8-eb3a-442c-b9c4-c835c79d75c5","physics-simulators-rl-llm-reasoning-en","Physics Simulators as RL Data for LLM Reasoning","Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776146992039-q2sc.png","2026-04-14T06:09:33.23692+00:00",{"id":170,"slug":171,"title":172,"summary":173,"category":26,"image_url":174,"cover_image":174,"language":19,"created_at":175},"3cefc37f-e116-4597-a5cb-55bfb3fc4aa4","act-wisely-tool-use-agentic-multimodal-models-en","Act Wisely: Teaching Agents When Not to Call Tools","A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775801032138-7jih.png","2026-04-10T06:03:34.728615+00:00",{"id":177,"slug":178,"title":179,"summary":180,"category":181,"image_url":182,"cover_image":182,"language":19,"created_at":183},"15c2f00f-4c48-4580-a13e-74626eb520f7","five-ai-infra-frontiers-bessemer-2026-en","Five AI Infra Frontiers Bessemer Expects for 2026","Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775164380914-xfye.png","2026-04-02T21:12:40.223864+00:00",17]