[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-one-transformer-layer-can-carry-rl-gains-en":3,"article-related-one-transformer-layer-can-carry-rl-gains-en":30,"series-research-b8167640-c431-4064-be79-10c877d15087":77},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"b8167640-c431-4064-be79-10c877d15087","one-transformer-layer-can-carry-rl-gains-en","One Transformer Layer Can Carry RL Gains","\u003Cp data-speakable=\"summary\">Training one transformer layer can recover most of the gains from full RL post-training.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: Seven models\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: Measure “layer contribution” by isolating RL updates to one layer\u003C\u002Fli>\u003C\u002Ful>\u003Cp>For engineers working on \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> post-training, this paper points to a simple but important possibility: you may not need to update every parameter to get most of the benefit from \u003Ca href=\"\u002Ftag\u002Freinforcement-learning\">reinforcement learning\u003C\u002Fa>. The authors study where RL gains actually land inside a transformer and find that the effect is concentrated, not evenly spread.\u003C\u002Fp>\u003Cp>The paper is \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2607.01232\">Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training\u003C\u002Fa>, and its core message is practical: if RL adaptation is mostly carried by a small slice of the network, then training strategy, compute budget, and debugging all change. That matters whether you are trying to reduce post-training cost or understand why a model improves on one task but not another.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>Most RL post-training setups for large language models update all parameters uniformly. That assumes every transformer layer contributes roughly equally to the gains you get from RL. The authors argue that this assumption has not been well understood, so they run a systematic layer-wise study to see where the improvement actually comes from.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782973978159-8klr.png\" alt=\"One Transformer Layer Can Carry RL Gains\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That question matters because RL post-training is expensive and often opaque. If only a few layers matter, then full-parameter training may be doing a lot of unnecessary work. If the useful signal is concentrated in a predictable part of the stack, that also gives researchers a cleaner way to study \u003Ca href=\"\u002Fnews\u002Fself-explanation-training-tracks-model-behavior-en\">model behavior\u003C\u002Fa>.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The paper introduces a quantity called \u003Cem>layer contribution\u003C\u002Fem>. In simple terms, it measures how much of the improvement from full RL training you can recover by training one layer in isolation. Instead of asking, “Did the whole model get better?”, the authors ask, “Which layer is actually doing the heavy lifting?”\u003C\u002Fp>\u003Cp>They test this idea across seven models from two model families, Qwen3 and Qwen2.5. They also cover three RL algorithms: GRPO, GiGPO, and Dr. GRPO. The task mix includes mathematical reasoning, code generation, and agentic decision-making, so this is not a single-task curiosity.\u003C\u002Fp>\u003Cp>The method is straightforward but revealing: train layers separately, compare each layer’s isolated gain to the full RL gain, and rank layers by how much they contribute. The paper then checks whether those rankings stay stable across datasets, tasks, model families, and RL algorithms.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The main finding is surprising in its simplicity: training a single transformer layer can recover most of the gains from full-parameter RL training, and in some cases even surpass it. The paper does not give \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> numbers in the abstract, so it is not possible to quote exact scores here. But the qualitative claim is strong: RL gains are highly concentrated.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782973974695-dzdb.png\" alt=\"One Transformer Layer Can Carry RL Gains\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That concentration does not look random. The authors report a stable structural pattern in which high-contribution layers tend to sit in the middle of the transformer stack, while layers near the input and output ends contribute much less. In other words, the model’s RL adaptation appears to cluster around the middle rather than being evenly distributed from bottom to top.\u003C\u002Fp>\u003Cp>They also say the resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms. That is important because it suggests the pattern is not just an artifact of one benchmark or one training recipe. The same broad structure shows up repeatedly across the experiments they ran.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If this result holds up beyond the paper’s setup, it could reshape how teams think about RL post-training. A smaller trainable subset could mean lower compute cost, simpler experimentation, and faster iteration when tuning models for reasoning, coding, or agentic behavior.\u003C\u002Fp>\u003Cp>It also gives practitioners a new diagnostic lens. Instead of treating the transformer as a black box, you can ask which layers are responsible for a given RL improvement. That could help with debugging training instability, comparing algorithms, or designing more targeted adaptation methods.\u003C\u002Fp>\u003Cp>There is also a software-engineering angle. If only a subset of layers matters, then selective fine-tuning, parameter-efficient training, or layer-specific scheduling may be worth exploring more aggressively. The paper does not claim those methods are solved here, but it gives a concrete signal that the usual “update everything” default may be overkill in some RL settings.\u003C\u002Fp>\u003Ch2>What this paper does not prove\u003C\u002Fh2>\u003Cp>The abstract is clear about the broad pattern, but it does not provide benchmark tables, exact recovery percentages, or compute savings. So while the qualitative result is compelling, the magnitude of the effect is not quantified in the source text available here.\u003C\u002Fp>\u003Cp>It is also worth being careful about scope. The study covers seven models, two \u003Ca href=\"\u002Ftag\u002Fqwen\">Qwen\u003C\u002Fa> families, three RL algorithms, and several task domains, which is a solid spread, but it is still a specific slice of the LLM ecosystem. The paper shows a stable pattern within that slice; it does not claim that every transformer, every training recipe, or every downstream use case will behave the same way.\u003C\u002Fp>\u003Cp>Still, the engineering takeaway is hard to ignore: RL gains may be much more localized inside the network than most training pipelines assume. If you are building or optimizing post-training workflows, that is exactly the kind of result worth testing in your own stack.\u003C\u002Fp>\u003Ch2>Bottom line\u003C\u002Fh2>\u003Cp>This paper argues that RL adaptation in transformers is concentrated in a small number of layers, often just one, with the middle of the stack doing most of the work. For anyone shipping or tuning \u003Ca href=\"\u002Ftag\u002Fllms\">LLMs\u003C\u002Fa>, that is a strong hint that full-parameter RL may not always be necessary to get useful post-training gains.\u003C\u002Fp>\u003Cul>\u003Cli>RL gains are not spread evenly across transformer layers.\u003C\u002Fli>\u003Cli>Middle layers tend to carry the strongest contribution.\u003C\u002Fli>\u003Cli>The paper suggests a path toward cheaper, more targeted RL post-training.\u003C\u002Fli>\u003C\u002Ful>","A layer-wise RL study finds that training one transformer layer can recover most post-training gains.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2607.01232",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782973978159-8klr.png","research","en","5b59165e-18fd-4c10-afa4-1307e39a11f0",[17,18,19,20,21],"reinforcement learning","transformers","LLM post-training","layer-wise analysis","Qwen",[23,24,25],"A single transformer layer can recover most of the gains from full RL training.","High-contribution layers are concentrated in the middle of the transformer stack.","The layer-ranking pattern stays stable across models, tasks, and RL algorithms.",0,"2026-07-02T06:32:29.644564+00:00","2026-07-02T06:32:29.632+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":36,"relatedPosts":40},[32,34],{"name":21,"slug":33},"qwen",{"name":17,"slug":35},"reinforcement-learning",{"id":15,"slug":37,"title":38,"language":39},"one-transformer-layer-can-carry-rl-gains-zh","單層 Transformer 也能扛住 RL 增益","zh",[41,47,53,59,65,71],{"id":42,"slug":43,"title":44,"cover_image":45,"image_url":45,"created_at":46,"category":13},"cc12b2b9-0f6f-4dbf-8e2e-49d52008dda2","language-critiques-imitation-learning-en","Language critiques improve imitation learning","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782975783575-ibss.png","2026-07-02T07:02:29.283153+00:00",{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"8d35bb8a-3563-4ac6-8c45-745d4e606f7f","bineval-binary-questions-llm-evals-en","BINEVAL uses binary questions to score LLM outputs","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782927166631-h8c1.png","2026-07-01T17:32:24.15899+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"4987870f-92aa-4f80-8eb7-aa8f0109337e","rlmf-teaches-llms-express-uncertainty-better-en","RLMF teaches LLMs to express uncertainty better","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782887573710-gn6d.png","2026-07-01T06:32:29.360612+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"c31a1ae3-05aa-445e-a8c4-efafed7fbc2d","qval-dense-supervision-testbed-long-horizon-agents-en","QVal tests dense supervision before training","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782886678947-rwaj.png","2026-07-01T06:17:34.353581+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"28e23e1d-1463-4129-9d01-f0aa4e3578e6","self-explanation-training-tracks-model-behavior-en","Self-Explanation Training Still Tracks Model Behavior","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782885775255-0o56.png","2026-07-01T06:02:31.014016+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"c6744f0f-9be6-4da8-8bab-3b4fbfe127ba","worldevolver-self-evolving-world-models-llm-planning-en","WorldEvolver lets LLM agents revise foresight","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782801184442-vqwa.png","2026-06-30T06:32:29.368198+00:00",[78,83,88,93,98,103,108,113,118,123],{"id":79,"slug":80,"title":81,"created_at":82},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":84,"slug":85,"title":86,"created_at":87},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]