[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-fixing-llm-forgetting-es-fine-tuning-en":3,"article-related-fixing-llm-forgetting-es-fine-tuning-en":30,"series-research-9383f93b-9272-4bd3-81b9-1b3e84f4663e":83},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"9383f93b-9272-4bd3-81b9-1b3e84f4663e","fixing-llm-forgetting-es-fine-tuning-en","Fixing LLM forgetting in ES fine-tuning","\u003Cp data-speakable=\"summary\">This paper shows LLM fine-tuning with evolution strategies can drift, and anchored weight decay can curb it.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: No benchmark numbers in abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: Anchored Weight Decay constrains updates toward initial model parameters\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Fine-tuning large \u003Ca href=\"\u002Fnews\u002Faudio-language-models-arbitration-reversals-en\">language models\u003C\u002Fa> is usually sold as a straightforward trade: adapt the model to a new task and hope the old \u003Ca href=\"\u002Ftag\u002Fskills\">skills\u003C\u002Fa> stay intact. This paper argues that the story is messier. In the authors’ view, the “forgetting” people see during evolution-strategy-based fine-tuning is often not permanent loss at all, but performance drift that can recover later in training.\u003C\u002Fp>\u003Cp>That matters for engineers because it changes how you diagnose regressions. If prior-task performance can bounce around during optimization, then a temporary dip does not necessarily mean the method is broken. It may mean the update path is wandering through a part of parameter space that is only weakly constrained.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>The paper is about a familiar problem in \u003Ca href=\"\u002Ftag\u002Fcontinual-learning\">continual learning\u003C\u002Fa>: after you fine-tune a model on a new task, it may get worse on earlier tasks. Recent work had suggested that evolution strategies, or ES, were especially prone to this kind of forgetting when used for LLM fine-tuning.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780604273180-xa1x.png\" alt=\"Fixing LLM forgetting in ES fine-tuning\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>ES is attractive because it is simple, scalable, and \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>-only during training. The issue is that if it causes models to lose prior capability, that limits its usefulness for multi-stage or continual adaptation. The authors are trying to separate a real algorithmic weakness from a misleading training dynamic.\u003C\u002Fp>\u003Cp>Instead of treating forgetting as a fixed failure mode, the paper asks whether the observed drop in prior-task performance is actually reversible. That distinction matters because reversible drift can be managed, while true forgetting often requires a different training strategy altogether.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The paper’s first move is conceptual: it reframes prior-task forgetting as performance drift rather than irreversible forgetting. In the authors’ experiments and analysis, prior-task performance often recovers during ES training, which suggests the model is not always “losing” something permanently.\u003C\u002Fp>\u003Cp>The second move is diagnostic. The paper says this drift is not unique to ES. Similar behavior can also show up in \u003Ca href=\"\u002Ftag\u002Freinforcement-learning\">reinforcement learning\u003C\u002Fa> fine-tuning, which means the problem is broader than one optimization method.\u003C\u002Fp>\u003Cp>Then the authors look at why the drift happens. Their explanation points to ES training dynamics, especially random walk behavior in weakly constrained directions of the weight space. In other words, if the optimization has room to move in directions that are not strongly anchored, the model can wander enough to hurt earlier-task performance.\u003C\u002Fp>\u003Cp>To address that, they introduce Anchored Weight Decay, or AWD. The idea is simple: add parameter-space regularization that keeps optimization closer to the initial model parameters. Rather than letting the weights drift freely, AWD nudges training back toward the starting point.\u003C\u002Fp>\u003Cp>That design choice is practical because it does not require changing the overall ES setup. The paper presents AWD as a stabilizer for training, not as a new model architecture or a new \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> suite.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The abstract does not give benchmark numbers, so there is no numeric score to quote here. What it does claim is qualitative but still useful: AWD stabilizes prior-task performance while preserving target-task performance.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780604277965-k8ac.png\" alt=\"Fixing LLM forgetting in ES fine-tuning\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>It also says AWD can deliver benefits comparable to using much larger ES population sizes, but at much lower computational cost. For developers, that is the most concrete engineering takeaway in the abstract: if a bigger population was your brute-force fix for instability, AWD may offer a cheaper way to get similar behavior.\u003C\u002Fp>\u003Cp>The paper’s broader claim is that prior-task forgetting under ES is largely avoidable. That is a stronger statement than simply saying “we improved one metric,” because it reframes the problem as something you can control with regularization and training dynamics rather than accept as a built-in limitation.\u003C\u002Fp>\u003Cp>Another important detail is that the authors position ES as a promising approach for continual learning in LLMs. That matters because ES has already been attractive for its simplicity and inference-only training, and this paper argues that its biggest perceived weakness may be manageable.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If you are fine-tuning models in stages, especially across multiple tasks, this paper suggests you should watch for transient drift instead of assuming every dip is permanent forgetting. That can change how you evaluate checkpoints and when you stop training.\u003C\u002Fp>\u003Cp>It also gives you a concrete regularization idea to try: anchor the weights toward the starting model. Even without the full paper’s implementation details, the concept is easy to understand and fits into the broader family of parameter-space regularization methods.\u003C\u002Fp>\u003Cp>For teams balancing quality and compute, the “large population size versus AWD” comparison is especially relevant. If the abstract’s claim holds up in the full paper, AWD could reduce the need to spend extra compute just to keep earlier capabilities from wobbling.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The abstract is clear about the direction of the result, but it does not provide benchmark numbers, task names, or training setup details. That means you should treat the claims as promising but not yet fully quantified from the information available here.\u003C\u002Fp>\u003Cp>It also leaves open how AWD behaves across different model sizes, task types, or more realistic production fine-tuning pipelines. The abstract says the method preserves target-task performance and stabilizes prior-task performance, but it does not explain the full trade-off curve.\u003C\u002Fp>\u003Cp>Finally, the paper’s explanation of drift points to weakly constrained directions in weight space, which is a useful mental model, but the practical question is how robust that diagnosis is across other fine-tuning regimes. The abstract suggests the issue is broader than ES, but it does not map out exactly where the boundary lies.\u003C\u002Fp>\u003Ch2>Bottom line\u003C\u002Fh2>\u003Cp>This paper’s main contribution is not just a new regularizer. It is a reframing of “forgetting” during LLM fine-tuning with evolution strategies as a mostly manageable training dynamic, plus a simple way to reduce it.\u003C\u002Fp>\u003Cp>For engineers, that means ES may be more viable for continual adaptation than recent concerns suggested. The key idea is to keep the model anchored, avoid unnecessary drift, and judge regressions carefully before assuming the method has truly forgotten the old task.\u003C\u002Fp>","This paper shows LLM fine-tuning with evolution strategies can drift, and anchored weight decay can curb it.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30148",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780604273180-xa1x.png","research","en","923bb0c4-95f3-49a0-8e01-5cdd6bcd2e32",[17,18,19,20,21],"LLM fine-tuning","evolution strategies","catastrophic forgetting","continual learning","weight decay",[23,24,25],"Forgetting during ES fine-tuning is often performance drift, not permanent loss.","Anchored Weight Decay keeps updates closer to the initial model parameters.","The abstract claims AWD can match large-population benefits with lower compute.",0,"2026-06-04T20:17:26.230817+00:00","2026-06-04T20:17:26.222+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":42,"relatedPosts":46},[32,34,36,38,40],{"name":20,"slug":33},"continual-learning",{"name":19,"slug":35},"catastrophic-forgetting",{"name":21,"slug":37},"weight-decay",{"name":18,"slug":39},"evolution-strategies",{"name":17,"slug":41},"llm-fine-tuning",{"id":15,"slug":43,"title":44,"language":45},"fixing-llm-forgetting-es-fine-tuning-zh","ES 微調忘記問題有解了","zh",[47,53,59,65,71,77],{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"0feba7dc-6027-4e75-bcf3-62d3e2a090a7","tls-turns-insecure-links-into-encrypted-sessions-en","TLS turns insecure links into encrypted sessions","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780596200962-3rqr.png","2026-06-04T18:02:51.489159+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"9426a6bd-912e-444b-893d-ef9a0434d0ae","streamma-multi-agent-reasoning-latency-en","StreamMA cuts multi-agent reasoning latency","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780554790437-pffi.png","2026-06-04T06:32:33.361195+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"dfcbc7e1-aadb-4fe2-b572-c2e0372a3022","audio-language-models-arbitration-reversals-en","How audio-language models lose to text","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780553874831-f2dl.png","2026-06-04T06:17:28.510747+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"b940c037-352c-4c68-8e44-62748fafa560","stride-training-data-attribution-sparse-recovery-en","STRIDE tracks training data influence faster","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780552977778-4t7h.png","2026-06-04T06:02:29.766655+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"c9c264b1-3a0d-4f5b-ada3-02687c9ab795","mathematicians-warn-ai-could-distort-math-en","Mathematicians Warn AI Could Distort Math","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780504385180-uln0.png","2026-06-03T16:32:29.94161+00:00",{"id":78,"slug":79,"title":80,"cover_image":81,"image_url":81,"created_at":82,"category":13},"50db75e4-31d8-4222-9f32-476b682a3848","humanoid-gpt-zero-shot-motion-tracking-en","Humanoid-GPT scales motion tracking with a GPT-style model","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780469286641-cfel.png","2026-06-03T06:47:34.975723+00:00",[84,89,94,99,104,109,114,119,124,129],{"id":85,"slug":86,"title":87,"created_at":88},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]