[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-randomized-yarn-long-context-reasoning-en":3,"article-related-randomized-yarn-long-context-reasoning-en":30,"series-research-d81e3cd8-ad4e-430c-a71e-c66d867a627f":74},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"d81e3cd8-ad4e-430c-a71e-c66d867a627f","randomized-yarn-long-context-reasoning-en","Randomized YaRN boosts long-context reasoning","\u003Cp data-speakable=\"summary\">Randomized YaRN helps \u003Ca href=\"\u002Ftag\u002Fllms\">LLMs\u003C\u002Fa> generalize better from short training contexts to much longer reasoning windows.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: Context lengths from 16K to 128K\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: YaRN positional extrapolation plus randomized positional encoding and a length curriculum\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Long-context support is one of those features that looks solved until you push a model beyond the range it saw during training. This paper tackles that gap directly: instead of assuming a model trained on short sequences will naturally stretch to very long ones, it changes how positional information is presented during training so the model learns to handle out-of-distribution lengths more gracefully.\u003C\u002Fp>\u003Cp>For engineers building retrieval-heavy assistants, multi-document reasoning systems, or agents that need to keep track of references across huge prompts, the practical question is not just whether a model can accept a long input. It is whether it can still reason well when the context gets much longer than the training distribution. That is the problem Randomized YaRN is trying to fix.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>Most large language models are pretrained on relatively short sequences and then extended to longer contexts through extra training. That gets them part of the way there, but the abstract says they still struggle to generalize to very long sequences. In other words, a model may look fine at the lengths it was trained on and then degrade once you push it farther out.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782195478116-wxaz.png\" alt=\"Randomized YaRN boosts long-context reasoning\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>This matters because long-context reasoning is not just about fitting more tokens into the window. The model has to preserve relationships across distant parts of the prompt, and positional encoding can become a hidden failure mode. If the model only ever sees one narrow range of positions during training, it may not learn how to behave when those positions are shifted far beyond the original regime.\u003C\u002Fp>\u003Cp>The paper’s framing is straightforward: improve length generalization by exposing the model to harder positional conditions during training, even when the underlying text examples are still short. That is the core idea behind Randomized YaRN.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>Randomized YaRN combines three ingredients: YaRN-based positional extrapolation, randomized positional encoding, and a length curriculum. The key move is that during training on short-context data, tokens are assigned YaRN positional encodings sampled from a larger position range. So even though the text itself is short, the model sees positional representations that look like they came from much longer contexts.\u003C\u002Fp>\u003Cp>That is a clever way to create out-of-distribution pressure without needing every training sample to actually be huge. Instead of only learning “short context behavior,” the model is repeatedly exposed to position patterns that sit outside the usual training range. The paper argues that this helps the model become more robust when it later has to handle much longer sequences.\u003C\u002Fp>\u003Cp>The length curriculum adds another layer: the model is not thrown into the hardest setting all at once. The abstract does not spell out every curriculum detail, but it clearly presents the method as a progressive exposure strategy. For practitioners, that usually signals training that ramps difficulty over time rather than relying on a single static setup.\u003C\u002Fp>\u003Cp>One important detail: this is about positional generalization, not a new attention architecture in the abstract. The method changes how positions are assigned during training so the model’s existing long-context machinery can generalize better.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The paper evaluates Randomized YaRN on two long-context reasoning benchmarks: BABILong and Multi-Round Coreference Resolution, or MRCR. Those are both challenging settings because they test whether a model can track information and reason across very long spans of text, not just retrieve a single fact.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782195481100-fsq8.png\" alt=\"Randomized YaRN boosts long-context reasoning\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The abstract gives one concrete range that matters: the training data has less than 8K context, while evaluation covers context lengths from 16K to 128K. That is the key stress test. The method is being asked to generalize well beyond the sequence lengths it saw in training, and the paper says it consistently improves reasoning performance across that range.\u003C\u002Fp>\u003Cp>It also says Randomized YaRN outperforms standard fine-tuning, with the largest gains appearing at the far out-of-distribution lengths. The abstract does not provide exact \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> scores, percentage gains, or latency numbers, so those are not available from the source. What we can say is that the improvement is consistent and strongest when the context length is most extreme.\u003C\u002Fp>\u003Cp>That pattern is important. A method that helps only a little near the training boundary is useful, but a method that keeps helping at 128K is more interesting for real systems. It suggests the training recipe is doing something structural to generalization, not just nudging performance on easy cases.\u003C\u002Fp>\u003Ch2>Why engineers should care\u003C\u002Fh2>\u003Cp>If you are building with long-context LLMs, the usual failure mode is not obvious until production: the model can accept the prompt but miss dependencies, confuse references, or degrade when the context becomes truly long. This paper points at a training-time fix for that class of problems.\u003C\u002Fp>\u003Cp>The practical takeaway is that you may not need to rely only on longer pretraining sequences to improve long-context behavior. Randomized exposure to out-of-distribution positional encodings during training on shorter data may be enough to make the model more reliable at much longer lengths. That is appealing because long-sequence training is expensive, and not every team can afford to build a dataset dominated by very long contexts.\u003C\u002Fp>\u003Cp>For teams working on retrieval-augmented generation, multi-hop QA, codebase assistants, or document analysis, this kind of recipe could be a useful part of the training stack. It is especially relevant if your deployment window is much larger than your available training budget.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The abstract is solid on the high-level idea, but it leaves out several things developers would want before adopting the method. We do not get benchmark numbers, training compute, model sizes, ablations, or details about how much each component of Randomized YaRN contributes on its own. So while the paper claims consistent gains, the source here does not let us quantify the cost-benefit tradeoff.\u003C\u002Fp>\u003Cp>We also do not know from the abstract how broadly the method transfers beyond the two reported benchmarks. BABILong and MRCR are both long-context reasoning tasks, but they are still specific tasks. The big open question is whether the same training recipe helps across other long-context workloads, including retrieval, summarization, or agentic tool use.\u003C\u002Fp>\u003Cp>Another unresolved point is implementation complexity. The abstract makes the method sound conceptually simple, but training recipes can hide real engineering overhead once you move from paper to production. Developers would want to know how sensitive the approach is to the positional range, the curriculum schedule, and the base model’s architecture.\u003C\u002Fp>\u003Ch2>Bottom line\u003C\u002Fh2>\u003Cp>Randomized YaRN is a training recipe for making short-context models generalize better to very long contexts by randomizing positional encodings during training and using a length curriculum. The paper’s main claim is that this improves reasoning on 16K to 128K contexts, especially when the model is pushed far beyond its training range.\u003C\u002Fp>\u003Cp>For practitioners, the appeal is clear: it targets a real failure mode in long-context LLMs without requiring the model to be trained only on giant sequences. The source does not give exact scores, but it does give a concrete direction for improving long-context robustness: make positional distributions harder during training, not just the text length.\u003C\u002Fp>\u003Cp>That makes Randomized YaRN worth watching if your product depends on long prompts, deep reference tracking, or reasoning that has to survive well past the training window.\u003C\u002Fp>\u003Cul>\u003Cli>It targets length generalization, not just longer input support.\u003C\u002Fli>\u003Cli>It uses randomized positional encodings to expose models to OOD positions during training.\u003C\u002Fli>\u003Cli>It reports consistent gains on BABILong and MRCR from 16K to 128K contexts.\u003C\u002Fli>\u003C\u002Ful>","Randomized YaRN helps LLMs generalize better from short training contexts to much longer reasoning windows.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.23687",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782195478116-wxaz.png","research","en","7171fed6-f304-4f46-9efe-f691ea304b65",[17,18,19,20,21],"long-context LLMs","positional encoding","YaRN","length generalization","reasoning",[23,24,25],"Randomized YaRN improves long-context generalization by randomizing positional encodings during training.","The paper reports consistent gains from 16K to 128K contexts on BABILong and MRCR.","The abstract provides no exact benchmark scores, so the result is directional rather than fully quantified.",0,"2026-06-23T06:17:32.896933+00:00","2026-06-23T06:17:32.888+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":33,"relatedPosts":37},[32],{"name":21,"slug":21},{"id":15,"slug":34,"title":35,"language":36},"randomized-yarn-long-context-reasoning-zh","Randomized YaRN 讓長上下文更穩","zh",[38,44,50,56,62,68],{"id":39,"slug":40,"title":41,"cover_image":42,"image_url":42,"created_at":43,"category":13},"96178a82-96e4-42e6-ab00-6c8c09059d5a","lifescibench-tests-biotech-models-en","LifeSciBench lets you test biotech models","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782198211594-rl4h.png","2026-06-23T07:02:47.704936+00:00",{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":13},"1ebf2fd0-d54e-46ce-8be1-3c0afe10cf29","coordex-humanoid-loco-manipulation-priors-en","CoorDex lets humanoids move while manipulating","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782196377805-l76f.png","2026-06-23T06:32:32.755081+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"5044acd9-3264-427c-803a-97955cd42bd9","autodex-automates-dexterous-grasp-data-collection-en","AutoDex automates dexterous grasp data collection","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782194577248-yvij.png","2026-06-23T06:02:31.714363+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"fa4555ac-ba1b-4d3a-8563-b43f6a2757b3","anthropic-scale-lead-frontier-ai-moat-en","Anthropic’s scale lead is the real moat in frontier AI","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782169363684-kjh1.png","2026-06-22T23:02:23.725574+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"7b888d1b-5890-4f27-b580-f8bb958ea5a2","teampcp-supply-chain-ai-poisoning-en","TeamPCP供应链投毒暴露AI攻击升级","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782162171698-7dpn.png","2026-06-22T21:02:23.140079+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"f05d7971-4858-4384-81d8-00299b99ed17","ethereum-wikipedia-dev-cheat-sheet-en","Ethereum turns Wikipedia into a dev cheat sheet","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782152297559-pocz.png","2026-06-22T18:17:50.367827+00:00",[75,80,85,90,95,100,105,110,115,120],{"id":76,"slug":77,"title":78,"created_at":79},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":81,"slug":82,"title":83,"created_at":84},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]