[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-reroute-keeps-useful-vision-tokens-alive-en":3,"article-related-reroute-keeps-useful-vision-tokens-alive-en":30,"series-research-e9cb5863-f541-4d53-8f38-289660919a1f":82},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"e9cb5863-f541-4d53-8f38-289660919a1f","reroute-keeps-useful-vision-tokens-alive-en","Reroute Keeps Useful Vision Tokens Alive","\u003Cp data-speakable=\"summary\">Reroute lets vision-language models defer, not discard, visual tokens so later layers can still use them.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: No benchmark numbers in abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: Recoverable routing that reintroduces deferred tokens at later stages\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Vision-language models are expensive because they turn images into hundreds or thousands of visual tokens, and every extra \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> adds work to decoder attention and KV-cache storage. This paper argues that the usual fix—score tokens, keep a few, and permanently drop the rest—can be too blunt, especially when the model needs to ground later answers in image details it did not care about early on.\u003C\u002Fp>\u003Cp>That matters for engineers because token reduction is one of the clearest ways to cut \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> cost in multimodal systems, but a bad reduction strategy can quietly damage accuracy. The paper’s core idea is simple: instead of removing a token forever, route it out temporarily and let it come back into consideration later.\u003C\u002Fp>\u003Ch2>What problem the paper is trying to fix\u003C\u002Fh2>\u003Cp>In a vision-language model, the image is not fed in as pixels all the way through the decoder. It is projected into a sequence of visual tokens, and those tokens participate in attention just like text tokens do. The catch is scale: the more visual tokens you keep, the more expensive decoding becomes in both compute and memory.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781157784473-28u1.png\" alt=\"Reroute Keeps Useful Vision Tokens Alive\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Existing visual-token reduction methods usually follow a rank-and-remove pattern. They assign importance scores, keep the top tokens, and permanently discard the rest. That works if importance is stable across the whole decoder, but the paper says that assumption breaks down in practice.\u003C\u002Fp>\u003Cp>The authors’ argument is that visual-token importance changes as the decoder gets deeper. A token that looks unimportant in an early stage may become relevant later, especially for grounding-sensitive questions where the model has to connect language to a specific part of the image. If you remove that token too early, the model never gets the chance to use it again.\u003C\u002Fp>\u003Cp>So the problem is not just “how do we compress visual tokens?” It is “how do we reduce them without making the reduction irreversible?” That framing is the main shift in the paper.\u003C\u002Fp>\u003Ch2>How Reroute works in plain English\u003C\u002Fh2>\u003Cp>Reroute is a training-free plug-in, which means it does not require retraining the base model. Instead, it changes the routing behavior during decoding. The method keeps the existing attention-score ranking rules and stage-wise schedules used by pruning methods, so it can slot into those systems rather than replace them entirely.\u003C\u002Fp>\u003Cp>The key difference is what happens to tokens that are not selected at a given stage. In a standard pruning setup, those tokens are removed. In Reroute, they are deferred: they bypass the current decoder stage and re-enter the candidate pool at the next routing decision.\u003C\u002Fp>\u003Cp>That makes the token flow recoverable. A token can miss one round and still survive long enough to be reconsidered later. In other words, Reroute treats token reduction as a sequence of routing decisions, not a one-way deletion process.\u003C\u002Fp>\u003Cp>Because it reuses the same ranking and scheduling logic, Reroute is designed to preserve the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. The abstract does not give the exact overhead numbers, so the safe takeaway is that the method is intended to keep the same efficiency regime while improving robustness.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The paper evaluates Reroute across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and \u003Ca href=\"\u002Ftag\u002Fqwen\">Qwen\u003C\u002Fa> backbones. That is a useful spread because it suggests the idea is not tied to a single model family or one specific pruning recipe.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781157781129-4kfz.png\" alt=\"Reroute Keeps Useful Vision Tokens Alive\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>According to the abstract, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. The paper does not provide \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> values in the abstract, so there are no exact scores to quote here. Still, the qualitative result is important: the method aims to recover the kinds of image details that pruning tends to lose, without giving up the broader visual question answering behavior developers still need.\u003C\u002Fp>\u003Cp>The authors specifically call out grounding-sensitive queries as the setting where recoverable routing helps most. That makes sense: if a question depends on the model revisiting a token tied to a small object, region, or relation in the image, a temporary defer-and-retry strategy is more forgiving than a hard delete.\u003C\u002Fp>\u003Cp>One practical implication is that the paper is not claiming token reduction is bad. It is saying the implementation choice matters. If you are already using a pruning-style reducer, the paper suggests you may be able to get better behavior by changing the fate of low-ranked tokens instead of changing the ranking model itself.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>For teams building multimodal products, visual-token reduction is one of the few levers that directly affects latency and memory. If your decoder is paying attention over thousands of visual tokens, every optimization matters. But if the optimization is too aggressive, you can end up with a model that answers general questions fine and still fails on image grounding.\u003C\u002Fp>\u003Cp>Reroute is interesting because it is training-free and compatible with existing ranking rules. That lowers adoption friction. You do not need to redesign the whole VLM stack to test the idea; you can treat it as a plug-in on top of pruning methods you already understand.\u003C\u002Fp>\u003Cp>It is also a reminder that multimodal efficiency is not just about fewer tokens. It is about smarter token lifecycles. Some tokens are not globally important, but they may be locally important at the right decoder depth. Recoverable routing is a straightforward way to encode that assumption.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The abstract is clear about the direction of the result, but it leaves out the hard numbers. There are no benchmark tables, no exact latency or memory figures, and no detailed failure cases in the source material provided here. So while the method appears promising, the abstract alone does not let us measure the size of the gain.\u003C\u002Fp>\u003Cp>Another open question is how broadly the routing idea transfers beyond the tested combinations of FastV, PDrop, Nüwa, LLaVA-1.5, and Qwen backbones. The abstract says the method works across those variants, but it does not tell us how it behaves under different decoder depths, different routing schedules, or more extreme token budgets.\u003C\u002Fp>\u003Cp>There is also a systems question: if a token is deferred several times, when does it finally stop being useful? The abstract does not spell out the stopping rule, edge cases, or any overhead introduced by reintroducing tokens into the candidate pool. Those details matter if you want to ship this in production.\u003C\u002Fp>\u003Cp>Still, the core message is strong and practical. If a vision-language model is going to reduce tokens, the paper argues it should do so in a way that keeps the door open for later recovery. For developers, that is a useful design pattern: reduce aggressively, but do not assume the first ranking is the last word.\u003C\u002Fp>\u003Cul>\u003Cli>Reroute replaces irreversible token pruning with recoverable routing.\u003C\u002Fli>\u003Cli>It is training-free and reuses existing attention-score ranking and schedules.\u003C\u002Fli>\u003Cli>The abstract reports better grounding under aggressive reduction, but gives no benchmark numbers.\u003C\u002Fli>\u003C\u002Ful>","Reroute lets vision-language models defer, not discard, visual tokens so later layers can still use them.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.12412",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781157784473-28u1.png","research","en","2a097023-5013-40ba-81e1-014bc4ef713d",[17,18,19,20,21],"vision-language models","token pruning","KV-cache","grounding","inference efficiency",[23,24,25],"Reroute defers low-ranked visual tokens instead of deleting them permanently.","The method is training-free and plugs into existing pruning pipelines.","The abstract claims better grounding under aggressive reduction, but provides no exact benchmark numbers.",0,"2026-06-11T06:02:32.556043+00:00","2026-06-11T06:02:32.548+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":42,"relatedPosts":46},[32,33,36,38,40],{"name":20,"slug":20},{"name":34,"slug":35},"KV cache","kv-cache",{"name":21,"slug":37},"inference-efficiency",{"name":17,"slug":39},"vision-language-models",{"name":18,"slug":41},"token-pruning",{"id":15,"slug":43,"title":44,"language":45},"reroute-keeps-useful-vision-tokens-alive-zh","Reroute 讓視覺 token 可回流","zh",[47,53,59,65,71,77],{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"f20df85e-3b45-4eec-a44c-7fa0940e0d39","factr-2-force-sensing-robot-arms-en","FACTR 2 brings force sensing to cheap robot arms","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781159586943-1hon.png","2026-06-11T06:32:36.958637+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"3d767fcf-58cf-4a92-9652-00f1ec1e3677","c-dic-incremental-compression-dialogue-memory-en","C-DIC compresses long dialogue memory turn by turn","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781158686061-b943.png","2026-06-11T06:17:38.690213+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"ffb2e7ac-bff8-4c03-a4d4-1c19264c6967","sequential-fine-tuning-essay-scoring-en","Sequential fine-tuning improves essay scoring","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781146976434-ism9.png","2026-06-11T03:02:29.860107+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"21a693ca-7c72-49e6-886e-4d190baa33c1","nvidia-nemotron-3-ultra-open-models-compete-en","NVIDIA Nemotron 3 Ultra proves open models can still compete","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781108276690-iat5.png","2026-06-10T16:17:24.880013+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"3e763732-2a73-4539-8990-b8af7d671b3e","speechllm-l2-assessment-rationales-en","SpeechLLM Gives L2 Scores and Rationales","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781103780733-r41k.png","2026-06-10T15:02:34.030726+00:00",{"id":78,"slug":79,"title":80,"cover_image":11,"image_url":11,"created_at":81,"category":13},"ef3677ab-2c91-4c09-9c61-b19dbd7d12fb","eevee-test-time-prompt-learning-real-world-en","EEVEE tackles prompt learning across real-world streams","2026-06-10T06:32:32.554039+00:00",[83,88,93,98,103,108,113,118,123,128],{"id":84,"slug":85,"title":86,"created_at":87},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]