[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-kv-cache":3},{"tag":4,"articles":11,"peer_article_count":137},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"422aade2-8ccd-4b7c-b4a5-7836c6353ec7","KV cache","kv-cache",13,"KV cache 是大型語言模型推論時最吃記憶體的部分之一，長上下文、低延遲服務與雲端部署都會直接受它影響。這個主題涵蓋量化、壓縮、HBM 容量與頻寬取捨，以及像 TurboQuant 這類降低 KV cache 成本的方法。","KV cache is the working memory that lets LLMs reuse past tokens during inference, and it often becomes the main limit on context length, latency, and serving cost. This tag covers quantization, compression, HBM capacity and bandwidth trade-offs, and papers like TurboQuant.",[12,21,28,36,43,50,57,65,73,80,88,95,102,109,116,123,130],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"9fd702bc-6c80-4d27-8f85-5971f898bef3","ultraquant-4bit-kv-caching-agents-en","UltraQuant: 4-bit KV caching for long agents","UltraQuant shows 4-bit KV caching can speed long, multi-turn agent serving while keeping more context resident.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782331384598-tjhi.png","en","2026-06-24T20:02:33.028079+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":17,"image_url":26,"cover_image":26,"language":19,"created_at":27},"434fbb0a-e925-43f3-9c3d-a3fbd187acdc","variable-width-transformers-cut-wasted-capacity-en","Variable-Width Transformers cut wasted capacity","A new transformer design widens early and late layers while shrinking the middle to save compute and memory.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781677980601-tp4b.png","2026-06-17T06:32:32.993101+00:00",{"id":29,"slug":30,"title":31,"summary":32,"category":33,"image_url":34,"cover_image":34,"language":19,"created_at":35},"093f7c46-be7c-4b62-be00-73808a61e0a0","turboquant-amd-gpus-kv-cache-latency-en","TurboQuant on AMD GPUs cuts KV-cache latency","TurboQuant on AMD GPUs improves long-context LLM serving with up to 3.6x speedup and far lower KV-cache pressure.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781299067778-3pzd.png","2026-06-12T21:17:26.07+00:00",{"id":37,"slug":38,"title":39,"summary":40,"category":33,"image_url":41,"cover_image":41,"language":19,"created_at":42},"0ac121b9-de23-42b9-94f7-fac9ea703e18","turboquant-makes-long-context-ai-cheaper-en","TurboQuant makes long-context AI much cheaper","4 ways TurboQuant’s 100x KV cache cut could lower long-context AI costs, ease GPU needs, and change model serving.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781272983524-0j31.png","2026-06-12T14:02:27.64087+00:00",{"id":44,"slug":45,"title":46,"summary":47,"category":17,"image_url":48,"cover_image":48,"language":19,"created_at":49},"e9cb5863-f541-4d53-8f38-289660919a1f","reroute-keeps-useful-vision-tokens-alive-en","Reroute Keeps Useful Vision Tokens Alive","Reroute lets vision-language models defer, not discard, visual tokens so later layers can still use them.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781157784473-28u1.png","2026-06-11T06:02:32.556043+00:00",{"id":51,"slug":52,"title":53,"summary":54,"category":17,"image_url":55,"cover_image":55,"language":19,"created_at":56},"9f0c9505-6d75-411c-ba46-2382e8f295a5","turboquant-cuts-kv-cache-memory-6x-google-tests-en","TurboQuant cuts KV cache memory 6x in Google tests","Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906679116-fqdo.png","2026-06-08T08:17:22.276769+00:00",{"id":58,"slug":59,"title":60,"summary":61,"category":62,"image_url":63,"cover_image":63,"language":19,"created_at":64},"0117641d-93d6-40f1-8b9e-158b8240493a","tether-turboquant-cuts-ai-memory-use-5x-en","Tether’s TurboQuant cuts AI memory use 5x","Tether released TurboQuant in QVAC SDK 0.12.0, claiming up to 5x lower AI memory use for local sessions on laptops and phones.","blockchain","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780543069267-cwa3.png","2026-06-04T03:17:20.409795+00:00",{"id":66,"slug":67,"title":68,"summary":69,"category":70,"image_url":71,"cover_image":71,"language":19,"created_at":72},"1247e920-56ea-4e12-9d8c-5a4a7d4df9dd","why-tether-is-right-to-push-local-ai-memory-into-everyday-de-en","Why Tether Is Right to Push Local AI Memory Into Everyday Devices","Tether’s TurboQuant matters because it makes long-context AI practical on local devices, not just in data centers.","tools","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780542172839-ie86.png","2026-06-04T03:02:19.993669+00:00",{"id":74,"slug":75,"title":76,"summary":77,"category":17,"image_url":78,"cover_image":78,"language":19,"created_at":79},"3a65bf83-79cf-4b24-a099-b102054e1465","videomla-low-rank-kv-cache-video-diffusion-en","VideoMLA cuts video KV cache memory 92.7%","VideoMLA compresses video diffusion KV caches with a shared low-rank latent and cuts per-token memory 92.7%.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780035485810-fkg5.png","2026-05-29T06:17:31.115044+00:00",{"id":81,"slug":82,"title":83,"summary":84,"category":85,"image_url":86,"cover_image":86,"language":19,"created_at":87},"e71cb6f6-c753-4b14-9e37-19634bdad1d8","why-verkor-turboquant-silicon-ip-matters-en","Why Verkor’s TurboQuant silicon IP matters more than the headline says","Verkor’s TurboQuant accelerator is a real step for LLM inference, but the bigger story is how quickly algorithm ideas are becoming silicon IP.","ai-agent","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779896872842-2hm8.png","2026-05-27T15:47:25.880442+00:00",{"id":89,"slug":90,"title":91,"summary":92,"category":70,"image_url":93,"cover_image":93,"language":19,"created_at":94},"8a164bd6-6f92-47a6-87fb-72a6371aae17","why-llama-cpp-should-treat-turboquant-as-default-en","Why llama.cpp should treat TurboQuant as the new default path","TurboQuant is the right direction for llama.cpp because asymmetric KV compression cuts memory without breaking compatibility.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779481556833-a9v3.png","2026-05-22T20:25:23.12744+00:00",{"id":96,"slug":97,"title":98,"summary":99,"category":70,"image_url":100,"cover_image":100,"language":19,"created_at":101},"cbaeb6db-c465-4659-b35b-640435c673bf","why-kv-cache-compression-will-decide-edge-ai-inference-en","Why KV-cache compression will decide edge AI inference","TurboQuant-style KV-cache compression is the real bottleneck-breaker for edge AI inference.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779285828871-4n8z.png","2026-05-20T14:03:20.811149+00:00",{"id":103,"slug":104,"title":105,"summary":106,"category":17,"image_url":107,"cover_image":107,"language":19,"created_at":108},"a259bf3b-e800-46fa-8550-605b5b8f4115","why-turboquant-changes-kv-cache-debate-en","Why TurboQuant changes the KV cache debate","TurboQuant makes KV cache compression a theoretical win, not just an engineering trick.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778016643980-zx6u.png","2026-05-05T21:30:24.349733+00:00",{"id":110,"slug":111,"title":112,"summary":113,"category":17,"image_url":114,"cover_image":114,"language":19,"created_at":115},"fdb997e1-6691-46c5-bb2d-e1ca3f730c25","turboquant-google-paper-explained-en","TurboQuant Explained: Why Google’s New Paper Matters","Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775160958409-7jj5.png","2026-04-02T20:15:40.601225+00:00",{"id":117,"slug":118,"title":119,"summary":120,"category":17,"image_url":121,"cover_image":121,"language":19,"created_at":122},"d4867ede-353b-4812-aac7-aebe28ef3613","turboquant-wont-fix-memory-crunch-en","TurboQuant Won’t Fix the Memory Crunch","Google’s TurboQuant can cut KV-cache memory use 6x, but longer contexts may keep DRAM and NAND demand climbing.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775132152400-1kew.png","2026-04-02T12:15:32.095995+00:00",{"id":124,"slug":125,"title":126,"summary":127,"category":17,"image_url":128,"cover_image":128,"language":19,"created_at":129},"cdcfe76f-c9bf-44ac-98d9-e9041d414d6c","sebastian-raschka-llm-architecture-gallery-en","Sebastian Raschka’s LLM Architecture Gallery","Raschka’s gallery compares GPT-2, Llama 3, OLMo 2, DeepSeek, and Qwen stacks with exact layer, cache, and attention data.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775121663908-8tcs.png","2026-04-02T07:27:33.848813+00:00",{"id":131,"slug":132,"title":133,"summary":134,"category":17,"image_url":135,"cover_image":135,"language":19,"created_at":136},"27f0d044-b9f9-4a58-99e8-1a181ea32f19","universal-yoco-efficient-depth-scaling-en","Universal YOCO aims to scale depth without cache bloat","YOCO-U mixes recursive computation with efficient attention to scale LLM depth while keeping inference overhead and KV cache growth in check.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775115621645-wqql.png","2026-04-02T06:06:26.960639+00:00",25]