[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-llm-inference":3},{"tag":4,"articles":11,"peer_article_count":129},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"a487ff8b-bc7c-473d-b9f2-867dd22c9327","LLM inference","llm-inference",4,"LLM 推論聚焦模型在部署時的延遲、吞吐量與記憶體成本，尤其是 KV cache、量化與加速器友善的實作。這類技術直接影響大模型能否在雲端與邊緣裝置上穩定運行。","LLM inference covers the runtime side of large models: latency, throughput, memory footprint, and how KV cache, quantization, and accelerator-friendly kernels shape deployment. It matters because these choices determine whether a model is practical on GPUs, servers, or edge devices.",[12,21,28,35,42,50,58,65,73,80,87,94,101,108,115,122],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"59866fce-b78e-4d8a-ad3e-7ef7d607979e","turboquant-cuts-llm-memory-use-without-retraining-en","TurboQuant cuts LLM memory use without retraining","5 ways TurboQuant shrinks KV cache memory and speeds LLM inference, with near-lossless results around 3–4 bits on retrieval benchmarks.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782710265164-q297.png","en","2026-06-29T05:17:22.810166+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":17,"image_url":26,"cover_image":26,"language":19,"created_at":27},"ae186d76-a3b0-4ac1-bc6e-bc4f3ceb488f","openai-jalapeno-llm-inference-chip-en","OpenAI’s Jalapeño chip points to faster LLM inference","1 chip, 1 partnership, and 1 new compute platform aimed at making LLM inference faster, more reliable, and more available.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782598657738-7bm5.png","2026-06-27T22:17:20.647493+00:00",{"id":29,"slug":30,"title":31,"summary":32,"category":17,"image_url":33,"cover_image":33,"language":19,"created_at":34},"adf04097-64e9-416a-845e-3a376ed6289e","v100-raw-gguf-vs-prepacked-weight-cache-en","V100 raw GGUF vs prepacked weight cache","This compares raw GGUF Q4_K kernels and prepacked weight caches for V100 decode inference.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781441282199-hh84.png","2026-06-14T12:47:38.493638+00:00",{"id":36,"slug":37,"title":38,"summary":39,"category":17,"image_url":40,"cover_image":40,"language":19,"created_at":41},"0ac121b9-de23-42b9-94f7-fac9ea703e18","turboquant-makes-long-context-ai-cheaper-en","TurboQuant makes long-context AI much cheaper","4 ways TurboQuant’s 100x KV cache cut could lower long-context AI costs, ease GPU needs, and change model serving.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781272983524-0j31.png","2026-06-12T14:02:27.64087+00:00",{"id":43,"slug":44,"title":45,"summary":46,"category":47,"image_url":48,"cover_image":48,"language":19,"created_at":49},"58924f21-83f4-405d-8d9a-4af334e9d030","bentoml-turns-model-serving-into-python-apis-en","BentoML turns model serving into Python APIs","I break down BentoML’s serving model and give you a copy-ready template for OpenAI-compatible model APIs.","tools","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781054304942-bxxs.png","2026-06-10T01:17:56.721066+00:00",{"id":51,"slug":52,"title":53,"summary":54,"category":55,"image_url":56,"cover_image":56,"language":19,"created_at":57},"9f0c9505-6d75-411c-ba46-2382e8f295a5","turboquant-cuts-kv-cache-memory-6x-google-tests-en","TurboQuant cuts KV cache memory 6x in Google tests","Google Research says TurboQuant compresses KV caches by over 4x, with up to 6x less memory and no loss on long-context tests.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906679116-fqdo.png","2026-06-08T08:17:22.276769+00:00",{"id":59,"slug":60,"title":61,"summary":62,"category":17,"image_url":63,"cover_image":63,"language":19,"created_at":64},"e1f89f09-de96-4b90-95cd-405b3cf14807","tensormesh-raises-20m-cut-llm-memory-waste-en","Tensormesh raises $20M to cut LLM memory waste","Tensormesh raised $20 million from Nvidia, AMD and CoreWeave to reduce LLM reprocessing with KV caching.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780012972119-fiwk.png","2026-05-29T00:02:29.028672+00:00",{"id":66,"slug":67,"title":68,"summary":69,"category":70,"image_url":71,"cover_image":71,"language":19,"created_at":72},"e71cb6f6-c753-4b14-9e37-19634bdad1d8","why-verkor-turboquant-silicon-ip-matters-en","Why Verkor’s TurboQuant silicon IP matters more than the headline says","Verkor’s TurboQuant accelerator is a real step for LLM inference, but the bigger story is how quickly algorithm ideas are becoming silicon IP.","ai-agent","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779896872842-2hm8.png","2026-05-27T15:47:25.880442+00:00",{"id":74,"slug":75,"title":76,"summary":77,"category":55,"image_url":78,"cover_image":78,"language":19,"created_at":79},"8b3832ee-9b1b-4684-9d11-919559a92b28","marlin-greener-llm-inference-datacenters-en","MARLIN tackles greener LLM inference in datacenters","MARLIN uses multi-agent game-theoretic RL to make cloud LLM inference more sustainable.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779084239926-7642.png","2026-05-18T06:03:36.916559+00:00",{"id":81,"slug":82,"title":83,"summary":84,"category":55,"image_url":85,"cover_image":85,"language":19,"created_at":86},"407ca117-f24b-4ff9-96b8-09d4d4733b31","taming-black-box-llm-inference-scheduling-en","Taming Black-Box LLM Inference Scheduling","A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778740250597-fhpf.png","2026-05-14T06:30:33.21401+00:00",{"id":88,"slug":89,"title":90,"summary":91,"category":55,"image_url":92,"cover_image":92,"language":19,"created_at":93},"01b8c278-3f2b-4c2c-8505-63dea2a0fd5f","saga-workflow-atomic-scheduling-gpu-clusters-en","SAGA makes AI agent GPU scheduling workflow-aware","SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778567457823-o68t.png","2026-05-12T06:30:33.774584+00:00",{"id":95,"slug":96,"title":97,"summary":98,"category":55,"image_url":99,"cover_image":99,"language":19,"created_at":100},"3d747e63-24a0-4e20-9e83-e2263d06a779","speckv-adaptive-speculative-decoding-gamma-en","SpecKV tunes speculative decoding on the fly","SpecKV adapts speculative decoding’s token budget per step, using draft-model signals to beat fixed gamma across compression settings.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777961487463-lssf.png","2026-05-05T06:10:40.207648+00:00",{"id":102,"slug":103,"title":104,"summary":105,"category":55,"image_url":106,"cover_image":106,"language":19,"created_at":107},"bc8a4577-e218-43ae-a08b-4898abf26e2a","turboquant-online-vector-quantization-near-optimal-en","TurboQuant brings near-optimal online vector quantization","TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777467656845-z759.png","2026-04-29T13:00:40.593903+00:00",{"id":109,"slug":110,"title":111,"summary":112,"category":55,"image_url":113,"cover_image":113,"language":19,"created_at":114},"d7b529f2-02b7-4d5b-bf82-490aa5fe8362","turboquant-eden-citation-fight-en","TurboQuant, EDEN, and the citation fight","TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777467061610-ug4x.png","2026-04-29T12:50:47.131528+00:00",{"id":116,"slug":117,"title":118,"summary":119,"category":55,"image_url":120,"cover_image":120,"language":19,"created_at":121},"fdb997e1-6691-46c5-bb2d-e1ca3f730c25","turboquant-google-paper-explained-en","TurboQuant Explained: Why Google’s New Paper Matters","Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775160958409-7jj5.png","2026-04-02T20:15:40.601225+00:00",{"id":123,"slug":124,"title":125,"summary":126,"category":55,"image_url":127,"cover_image":127,"language":19,"created_at":128},"6fd1f021-a7ca-4fa7-9aae-6ca84b22dc6c","googles-turboquant-cuts-llm-memory-costs-en","Google's TurboQuant Cuts LLM Memory Costs","Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775160776347-4esa.png","2026-04-02T20:12:32.387326+00:00",6]