[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-量化":3},{"tag":4,"articles":10,"peer_article_count":11},{"id":5,"name":6,"slug":6,"article_count":7,"description_zh":8,"description_en":9},"a045e526-4ad5-4abd-afac-067f4f8cfd66","量化",5,"量化在 AI 推論裡多半指把權重或 KV cache 轉成更低位元表示，以換取更少記憶體、更低延遲與更高吞吐。近期焦點集中在 TurboQuant 這類方法，及其對長上下文、伺服器成本與 benchmark 公平性的影響。","Quantization in AI inference usually means storing weights or KV cache in lower-bit formats to cut memory use, latency, and cost. Recent coverage centers on TurboQuant-style methods and their trade-offs for long-context workloads, server economics, and benchmark fairness.",[],8]