[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-sebastian-raschka-llm-architecture-gallery-zh":3,"article-related-sebastian-raschka-llm-architecture-gallery-zh":30,"series-research-e7d8242f-edab-4282-8317-9a27fec3cb91":87},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":11,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":11},"e7d8242f-edab-4282-8317-9a27fec3cb91","sebastian-raschka-llm-architecture-gallery-zh","Sebastian Raschka 的 LLM 架構圖鑑","\u003Cp>\u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fllm-architecture-gallery\u002F\" target=\"_blank\" rel=\"noopener\">Sebastian Raschka’s LLM Architecture Gallery\u003C\u002Fa> 很像工程師的作弊表。它把 30 多個語言模型攤開來看。從 \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Fresearch\u002Fgpt-2\" target=\"_blank\" rel=\"noopener\">GPT-2\u003C\u002Fa> 到 \u003Ca href=\"https:\u002F\u002Fwww.llama.com\u002F\" target=\"_blank\" rel=\"noopener\">Llama 4\u003C\u002Fa>，每個模型都有層數、上下文長度、注意力型態，還有 KV cache 數字。\u003C\u002Fp>\u003Cp>這頁最猛的地方，是它不講空話。你只要看幾個欄位，就知道模型在伺服器上會多吃資源。像 \u003Ca href=\"https:\u002F\u002Fwww.llama.com\u002Fllama3\u002F\" target=\"_blank\" rel=\"noopener\">Llama 3\u003C\u002Fa> 8B 用 32 層，bf16 下每個 token 只要 128 KiB KV cache。\u003Ca href=\"https:\u002F\u002Fallenai.org\u002Folmo\" target=\"_blank\" rel=\"noopener\">OLMo 2\u003C\u002Fa> 7B 也是 32 層，但每個 token 要 512 KiB。差了 4 倍，這種差距不是小事。\u003C\u002Fp>\u003Ch2>這頁到底在幹嘛\u003C\u002Fh2>\u003Cp>講白了，這是一個模型架構資料庫。不是宣傳頁，也不是 benchmark 排行榜。它把架構圖、設定檔、技術報告連在一起，讓你能追到原始資料。這對做軟體的人很重要，因為很多成本問題，都藏在看起來很無聊的細節裡。\u003C\u002Fp>\u003Cp>像是 attention 用什麼形式、layer norm 放哪裡、layer 數多少、context 開多長。這些東西不會直接出現在行銷文案裡。可是它們會直接影響推論延遲、顯存壓力，還有一台卡能塞幾個 session。\u003C\u002Fp>\u003Cp>Raschka 也把他自己的比較文章串進來。像 \u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2024\u002Fthe-big-llm-architecture-comparison.html\" target=\"_blank\" rel=\"noopener\">The Big LLM Architecture Comparison\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2024\u002Ffrom-gpt2-to-gpt-oss.html\" target=\"_blank\" rel=\"noopener\">From GPT-2 to gpt-oss\u003C\u002Fa>，還有 \u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2025\u002Ffrom-deepseek-v3-to-v3-2.html\" target=\"_blank\" rel=\"noopener\">From DeepSeek V3 to V3.2\u003C\u002Fa>。你可以把它當成一個入口，直接跳去看原始脈絡。\u003C\u002Fp>\u003Cul>\u003Cli>GPT-2 XL：15 億參數，1,024 token context，48 層 MHA，300 KiB KV cache\u003C\u002Fli>\u003Cli>Llama 3 8B：80 億參數，8,192 token context，32 層 GQA，128 KiB KV cache\u003C\u002Fli>\u003Cli>OLMo 2 7B：70 億參數，4,096 token context，32 層 MHA，512 KiB KV cache\u003C\u002Fli>\u003Cli>DeepSeek V3：6710 億總參數，370 億 active，61 層，68.6 KiB KV cache\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>架構差異，真的會影響部署\u003C\u002Fh2>\u003Cp>很多人看模型，先看參數量。說真的，這只看一半。真正決定你伺服器會不會爆掉的，常常是 cache 和 attention。Dense 模型比較好理解，但不一定好跑。MoE 模型參數很多，可是 active compute 可能低很多。\u003C\u002Fp>\u003Cp>像 \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek\u003C\u002Fa> V3 和 \u003Ca href=\"https:\u002F\u002Fwww.llama.com\u002Fllama4\u002F\" target=\"blank\" rel=\"noopener\">Llama 4 Maverick\u003C\u002Fa> 這類 MoE 架構，就是把容量分散到多個 ex\u003Ca href=\"\u002Fnews\u002Fopenai-122b-raise-ipo-expectations-zh\">pe\u003C\u002Fa>rt。這樣做的好處很直接。總參數可以很大，但每次只喚醒一部分，推論成本不一定跟著爆。\u003C\u002Fp>\u003Cp>注意力設計也很有戲。有人用標準 multi-head attention。有人用 grou\u003Ca href=\"\u002Fnews\u002Fopenai-closes-122bn-round-ipo-looms-zh\">pe\u003C\u002Fa>d-query attention。有人加 QK-Norm。有人把長上下文切成 chunk，再混一點 full attention。Raschka 把這些設計放在同一頁，差異一眼就看得出來。\u003C\u002Fp>\u003Cblockquote>“The best way to understand a model is to look at its architecture.” — Sebastian Raschka, \u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2024\u002Fthe-big-llm-architecture-comparison.html\" target=\"_blank\" rel=\"noopener\">The Big LLM Architecture Comparison\u003C\u002Fa>\u003C\u002Fblockquote>\u003Cp>這句話很直白，也很對。Benchmark 只告訴你結果。架構會告訴你，這模型為什麼能跑出這個結果。\u003C\u002Fp>\u003Cp>我覺得這頁還有一個加分點。它不是一次性圖表。它把來源、版本、差異都整理起來。頁面也有 issue tracker，錯了可以回報到 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frasbt\u002Fllm-architecture-gallery\u002Fissues\" target=\"_blank\" rel=\"noopener\">Architecture Gallery issue tracker\u003C\u002Fa>。在 LLM 世界，規格常常改很快。這種維護很實際。\u003C\u002Fp>\u003Cul>\u003Cli>Llama 4 Maverick：4000 億總參數，170 億 active，1,000,000 token context，36 chunked + 12 full GQA layers\u003C\u002Fli>\u003Cli>Qwen3 235B-A22B：2350 億總參數，220 億 active，94 層，188 KiB KV cache\u003C\u002Fli>\u003Cli>Gemma 3 27B：270 億參數，128,000 token context，52 個 sliding-window + 10 個 global layers\u003C\u002Fli>\u003Cli>Mistral Small 3.1：240 億參數，128,000 token context，40 層 GQA，160 KiB KV cache\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>為什麼比對工具比海報更有用\u003C\u002Fh2>\u003Cp>這頁也有海報版，還能在 \u003Ca href=\"https:\u002F\u002Fwww.redbubble.com\u002F\" target=\"_blank\" rel=\"noopener\">Redbubble\u003C\u002Fa> 買到，或去 \u003Ca href=\"https:\u002F\u002Fgumroad.com\u002F\" target=\"_blank\" rel=\"noopener\">Gumroad\u003C\u002Fa> 找可列印版本。拿來掛牆上很帥，這點我不否認。但真正有用的是比較\u003Ca href=\"\u002Fnews\u002Fai-coding-tool-prices-2026-free-vs-paid-zh\">工具\u003C\u002Fa>。牆上海報是裝飾。比較工具才是工程師會一直開著的東西。\u003C\u002Fp>\u003Cp>因為很多模型在參數大小上差不多，部署成本卻差超多。Llama 3 8B 每個 token 只要 128 KiB cache。OLMo 2 7B 卻要 512 KiB。這不是小差異。這會直接影響 batch size、吞吐量、延遲，還有你到底能不能在同一張卡上多開幾個 request。\u003C\u002Fp>\u003Cp>更大的模型差異更明顯。DeepSeek V3 有 671B total parameters，但 active 只有 37B。這種配置很適合拿來討論 serving 策略。你不能只說它大。你要問的是，實際推論時到底啟動多少參數。\u003C\u002Fp>\u003Cp>Llama 4 Maverick 更誇張。它把 context 拉到 1,000,000 token。這種數字很容易讓人喊哇塞，但工程師會先問另一件事：長上下文到底要多少記憶體，吞吐量會掉多少。這才是重點。\u003C\u002Fp>\u003Cul>\u003Cli>Dense 8B 與 7B 模型，cache 差 4 倍\u003C\u002Fli>\u003Cli>DeepSeek V3 的 active 參數遠低於 total 參數\u003C\u002Fli>\u003Cli>1,000,000 token context 會改變 serving 方式\u003C\u002Fli>\u003Cli>GQA 通常比傳統 MHA 更省記憶體\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這頁放在產業脈絡裡怎麼看\u003C\u002Fh2>\u003Cp>LLM 這幾年很像從比誰大，變成比誰會省。早期大家在意參數量。後來大家開始看 context。現在更現實。大家在意的是，跑一次要多少顯存，能不能撐住長對話，API 成本會不會炸掉。\u003C\u002Fp>\u003Cp>這也解釋了為什麼架構圖會越來越重要。當模型數量一多，單看排行榜很容易失真。你可能以為兩個模型差不多。結果一個是 dense，一個是 MoE。或是一個用 128K context，另一個只有 4K。部署上的麻煩完全不同。\u003C\u002Fp>\u003Cp>對台灣團隊來說，這種資料很實用。很多新創和內部工具，不一定有超大 GPU 叢集。你更需要知道，哪個模型比較省 cache，哪個 attention 比較穩，哪個 stack 比較適合本地伺服器。這種時候，架構比行銷更誠實。\u003C\u002Fp>\u003Cp>如果你想對照不同模型的新聞解讀，也可以看 OraCore 先前的整理，例如 \u003Ca href=\"\u002Fnews\u002Fllama-4-maverick-architecture-notes\">Llama 4 Maverick architecture notes\u003C\u002Fa>，還有 \u003Ca href=\"\u002Fnews\u002Fdeepseek-v3-2-what-changed\">DeepSeek V3.2 breakdown\u003C\u002Fa>。這些內容跟 Raschka 的圖鑑放在一起看，會更有感。\u003C\u002Fp>\u003Ch2>工程師該怎麼用這份圖鑑\u003C\u002Fh2>\u003Cp>如果你在做 LLM 產品，我會建議你直接把這頁存書籤。真的。你在選模型時，先看 layer、cache、attention，再看 benchmark。順序別顛倒。因為 benchmark 很容易讓人高潮，架構才會決定你能不能上線。\u003C\u002Fp>\u003Cp>如果你在學 LLM，這頁也很適合拿來對照。你可以從 GPT-2 看起，接著看 Llama 3、OLMo 2，再看 DeepSeek 和 Qwen。你會很快發現，模型演進不是只靠參數變大。很多時候，差異來自更好的 attention、更聰明的 cache 設計，還有更務實的 serving 思維。\u003C\u002Fp>\u003Cp>我的判斷很簡單。接下來幾個月，大家會更常用這種架構圖鑑來做模型選型。不是因為它很潮。是因為它真的省時間，也真的能少踩坑。你如果要選一個模型上線，先看這頁，再看文件，通常比較不會翻車。\u003C\u002Fp>\u003C\u002Fcontent>","Raschka 的 LLM Architecture Gallery 把 GPT-2、Llama 3、OLMo 2、DeepSeek、Qwen 等模型的層數、注意力與 KV cache 數字攤開來比，工程師一眼就能看出部署差異。","sebastianraschka.com","https:\u002F\u002Fsebastianraschka.com\u002Fllm-architecture-gallery\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775121663540-srg4.png","research","zh","cdcfe76f-c9bf-44ac-98d9-e9041d414d6c",[17,18,19,20,21,22,23,24,25,26],"LLM architecture","Sebastian Raschka","KV cache","attention","GPT-2","Llama 3","DeepSeek","Qwen","模型架構","人工智慧",5,"2026-04-02T07:27:33.561537+00:00","2026-04-02T07:27:33.502+00:00",{"tags":31,"relatedLang":46,"relatedPosts":50},[32,34,36,37,39,41,43,45],{"name":18,"slug":33},"sebastian-raschka",{"name":24,"slug":35},"qwen",{"name":26,"slug":26},{"name":19,"slug":38},"kv-cache",{"name":17,"slug":40},"llm-architecture",{"name":22,"slug":42},"llama-3",{"name":23,"slug":44},"deepseek",{"name":25,"slug":25},{"id":15,"slug":47,"title":48,"language":49},"sebastian-raschka-llm-architecture-gallery-en","Sebastian Raschka’s LLM Architecture Gallery","en",[51,57,63,69,75,81],{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":13},"4a829d2a-24a3-42dd-8be4-49e5ab35435a","why-prompt-engineering-is-wrong-about-2026-zh","為什麼 2026 年 prompt engineering 錯了","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780661884287-ow45.png","2026-06-05T12:17:19.813402+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":13},"52a37532-880d-4261-8f62-2f254d6c592d","spire-evidence-grounded-ai-humanities-zh","SPIRE 讓人文 AI 更重證據","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780647483844-bcuj.png","2026-06-05T08:17:29.603104+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":13},"b38c56a6-e7f3-45fb-b100-d37e7b3ed417","reinforcement-aware-distillation-llm-reasoning-zh","強化感知蒸餾，想把推理一起學進去","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780646589500-0me6.png","2026-06-05T08:02:33.908932+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":13},"60f7d702-20a7-4cec-9a80-185f072c8dfe","next-token-models-plan-ahead-zh","次詞模型其實會先想一步","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780645684780-roea.png","2026-06-05T07:47:34.35089+00:00",{"id":76,"slug":77,"title":78,"cover_image":79,"image_url":79,"created_at":80,"category":13},"7ec803f7-2658-4c9e-baa6-2b8528407d7f","google-deepmind-co-scientist-researchers-zh","Google DeepMind 對外開放 Co-Scientist","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780636679231-q694.png","2026-06-05T05:17:30.68789+00:00",{"id":82,"slug":83,"title":84,"cover_image":85,"image_url":85,"created_at":86,"category":13},"923bb0c4-95f3-49a0-8e01-5cdd6bcd2e32","fixing-llm-forgetting-es-fine-tuning-zh","ES 微調忘記問題有解了","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780604276240-arx4.png","2026-06-04T20:17:25.720929+00:00",[88,93,98,103,108,113,118,123,128,133],{"id":89,"slug":90,"title":91,"created_at":92},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":134,"slug":135,"title":136,"created_at":137},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]