[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-evaluation-protocols-fine-tuned-llms-2026-zh":3,"article-related-evaluation-protocols-fine-tuned-llms-2026-zh":30,"series-research-404bac33-b9b4-41bb-bb9a-1d98a63aa536":73},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"404bac33-b9b4-41bb-bb9a-1d98a63aa536","evaluation-protocols-fine-tuned-llms-2026-zh","2026 微調 LLM 評估流程","\u003Cp data-speakable=\"summary\">建立一套可落地的微調 \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> 評估流程，涵蓋任務指標、LLM 評審、安全檢查與人工複核。\u003C\u002Fp>\u003Cp>這篇給 ML 工程師、應用研究員與產品團隊看。照著做完，你會得到一套可執行的評估協議，能用來檢查微調後模型的任務品質、安全性與真實情境可靠度。\u003C\u002Fp>\u003Cp>它適用於摘要、程式生成、聊天與其他下游任務。你也會知道何時該用自動指標、何時該交給 LLM 評審、何時必須拉人進來做人工複核。\u003C\u002Fp>\u003Ch2>開始之前\u003C\u002Fh2>\u003Cul>\u003Cli>Python 3.11+\u003C\u002Fli>\u003Cli>Node 20+，如果你要做 Web dashboard 或 review UI\u003C\u002Fli>\u003Cli>至少一個可呼叫的微調 LLM endpoint\u003C\u002Fli>\u003Cli>Judge model 供應商的 API key，或自架 judge model\u003C\u002Fli>\u003Cli>已標註的 validation set 與獨立的 held-out test set\u003C\u002Fli>\u003Cli>官方文件：DeepEval \u003Ca href=\"https:\u002F\u002Fdocs.deepeval.com\u002F\">docs.deepeval.com\u003C\u002Fa>，GitHub \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fdeepeval\">github.com\u002Fconfident-ai\u002Fdeepeval\u003C\u002Fa>\u003C\u002Fli>\u003Cli>官方文件：LightEval \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Findex\">huggingface.co\u002Fdocs\u002Flighteval\u003C\u002Fa>，GitHub \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval\">github.com\u002Fhuggingface\u002Flighteval\u003C\u002Fa>\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Step 1: 定義成功規格\u003C\u002Fh2>\u003Cp>目的：先把「好模型」寫成可檢查的標準，避免後面所有分數都偏離產品目標。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783101776530-b8eu.png\" alt=\"2026 微調 LLM 評估流程\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>把你在意的行為列清楚，例如 factuality、brevity、tone、安全性、\u003Ca href=\"\u002Fnews\u002Fcodex-chat-to-delivery-ai-coding-zh\">code\u003C\u002Fa> correctness 或 instruction adherence。先寫規格，再跑任何測試，這樣評估才會對齊產品，\u003Ca href=\"\u002Fnews\u002Fdeepspec-data-regeneration-pipeline-qwen3-eagle3-zh\">而不是\u003C\u002Fa>對齊最容易算的數字。\u003C\u002Fp>\u003Cpre>\u003Ccode>Support model success criteria example:\n- Directly answer the user’s question\n- Stay under 120 words unless detail is required\n- Avoid unsafe or private-data content\n- Use a calm, professional tone\n- Escalate uncertain cases instead of inventing facts\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>驗收：你應該看到一份可供審查者一致套用的 rubric 文件。若兩位審查者對同一批樣本的分數接近，這份規格就足夠驅動後續流程。\u003C\u002Fp>\u003Ch2>Step 2: 選定任務指標\u003C\u002Fh2>\u003Cp>目的：建立一組快速、便宜的基準指標，先篩掉明顯不合格的輸出，再把深度檢查留給後續層。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783101772635-xogv.png\" alt=\"2026 微調 LLM 評估流程\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>分類與 QA 用 exact match 或 F1，摘要用 ROUGE-L 或 BERTScore，程式生成用 Pass@k 加單元測試。對開放式聊天，只把輕量 helpfulness 或 coherence 當初步濾網，不要當最終裁決。\u003C\u002Fp>\u003Cp>驗收：你應該看到一份 baseline report，會按任務類型分開列分數，而不是把所有任務平均成單一總分。若程式模型相似度高卻測試失敗，代表你選錯主指標。\u003C\u002Fp>\u003Ch2>Step 3: 加入 LLM 評審層\u003C\u002Fh2>\u003Cp>目的：補上語意層的判斷，讓評估不只停在字串重疊，而是能接近人類對品質的理解。\u003C\u002Fp>\u003Cp>把 prompt、模型輸出與清楚 rubric 一起餵給 judge model，維度可包含 coherence、relevance、completeness 與 safety。建議用 1 到 5 分這種結構化評分，方便追蹤趨勢與比較不同 run。\u003C\u002Fp>\u003Cp>若能選專門的 judge 就不要只用通用 judge，因為後者容易有風格偏誤或長度偏誤。若做 pairwise 比較，記得隨機化候選順序，降低位置偏誤。\u003C\u002Fp>\u003Cp>驗收：你應該看到每個維度的 judge 分數與文字理由。若 judge 能說清楚為何 A 比 B 好，這一層就已經可用。\u003C\u002Fp>\u003Ch2>Step 4: 執行安全與偏誤檢查\u003C\u002Fh2>\u003Cp>目的：把有害行為獨立量化，避免只看任務成功卻忽略毒性、偏見或隱私外洩。\u003C\u002Fp>\u003Cp>用 red-team prompts、jailbreak 嘗試與對抗式邊界案例測試模型，並量化有害輸出率、拒答品質，以及在壓力下是否洩漏私人或訓練來源內容。\u003C\u002Fp>\u003Cp>把 fairness 與 toxicity 一起納入同一輪評估，避免安全被當成後期附加項。如果模型在 helpfulness 表現很好，但有害內容失敗率高，安全分數就應該直接擋下發布。\u003C\u002Fp>\u003Cp>驗收：你應該看到一份安全儀表板，失敗案例會按風險類型分組。若經過 prompt 或資料調整後失敗率下降，代表你的緩解策略有效。\u003C\u002Fp>\u003Ch2>Step 5: 驗證留出集與真實樣本\u003C\u002Fh2>\u003Cp>目的：證明模型不只在訓練分布內表現好，也能在看不過的資料上維持品質。\u003C\u002Fp>\u003Cp>保持 test data 與 training、validation 完全分離，再加入 out-of-distribution 樣本與真實使用者查詢，去捕捉資料洩漏、記憶化與脆弱行為。\u003C\u002Fp>\u003Cp>抽一小批輸出做人工複核，並和自動分數對照。如果相關性很弱，就先修 rubric、judge prompt 或指標組合，再決定是否上線。\u003C\u002Fp>\u003Cp>驗收：你應該看到一份 validation summary，同時包含 offline scores 與 human spot-check 結果。若留出集表現穩定，且真實提示也通過人工檢查，這套評估協議就可信。\u003C\u002Fp>\u003Ch2>Step 6: 建立上線後漂移監控\u003C\u002Fh2>\u003Cp>目的：把評估\u003Ca href=\"\u002Fnews\u002Fornith-1-agent-coding-server-template-zh\">變成\u003C\u002Fa>持續循環，讓模型在上線後的品質變化能被及早發現。\u003C\u002Fp>\u003Cp>追蹤核心指標的時間趨勢、記錄被拒答的輸出，並把失敗案例回灌到評估集。對新樣本重跑安全檢查與 judge scoring，讓回歸問題提早浮現。\u003C\u002Fp>\u003Cp>如果你看到 helpfulness 下降或安全事件增加，應把它視為評估失敗，而不只是客服問題。重點是讓評估協議跟真實使用情境保持同步。\u003C\u002Fp>\u003Cp>驗收：你應該看到固定的報告節奏、趨勢線、告警與持續擴充的失敗案例庫。若監控循環已啟動，評估系統就成了產品生命週期的一部分。\u003C\u002Fp>\u003Ch2>常見錯誤\u003C\u002Fh2>\u003Cul>\u003Cli>把 perplexity 當成主要成功指標。修法：它只適合 pre-training 或 token prediction 任務，微調輸出請改用任務指標。\u003C\u002Fli>\u003Cli>讓訓練樣本滲入 test set。修法：強制資料切分，並在最終評估加入 out-of-distribution prompts。\u003C\u002Fli>\u003Cli>只信一個 judge 分數。修法：拿 judge 結果對照人工標註，若相關性弱就調整 rubric 與 judge prompt。\u003C\u002Fli>\u003C\u002Ful>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>指標\u003C\u002Fth>\u003Cth>基準／優化前\u003C\u002Fth>\u003Cth>結果／優化後\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>評估範圍\u003C\u002Ftd>\u003Ctd>只看 perplexity 或 ROUGE\u003C\u002Ftd>\u003Ctd>任務指標 + judge + safety + human review\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>程式品質訊號\u003C\u002Ftd>\u003Ctd>只看文字相似度\u003C\u002Ftd>\u003Ctd>Pass@k 搭配 unit tests\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>安全覆蓋\u003C\u002Ftd>\u003Ctd>臨時人工檢查\u003C\u002Ftd>\u003Ctd>Red-team prompts 與 toxicity scoring\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>上線準備度\u003C\u002Ftd>\u003Ctd>只有離線 benchmark\u003C\u002Ftd>\u003Ctd>留出集加真實世界驗證\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>接下來可以看什麼\u003C\u002Fh2>\u003Cp>等這套評估流程穩定後，可以往領域專屬 \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa>、持續監控與 CI 自動回歸閘門延伸，讓每次新微調都用同一標準檢查。\u003C\u002Fp>","建立一套可落地的微調 LLM 評估流程，涵蓋任務指標、LLM 評審、安全檢查與人工複核。","brics-econ.org","https:\u002F\u002Fbrics-econ.org\u002Fevaluation-protocols-for-fine-tuned-llms-what-to-measure-in",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783101776530-b8eu.png","research","zh","c2d5749a-f8c3-460f-bdcc-019aa1bf2552",[17,18,19,20,21],"Fine-tuned LLM","evaluation protocol","LLM judge","safety checks","human review",[23,24,25],"先定義成功規格，再選任務指標，避免評估目標跑偏。","用自動指標、LLM 評審、安全檢查與人工複核組成分層流程。","把留出集、真實樣本與上線監控接進同一套評估協議。",0,"2026-07-03T18:02:24.572198+00:00","2026-07-03T18:02:24.567+00:00","0c35a120-52fc-41fc-afa3-d404eb934158",{"tags":31,"relatedLang":32,"relatedPosts":36},[],{"id":15,"slug":33,"title":34,"language":35},"evaluation-protocols-fine-tuned-llms-2026-en","Evaluation Protocols for Fine-Tuned LLMs in 2026","en",[37,43,49,55,61,67],{"id":38,"slug":39,"title":40,"cover_image":41,"image_url":41,"created_at":42,"category":13},"8f3122c8-9eb1-4aa6-b780-3b62003b3418","deepspec-data-regeneration-pipeline-qwen3-eagle3-zh","DeepSpec 應被視為資料重生管線，而不是訓練技巧","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783080165006-321z.png","2026-07-03T12:02:18.375863+00:00",{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"6cfddc0d-ce6e-4a14-baf7-3531bf32bc5d","program-as-weights-fuzzy-functions-zh","PAW把提示詞編成可重用工具","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783062178440-pnt0.png","2026-07-03T07:02:32.5878+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"5bd0dc27-5a7f-4563-8086-acccc98eb2fc","lacuna-llm-unlearning-localization-testbed-zh","LACUNA：檢驗 LLM 真的有沒有忘記","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783060373883-d92j.png","2026-07-03T06:32:31.28626+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"ff17d0f0-f249-41e3-b62e-658282631451","persistent-state-ai-agents-attack-surface-zh","持久狀態 AI 代理的新攻擊面","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1783058580349-ldhu.png","2026-07-03T06:02:30.282788+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"4c1c0228-6f8e-4be6-b948-61bc48e67746","language-critiques-imitation-learning-zh","語言批註讓模仿學習更準","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782975775937-7kd6.png","2026-07-02T07:02:28.766504+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"5b59165e-18fd-4c10-afa4-1307e39a11f0","one-transformer-layer-can-carry-rl-gains-zh","單層 Transformer 也能扛住 RL 增益","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782973979895-px83.png","2026-07-02T06:32:29.183313+00:00",[74,79,84,89,94,99,104,109,114,119],{"id":75,"slug":76,"title":77,"created_at":78},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":80,"slug":81,"title":82,"created_at":83},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]