[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cuda-asinf-accuracy-no-performance-hit-zh":3,"article-related-cuda-asinf-accuracy-no-performance-hit-zh":27,"series-tools-83e2a967-1919-4771-857f-37fb8d4cfd00":82},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":11,"views":24,"created_at":25,"published_at":26,"topic_cluster_id":11},"83e2a967-1919-4771-857f-37fb8d4cfd00","cuda-asinf-accuracy-no-performance-hit-zh","CUDA asinf() 更準，速度沒掉","\u003Cp>GPU 上的三角函式，常常很現實。多 1、2 條指令，整個 kernel 就可能變味。這次在 \u003Ca href=\"https:\u002F\u002Fforums.developer.nvidia.com\u002F\" target=\"_blank\" rel=\"noopener\">NVIDIA Developer Forums\u003C\u002Fa> 上，有人把 \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002F\" target=\"_blank\" rel=\"noopener\">CUDA\u003C\u002Fa> 的 \u003Ccode>asinf()\u003C\u002Fcode> 拿來重做，目標很直白：準度更好，效能別掉。\u003C\u002Fp>\u003Cp>更狠的是，CUDA 12.8 原生 \u003Ccode>asinf()\u003C\u002Fcode> 編譯後是 26 條指令。這代表你想贏它，不能靠嘴砲。你得在同樣級距內，把誤差壓得更漂亮。講白了，這就是 GPU 數學工程的硬仗。\u003C\u002Fp>\u003Cp>我覺得這種題目很有意思。因為它不是在玩花俏演算法。它是在碰實際開發會遇到的痛點。你要的是能塞進現有 k\u003Ca href=\"\u002Fnews\u002Fbytedance-deerflow-2-0-47k-stars-zh\">er\u003C\u002Fa>nel 的版本，不是紙上談兵的漂亮公式。\u003C\u002Fp>\u003Ch2>為什麼 GPU 數學這麼難搞\u003C\u002Fh2>\u003Cp>在 GPU 上，函式不是單獨存在。它會被一整批 thread 重複呼叫。只要一個 \u003Ccode>asinf()\u003C\u002Fcode> 多幾條指令，吞吐量就可能被拖到。這在模擬、渲染、訊號處理，還有前處理資料時都很常見。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142948311-udy5.png\" alt=\"CUDA asinf() 更準，速度沒掉\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>問題是，精度和速度常常互相拉扯。你想把誤差壓低，通常就得多做幾步近似修正。你想快，就可能得接受粗一點的結果。這次的重點，正是想把這條線往前推一點。\u003C\u002Fp>\u003Cp>CUDA 的標準數學函式本來就有做過硬體優化。要在這種基準上再改進，難度不低。尤其 \u003Ccode>asinf()\u003C\u002Fcode> 這種反三角函式，輸入靠近 -1 或 1 時，數值敏感度會上來，誤差很容易被放大。\u003C\u002Fp>\u003Cul>\u003Cli>CUDA 12.8 原生 \u003Ccode>asinf()\u003C\u002Fcode>：26 條指令\u003C\u002Fli>\u003Cli>目標：提高精度，別增加明顯成本\u003C\u002Fli>\u003Cli>適用場景：大量重複呼叫的 GPU kernel\u003C\u002Fli>\u003Cli>風險：邊界輸入的誤差會被放大\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這次改的是哪個痛點\u003C\u002Fh2>\u003Cp>\u003Ccode>asinf()\u003C\u002Fcode> 看起來很單純。其實它很挑輸入。靠近區間邊界時，arcsine 的斜率變化很大。這表示一點點近似誤差，可能在輸出端變得很明顯。對做數值運算的人來說，這種地方最容易出事。\u003C\u002Fp>\u003Cp>這篇討論的出發點，和之前的 \u003Ccode>acosf()\u003C\u002Fcode> 優化很像。先找出內建函式的誤差弱點，再用更細的近似策略補上。這種做法很務實。它不是追求理論上最漂亮，而是追求在真實 GPU 上比較好用。\u003C\u002Fp>\u003Cp>重點還有一個。它不是只看精度。它同時盯著指令數。因為在 CUDA 世界裡，指令數很誠實。你多寫一點，編譯器和硬體通常都會讓你付帳。這也是為什麼 26 條指令這個基準很重要。\u003C\u002Fp>\u003Cblockquote>“The built-in implementation of CUDA 12.8 served as my baseline. It compiles to 26 instructions ...”\u003C\u002Fblockquote>\u003Cp>這句話很乾脆。它把比較基準講清楚了。不是拿舊版本、不是拿 debug build、也不是拿一個慢到不行的參考實作。它直接對準 \u003Ca href=\"\u002Fnews\u002Fcuda-tile-basic-nvidia-april-fools-post-zh\">NVID\u003C\u002Fa>IA 現成版本。\u003C\u002Fp>\u003Cp>如果你想看原始討論，來源在 \u003Ca href=\"https:\u002F\u002Fforums.developer.nvidia.com\u002Ft\u002Fimplementation-of-asinf-with-improved-accuracy-and-without-negative-performance-impact\u002F365423\" target=\"_blank\" rel=\"noopener\">NVIDIA Developer Forums\u003C\u002Fa>。相關背景也可以搭配 OraCore 的 \u003Ca href=\"\u002Fnews\u002Fcuda-12-8-math-updates\" target=\"_blank\" rel=\"noopener\">CUDA 12.8 math updates\u003C\u002Fa> 一起看。\u003C\u002Fp>\u003Ch2>跟原生版本比，差在哪裡\u003C\u002Fh2>\u003Cp>這類優化最怕一件事。你以為自己贏了，結果只是把誤差從 A 換成 B。真正有價值的比較，必須在同一顆 GPU、同一個編譯器條件下做。這樣才知道差異是不是實際存在。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142961548-rnqy.png\" alt=\"CUDA asinf() 更準，速度沒掉\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>原生 \u003Ccode>asinf()\u003C\u002Fcode> 已經很強。它能維持 26 條指令，代表 NVIDIA 早就把很多細節磨過了。你要在這個基準上改善，通常得靠更精細的分段近似，或更好的誤差修正策略。\u003C\u002Fp>\u003Cp>我覺得這類工作最有價值的地方，不是單次結果，而是方法論。先找 v\u003Ca href=\"\u002Fnews\u002Fopenai-content-filtering-labeling-factory-zh\">en\u003C\u002Fa>dor baseline。再看誤差分佈。最後才決定要不要換掉內建函式。這種流程，比看到一個漂亮數字就高潮來得可靠多了。\u003C\u002Fp>\u003Cul>\u003Cli>原生版本已經高度優化，不是隨便就能超過\u003C\u002Fli>\u003Cli>比較重點是同硬體、同編譯條件\u003C\u002Fli>\u003Cli>邊界區間的誤差最值得盯\u003C\u002Fli>\u003Cli>能直接塞進既有 kernel，實用性才高\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這件事放到產業裡怎麼看\u003C\u002Fh2>\u003Cp>GPU 數學優化，通常不會上新聞首頁。可是它真的會影響產品。做 3D、科學計算、影像管線、ML 前處理的人，都可能碰到這種函式。你平常看不到它，但它會藏在熱點裡偷吃效能。\u003C\u002Fp>\u003Cp>這也解釋了為什麼很多團隊會自己寫近似函式。不是因為官方版本爛。是因為不同工作負載，容忍的誤差不同。像有些圖學管線，能接受一點誤差換吞吐量；但某些物理模擬，就得把誤差壓得更死。\u003C\u002Fp>\u003Cp>這裡可以順手對比一下。NVIDIA 的原生數學庫，優勢在穩定和硬體貼合。自寫近似函式，優勢在可控。前者像現成工具箱。後者像自己改扳手。哪個好，要看你手上的工作。\u003C\u002Fp>\u003Cul>\u003Cli>原生函式：穩定、好用、貼近硬體\u003C\u002Fli>\u003Cli>自寫近似：可調整誤差與成本\u003C\u002Fli>\u003Cli>適合大量重複呼叫的熱點函式\u003C\u002Fli>\u003Cli>數值工作越敏感，越需要自己量測\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>背景再往前看一點\u003C\u002Fh2>\u003Cp>這種討論其實不是新鮮事。從 CPU 時代開始，數學函式就一直在精度和速度之間拉扯。到了 GPU，這個問題更明顯。因為一個 kernel 可能同時跑上千個 thread，任何微小成本都會被放大。\u003C\u002Fp>\u003Cp>另一個背景是，現代編譯器和硬體已經很會優化。這代表你不能再用「我自己寫一定比較快」這種老派想法。很多時候，內建版本就是很強。你要贏它，得拿出明確證據，不然只是自嗨。\u003C\u002Fp>\u003Cp>也因為這樣，這次的案例才值得看。它沒有亂吹。它直接把目標鎖在 26 條指令這個硬門檻上。這種做法很工程，也很誠實。對開發者來說，這比空談精度有用多了。\u003C\u002Fp>\u003Ch2>你可以怎麼用這個思路\u003C\u002Fh2>\u003Cp>如果你自己在寫 CUDA，我會建議先看熱點。先找出哪些函式被呼叫最多。再看它們是不是剛好落在 \u003Ccode>asinf()\u003C\u002Fcode>、\u003Ccode>acosf()\u003C\u002Fcode> 這種高敏感區。不要一開始就改整包，先動最痛的地方。\u003C\u002Fp>\u003Cp>接著，自己做測試。量誤差。量指令數。量 kernel 時間。三個都要看。少一個，你就很容易被假象騙到。尤其是資料量一大，單次函式差一點點，最後都會變成真金白銀的成本。\u003C\u002Fp>\u003Cp>我自己的看法是，這類優化會越來越實際。不是因為大家突然愛研究數學。是因為 GPU 算力很貴，誰都不想把時間浪費在不必要的近似誤差上。你如果能把準度拉高，還不多花指令，這種成果很難不讓人心動。\u003C\u002Fp>\u003Cp>下一步最值得看的，不是這個版本本身，而是它能不能在更多 GPU、更多輸入分佈、更多編譯設定下維持表現。你要是正在做 CUDA 專案，現在就該把熱函式列出來，重新量一次。別猜，直接測。\u003C\u002Fp>","NVIDIA Developer Forums 上有人替 CUDA 12.8 的 asinf() 做精度優化，指令數仍維持 26 條。這篇看它怎麼在 GPU 數學裡，硬拚準度與效能。","forums.developer.nvidia.com","https:\u002F\u002Fforums.developer.nvidia.com\u002Ft\u002Fimplementation-of-asinf-with-improved-accuracy-and-without-negative-performance-impact\u002F365423",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142948311-udy5.png","tools","zh","5dda57f2-dfb7-4970-98ec-2e6ad298dd8c",[17,18,19,20,21,22,23],"CUDA","asinf","GPU math","NVIDIA","數值精度","效能優化","CUDA 12.8",5,"2026-04-02T15:15:32.933149+00:00","2026-04-02T15:15:32.901+00:00",{"tags":28,"relatedLang":41,"relatedPosts":45},[29,30,33,35,36,38,40],{"name":22,"slug":22},{"name":31,"slug":32},"Nvidia","nvidia",{"name":17,"slug":34},"cuda",{"name":18,"slug":18},{"name":19,"slug":37},"gpu-math",{"name":23,"slug":39},"cuda-128",{"name":21,"slug":21},{"id":15,"slug":42,"title":43,"language":44},"cuda-asinf-accuracy-no-performance-hit-en","CUDA asinf() Gets More Accurate Without Slowing Down","en",[46,52,58,64,70,76],{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"91822854-0010-478e-b70c-6a624d039703","cloudflare-turns-startup-traffic-into-a-moat-zh","Cloudflare 讓流量變護城河","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780590804649-xc2z.png","2026-06-04T16:32:50.96702+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"6ea3977e-ea7f-4d71-9472-08b512f81593","ai-code-review-tools-catch-hard-bugs-zh","AI code review 讓你抓到硬 bug","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780582701702-jnoi.png","2026-06-04T14:17:50.313258+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"0342ff17-feea-4e43-81ff-d12c43cc93c0","claude-partner-network-learning-path-launches-zh","Claude 合作夥伴課程上線","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780578178111-1za9.png","2026-06-04T13:02:27.319581+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"1a92ac0a-75ea-4877-874d-4a309cd0085b","nvidia-research-gpu-template-zh","NVIDIA 研究頁把 GPU 資源變模板","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780567412863-e8oq.png","2026-06-04T10:02:58.043845+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"3ead09ec-5656-4165-9bb0-f602add3c409","qdrant-filter-first-rag-design-decoded-zh","Qdrant 讓 RAG 先過濾再找相似","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780566519640-bdds.png","2026-06-04T09:47:59.450347+00:00",{"id":77,"slug":78,"title":79,"cover_image":80,"image_url":80,"created_at":81,"category":13},"7b5e6965-307e-4492-bf65-d922cd7818ad","anthropic-code-review-tool-ai-generated-code-zh","Anthropic 讓 AI 程式變可審","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780563813320-5wc7.png","2026-06-04T09:02:56.999212+00:00",[83,88,93,98,103,108,113,118,123,128],{"id":84,"slug":85,"title":86,"created_at":87},"855cd52f-6fab-46cc-a7c1-42195e8a0de4","surepath-real-time-mcp-policy-controls-zh","SurePath 推出即時 MCP 政策控管","2026-03-26T07:57:40.77233+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"9b19ab54-edef-4dbd-9ce4-a51e4bae4ebb","mcp-in-2026-the-ai-tool-layer-teams-use-zh","2026 年 MCP：團隊真的在用的 AI 工具層","2026-03-26T08:01:46.589694+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"af9c46c3-7a28-410b-9f04-32b3de30a68c","prompting-in-2026-what-actually-works-zh","2026 提示工程，真正有用的是什麼","2026-03-26T08:08:12.453028+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"05553086-6ed0-4758-81fd-6cab24b575e0","garry-tan-open-sources-claude-code-toolkit-zh","Garry Tan 開源 Claude Code 工具包","2026-03-26T08:26:20.068737+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"042a73a2-18a2-433d-9e8f-9802b9559aac","github-ai-projects-to-watch-in-2026-zh","2026 必看 20 個 GitHub AI 專案","2026-03-26T08:28:09.619964+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"a5f94120-ac0d-4483-9a8b-63590071ac6a","claude-code-vs-cursor-2026-zh","Claude Code 與 Cursor 深度對比：202…","2026-03-26T13:27:14.279193+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"0975afa1-e0c7-4130-a20d-d890eaed995e","practical-github-guide-learning-ml-2026-zh","2026 機器學習入門 GitHub 實用指南","2026-03-27T01:16:49.712576+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"bfdb467a-290f-4a80-b3a9-6f081afb6dff","aiml-2026-student-ai-ml-lab-repo-review-zh","AIML-2026：像課綱的學生實驗 Repo","2026-03-27T01:21:51.467798+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"80cabc3e-09fc-4ff5-8f07-b8d68f5ae545","ai-trending-github-repos-and-research-feeds-zh","AI Trending：把 AI 資源收成一張表","2026-03-27T01:31:35.262183+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"3ce6e6e2-bac5-463e-9f8d-45caabcc61f7","awesome-ai-for-science-research-tools-map-zh","AI 科研工具清單，開始像地圖了","2026-03-27T01:46:50.521945+00:00"]