[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-多模態模型":3},{"tag":4,"articles":10,"peer_article_count":11},{"id":5,"name":6,"slug":6,"article_count":7,"description_zh":8,"description_en":9},"45b5a7e9-4a2b-4a85-9a36-984bf468fd0e","多模態模型",4,"多模態模型把影像、文字、程式碼與語音放進同一套推理流程，適合代理式工作流、視覺理解與人機互動。這裡聚焦模型架構、長上下文、微調策略與部署成本，從 Qwen3.5 視覺分層訓練到 Kimi K2.5、MiMo 這類新模型的實作差異。","Multimodal models combine text, vision, code, and sometimes speech in one inference stack, making them relevant to agentic workflows, visual understanding, and human-computer interaction. This tag covers model design, long-context handling, fine-tuning, and deployment trade-offs, from Qwen3.5 vision tuning to Kimi K2.5 and MiMo.",[],8]