[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-xiaomi-mimo-v2-omni-perception-action-en":3,"article-related-xiaomi-mimo-v2-omni-perception-action-en":33,"series-industry-d023a8fa-d96f-40f7-bc2c-31e00f459c29":80},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":25,"views":29,"created_at":30,"published_at":31,"topic_cluster_id":32},"d023a8fa-d96f-40f7-bc2c-31e00f459c29","xiaomi-mimo-v2-omni-perception-action-en","Xiaomi MiMo-V2-Omni turns perception into action","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Fnews\u002Fxiaomi-mimo-v2-5-pro-pricing-benchmarks-limits-en\">Xiaomi MiMo\u003C\u002Fa>-V2-Omni is a multimodal \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> model that links perception with browser and office actions.\u003C\u002Fp>\u003Cp>Xiaomi’s MiMo-V2-Omni is built for agents that need to see, hear, and act, not just answer questions. The release says the model is now available via \u003Ca href=\"\u002Ftag\u002Fapi\">API\u003C\u002Fa> at $0.4 per million input tokens and $2 per million output tokens.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Item\u003C\u002Fth>\u003Cth>What it does\u003C\u002Fth>\u003Cth>Noted spec\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Visual understanding\u003C\u002Ftd>\u003Ctd>Chart analysis and visual reasoning\u003C\u002Ftd>\u003Ctd>Surpasses Claude 4.6 Opus; closing in on Gemini 3\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Audio understanding\u003C\u002Ftd>\u003Ctd>Sound classification, speaker separation, long audio\u003C\u002Ftd>\u003Ctd>Handles continuous audio over 10 hours\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Video understanding\u003C\u002Ftd>\u003Ctd>Native audio-video joint input\u003C\u002Ftd>\u003Ctd>Built for situational awareness and prediction\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>API pricing\u003C\u002Ftd>\u003Ctd>Model access through Xiaomi MiMo API\u003C\u002Ftd>\u003Ctd>$0.4 input \u002F $2 output per million tokens\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>1. A unified model for text, vision, and speech\u003C\u002Fh2>\u003Cp>MiMo-V2-Omni is presented as a single foundation model for text, vision, and speech. Xiaomi says that unified setup helps perception and action work together instead of living in separate systems.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782419571606-lhdb.png\" alt=\"Xiaomi MiMo-V2-Omni turns perception into action\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The practical pitch is simple: fewer handoffs, less glue code, and a cleaner path from understanding to execution. That matters for \u003Ca href=\"\u002Fnews\u002Flibghostty-terminal-substrate-agent-workflows-en\">agent workflows\u003C\u002Fa> where a model has to read, watch, listen, decide, and then do something useful.\u003C\u002Fp>\u003Cul>\u003Cli>Text, image, audio, and video inputs are part of the same stack.\u003C\u002Fli>\u003Cli>The model is aimed at real-world multimodal interaction.\u003C\u002Fli>\u003Cli>It is designed to support agent frameworks directly.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>2. Visual reasoning that targets charts and complex scenes\u003C\u002Fh2>\u003Cp>Xiaomi says the model has strong visual reasoning across multidisciplinary tasks, including chart analysis. In the release, it is described as outperforming \u003Ca href=\"\u002Ftag\u002Fclaude\">Claude\u003C\u002Fa> 4.6 Opus and moving closer to top closed models such as \u003Ca href=\"\u002Ftag\u002Fgemini\">Gemini\u003C\u002Fa> 3.\u003C\u002Fp>\u003Cp>That makes the visual side more than a demo feature. If a model can read dense charts, compare figures, and track details in messy scenes, it becomes more useful for office work, research, and browser-based tasks.\u003C\u002Fp>\u003Cul>\u003Cli>Chart interpretation\u003C\u002Fli>\u003Cli>Multidisciplinary visual reasoning\u003C\u002Fli>\u003Cli>General scene understanding for agent workflows\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>3. Audio understanding that goes past short clips\u003C\u002Fh2>\u003Cp>The audio system is built for more than speech-to-text. Xiaomi highlights environmental sound classification, multi-speaker separation, audio-visual joint reasoning, and deep comprehension of continuous audio longer than 10 hours.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782419568993-ueb3.png\" alt=\"Xiaomi MiMo-V2-Omni turns perception into action\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That broad range matters for assistants that need to listen in the wild, not just in a clean studio setting. The company says its audio performance exceeds Gemini 3 Pro, which puts the model in a serious spot for long-form audio analysis.\u003C\u002Fp>\u003Cul>\u003Cli>Environmental sound classification\u003C\u002Fli>\u003Cli>Multi-speaker separation\u003C\u002Fli>\u003Cli>Audio-visual joint reasoning\u003C\u002Fli>\u003Cli>Long continuous audio comprehension\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>4. Video understanding with native audio-video input\u003C\u002Fh2>\u003Cp>MiMo-V2-Omni supports native audio-video joint input, which Xiaomi says gives it true multimodal video comprehension. The release also points to video pre-training that improves situational awareness and predictive reasoning.\u003C\u002Fp>\u003Cp>In plain terms, the model is not only watching frames. It is meant to connect sound, motion, and context so it can follow events as they unfold. That is useful for surveillance-style review, content analysis, and any task where timing matters.\u003C\u002Fp>\u003Ccode>Example use cases:\n- follow a live event with audio and video together\n- identify what changed in a scene\n- predict the next step in a sequence\u003C\u002Fcode>\u003Ch2>5. Agent actions in browsers and office apps\u003C\u002Fh2>\u003Cp>The strongest part of the release is the action layer. Xiaomi says the model can invoke tools, execute functions, operate GUIs, and plug into major agent frameworks. It also shows browser tasks such as shopping, bargaining with customer service, and publishing TikTok videos.\u003C\u002Fp>\u003Cp>Office workflows are part of the pitch too. The model can generate Word documents, Excel sheets, PDFs, and PPTs from natural dialogue, then use web search and file \u003Ca href=\"\u002Ftag\u002Fskills\">skills\u003C\u002Fa> to produce structured outputs like college application recommendations.\u003C\u002Fp>\u003Cul>\u003Cli>Browser use with multi-tab context management\u003C\u002Fli>\u003Cli>Workflow recovery after anti-automation checks\u003C\u002Fli>\u003Cli>Document generation for Word, Excel, PDF, and PPT\u003C\u002Fli>\u003Cli>API access through platform.xiaomimimo.com\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>How to decide\u003C\u002Fh2>\u003Cp>If you need a model mainly for chat or text-only automation, MiMo-V2-Omni may be more than you need. If your work depends on images, audio, video, browsers, and office files in one pipeline, this release is aimed at that mix.\u003C\u002Fp>\u003Cp>Choose it if you want one model that can observe a task, plan the next move, and finish the job with tools. If your priority is cheaper basic text generation, a smaller model may still be the better fit.\u003C\u002Fp>","5 takeaways from Xiaomi MiMo-V2-Omni, a multimodal agent model that pairs visual, audio, video, and browser action skills.","mimo.mi.com","https:\u002F\u002Fmimo.mi.com\u002Fdocs\u002Fen-US\u002Fnews\u002Flatest\u002Fv2-omni-release",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782419571606-lhdb.png","industry","en","526c4740-6990-4cda-ad85-02e1cbd8061d",[17,18,19,20,21,22,23,24],"Xiaomi MiMo","MiMo-V2-Omni","multimodal AI","agentic model","browser automation","audio understanding","video understanding","API pricing",[26,27,28],"MiMo-V2-Omni combines text, vision, speech, and tool use in one agent-focused model.","The release claims strong results in visual, audio, and video understanding, including long audio.","Xiaomi also positions it for browser tasks and office document workflows through API access.",0,"2026-06-25T20:32:23.968289+00:00","2026-06-25T20:32:23.962+00:00","f387c695-5c1b-40a6-9c25-94628cae173d",{"tags":34,"relatedLang":39,"relatedPosts":43},[35,37],{"name":19,"slug":36},"multimodal-ai",{"name":21,"slug":38},"browser-automation",{"id":15,"slug":40,"title":41,"language":42},"xiaomi-mimo-v2-omni-perception-action-zh","Xiaomi MiMo-V2-Omni 把感知接到動作","zh",[44,50,56,62,68,74],{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":13},"c1ad19fc-aa93-45c5-8e83-f935c896fbd0","ethereum-foundation-reorganizes-cuts-54-staff-en","54 staff cut as Ethereum Foundation reorganizes","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782424967353-kced.png","2026-06-25T22:02:23.440455+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"9113a59f-8dc7-4735-a6ac-c4b83b35246d","ai-companies-must-earn-trust-on-jobs-en","AI companies will win only by proving they won’t hollow out jobs","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782416874209-4bn7.png","2026-06-25T19:47:26.232743+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"2584390e-bd1f-4d7d-a835-aedd9abb4b29","microsoft-ai-education-report-adoption-support-en","Microsoft says AI is now normal in classrooms","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782415073845-lq3n.png","2026-06-25T19:17:28.358298+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"7daeae3a-965a-44c3-88f2-7a7f0ff6092c","ruffle-keeps-flash-games-playable-en","Ruffle keeps Flash games playable after Flash died","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782414171182-ggjn.png","2026-06-25T19:02:27.873606+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"0eb3e265-8500-4256-9c96-f718e1750aa1","jalapeno-turns-openai-into-chip-designer-en","Jalapeño turns OpenAI into a chip designer","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782407897778-icsf.png","2026-06-25T17:17:56.901981+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":13},"c6750f6e-fd92-4c65-97f4-8e4b01d1d9d3","anthropic-overseas-data-center-push-right-move-en","Anthropic’s overseas data-center push is the right move","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782406974135-qpfa.png","2026-06-25T17:02:28.979286+00:00",[81,86,91,96,101,106,111,116,121,126],{"id":82,"slug":83,"title":84,"created_at":85},"d35a1bd9-e709-412e-a2df-392df1dc572a","ai-impact-2026-developments-market-en","AI's Impact in 2026: Key Developments and Market Shifts","2026-03-25T16:20:33.205823+00:00",{"id":87,"slug":88,"title":89,"created_at":90},"5ed27921-5fd6-492e-8c59-78393bf37710","trumps-ai-legislative-framework-en","Trump's AI Legislative Framework: What's Inside?","2026-03-25T16:22:20.005325+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"e454a642-f03c-4794-b185-5f651aebbaca","nvidia-gtc-2026-key-highlights-innovations-en","NVIDIA GTC 2026: Key Highlights and Innovations","2026-03-25T16:22:47.882615+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"0ebb5b16-774a-4922-945d-5f2ce1df5a6d","claude-usage-diversifies-learning-curves-en","Claude Usage Diversifies, Learning Curves Emerge","2026-03-25T16:25:50.770376+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"69934e86-2fc5-4280-8223-7b917a48ace8","openclaw-ai-commoditization-concerns-en","OpenClaw's Rise Raises Concerns of AI Model Commoditization","2026-03-25T16:26:30.582047+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"b4b2575b-2ac8-46b2-b90e-ab1d7c060797","google-gemini-ai-rollout-2026-en","Google's Gemini AI Rollout Extended to 2026","2026-03-25T16:28:14.808842+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"6e18bc65-42ae-4ad0-b564-67d7f66b979e","meta-llama4-fabricated-results-scandal-en","Meta's Llama 4 Scandal: Fabricated AI Test Results Unveiled","2026-03-25T16:29:15.482836+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"bf888e9d-08be-4f47-996c-7b24b5ab3500","accenture-mistral-ai-deployment-en","Accenture and Mistral AI Team Up for AI Deployment","2026-03-25T16:31:01.894655+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"5382b536-fad2-49c6-ac85-9eb2bae49f35","mistral-ai-high-stakes-2026-en","Mistral AI: Facing High Stakes in 2026","2026-03-25T16:31:39.941974+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"9da3d2d6-b669-4971-ba1d-17fdb3548ed5","cursors-meteoric-rise-pressures-en","Cursor's Meteoric Rise Faces Industry Pressures","2026-03-25T16:32:21.899217+00:00"]