Xiaomi MiMo-V2-Omni turns perception into action
5 takeaways from Xiaomi MiMo-V2-Omni, a multimodal agent model that pairs visual, audio, video, and browser action skills.

Xiaomi MiMo-V2-Omni is a multimodal agent model that links perception with browser and office actions.
Xiaomi’s MiMo-V2-Omni is built for agents that need to see, hear, and act, not just answer questions. The release says the model is now available via API at $0.4 per million input tokens and $2 per million output tokens.
| Item | What it does | Noted spec |
|---|---|---|
| Visual understanding | Chart analysis and visual reasoning | Surpasses Claude 4.6 Opus; closing in on Gemini 3 |
| Audio understanding | Sound classification, speaker separation, long audio | Handles continuous audio over 10 hours |
| Video understanding | Native audio-video joint input | Built for situational awareness and prediction |
| API pricing | Model access through Xiaomi MiMo API | $0.4 input / $2 output per million tokens |
1. A unified model for text, vision, and speech
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
MiMo-V2-Omni is presented as a single foundation model for text, vision, and speech. Xiaomi says that unified setup helps perception and action work together instead of living in separate systems.

The practical pitch is simple: fewer handoffs, less glue code, and a cleaner path from understanding to execution. That matters for agent workflows where a model has to read, watch, listen, decide, and then do something useful.
- Text, image, audio, and video inputs are part of the same stack.
- The model is aimed at real-world multimodal interaction.
- It is designed to support agent frameworks directly.
2. Visual reasoning that targets charts and complex scenes
Xiaomi says the model has strong visual reasoning across multidisciplinary tasks, including chart analysis. In the release, it is described as outperforming Claude 4.6 Opus and moving closer to top closed models such as Gemini 3.
That makes the visual side more than a demo feature. If a model can read dense charts, compare figures, and track details in messy scenes, it becomes more useful for office work, research, and browser-based tasks.
- Chart interpretation
- Multidisciplinary visual reasoning
- General scene understanding for agent workflows
3. Audio understanding that goes past short clips
The audio system is built for more than speech-to-text. Xiaomi highlights environmental sound classification, multi-speaker separation, audio-visual joint reasoning, and deep comprehension of continuous audio longer than 10 hours.

That broad range matters for assistants that need to listen in the wild, not just in a clean studio setting. The company says its audio performance exceeds Gemini 3 Pro, which puts the model in a serious spot for long-form audio analysis.
- Environmental sound classification
- Multi-speaker separation
- Audio-visual joint reasoning
- Long continuous audio comprehension
4. Video understanding with native audio-video input
MiMo-V2-Omni supports native audio-video joint input, which Xiaomi says gives it true multimodal video comprehension. The release also points to video pre-training that improves situational awareness and predictive reasoning.
In plain terms, the model is not only watching frames. It is meant to connect sound, motion, and context so it can follow events as they unfold. That is useful for surveillance-style review, content analysis, and any task where timing matters.
Example use cases:
- follow a live event with audio and video together
- identify what changed in a scene
- predict the next step in a sequence5. Agent actions in browsers and office apps
The strongest part of the release is the action layer. Xiaomi says the model can invoke tools, execute functions, operate GUIs, and plug into major agent frameworks. It also shows browser tasks such as shopping, bargaining with customer service, and publishing TikTok videos.
Office workflows are part of the pitch too. The model can generate Word documents, Excel sheets, PDFs, and PPTs from natural dialogue, then use web search and file skills to produce structured outputs like college application recommendations.
- Browser use with multi-tab context management
- Workflow recovery after anti-automation checks
- Document generation for Word, Excel, PDF, and PPT
- API access through platform.xiaomimimo.com
How to decide
If you need a model mainly for chat or text-only automation, MiMo-V2-Omni may be more than you need. If your work depends on images, audio, video, browsers, and office files in one pipeline, this release is aimed at that mix.
Choose it if you want one model that can observe a task, plan the next move, and finish the job with tools. If your priority is cheaper basic text generation, a smaller model may still be the better fit.
// Related Articles
- [IND]
AI companies will win only by proving they won’t hollow out jobs
- [IND]
Microsoft says AI is now normal in classrooms
- [IND]
Ruffle keeps Flash games playable after Flash died
- [IND]
Jalapeño turns OpenAI into a chip designer
- [IND]
Anthropic’s overseas data-center push is the right move
- [IND]
Nx Polygraph targets AI agent bottlenecks