Tag

vision-language

Vision-language models connect images, text, and reasoning in one pipeline, powering tasks like VQA, preference alignment, and multimodal MoE. This topic centers on how models interpret visuals, route to the right experts, and stay reliable under task-specific constraints.

2 articles

Tools & Apps/Jun 30

PixelRAG turns screenshots into retrievable context

I break down PixelRAG’s screenshot-first RAG pipeline and give you a copy-ready template for visual retrieval.

Research/Apr 10

Why multimodal MoE models get distracted

A study of multimodal MoE models finds visual inputs can derail routing to reasoning experts, and a routing-guided fix improves results.