[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-learning-action-priors-cross-embodiment-manipulation-en":3,"article-related-learning-action-priors-cross-embodiment-manipulation-en":30,"series-research-627d2830-fad8-4df9-ab53-16040cd5efa8":73},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"627d2830-fad8-4df9-ab53-16040cd5efa8","learning-action-priors-cross-embodiment-manipulation-en","Learning Action Priors for Cross-Embodiment Manipulation","\u003Cp data-speakable=\"summary\">A two-stage training scheme gives VLA robots an explicit motion prior before cross-modal alignment.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: 13 cross-embodiment tasks\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: Pretrain a flow-matching action module on unconditioned trajectories\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Vision-language-action systems are good at inheriting visual and linguistic knowledge from a foundation model, but the action side often starts nearly from zero. This paper argues that gap matters more in cross-embodiment robotics, where the model has to map the same intent onto different bodies, dynamics, and control spaces.\u003C\u002Fp>\u003Cp>Instead of forcing one training run to discover motion structure and cross-modal alignment at the same time, the authors split the problem into two stages. That is the core idea engineers should care about: give the policy a motion prior first, then let it learn how language and vision line up with that prior.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>Most VLA models attach an action module to a vision-language backbone and then optimize the full policy jointly. That sounds simple, but it leaves the action module to learn physical motion almost from scratch. The result is a policy that has strong perceptual and linguistic priors, but no explicit motion prior.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782367379107-moh2.png\" alt=\"Learning Action Priors for Cross-Embodiment Manipulation\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The abstract says this creates a hard optimization problem early in training. The model has to discover temporal action dynamics and cross-modal alignment at the same time, and that difficulty gets worse when the robot embodiment changes. In other words, the policy is not just learning what to do; it is also learning how motion should look for a particular body.\u003C\u002Fp>\u003Cp>For robotics developers, that distinction matters. If your system is trying to generalize across different platforms, a policy that only learns from end-to-end VLA training may spend too much capacity relearning basic dynamics instead of focusing on task understanding and transfer.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The paper proposes a two-stage training framework. Stage 1 is about learning motion structure before any vision or language conditioning is introduced. Stage 2 is about transferring that learned structure into VLA training so the action module does not start cold.\u003C\u002Fp>\u003Cp>In Stage 1, the authors use a lightweight flow-matching-based encoder-decoder action module. It learns temporal motion structure solely from unconditioned action trajectories, meaning it does not process visual or language tokens. That makes Stage 1 a pure motion-learning phase, focused on the sequence structure of actions rather than on task semantics.\u003C\u002Fp>\u003Cp>In Stage 2, the learned prior is reused during VLA training through decoder reuse and early-stage latent distillation. The goal is to align visual-language features with the action embedding space while still allowing end-to-end policy refinement. So the system is not frozen into a precomputed motion template; it still gets to adapt during full policy training.\u003C\u002Fp>\u003Cp>The trained encoder has a second job too: it acts as a compact history compressor. According to the abstract, it summarizes state-action histories into a single temporal context \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> for history-aware modeling at negligible cost. That is a useful design detail because it suggests the motion prior is not just initialization, but also a reusable representation for temporal context.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The paper reports extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms. The abstract does not give per-task numbers, exact success rates, or any \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> table, so there are no concrete numeric scores to quote here beyond the task count.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782367379890-j0qk.png\" alt=\"Learning Action Priors for Cross-Embodiment Manipulation\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>What it does claim is directional but important: compared with VLA training without action priors, the model converges faster, reaches higher success rates, and performs substantially better on data-scarce real-world tasks. That combination is the practical signal. Faster convergence means less training time and less wasted compute. Higher success rates mean the policy is actually using the prior rather than fighting it. Better data-scarce performance suggests the motion prior helps where real robot data is expensive.\u003C\u002Fp>\u003Cp>The abstract also says that scaling up the action data in Stage 1 produces a more generalizable action prior, and that this improved prior directly boosts downstream VLA performance. That is a useful scaling lesson: if you have extra action trajectories, they are not just more data for the same model; they can become a reusable motion foundation for later cross-modal training.\u003C\u002Fp>\u003Ch2>Why this matters for developers\u003C\u002Fh2>\u003Cp>If you build robot policies, this paper is basically a reminder that the action stack deserves its own pretraining strategy. Vision-language backbones already bring strong priors, but action modules often remain underpowered because they are expected to learn control dynamics and task grounding at the same time.\u003C\u002Fp>\u003Cp>The proposed separation is attractive because it matches how engineers often think about systems anyway: first learn the dynamics, then learn the interface. In this case, the interface is the alignment between visual-language features and the action embedding space. The method keeps end-to-end refinement in play, so it is not just a rigid two-step pipeline.\u003C\u002Fp>\u003Cp>The history-compression angle is also practical. A single temporal context token is a compact way to carry state-action history, which could matter in systems where memory and latency are tight. The abstract does not provide runtime measurements, so we cannot say how much overhead it saves, only that the authors describe it as negligible.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The abstract is strong on method and broad outcomes, but light on implementation detail. It does not provide benchmark numbers, ablation results, or task-by-task breakdowns, so it is hard to judge how much of the gain comes from the motion prior itself versus decoder reuse or latent distillation.\u003C\u002Fp>\u003Cp>It also does not spell out which robot embodiments were included, how different the control spaces were, or what kinds of trajectories were used in Stage 1. Those details matter if you want to port the approach to a new platform. Without them, the safest takeaway is architectural: pretraining the action module on motion structure appears to help, especially when data is scarce and embodiments vary.\u003C\u002Fp>\u003Cp>Another open question is how general the action prior becomes as the action dataset scales. The abstract says more Stage 1 action data improves downstream VLA performance, but it does not define the scaling curve or the point of diminishing returns. For practitioners, that means the approach is promising, but still needs careful validation on your own robot, your own control space, and your own data budget.\u003C\u002Fp>\u003Ch2>The bottom line\u003C\u002Fh2>\u003Cp>This paper makes a clear case for treating action modeling as a first-class pretraining problem in VLA robotics. By learning a motion prior before cross-modal alignment, the model can start with a better inductive bias for control instead of discovering motion dynamics from scratch during full policy training.\u003C\u002Fp>\u003Cp>For teams working on cross-embodiment manipulation, the main lesson is simple: if the action module is the weakest part of your stack, give it its own training phase. The paper suggests that doing so can improve convergence, robustness, and real-world performance, especially when robot data is limited.\u003C\u002Fp>","A two-stage training scheme gives VLA robots an explicit motion prior before cross-modal alignment.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.26095",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782367379107-moh2.png","research","en","978e67d0-1acb-479e-af06-9ead35e4eb74",[17,18,19,20,21],"vision-language-action","robot manipulation","cross-embodiment","motion priors","flow matching",[23,24,25],"Pretrain the action module on unconditioned trajectories before VLA alignment.","The method uses flow matching plus decoder reuse and latent distillation.","Reported gains include faster convergence and better real-world performance, but no benchmark numbers are given.",0,"2026-06-25T06:02:30.294341+00:00","2026-06-25T06:02:30.28+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":32,"relatedPosts":36},[],{"id":15,"slug":33,"title":34,"language":35},"learning-action-priors-cross-embodiment-manipulation-zh","先學動作先驗，再對齊多模態","zh",[37,43,49,55,61,67],{"id":38,"slug":39,"title":40,"cover_image":41,"image_url":41,"created_at":42,"category":13},"cb071ec2-19f7-44b6-936e-6f37a9c43b33","ai-papers-code-music-rare-disease-en","3 AI papers on code, music, and diagnosis","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782372780798-rpru.png","2026-06-25T07:32:27.739296+00:00",{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"cd6be4d9-484d-4fa6-8736-8a3b564c4477","new-nlp-papers-agent-memory-tool-use-en","New NLP papers map agent memory and tool use","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782371891968-0m9y.png","2026-06-25T07:17:39.682691+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"17884e8b-86d6-431c-8e83-d628bb4d060a","self-distillation-shrinks-output-diversity-en","Self-Distillation Can Shrink Model Diversity","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782369170326-a6te.png","2026-06-25T06:32:27.005106+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"671fd56c-27db-4f72-956d-7ef067cbe2b4","revengebench-reverse-engineering-game-policies-en","RevengeBench tests reverse-engineering game policies","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782368277490-091s.png","2026-06-25T06:17:29.467265+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"06b86f04-a846-4cd5-95f0-1a5d3925c846","opsd-user-feedback-training-loop-en","OPSD lets you turn user clicks into training","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782335103738-zb9h.png","2026-06-24T21:04:40.861287+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"9fd702bc-6c80-4d27-8f85-5971f898bef3","ultraquant-4bit-kv-caching-agents-en","UltraQuant: 4-bit KV caching for long agents","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782331384598-tjhi.png","2026-06-24T20:02:33.028079+00:00",[74,79,84,89,94,99,104,109,114,119],{"id":75,"slug":76,"title":77,"created_at":78},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":80,"slug":81,"title":82,"created_at":83},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]