Meta V-JEPA 2 world model uses raw video to train robots

Source: roboticsbusinessreview
Author: @SteveCrowe
Published: 6/11/2025
To read the full content, please visit the original article.
Read original articleMeta has introduced V-JEPA 2, a 1.2-billion-parameter world model designed to enhance robotic understanding, prediction, and planning by training primarily on raw video data. Built on the Joint Embedding Predictive Architecture (JEPA), V-JEPA 2 undergoes a two-stage training process: first, self-supervised learning from over one million hours of video and a million images to capture physical interaction patterns; second, action-conditioned learning using about 62 hours of robot control data to incorporate agent actions for outcome prediction. This approach enables the model to support planning and closed-loop control in robots without requiring extensive domain-specific training or human annotations.
In practical tests within Meta’s labs, V-JEPA 2 demonstrated strong performance on common robotic tasks such as pick-and-place, achieving success rates between 65% and 80% in previously unseen environments. The model uses vision-based goal representations, generating candidate actions for simpler tasks and employing sequences of visual subgoals for more complex tasks
Tags
roboticsAIworld-modelsmachine-learningvision-based-controlrobotic-manipulationself-supervised-learning