Logo MV-VDP

A Spatio-Temporal-Aware Video Action Model

Peiyan Li,1,2,* Yixiang Chen,1,2,* Yuan Xu,1,2 Jiabing Yang,1,2 Xiangnan Wu,1,2
Jun Guo,4 Nan Sun,4 Long Qian,5 Xinghang Li,4 Xin Xiao,6 Jing Liu,3 Nianfeng Liu,3
Tao Kong,4, Yan Huang,1,2,3, Liang Wang,1,2 Tieniu Tan,1,2,7
*Equal Contribution, Corresponding Author
1New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences
2School of Artificial Intelligence, University of Chinese Academy of Sciences
3FiveAges 4Tsinghua University 5Xi'an Jiaotong University
6Wuhan University 7Nanjing University

TL;DR:

We introduce a 3D spatio-temporal-aware video action model that jointly predicts multi-view RGB frames and heatmaps, bridging video prediction and action recognition in a data-efficient, robust, generalizable, and interpretable manner.

Abstract

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image–text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only the actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction–based, 3D-based, and vision–language–action models, establishing a new state of the art in data-efficient multi-task manipulation.
grade-lv

Figure 1: Overview of MV-VDP

Method

As illustrated in Fig. 2, MV-VDP employs a unified framework for joint spatio-temporal modeling and action prediction. The overall framework can be decomposed into three main components: multi-view representation and projection, joint video and heatmap diffusion, and hierarchical action decoding.

Multi-View Representation and Projection

MV-VDP transforms input colored point clouds and robot end-effector poses into a format compatible with video foundation models. Specifically, it uses orthographic projection to generate multi-view RGB images from the point cloud and represents robot states as multi-view Gaussian heatmaps. This design implicitly encodes 3D spatial structure while aligning with the representation used in video pre-training.

Joint Video and Heatmap Diffusion

The model leverages a 5B-parameter video foundation model (Wan2.2) augmented with view-attention modules to ensure multi-view consistency. It is trained to jointly predict future multi-view RGB video sequences and corresponding heatmap sequences. By modeling how the environment evolves as a consequence of actions, the model captures environmental dynamics and future end-effector trajectories within a shared latent space.

Hierarchical Action Decoding

To recover executable actions, MV-VDP employs a two-stream decoding process. The predicted heatmap peaks are back-projected into 3D space using camera parameters to recover a continuous 3D end-effector trajectory. Simultaneously, a lightweight rotation and gripper predictor takes the denoised video latents and localized heatmap features as input to estimate the end-effector's rotation (Euler angles) and gripper states.

grade-lv

Figure 2: Network architecture of MV-VDP

Experiments

Meta-World Simulation

The simulation experiments focus on seven distinct Meta-World tasks, including Door-Open, Button-Press, and Faucet-Close, using only five demonstration trajectories per task to evaluate learning efficiency. MV-VDP achieves a state-of-the-art average success rate of 89.1%, significantly outperforming the next best video-prediction baseline, Track2Act (67.4%), and standard behavioral cloning methods which struggle under such limited data. These results demonstrate that aligning action fine-tuning with video foundation model pretraining effectively reduces the gap between perception and control.

Method Meta-World Tasks Avg.
Succ. (%) ↑
D-Open D-Close Btn Btn-Top Fct-Cls Fct-Open Handle
UniPi(Du et al., 2023) 0/259/253/250/251/253/254/2511.40
BC-Scratch(Nair et al., 2022) 6/259/259/253/255/255/259/2526.20
BC-R3M(Nair et al., 2022) 1/2515/259/251/256/2517/2513/2535.40
DP(Chi et al., 2025) 12/2512/2510/255/256/2515/256/2537.70
AVDC(Ko et al., 2023) 18/2523/2515/256/2514/256/2521/2558.90
DreamZero(Ye et al., 2026) 0/2511/2523/253/2520/2525/2525/2561.10
Track2Act(Bharadhwaj et al., 2024b) 22/2519/2514/2510/2512/2522/2519/2567.40
MV-VDP (Ours) 25/2525/2525/2524/258/2524/2525/2589.10

Table 1: Success rates on seven Meta-World tasks under a low-data regime (5 demonstrations per task). Each entry reports the number of successful rollouts out of 25 trials. Our method (MV-VDP) achieves the highest average success rate and consistently outperforms prior video-prediction and behavior-cloning baselines.

Button Press

Episode 1

Episode 2

Episode 3

Button Press Topdown

Episode 1

Episode 2

Episode 3

Door Close

Episode 1

Episode 2

Episode 3

Door Open

Episode 1

Episode 2

Episode 3

Faucet Close

Episode 1

Episode 2

Episode 3

Faucet Open

Episode 1

Episode 2

Episode 3

Handle Press

Episode 1

Episode 2

Episode 3

Real-World Base Tasks

The real-world evaluation involves a Franka Research 3 robot performing three foundational tasks: Put Lion (pick-and-place), Push-T (complex pushing), and Scoop Tortilla (contact-rich manipulation). With fewer than ten expert demonstrations, MV-VDP reaches a 100% success rate on the Put Lion task and successfully manages the continuous dynamics of pushing and scooping where prior key-pose-based methods like BridgeVLA often fail. By predicting continuous action chunks rather than isolated waypoints, the model maintains the precise temporal coordination required for high-dexterity tasks.

Method Basic Tasks Unseen Tasks Avg.
Succ. (%) ↑
Put Lion Push-T Scoop Tort. Put-B Put-H Push-L Scoop-C
DP3(Ze et al., 2024) 0/100/100/100/100/100/100/100.00
π0.5(Black et al., 2025a) 1/100/100/100/100/100/100/101.40
UVA(Li et al., 2025d) 2/100/100/101/101/100/100/105.70
BridgeVLA(Li et al., 2025b) 9/100/104/108/107/100/101/1041.42
MV-VDP (Ours) 10/104/107/105/106/103/105/1057.10

Table 2: Real-world manipulation results under limited demonstrations. We report success rates over 10 trials per task on three basic tasks and four unseen tasks. All methods are trained with 10 expert trajectories.

Scoop the tortilla into the plastic plate

Trial 1

Trial 2

Trial 3

Push the T Block into the target region

Trial 1

Trial 2

Trial 3

Put the lion on the shelf

Trial 1

Trial 2

Trial 3

Generalization to Unseen Scenarios

To test robustness, the model is deployed in four "unseen" settings: novel backgrounds (Put-B), increased object height (Put-H), dark lighting (Push-L), and an entirely new object category (Scoop-C, using plastic noodles). MV-VDP demonstrates strong generalization, particularly in lighting and category variations, achieving an average success rate of 57.1% across all real-world tasks compared to just 41.4% for the strongest Vision-Language-Action baseline. This suggests that the 3D-aware multi-view projections provide a more resilient representation of the environment than standard 2D or MLP-based 3D encoders.

Background Variation

Trial 1

Trial 2

Category Variation

Trial 1

Trial 2

Height Variation

Trial 1

Trial 2

Lighting Variation

Trial 1

Trial 2

Video Generation Showcase

Meta-World

button-press

door-open

faucet-close

Real-World

Scoop the tortilla into the plastic plate

Push the T Block into the target region

Put the lion on the shelf

Robustness Analysis

For the inference denoising steps N, we vary it from 1 to 50. The results are presented in Fig. 3. Surprisingly, in contrast to typical video diffusion processes, which require around 50 denoising steps, MV-VDP achieves a comparable success rate with just a single denoising step. We believe the main reason for this is that heatmaps have relatively simple distribution modes and lack the high-frequency details present in regular images. Furthermore, our action prediction relies solely on the peak locations of the heatmaps, meaning that the overall quality of the heatmaps is less critical, and thus fewer denoising steps are required.

Diffusion steps analysis

Figure 3: Average success rates for different inference denoising steps. MV-VDP demonstrates high robustness to varying diffusion steps, achieving strong performance even when the denoising step is set to 1.

For better visualization, we present example prediction results with different inference denoising steps in Fig. 4. As shown, when the denoising step is set to 1, the visual quality of the RGB prediction is somewhat lower, but the predicted heatmaps still provide meaningful information.

Ablation Studies

We conduct ablation studies on the Meta-World benchmark to evaluate the key design choices in MV-VDP, as summarized in Tab. 3. Model #2 replaces LoRA with full fine-tuning and achieves comparable performance (87.4%), but with significantly higher computational and memory costs. Therefore, we adopt LoRA fine-tuning in all experiments unless otherwise specified.

Model #3 concatenates heatmap and image sequences along the channel dimension rather than the view dimension, requiring additional convolutional adaptation for DiT. This introduces an information bottleneck and degrades performance (86.9% → 81.1%), indicating that view-wise concatenation better preserves multi-view information.

Model #4 predicts only heatmap sequences without joint video prediction, resulting in a substantial performance drop (89.1% → 61.1%), highlighting the importance of modeling temporal dynamics through video prediction. Finally, Model #5 removes pretrained video foundation weights and fails to fit the training data, demonstrating the critical role of large-scale video pretraining for data-efficient manipulation.

# Video Pred View Concat Initial Weights LoRA Avg (%)
1 89.1
5 - 87.4
3 - 81.1
2 - 61.1
4 - 4.6

Table 3: Ablation study of MV-VDP design choices on Meta-World. Columns indicate whether the model applies LoRA fine-tuning (LoRA), concatenates heatmap and image sequences along the view dimension (View Concat), predicts future video frames (Video Pred) and uses pretrained initialization (Initial Weights). The last column reports the average success rate (%) across seven tasks.

Safer Deployment and Enhanced Explainability

Deploying manipulation policies in the real world poses safety challenges: it's difficult to assess whether a predicted action sequence is reasonable or safe from its raw numerical representation. In practice, reliable verification often requires actual action execution, which can be unsafe and may damage the robot or surroundings.

In contrast, MV-VDP generates realistic, temporally consistent multi-view video and heatmap sequences, allowing users to visually inspect predicted rollouts before execution. This provides a safer, more interpretable action validation mechanism.

To quantify this benefit, we conducted a study with four evaluators, each performing 35 rollouts (140 in total). The evaluators reviewed the generated videos before execution and reran any rollout that appeared unsafe (e.g., potential collisions). Tab. 4 shows that video-based inspection significantly reduces collision events, demonstrating that MV-VDP's predicted RGB videos enhance the interpretability of action outputs and provide a practical tool for safer deployment.

With video checking Without video checking
Collisions 0 / 140 6 / 140

Table 4: Number of collision events with and without video-based action checking.

Citation


@article{li2026multi,
  title={Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model},
  author={Li, Peiyan and Chen, Yixiang and Xu, Yuan and Yang, Jiabing and Wu, Xiangnan and Guo, Jun and Sun, Nan and Qian, Long and Li, Xinghang and Xiao, Xin and others},
  journal={arXiv preprint arXiv:2604.03181},
  year={2026}
}