MV-VDP

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image–text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only the actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction–based, 3D-based, and vision–language–action models, establishing a new state of the art in data-efficient multi-task manipulation.

Figure 1: Overview of MV-VDP

As illustrated in Fig. 2, MV-VDP employs a unified framework for joint spatio-temporal modeling and action prediction. The overall framework can be decomposed into three main components: multi-view representation and projection, joint video and heatmap diffusion, and hierarchical action decoding.

Multi-View Representation and Projection

MV-VDP transforms input colored point clouds and robot end-effector poses into a format compatible with video foundation models. Specifically, it uses orthographic projection to generate multi-view RGB images from the point cloud and represents robot states as multi-view Gaussian heatmaps. This design implicitly encodes 3D spatial structure while aligning with the representation used in video pre-training.

Joint Video and Heatmap Diffusion

The model leverages a 5B-parameter video foundation model (Wan2.2) augmented with view-attention modules to ensure multi-view consistency. It is trained to jointly predict future multi-view RGB video sequences and corresponding heatmap sequences. By modeling how the environment evolves as a consequence of actions, the model captures environmental dynamics and future end-effector trajectories within a shared latent space.

Hierarchical Action Decoding

To recover executable actions, MV-VDP employs a two-stream decoding process. The predicted heatmap peaks are back-projected into 3D space using camera parameters to recover a continuous 3D end-effector trajectory. Simultaneously, a lightweight rotation and gripper predictor takes the denoised video latents and localized heatmap features as input to estimate the end-effector's rotation (Euler angles) and gripper states.

Figure 2: Network architecture of MV-VDP

Meta-World Simulation

The simulation experiments focus on seven distinct Meta-World tasks, including Door-Open, Button-Press, and Faucet-Close, using only five demonstration trajectories per task to evaluate learning efficiency. MV-VDP achieves a state-of-the-art average success rate of 89.1%, significantly outperforming the next best video-prediction baseline, Track2Act (67.4%), and standard behavioral cloning methods which struggle under such limited data. These results demonstrate that aligning action fine-tuning with video foundation model pretraining effectively reduces the gap between perception and control.

Method	Meta-World Tasks							Avg. Succ. (%) ↑
Method	D-Open	D-Close	Btn	Btn-Top	Fct-Cls	Fct-Open	Handle	Avg. Succ. (%) ↑
UniPi(Du et al., 2023)	0/25	9/25	3/25	0/25	1/25	3/25	4/25	11.40
BC-Scratch(Nair et al., 2022)	6/25	9/25	9/25	3/25	5/25	5/25	9/25	26.20
BC-R3M(Nair et al., 2022)	1/25	15/25	9/25	1/25	6/25	17/25	13/25	35.40
DP(Chi et al., 2025)	12/25	12/25	10/25	5/25	6/25	15/25	6/25	37.70
AVDC(Ko et al., 2023)	18/25	23/25	15/25	6/25	14/25	6/25	21/25	58.90
DreamZero(Ye et al., 2026)	0/25	11/25	23/25	3/25	20/25	25/25	25/25	61.10
Track2Act(Bharadhwaj et al., 2024b)	22/25	19/25	14/25	10/25	12/25	22/25	19/25	67.40
MV-VDP (Ours)	25/25	25/25	25/25	24/25	8/25	24/25	25/25	89.10

Episode 3

Handle Press

Episode 1

Episode 2

Episode 3

Real-World Base Tasks

The real-world evaluation involves a Franka Research 3 robot performing three foundational tasks: Put Lion (pick-and-place), Push-T (complex pushing), and Scoop Tortilla (contact-rich manipulation). With fewer than ten expert demonstrations, MV-VDP reaches a 100% success rate on the Put Lion task and successfully manages the continuous dynamics of pushing and scooping where prior key-pose-based methods like BridgeVLA often fail. By predicting continuous action chunks rather than isolated waypoints, the model maintains the precise temporal coordination required for high-dexterity tasks.

Method	Basic Tasks			Unseen Tasks				Avg. Succ. (%) ↑
Method	Put Lion	Push-T	Scoop Tort.	Put-B	Put-H	Push-L	Scoop-C	Avg. Succ. (%) ↑
DP3(Ze et al., 2024)	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0.00
π_0.5(Black et al., 2025a)	1/10	0/10	0/10	0/10	0/10	0/10	0/10	1.40
UVA(Li et al., 2025d)	2/10	0/10	0/10	1/10	1/10	0/10	0/10	5.70
BridgeVLA(Li et al., 2025b)	9/10	0/10	4/10	8/10	7/10	0/10	1/10	41.42
MV-VDP (Ours)	10/10	4/10	7/10	5/10	6/10	3/10	5/10	57.10

Table 2: Real-world manipulation results under limited demonstrations. We report success rates over 10 trials per task on three basic tasks and four unseen tasks. All methods are trained with 10 expert trajectories.

Scoop the tortilla into the plastic plate

Trial 1

Trial 2

Trial 3

Push the T Block into the target region

Trial 1

Trial 2

Trial 3

Put the lion on the shelf

Trial 1

Trial 2

Meta-World

button-press

Episode 1

Episode 2

Episode 3

door-open

Episode 1

Episode 2

Episode 3

faucet-close

Episode 1

Episode 2

Episode 3

Real-World

Scoop the tortilla into the plastic plate

Trial 1

Trial 2

Trial 3

Push the T Block into the target region

Trial 1

Trial 2

Trial 3

Put the lion on the shelf

Trial 1

Trial 2

Trial 3

For the inference denoising steps N, we vary it from 1 to 50. The results are presented in Fig. 3. Surprisingly, in contrast to typical video diffusion processes, which require around 50 denoising steps, MV-VDP achieves a comparable success rate with just a single denoising step. We believe the main reason for this is that heatmaps have relatively simple distribution modes and lack the high-frequency details present in regular images. Furthermore, our action prediction relies solely on the peak locations of the heatmaps, meaning that the overall quality of the heatmaps is less critical, and thus fewer denoising steps are required.

Figure 3: Average success rates for different inference denoising steps. MV-VDP demonstrates high robustness to varying diffusion steps, achieving strong performance even when the denoising step is set to 1.

For better visualization, we present example prediction results with different inference denoising steps in Fig. 4. As shown, when the denoising step is set to 1, the visual quality of the RGB prediction is somewhat lower, but the predicted heatmaps still provide meaningful information.

Figure 4: Visualization of video predictions under different denoising steps. Predicted RGB videos and heatmap videos under different denoising step settings. Lower denoising steps lead to visibly lower RGB video quality, while the predicted heatmaps remain relatively stable.

We conduct ablation studies on the Meta-World benchmark to evaluate the key design choices in MV-VDP, as summarized in Tab. 3. Model #2 replaces LoRA with full fine-tuning and achieves comparable performance (87.4%), but with significantly higher computational and memory costs. Therefore, we adopt LoRA fine-tuning in all experiments unless otherwise specified.

Model #3 concatenates heatmap and image sequences along the channel dimension rather than the view dimension, requiring additional convolutional adaptation for DiT. This introduces an information bottleneck and degrades performance (86.9% → 81.1%), indicating that view-wise concatenation better preserves multi-view information.

Model #4 predicts only heatmap sequences without joint video prediction, resulting in a substantial performance drop (89.1% → 61.1%), highlighting the importance of modeling temporal dynamics through video prediction. Finally, Model #5 removes pretrained video foundation weights and fails to fit the training data, demonstrating the critical role of large-scale video pretraining for data-efficient manipulation.

#	Video Pred	View Concat	Initial Weights	LoRA	Avg (%)
1	✓	✓	✓	✓	89.1
5	✓	✓	✓	-	87.4
3	✓	-	✓	✓	81.1
2	-	✓	✓	✓	61.1
4	✓	✓	-	✓	4.6

Table 3: Ablation study of MV-VDP design choices on Meta-World. Columns indicate whether the model applies LoRA fine-tuning (LoRA), concatenates heatmap and image sequences along the view dimension (View Concat), predicts future video frames (Video Pred) and uses pretrained initialization (Initial Weights). The last column reports the average success rate (%) across seven tasks.

Deploying manipulation policies in the real world poses safety challenges: it's difficult to assess whether a predicted action sequence is reasonable or safe from its raw numerical representation. In practice, reliable verification often requires actual action execution, which can be unsafe and may damage the robot or surroundings.

In contrast, MV-VDP generates realistic, temporally consistent multi-view video and heatmap sequences, allowing users to visually inspect predicted rollouts before execution. This provides a safer, more interpretable action validation mechanism.

To quantify this benefit, we conducted a study with four evaluators, each performing 35 rollouts (140 in total). The evaluators reviewed the generated videos before execution and reran any rollout that appeared unsafe (e.g., potential collisions). Tab. 4 shows that video-based inspection significantly reduces collision events, demonstrating that MV-VDP's predicted RGB videos enhance the interpretability of action outputs and provide a practical tool for safer deployment.

	With video checking	Without video checking
Collisions	0 / 140	6 / 140

Table 4: Number of collision events with and without video-based action checking.


@article{li2026multi,
  title={Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model},
  author={Li, Peiyan and Chen, Yixiang and Xu, Yuan and Yang, Jiabing and Wu, Xiangnan and Guo, Jun and Sun, Nan and Qian, Long and Li, Xinghang and Xiao, Xin and others},
  journal={arXiv preprint arXiv:2604.03181},
  year={2026}
}

MV-VDP

A Spatio-Temporal-Aware Video Action Model

TL;DR:

Abstract

Method

Multi-View Representation and Projection

Joint Video and Heatmap Diffusion

Hierarchical Action Decoding

Experiments

Meta-World Simulation

Button Press

Button Press Topdown

Door Close

Door Open

Faucet Close

Faucet Open

Handle Press

Real-World Base Tasks

Scoop the tortilla into the plastic plate

Push the T Block into the target region

Put the lion on the shelf

Generalization to Unseen Scenarios

Background Variation

Category Variation

Height Variation

Lighting Variation

Video Generation Showcase

Meta-World

Real-World

Robustness Analysis

Ablation Studies

Safer Deployment and Enhanced Explainability

Citation