DriveDreamer-Policy

Abstract

Bridging World Models and Planning through Geometry

Driving world models can forecast future observations from large-scale logs, while recent vision–language–action (VLA) planners leverage large language models for richer reasoning and instruction following. Bridging these directions, world–action models aim to couple future world generation with motion planning. However, existing approaches often focus on 2D appearance or latent representation, leaving the role of explicit 3D structure under-explored.

We present DriveDreamer-Policy, a unified driving world–action model that integrates depth generation, future video imagination, and motion planning within a single modular architecture. The model employs a large language model that processes language instructions, multi-view images, and action context together with a fixed set of learnable world and action queries. These queries produce compact world embeddings that condition three lightweight generative experts via cross-attention: a pixel-space depth generator, a latent-space video generator, and a diffusion-based action generator.

A structured query ordering enforces a 3D → 2D → 1D information flow, allowing video imagination to benefit from geometric cues and planning to leverage both geometry and predicted future dynamics while keeping inference modular and efficient.

Keywords: World Action Model · Autonomous Driving · Video Generation · Depth Estimation · Motion Planning

Method

Architecture Overview

DriveDreamer-Policy couples a large language model (Qwen3-VL-2B) with three lightweight generative experts, connected through a fixed-size query bottleneck. The LLM processes multi-view images, language instructions, and current action context alongside learnable depth, video, and action queries. The resulting embeddings condition three modular experts for depth, video, and action generation.

3D Depth→ 2D Video→ 1D Action

Figure 2: Overview of the DriveDreamer-Policy pipeline. The large language model takes the language instruction, multi-view images and current action, along with learnable queries, to generate world and action embeddings. These embeddings are passed into three generative experts as cross-attention conditions to generate depth, future video, and future action.

Depth Generator

Pixel-space diffusion transformer trained with flow-matching. Generates monocular depth as an explicit 3D scaffold, conditioned on world depth embeddings via cross-attention. Operates directly in pixel space for boundary fidelity.

Video Generator

Latent-space text-image-to-video diffusion transformer initialized from Wan-2.1. Conditioned on world video embeddings (which incorporate upstream depth cues) and a CLIP visual condition for appearance grounding. Generates 9-frame future videos at 144×256.

Action Generator

Standalone flow-matching diffusion transformer mapping noise to feasible future trajectories. Parameterized as (x, y, cos θ, sin θ) for smooth turn dynamics. Can operate independently for planning-only mode.

Positioning

Comparison with Existing Models

DriveDreamer-Policy extends beyond existing paradigms by jointly producing depth, video, and actions. Vision-based and VLA planners map observations directly to actions without predicting the future. World models generate future observations but rely on external action signals. Recent world–action models unify generation and planning but typically operate only on 2D image/video representations. DriveDreamer-Policy adds explicit 3D depth generation alongside video and actions, enabling geometry-grounded imagination and planning.

Figure 1: Comparison of DriveDreamer-Policy with existing models. Items with dashed lines are optional. DriveDreamer-Policy extends world–action models by explicitly generating depth alongside video and actions, enabling geometry-grounded imagination and planning within a unified model.

Contributions

Key Contributions

Unified, modular world-action architecture — combines an LLM with generative experts connected through a fixed-size query interface, enabling practical compute control and flexible operating modes (planning-only, imagination-enabled planning, or full generation).
Explicit 3D depth generation with causal conditioning — incorporates a depth generation module and utilizes a causal 3D→2D→1D conditioning pathway, allowing geometry to directly scaffold future video generation and motion planning.
Comprehensive evaluation and state-of-the-art results — achieves 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video (FVD 53.59) and depth predictions.

Experiments

Quantitative Results

Table 1: Navsim v1 — Planning Performance (navtest)

Best result in bold. "*" indicates results with imitation learning. Methods span three categories: vision-based E2E, VLA-based, and world-model-based.

Method	Venue	Sensors	NC ↑	DAC ↑	TTC ↑	C ↑	EP ↑	PDMS ↑
Vision-Based End-to-End Methods
TransFuser	TPAMI'23	3×C + L	97.7	92.8	92.8	100.0	79.2	84.0
UniAD	CVPR'23	6×C	97.8	91.9	92.9	100.0	78.8	83.4
DiffusionDrive	CVPR'25	3×C + L	98.2	96.2	94.7	100.0	82.2	88.1
Vision-Language-Action Methods
AutoVLA	NeurIPS'25	3×C	98.4	95.6	98.0	99.9	81.9	89.1
RecogDrive*	ICLR'26	3×C	98.1	94.7	94.2	100.0	80.9	86.5
DriveVLA-W0	ICLR'26	1×C	98.7	96.2	95.5	100.0	82.2	88.4
World-Model-Based Methods
LAW	ICLR'25	1×C	96.4	95.4	88.7	99.9	81.7	84.6
DrivingGPT	ICCV'25	1×C	98.9	90.7	94.9	95.6	79.7	82.4
WoTE	ICCV'25	3×C + L	98.5	96.8	94.4	99.9	81.9	88.3
Epona	ICCV'25	3×C	97.9	95.1	93.8	99.9	80.4	86.2
FSDrive	NeurIPS'25	3×C	98.2	93.8	93.3	99.9	80.1	85.1
PWM	NeurIPS'25	1×C	98.6	95.9	95.4	100.0	81.8	88.1
DriveDreamer-Policy (Ours)	–	3×C	98.4	97.1	95.1	100.0	83.5	89.2

Table 2: Navsim v2 — Planning Performance (navtest)

Extended PDMS (EPDMS) evaluation with additional compliance metrics.

Method	Venue	NC ↑	DAC ↑	DDC ↑	TLC ↑	EP ↑	TTC ↑	LK ↑	HC ↑	EC ↑	EPDMS ↑
Vision-Based End-to-End Methods
TransFuser	TPAMI'23	96.9	89.9	97.8	99.7	87.1	95.4	92.7	98.3	87.2	76.7
DiffusionDrive	CVPR'25	98.2	95.9	99.4	99.8	87.5	97.3	96.8	98.3	87.7	84.5
DriveSuprim	AAAI'26	97.5	96.5	99.4	99.6	88.4	96.6	95.5	98.3	77.0	83.1
ARTEMIS	RAL'26	98.3	95.1	98.6	99.8	81.5	97.4	96.5	98.3	89.1	83.1
Vision-Language-Action Methods
DriveVLA-W0	ICLR'26	98.5	99.1	98.0	99.7	86.4	98.1	93.2	97.9	58.9	86.1
World-Model-Based Methods
DriveDreamer-Policy (Ours)	–	98.4	97.1	99.5	99.9	87.9	97.7	97.6	98.3	79.4	88.7

Table 3a: Video Generation

Comparison with generative world-model methods on Navsim.

Method	LPIPS ↓	PSNR ↑	FVD ↓
PWM	0.23	21.57	85.95
Ours	0.20	21.05	53.59

Table 3b: Depth Generation

Comparison among depth model variants on Navsim. Depth ground truth is generated by Depth Anything 3 (DA3) as pseudo labels.

Method	AbsRel ↓	δ₁ ↑	δ₂ ↑	δ₃ ↑
PPD (zero-shot)	18.5	80.4	94.0	97.2
PPD (fine-tuned)	9.3	91.4	98.3	99.5
Ours	8.1	92.8	98.6	99.5

Ablation Studies

Understanding the Design Choices

All world learning strategies improve planning. Joint depth+video training achieves the largest gains by enabling more generalizable and robust features.

Strategy	Depth	Video	NC ↑	DAC ↑	TTC ↑	C ↑	EP ↑	PDMS ↑
Without World Learning	✗	✗	98.0	96.3	94.4	100.0	82.5	88.0
Depth only	✓	✗	98.1	96.7	94.9	100.0	82.8	88.5
Video only	✗	✓	98.1	97.0	95.0	100.0	83.1	88.9
Full (Depth + Video)	✓	✓	98.4	97.1	95.1	100.0	83.5	89.2

Joint learning with depth improves video generation accuracy — depth provides an effective 3D scaffold for coherent future prediction.

Strategy	Depth	LPIPS ↓	PSNR ↑	FVD ↓
Without Depth	✗	0.22	19.89	65.82
With Depth	✓	0.20	21.05	53.59

More query tokens provide higher-capacity slots for storing relevant context, enhancing both generation and planning performance.

D / V / A Queries	AbsRel ↓	δ₁ ↑	FVD ↓	NC ↑	DAC ↑	EP ↑	PDMS ↑
32 / 32 / 4	9.7	90.2	57.97	98.2	97.0	83.2	88.9
64 / 64 / 8	8.1	92.8	53.59	98.4	97.1	83.5	89.2

Qualitative Results

Visualization of Generation and Planning

Representative Navsim scenarios showing predicted trajectories on BEV renderings alongside generated depth maps and future frames. Depth-conditioned imagination reduces common failure modes such as short-horizon collision risk and off-road drift, and improves interpretability by exposing predicted geometry and future scene evolution.

DriveDreamer-Policy depth video planning visualization

Figure 3: Visualization of generated depth, video, and actions. Depth is truncated to below 80 meters for better visualization. The planning aligns well with human trajectories (top) and slows down more effectively (bottom).

Generated Future Videos

Our model generates action-conditioned future videos across diverse driving scenarios. Below we showcase representative results demonstrating realistic scene dynamics, geometric consistency, and plausible motion of surrounding agents.

Turn Left Left turn at an intersection fowwing a white vehicle ahead.

Turn Right Right turn with scattered traffic cones.

Keep Straight Urban street with dense traffics.

Effect of World Learning on Planning

We compare four ablation variants — Action-Only, Depth-Action, Video-Action, and Depth-Video-Action — to show how world representations improve trajectory quality. Across different scenarios, adding depth and video cues consistently yields safer, more human-like trajectories.

DriveDreamer-Policy qualitative ablation

Figure 4: Visualization of world learning for planning. Green denotes the human trajectory and red denotes the predicted trajectory. Depth and video provide complementary world cues that improve safety margins and trajectory consistency.