DriveDreamer-Policy

A Geometry-Grounded World-Action Model for Unified Generation and Planning

1 GigaAI

2 University of Toronto

3 CUHK MMLab

* Corresponding Author

This work was conducted during Yang's internship at GigaAI.

State-of-the-Art on Navsim Planning & Generation
89.2
PDMS on Navsim v1
88.7
EPDMS on Navsim v2
53.59
FVD
(Reduction of 38% vs. prior)
8.1
AbsRel Depth Error
(vs. DA3 pseudo GT)

Bridging World Models and Planning through Geometry

Driving world models can forecast future observations from large-scale logs, while recent vision–language–action (VLA) planners leverage large language models for richer reasoning and instruction following. Bridging these directions, world–action models aim to couple future world generation with motion planning. However, existing approaches often focus on 2D appearance or latent representation, leaving the role of explicit 3D structure under-explored.

We present DriveDreamer-Policy, a unified driving world–action model that integrates depth generation, future video imagination, and motion planning within a single modular architecture. The model employs a large language model that processes language instructions, multi-view images, and action context together with a fixed set of learnable world and action queries. These queries produce compact world embeddings that condition three lightweight generative experts via cross-attention: a pixel-space depth generator, a latent-space video generator, and a diffusion-based action generator.

A structured query ordering enforces a 3D → 2D → 1D information flow, allowing video imagination to benefit from geometric cues and planning to leverage both geometry and predicted future dynamics while keeping inference modular and efficient.

Keywords: World Action Model · Autonomous Driving · Video Generation · Depth Estimation · Motion Planning


Architecture Overview

DriveDreamer-Policy couples a large language model (Qwen3-VL-2B) with three lightweight generative experts, connected through a fixed-size query bottleneck. The LLM processes multi-view images, language instructions, and current action context alongside learnable depth, video, and action queries. The resulting embeddings condition three modular experts for depth, video, and action generation.

3D Depth 2D Video 1D Action
DriveDreamer-Policy Pipeline
Figure 2: Overview of the DriveDreamer-Policy pipeline. The large language model takes the language instruction, multi-view images and current action, along with learnable queries, to generate world and action embeddings. These embeddings are passed into three generative experts as cross-attention conditions to generate depth, future video, and future action.

Depth Generator

Pixel-space diffusion transformer trained with flow-matching. Generates monocular depth as an explicit 3D scaffold, conditioned on world depth embeddings via cross-attention. Operates directly in pixel space for boundary fidelity.

Video Generator

Latent-space text-image-to-video diffusion transformer initialized from Wan-2.1. Conditioned on world video embeddings (which incorporate upstream depth cues) and a CLIP visual condition for appearance grounding. Generates 9-frame future videos at 144×256.

Action Generator

Standalone flow-matching diffusion transformer mapping noise to feasible future trajectories. Parameterized as (x, y, cos θ, sin θ) for smooth turn dynamics. Can operate independently for planning-only mode.


Comparison with Existing Models

DriveDreamer-Policy extends beyond existing paradigms by jointly producing depth, video, and actions. Vision-based and VLA planners map observations directly to actions without predicting the future. World models generate future observations but rely on external action signals. Recent world–action models unify generation and planning but typically operate only on 2D image/video representations. DriveDreamer-Policy adds explicit 3D depth generation alongside video and actions, enabling geometry-grounded imagination and planning.

DriveDreamer-Policy Comparison
Figure 1: Comparison of DriveDreamer-Policy with existing models. Items with dashed lines are optional. DriveDreamer-Policy extends world–action models by explicitly generating depth alongside video and actions, enabling geometry-grounded imagination and planning within a unified model.

Key Contributions
  1. Unified, modular world-action architecture — combines an LLM with generative experts connected through a fixed-size query interface, enabling practical compute control and flexible operating modes (planning-only, imagination-enabled planning, or full generation).
  2. Explicit 3D depth generation with causal conditioning — incorporates a depth generation module and utilizes a causal 3D→2D→1D conditioning pathway, allowing geometry to directly scaffold future video generation and motion planning.
  3. Comprehensive evaluation and state-of-the-art results — achieves 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video (FVD 53.59) and depth predictions.

Quantitative Results
Table 3a: Video Generation
Comparison with generative world-model methods on Navsim.
MethodLPIPS ↓PSNR ↑FVD ↓
PWM0.2321.5785.95
Ours0.2021.0553.59
Table 3b: Depth Generation
Comparison among depth model variants on Navsim. Depth ground truth is generated by Depth Anything 3 (DA3) as pseudo labels.
MethodAbsRel ↓δ₁ ↑δ₂ ↑δ₃ ↑
PPD (zero-shot)18.580.494.097.2
PPD (fine-tuned)9.391.498.399.5
Ours8.192.898.699.5

Understanding the Design Choices

All world learning strategies improve planning. Joint depth+video training achieves the largest gains by enabling more generalizable and robust features.

StrategyDepthVideoNC ↑DAC ↑TTC ↑C ↑EP ↑PDMS ↑
Without World Learning98.096.394.4100.082.588.0
Depth only98.196.794.9100.082.888.5
Video only98.197.095.0100.083.188.9
Full (Depth + Video)98.497.195.1100.083.589.2

Joint learning with depth improves video generation accuracy — depth provides an effective 3D scaffold for coherent future prediction.

StrategyDepthLPIPS ↓PSNR ↑FVD ↓
Without Depth0.2219.8965.82
With Depth0.2021.0553.59

More query tokens provide higher-capacity slots for storing relevant context, enhancing both generation and planning performance.

D / V / A QueriesAbsRel ↓δ₁ ↑FVD ↓NC ↑DAC ↑EP ↑PDMS ↑
32 / 32 / 49.790.257.9798.297.083.288.9
64 / 64 / 88.192.853.5998.497.183.589.2

Visualization of Generation and Planning

Representative Navsim scenarios showing predicted trajectories on BEV renderings alongside generated depth maps and future frames. Depth-conditioned imagination reduces common failure modes such as short-horizon collision risk and off-road drift, and improves interpretability by exposing predicted geometry and future scene evolution.

DriveDreamer-Policy depth video planning visualization
Figure 3: Visualization of generated depth, video, and actions. Depth is truncated to below 80 meters for better visualization. The planning aligns well with human trajectories (top) and slows down more effectively (bottom).

Generated Future Videos

Our model generates action-conditioned future videos across diverse driving scenarios. Below we showcase representative results demonstrating realistic scene dynamics, geometric consistency, and plausible motion of surrounding agents.

Effect of World Learning on Planning

We compare four ablation variants — Action-Only, Depth-Action, Video-Action, and Depth-Video-Action — to show how world representations improve trajectory quality. Across different scenarios, adding depth and video cues consistently yields safer, more human-like trajectories.

DriveDreamer-Policy qualitative ablation
Figure 4: Visualization of world learning for planning. Green denotes the human trajectory and red denotes the predicted trajectory. Depth and video provide complementary world cues that improve safety margins and trajectory consistency.

BibTeX
@inproceedings{drivedreamer-policy2026,
  title     = {DriveDreamer-Policy: A Geometry-Grounded Unified
               Driving World-Action Model for World Generation
               and Planning},
  author    = {Authors},
  booktitle = {Preprint},
  year      = {2026}
}