Driving world models can forecast future observations from large-scale logs, while recent vision–language–action (VLA) planners leverage large language models for richer reasoning and instruction following. Bridging these directions, world–action models aim to couple future world generation with motion planning. However, existing approaches often focus on 2D appearance or latent representation, leaving the role of explicit 3D structure under-explored.
We present DriveDreamer-Policy, a unified driving world–action model that integrates depth generation, future video imagination, and motion planning within a single modular architecture. The model employs a large language model that processes language instructions, multi-view images, and action context together with a fixed set of learnable world and action queries. These queries produce compact world embeddings that condition three lightweight generative experts via cross-attention: a pixel-space depth generator, a latent-space video generator, and a diffusion-based action generator.
A structured query ordering enforces a 3D → 2D → 1D information flow, allowing video imagination to benefit from geometric cues and planning to leverage both geometry and predicted future dynamics while keeping inference modular and efficient.
Keywords: World Action Model · Autonomous Driving · Video Generation · Depth Estimation · Motion Planning
DriveDreamer-Policy couples a large language model (Qwen3-VL-2B) with three lightweight generative experts, connected through a fixed-size query bottleneck. The LLM processes multi-view images, language instructions, and current action context alongside learnable depth, video, and action queries. The resulting embeddings condition three modular experts for depth, video, and action generation.
Pixel-space diffusion transformer trained with flow-matching. Generates monocular depth as an explicit 3D scaffold, conditioned on world depth embeddings via cross-attention. Operates directly in pixel space for boundary fidelity.
Latent-space text-image-to-video diffusion transformer initialized from Wan-2.1. Conditioned on world video embeddings (which incorporate upstream depth cues) and a CLIP visual condition for appearance grounding. Generates 9-frame future videos at 144×256.
Standalone flow-matching diffusion transformer mapping noise to feasible future trajectories. Parameterized as (x, y, cos θ, sin θ) for smooth turn dynamics. Can operate independently for planning-only mode.
DriveDreamer-Policy extends beyond existing paradigms by jointly producing depth, video, and actions. Vision-based and VLA planners map observations directly to actions without predicting the future. World models generate future observations but rely on external action signals. Recent world–action models unify generation and planning but typically operate only on 2D image/video representations. DriveDreamer-Policy adds explicit 3D depth generation alongside video and actions, enabling geometry-grounded imagination and planning.
| Method | LPIPS ↓ | PSNR ↑ | FVD ↓ |
|---|---|---|---|
| PWM | 0.23 | 21.57 | 85.95 |
| Ours | 0.20 | 21.05 | 53.59 |
| Method | AbsRel ↓ | δ₁ ↑ | δ₂ ↑ | δ₃ ↑ |
|---|---|---|---|---|
| PPD (zero-shot) | 18.5 | 80.4 | 94.0 | 97.2 |
| PPD (fine-tuned) | 9.3 | 91.4 | 98.3 | 99.5 |
| Ours | 8.1 | 92.8 | 98.6 | 99.5 |
All world learning strategies improve planning. Joint depth+video training achieves the largest gains by enabling more generalizable and robust features.
| Strategy | Depth | Video | NC ↑ | DAC ↑ | TTC ↑ | C ↑ | EP ↑ | PDMS ↑ |
|---|---|---|---|---|---|---|---|---|
| Without World Learning | ✗ | ✗ | 98.0 | 96.3 | 94.4 | 100.0 | 82.5 | 88.0 |
| Depth only | ✓ | ✗ | 98.1 | 96.7 | 94.9 | 100.0 | 82.8 | 88.5 |
| Video only | ✗ | ✓ | 98.1 | 97.0 | 95.0 | 100.0 | 83.1 | 88.9 |
| Full (Depth + Video) | ✓ | ✓ | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 |
Joint learning with depth improves video generation accuracy — depth provides an effective 3D scaffold for coherent future prediction.
| Strategy | Depth | LPIPS ↓ | PSNR ↑ | FVD ↓ |
|---|---|---|---|---|
| Without Depth | ✗ | 0.22 | 19.89 | 65.82 |
| With Depth | ✓ | 0.20 | 21.05 | 53.59 |
More query tokens provide higher-capacity slots for storing relevant context, enhancing both generation and planning performance.
| D / V / A Queries | AbsRel ↓ | δ₁ ↑ | FVD ↓ | NC ↑ | DAC ↑ | EP ↑ | PDMS ↑ |
|---|---|---|---|---|---|---|---|
| 32 / 32 / 4 | 9.7 | 90.2 | 57.97 | 98.2 | 97.0 | 83.2 | 88.9 |
| 64 / 64 / 8 | 8.1 | 92.8 | 53.59 | 98.4 | 97.1 | 83.5 | 89.2 |
Representative Navsim scenarios showing predicted trajectories on BEV renderings alongside generated depth maps and future frames. Depth-conditioned imagination reduces common failure modes such as short-horizon collision risk and off-road drift, and improves interpretability by exposing predicted geometry and future scene evolution.
Our model generates action-conditioned future videos across diverse driving scenarios. Below we showcase representative results demonstrating realistic scene dynamics, geometric consistency, and plausible motion of surrounding agents.
We compare four ablation variants — Action-Only, Depth-Action, Video-Action, and Depth-Video-Action — to show how world representations improve trajectory quality. Across different scenarios, adding depth and video cues consistently yields safer, more human-like trajectories.
@inproceedings{drivedreamer-policy2026,
title = {DriveDreamer-Policy: A Geometry-Grounded Unified
Driving World-Action Model for World Generation
and Planning},
author = {Authors},
booktitle = {Preprint},
year = {2026}
}