WPT: World-to-Policy Transfer via Online World Model Distillation

1USTC 2CUHK-SZ 3HKUST 4Huawei Foundation Model Dept
Corresponding Author

Abstract

Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop), surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9× faster inference, while retaining most of the gains.

Overview

Overview pipeline

Overview of WPT framework. During training (top), the pretrained world model predicts future world under given action conditions, and the teacher AD policy (T) generates multi-modal trajectories. The reward model evaluates these trajectories to produce world reward. During distillation (bottom), the student AD policy (S) learns from the teacher through two mechanisms: (1) policy distillation, which aligns the planning representations between teacher and student; and (2) world reward distillation, which encourages the student to match the teacher's optimal reward trajectory in the predicted future world.

Closed-loop Driving Examples

Main Results

NuScenes

Results

End-to-end planning performance on nuScenes validation set. The ego state was not utilized in the planning module.

Bench2Drive

World Model & Planning Construction

World Model

Planning Network

Results

Illustration of our Occ-based and instance-based baseline models. The top part shows the occupancy-based baseline model, while the bottom part illustrates the instance-based baseline model. Both approaches utilize a BEV decoder, but differ in how planning queries interact with features.

BibTeX

@inproceedings{Jiang2025WPT,
      title={WPT: World-to-Policy Transfer via Online World Model Distillation},
      author={Guangfeng Jiang and Yueru Luo and Jun Liu and Yi Huang and Yiyao Zhu and Zhan Qu and Dave Zhenyu Chen and Bingbing Liu and Xu Yan},
      year={2025}
    }