4D Driving Scene Generation With Stereo Forcing

1Hong Kong University of Science and Technology (Guangzhou) 2PhiGent Robotics 3University of Science and Technology of China 4Shanghai Jiao Tong University 5University of California, Berkeley
*Equal Contribution Corresponding Author

Abstract

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations.

Overview

Overview pipeline

Overall framework of the proposed PhiGenesis. (1) Stage 1 aims to train a 4D reconstrucion generalist. Multi-view images are first fed into a fixed, pre-trained video VAE. The multi-scale features extracted from the decoder of the video VAE are then passed through a range-view adapter to reconstruct the complete 4D scene (including optical flow, etc.). (2) The objective of Stage 2 is to enhance geometric consistency generation. We project the 4D scenes reconstructed based on history onto the future trajectory perspective. The rendered video is denoised according to geometric uncertainty by stereo forcing and then sent to the pre-trained encoder to obtain the rendered multi-view latent. The rendered multi-view latent and noise map are fed into the multi-view video diffusion model to generate the latent of the multi-view video of the target trajectory. The latent of multi-view video is fed into the pre-trained video decoder and the GS adapter of range-view to generate the 4D scene corresponding to the target trajectory.

Temporal Multi-View Video Generation

Novel View Synthesis

New View Synthesis

New View Synthesis

On Waymo dataset, we visualized the three views generated from long-time series videos. Under long video generation, our method still maintains high quality after shifting the observation view.

Scene Editing

Scene Editing

On Waymo dataset, we visualized the three views generated from long-time series videos. Under long video generation, our method still maintains high quality after shifting the observation view.

Stereo Forcing

Stereo Forcing

The stereo forcing can effectively reduce generative degradation and improve geometric consistency.

Main Results

Results

Comparison with state-of-the-art methods on the Waymo and nuScenes datasets. Metrics reported include Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), Depth RMSE (D-RMSE), Learned Perceptual Image Patch Similarity (LPIPS), and Pearson Correlation Coefficient (PCC). Higher values are better for PSNR, SSIM, and PCC; lower values are better for D-RMSE and LPIPS.