Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

CVPR 2024 Highlight

From a single satellite image covering urban streets, Sat2Scene is able to generate videos with photorealistic and consistent textures across different views.

Abstract

Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photorealistic street-view image sequences and cross-view urban scenes from satellite imagery.

MY ALT TEXT

Pipeline overview of our method. Three steps compose the full pipeline to generate the scene representation and render street views based on satellite-inferred geometries. The generation step initiates colors for the foreground point cloud by using a 3D diffusion model with sparse convolutions, as well as synthesizing the background panorama with a 2D diffusion model. The scene features tightly anchored with the point cloud are extracted at the feature extraction step. The final rendering step produces images from arbitrary views through neural rendering.

GT


Sat2Vid


MVDiffusion


Ours

Street-view videos generated on the HoliCity dataset (with baseline comparison). Our method produces higher-quality video with better temporal consistency compared with the baselines.

Video Presentation

Poster

BibTeX

@InProceedings{li2024sat2scene,
    author    = {Li, Zuoyue and Li, Zhenqiang and Cui, Zhaopeng and Pollefeys, Marc and Oswald, Martin R.},
    title     = {Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {7141-7150}
}