Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. We utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. While there are still challenges that the camera parameters and scale of depth are still absent in the generated image, we address those problems by ``controlling'' the diffusion model by hierarchical inpainting. Specifically, having access to ground-truth depth and camera parameters in simulation, we first render a photo-realistic image of only back-grounds in it. Then, we inpaint the foreground in this image, passing the geometric cues in the back-ground to the inpainting model, which informs the camera parameters. This process effectively controls the camera parameters and depth scale for the generated image, facilitating the back-projection from 2D image to 3D point clouds. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene.
Please select an image below to view the results.
@inproceedings{wangarchitect,
title={Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting},
author={Wang, Yian and Qiu, Xiaowen and Liu, Jiageng and Chen, Zhehuan and Cai, Jiting and Wang, Yufei and Wang, Tsun-Hsuan and Xian, Zhou and Gan, Chuang},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
}