Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. We utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. While there are still challenges that the camera parameters and scale of depth are still absent in the generated image, we address those problems by ``controlling'' the diffusion model by hierarchical inpainting. Specifically, having access to ground-truth depth and camera parameters in simulation, we first render a photo-realistic image of only back-grounds in it. Then, we inpaint the foreground in this image, passing the geometric cues in the back-ground to the inpainting model, which informs the camera parameters. This process effectively controls the camera parameters and depth scale for the generated image, facilitating the back-projection from 2D image to 3D point clouds. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene.
Our pipeline begins by rendering a photo-realistic image in a simulated empty scene with only a static background (like walls and floors). We then use this image as a template for inpainting the foreground using 2D diffusion models, with the inpainting mask covering the majority of the image. Here, we have access to some crucial information: ground-truth depth and camera parameters. Subsequently, we utilize visual recognition models to segment the 2D image to ground the semantics and geometric information of each object. Then we use a depth estimation model to estimate the depth of the inpainted image. To make the depth estimation more accurate, we align the depth of the unmasked part of the image with the ground-truth depth obtained in the simulation. After that, we can lift the depth into 3D space and place the large furniture according to the 3D bounding boxes of each object, either by retrieving them from large-scale asset databases or generating them using image-to-3D generative models.
Since all large and small objects are placed in the scene separately, they become naturally interactive when powered by a physics engine, opening up opportunities to collect large-scale robot manipulation data in complex scenarios.
@inproceedings{wangarchitect,
title={Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting},
author={Wang, Yian and Qiu, Xiaowen and Liu, Jiageng and Chen, Zhehuan and Cai, Jiting and Wang, Yufei and Wang, Tsun-Hsuan and Xian, Zhou and Gan, Chuang},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
}