Here we show the demo videos of our LuciBot method.
We introduce LuciBot, a pipeline that utilizes a general-purpose video generation model to autonomously generate supervision signals for complex embodied tasks.
Automatically generating training supervision for embodied tasks is essential for collecting large-scale data in simulators.
While prior works utilize large language models (LLMs) to generate reward code or leverage vision-language models (VLMs) as supervision, these approaches are largely limited to simple tasks with well-defined rewards, such as pick-and-place. This limitation arises because LLMs struggle to describe complex shapes in code, and VLM-based rewards, such as those derived from CLIP, tend to be less precise.
To address these challenges, we propose leveraging the imagination capability of off-the-shelf video generation models. Given an initial simulation frame and a textual task description, a video generation model produces a video demonstrating task completion with correct semantics. We then extract rich supervisory signals from the generated video, including 6D pose sequences of objects, 2D segmentations, and estimated depth, to facilitate task learning within the simulation.
Our approach significantly enhances supervision quality for complex embodied tasks, expanding the potential for large-scale training in simulators.
Given a scene configuration in the simulation and a textual task description, we generate a video based on the rendered image and text description, serving as an imagined execution process for completing the task. Supervisory signals are then extracted from this generated video to optimize an action trajectory for task execution.
Here we show the demo videos of our LuciBot method.
LuciBot utilizes a general-purpose video generation model to generate reference videos that conform to semantics and extracts rich supervisory signals from them to facilitate complex embodied tasks. Here, we show the generated videos, including deformable tasks, articulated tasks, and rigid tasks.
We extract supervisory signals from the generated video then optimize the motion trajectory of the actuator directly in the simulator. Here we demonstrate the optimized trajectory videos of the actuators.
By obtaining the 6 dofs signals, we sample grasping poses based on grasping affordance and employ inverse kinimatics and path-planning algorithms to control the robot arm to grasp the actuator and follow this trajectory. We show the demo videos of robots manipulation as follows.
Beyond rendered images, our method can also be applied to real-world scenarios. Specifically, we first set up a real-world scene, capture an image of it, and generate a video conditioned on the image and task description. We choose two challenging tasks: Stack cups and Pour water, and conduct real world experiment. For stacking cups, the robot successfully stacks the white cup inside the red cup. The videos from left to right show the generated reference video, simulation video, and the real world manipulation video.
For pouring water, the robot successfully pours the water from the cola can to the bowl. The videos from left to right show the generated reference video, simulation video, and the real world manipulation video.