- Haoyu Zhen 1*
- Qiao Sun 1*
- Hongxin Zhang 1
- Junyan Li 1
- Siyuan Zhou 2
- Yilun Du 3
- Chuang Gan 1
Now, let's enter the TesserAct
We are excited to introduce TesserAct: 4D Embodied World Models, a novel approach to learning 4D world models that predict how 3D scenes evolve over time in response to an input instruction, ensuring both spatial and temporal consistency. We achieve this by expanding a video generation model which depth and normal channels, which takes an input image and a text instruction to generate RGB, depth, and normal videos. We then reconstruct a 4D scene by an efficient algorithm using the generated RGB, depth, and normal videos. Finally, we predict actions based on the reconstructed 4D scene.
The key features of TesserAct include:
- 🎥 RGB-DN Learning: Train with RGB, Depth, and Normal data for richer 4D scene generation and understanding.
- 📊 Extended Datasets: Enhance robotic datasets with depth and normal data.
- 🌌 4D Scene Generation: Create high-quality, temporally coherent scenes.
- 🤖 Better Policy Learning: Boost agent performance with advanced 4D information.
Gallery
Below, we showcase a selection of videos generated by TesserAct. All current samples are produced using our experimental model. These videos are created from a variety of input images and prompts. Even though TesserAct has never been trained on data such as Van Gogh paintings or Ghibli-style films, it is still able to generate impressive results. We will continue to update the gallery to:
Explore the boundaries of TesserAct.
Action-Rich Generations
Our model can generate action-rich videos. Results below show the generated videos with same input image and different prompts.
Multi-Embodiment
Our model supports multiple robot arm conditons: Franka Emika Panda, Google Robot, and Trossen WidowX 250. Besides, these results also implies that our model have the potential to generalize to more diverse robot arms. Below, we show 2 cases of coke can pick up tasks in two different styles: Ghibli and Real. Both cases are generated with the same task instruction "pick up coke can" but conditioned on different robot arms.
Generation Diversity
Given the same input, TesserAct can generate diverse manipulation paths, covering a wide range of possible actions. Here, we show the generated videos with different random seeds during the inference process.
Limitations & Failure Cases ⚠️
World models are not perfect, and TesserAct is no exception. The generated videos still suffer from several limitations, including visual inconsistencies (e.g., object disappearance), incorrect functional understanding (e.g., wrong affordances), and limited generalization to unseen objects. For more details, please refer to Usage Documentation in our GitHub repository.
For most seen tasks (e.g., pick up the cup) and unseen environments (e.g., different backgrounds), TesserAct can generate reasonable results. To generate a perfect video, it is recommended to infer the model with ~5 random seeds and select the best one.