TesserAct: Learning 4D Embodied World Models

Now, let's enter the TesserAct

We are excited to introduce TesserAct: Learning 4D Embodied World Models, a novel approach to learning 4D world models that predict how 3D scenes evolve over time in response to an input instruction, ensuring both spatial and temporal consistency. We achieve this by expanding a video generation model which depth and normal channels, which takes an input image and a text instruction to generate RGB, depth, and normal videos. We then reconstruct a 4D scene by an efficient algorithm using the generated RGB, depth, and normal videos. Finally, we predict actions based on the reconstructed 4D scene.

Note: all images shown in teaser video are downloaded from the Unsplash with free license. All videos are generated by TesserAct and rendered by Blender.

The key features of TesserAct include:

🎥 RGB-DN Learning: Train with RGB, Depth, and Normal data for richer 4D scene generation and understanding.
📊 Extended Datasets: Enhance robotic datasets with depth and normal data.
🌌 4D Scene Generation: Create high-quality, temporally coherent scenes.
🤖 Better Policy Learning: Boost agent performance with advanced 4D information.

Gallery

Below, we showcase a selection of videos generated by TesserAct. All current samples are produced using our experimental model. These videos are created from a variety of input images and prompts. Even though TesserAct has never been trained on data such as Van Gogh paintings or Ghibli-style films, it is still able to generate impressive results. We will continue to update the gallery to:

Explore the boundaries of TesserAct.

In TesserAct, there are infinite possibilities—whether it's real-world images or samples from paintings and plays of various styles, it can generate high-quality videos and 4D scenes based on user instructions.

Pick up the cup Franka Emika Panda

Pick up the bear google robot

Pick up the spoon Franka Emika Panda

Pick up the orange Franka Emika Panda

Pick up the apple Franka Emika Panda

Pick up the apple google robot

Use the sponge to clean up the desk google robot

Pick up the strawberry google robot

Pick up cup Franka Emika Panda

Pick up the cup Franka Emika Panda

Move the cloth Franka Emika Panda

Pick up the can google robot

Move the cloth Trossen WidowX 250 robot arm

Pick up the toy bear google robot

Pick up the cup Frankx Emika Panda

Pick up the cup Franka Emika Panda

Pick up the orange Frankx Emika Panda

Pick up the pepsi can google robot

Action-Rich Generations

Our model can generate action-rich videos. Results below show the generated videos with same input image and different prompts.

Pick up the sponge Google Robot.

Use the sponge to clean the table Google Robot.

Multi-Embodiment

Our model supports multiple robot arm conditons: Franka Emika Panda, Google Robot, and Trossen WidowX 250. Besides, these results also implies that our model have the potential to generalize to more diverse robot arms. Below, we show 2 cases of coke can pick up tasks in two different styles: Ghibli and Real. Both cases are generated with the same task instruction "pick up coke can" but conditioned on different robot arms.

Pick up coke can (Ghibli)
Pick up coke can (Real)

Franka Emika Panda

Google Robot

Franka Emika Panda

Trossen WidowX 250

Generation Diversity

Given the same input, TesserAct can generate diverse manipulation paths, covering a wide range of possible actions. Here, we show the generated videos with different random seeds during the inference process.

Pick up the cup Franka Emika Panda

Seed 0

Seed 1

Limitations & Failure Cases ⚠️

World models are not perfect, and TesserAct is no exception. The generated videos still suffer from several limitations, including visual inconsistencies (e.g., object disappearance), incorrect functional understanding (e.g., wrong affordances), and limited generalization to unseen objects. For more details, please refer to Usage Documentation in our GitHub repository.

For most seen tasks (e.g., pick up the cup) and unseen environments (e.g., different backgrounds), TesserAct can generate reasonable results. To generate a perfect video, it is recommended to infer the model with ~5 random seeds and select the best one.

An example where the cup vanishes during the manipulation process.

An example where the robot arm fails to understand the affordance of the kettle.

An example where the robot arm cannot pick up the whole Mario Toy, but only the head.

BibTeX

@article{zhen2025tesseract,
  title={TesserAct: Learning 4D Embodied World Models}, 
  author={Haoyu Zhen and Qiao Sun and Hongxin Zhang and Junyan Li and Siyuan Zhou and Yilun Du and Chuang Gan},
  year={2025},
  eprint={2504.20995},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.20995}, 
}

Sections

Now, let's enter the TesserAct

Gallery

Action-Rich Generations

Multi-Embodiment

Generation Diversity

Limitations & Failure Cases ⚠️

BibTeX