ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

Jiraphon Yenphraphai1,2 Ashkan Mirzaei1 Jianqi Chen3 Jiaxu Zou1
Sergey Tulyakov1 Raymond A. Yeh2 Peter Wonka1,3 Chaoyang Wang1
1 Snap 2 Purdue University 3 KAUST

[ArXiv]     



ShapeGen4D generates high quality mesh sequences from input monocular videos.

Abstract

We introduce a native video-to-4D shape framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models:

  1. A temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation.
  2. A time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture.
  3. Noise sharing across frames to enhance temporal stability.

Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

Method


We present a flow-based latent diffusion model that generates mesh sequences capturing dynamic object motion, conditioned on a monocular video. Specifically, this involves extracting temporally-aligned latents by querying at the same surface location and introducing a spatiotemporal transformer for processing the sequence of frames.

Comparisons

Our method produces high-quality meshes, maintains consistent poses, and exhibits substantially less temporal jitter.


Citation


  @article{ShapeGen4D,
      title={ShapeGen4D: Towards High Quality 4D Shape Generation from Videos},
      author={Jiraphon Yenphraphai and Ashkan Mirzaei and Jianqi Chen and Jiaxu Zou
          and Sergey Tulyakov and Raymond A. Yeh and Peter Wonka and Chaoyang Wang},
      year={2025},
      journal={arXiv preprint},
  }