ReImagine

Abstract

Human video generation remains challenging, as it requires jointly modeling appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often treat these factors separately, leading to limited controllability or reduced visual quality.

We revisit this problem from an image-first perspective, where high-quality appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance from temporal consistency.

We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model.

Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis.

Viewpoint rotation

Our synthesis results are smooth and temporally consistent, even under large viewpoint changes.

In-the-wild appearance

Our method is capable of synthesizing humans with in-the-wild appearance and motion.

Qualitative results with in-the-wild reference appearance and SMPL-X motion

Canonical human asset dataset gallery

Disentangled assets from our canonical training setup: face, upper / lower clothing, and shoes.

Loading…

Additional rows: face, full outfit, and shoes .

Compositional generation

We additionally train a compositional human image synthesis model on our canonical human dataset.

Compositional human synthesis: identity, clothing, and pose under varying views

Method

Our image-first pipeline splits the problem into two steps: (1) per-frame pose- and view-conditioned synthesis, then (2) training-free temporal alignment. Given canonical front/back appearance, an SMPL‑X motion sequence, and target camera views, we generate consistent frames and refine them into a coherent video.

Pose- and View-Guided Image Synthesis Module

A fine-tuned pretrained image backbone (e.g., Flux/Kontext with LoRA) renders each frame from SMPL‑X normal cues, keeping identity and clothing aligned across poses and viewpoints.

Training-Free Temporal Consistency Module

A pretrained video diffusion model refines the frame sequence at inference only: latent blending and feature propagation reduce flicker, stabilize motion, and preserve identity. There is no extra temporal training.

BibTeX

If you use this work, you can cite it with the BibTeX entry below.


@article{sun2025rethinking,
  title={ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis},
  author={Sun, Zhengwentai and Zheng, Keru and Li, Chenghong and Liao, Hongjie and Yang, Xihe and Li, Heyuan and Zhi, Yihao and Ning, Shuliang and Cui, Shuguang and Han, Xiaoguang},
  journal={arXiv preprint arXiv:2604.19720},
  year={2026},
  url={https://arxiv.org/abs/2604.19720v1}
}

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Pose- and Viewpoint-Controllable Human Synthesis

High-Quality Human Video Generation