ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Zhengwentai Sun1,2, Keru Zheng1, Chenghong Li1,2, Hongjie Liao1, Xihe Yang1, Heyuan Li1,
Yihao Zhi1,2, Shuliang Ning1, Shuguang Cui1,2, Xiaoguang Han†,1,2

1The Chinese University of Hong Kong, Shenzhen    2Future Network of Intelligence Institute, CUHK-Shenzhen

† Corresponding author

Pose- and Viewpoint-Controllable Human Synthesis

High-Quality Human Video Generation

Try the ReImagine interactive demo (Hugging Face Space): taited-if-human.hf.space

Abstract

Human video generation remains challenging, as it requires jointly modeling appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often treat these factors separately, leading to limited controllability or reduced visual quality.

We revisit this problem from an image-first perspective, where high-quality appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance from temporal consistency.

We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model.

Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis.

Viewpoint rotation

Our synthesis results are smooth and temporally consistent, even under large viewpoint changes.

In-the-wild appearance

Our method is capable of synthesizing humans with in-the-wild appearance and motion.

Qualitative results with in-the-wild reference appearance and SMPL-X motion

Compositional generation

We additionally train a compositional human image synthesis model on our canonical human dataset.

Compositional human synthesis: identity, clothing, and pose under varying views

Method

Our image-first pipeline splits the problem into two steps: (1) per-frame pose- and view-conditioned synthesis, then (2) training-free temporal alignment. Given canonical front/back appearance, an SMPL‑X motion sequence, and target camera views, we generate consistent frames and refine them into a coherent video.

Pose- and View-Guided Image Synthesis Module

A fine-tuned pretrained image backbone (e.g., Flux/Kontext with LoRA) renders each frame from SMPL‑X normal cues, keeping identity and clothing aligned across poses and viewpoints.

Pose- and View-Guided Image Synthesis module

Training-Free Temporal Consistency Module

A pretrained video diffusion model refines the frame sequence at inference only: latent blending and feature propagation reduce flicker, stabilize motion, and preserve identity. There is no extra temporal training.

Training-Free Temporal Consistency module

Extended video materials

Longer walk-through with additional results.

BibTeX

If you use this work, you can cite it with the BibTeX entry below.


@article{sun2025rethinking,
  title={ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis},
  author={Sun, Zhengwentai and Zheng, Keru and Li, Chenghong and Liao, Hongjie and Yang, Xihe and Li, Heyuan and Zhi, Yihao and Ning, Shuliang and Cui, Shuguang and Han, Xiaoguang},
  journal={arXiv preprint arXiv:2604.19720},
  year={2026},
  url={https://arxiv.org/abs/2604.19720v1}
}