We present HuGDiffusion, a generalizable 3D Gaussian splatting (3DGS) learning pipeline to achieve novel view synthesis (NVS) of human characters from single-view input images. Existing approaches typically require monocular videos or calibrated multi-view images as inputs, whose applicability could be weakened in real-world scenarios with arbitrary and/or unknown camera poses. In this paper, we aim to generate the set of 3DGS attributes via a diffusion-based framework conditioned on human priors extracted from a single image. Specifically, we begin with carefully integrated human-centric feature extraction procedures to deduce informative conditioning signals. Based on our empirical observations that jointly learning the whole 3DGS attributes is challenging to optimize, we design a multi-stage generation strategy to obtain different types of 3DGS attributes. To facilitate the training process, we investigate constructing proxy ground-truth 3D Gaussian attributes as high-quality attribute-level supervision signals. Through extensive experiments, our HuGDiffusion shows significant performance improvements over the state-of-the-art methods.
我们提出了 HuGDiffusion,一种通用的3D高斯点云(3DGS)学习管道,用于从单视图输入图像实现人类角色的新视角合成(NVS)。现有的方法通常需要单目视频或标定好的多视角图像作为输入,这在现实场景中由于摄像机姿态的任意性和/或未知性,其适用性可能会受到限制。本文的目标是通过基于扩散的框架,生成一组由单张图像提取的人类先验条件的3DGS属性。具体来说,我们首先整合了以人为中心的特征提取过程,以推导出有用的条件信号。基于我们在实验中的观察,联合学习整个3DGS属性是一个具有挑战性的优化任务,因此我们设计了一种多阶段生成策略,以获取不同类型的3DGS属性。为了促进训练过程,我们探讨了构建代理真实标签的3D高斯属性作为高质量的属性级监督信号。通过大量实验,HuGDiffusion 在性能上显著超过了现有的最先进方法。