Our model was trained on the following datasets (~1.5M samples), with the top and bottom 5% excluded from training for validation/testing purposes:
- skeleton_trainv8_flux_0.txt_new.txt # Flux-generated data
- skeleton_trainv8_flux_1.txt_new.txt # Flux-generated data
- skeleton_trainv8_pinterest.txt_new.txt # Web-crawled data
- skeleton_trainv8_reelshort.txt_new.txt # Vertical short-video data
- skeleton_trainv8_vcg_0.txt_new.txt # Web-crawled data
- skeleton_trainv8_vcg_1.txt_new.txt # Web-crawled data
- skeleton_trainv8_vcg_2.txt_new.txt # Web-crawled data
- skeleton_trainv8_vcg_3.txt_new.txt # Web-crawled data
We additionally incorporated skeleton data from the COYO open-source dataset, though these were not used in final model iterations due to project adjustments:
- coyo_two_people_0.txt
- coyo_two_people_1.txt
- coyo_two_people_2.txt
- coyo_two_people_3.txt
- coyo_two_people_4.txt
- coyo274w_0.txt
- coyo274w_1.txt
- coyo274w_2.txt
- coyo274w_3.txt
- coyo274w_4.txt
- coyo274w_5.txt
- coyo274w_6.txt
- coyo274w_7.txt
- coyo274w_8.txt
- coyo274w_9.txt
- Dataset labels were generated using complex prompts + GPT-4o, with multi-dimensional annotations (see script:
tools/gen_caption.py) - During training, descriptions from different dimensions are randomly combined to create richer text distributions
- We also provide simple single-sentence prompts generated via
gen_prompt_simpleintools/gen_caption.py - These simple prompts were used to train our evaluation model: RHM2DGen_eval
- Pipeline: Face/human detection → Region extraction using detection boxes + SAM (resolving overlaps in multi-person cases) → Skeleton extraction and subsequent labeling using SAM masks
- Note: To ensure GPT-4o can recognize character relationships and maintain description consistency in multi-person scenarios, we input each person's SAM detection region to define character names (subject0/1) while maintaining whole-image context for detail description
(We provide original images, JSON files, and test prompts)
- eval_single_1k
- eval_double_1k
- For image data requests (non-commercial use only), please email: wxktongji@163.com
- Download link: Baidu Drive (Code: 4ujq)
- environment.yml
python3 train.py
python3 infer.py
- Download link: Baidu Drive (Code: rvgs)
- Code located in RHM2DGen_eval/, developed based on MDM's evaluation framework (modified for skeleton points)
- Environment:
RHM2DGen_eval/environment.yml - prompt:
tools/gen_caption.py, using simple prompts (gen_prompt_simple) - Training script:
RHM2DGen_eval/train.py - Evaluation script:
RHM2DGen_eval/eval.py
- double characters eval model: https://pan.baidu.com/s/13AQyPMiAyv56-xVDBJ165A?pwd=y2wz 提取码: y2wz
- single character eval model: https://pan.baidu.com/s/1_FOAq0STq74r4UVOt8w2gw?pwd=dsuf 提取码: dsuf
Detailed technical report will be released subsequently.
Primary contributors: Xuekuan Wang, Haoyu Yin, Haoyu Zheng, Yuqiu Huang, Keqiang Sun, Feng Qiu, Yunhao Shui, Junru Qiu

