LLM-Generated Dataset for Speech-Driven 3sD Facial Animation Models with Text-Controlled Expressivity
This repository contains the implementation and datasets for generating synthetic facial animation data using Large Language Models (LLMs) with text-controlled expressivity for 3D facial animation models.
├── dataframes/ # Processed emotion datasets
├── gen_data/ # Generated synthetic datasets
├── raw_data/ # Original emotion datasets
├── scripts/ # Main implementation scripts
│ ├── clip_module/ # CLIP-based model training
│ ├── dataset_generation/ # LLM-based data generation
│ └── evaluation/ # Model evaluation and visualization
├── environment.yml # Conda environment configuration
└── requirements.txt # Python dependencies
This project focuses on creating high-quality synthetic datasets for training speech-driven 3D facial animation models. The approach combines:
- Multi-source emotion datasets (GoEmotions, Tweet Intensity, ISEAR)
- LLM-generated facial descriptions using Llama 3.3 70B
- CLIP-based multimodal alignment between text and facial blendshapes
- Action Unit (AU) mapping based on FACS (Facial Action Coding System)
- CUDA-compatible GPU (recommended)
- Conda or Python 3.9+
- Git LFS for model weights
- Clone the repository:
git clone https://github.com/AI-Unicamp/LLM-Generated-Dataset.git
cd LLM-Generated-Dataset
- Install Git LFS (required for model weights):
sudo apt install git-lfs
conda install git-lfs
git lfs pull
- Set up environment:
# Using Conda (recommended)
conda env create -f environment.yml
conda activate llm_generated_dataset
# Or using pip
pip install -r requirements.txt
- GoEmotions: 58k Reddit comments with emotion labels
- Tweet Intensity: Emotion intensity tweets (anger, fear, joy, sadness)
- ISEAR: International Survey on Emotion Antecedents and Reactions
- Final synthetic dataset: Text + emotions + descriptions + blendshapes
- LLM outputs: Llama 3.3 70B generated emotional descriptions and action units
The core training pipeline includes:
- BlendshapeEncoder: Encodes 51D blendshape vectors to latent space
- TextProjector: Projects CLIP text embeddings to shared latent space
- BlendshapeDecoder: Reconstructs blendshapes from latent representations
- ClipEncoderModule: Frozen CLIP model for text encoding
# Model initialization
encoder = BlendshapeEncoder()
decoder = BlendshapeDecoder()
projector = TextProjector()
clip_encoder = ClipEncoderModule()
# Training with multimodal alignment
trainer = Trainer(
encoder=encoder,
decoder=decoder,
projector=projector,
clip_encoder=clip_encoder,
dataset=dataset,
batch_size=256,
learning_rate=1e-5,
epochs=100
)
Generate emotion datasets from raw sources:
cd scripts/dataset_generation/
python gen_dataframe_goemo.py
python gen_dataframe_tweet.py
python gen_dataframe_isear.py
python gen_dataframe_final.py
Generate facial descriptions using Llama 3.3:
# Configure your HuggingFace token in get_token.py
python gen_dataset_llama33_4bit.py
Train the CLIP-based alignment model:
cd scripts/clip_module/
python main.py
Generate t-SNE visualizations:
cd scripts/evaluation/
python tsne_plot.py
If you use this code or dataset in your research, please cite:
TBD
For questions or collaboration opportunities, please reach out through:
- GitHub Issues
- Email: p243236@dac.unicamp.br
- Institution: AIMS-Unicamp