We want to build a model to interpret continuous American Sign Language (ASL) signing into English text. For this use case, we conducted experiments for fine-tuning large language vision models, specifically:
- LLaVA-NeXT-Video
- Video-LLaVA
We are using the How2Sign
dataset which is ASL video footage aligned with English sentences. It includes RGB videos, green-screen frontal and side views, and 3D keypoints (hand, body, face). We focused on RGB frontal-view video data for fine-tuning to manage computational constraints.
data
: Contains the cleaned CSV file used as the source dataset.data_profiling
: Includes the code for data cleaning.llava-next-video
: Scripts for fine-tuning the LLaVA-NeXT-Video model on the How2Sign dataset, along with quantitative analysis of the trained model.video-llava
: Scripts for fine-tuning the Video-LLaVA model on the How2Sign dataset, as well as inference scripts for the trained model.
pip install -r requirements.txt
Navigate to the huggingface_trainer
directory within llava-next-video
and execute the following command:
cd llava-next-video/huggingface_trainer
sbatch train.sh
We used slurm jobs to trigger the training jobs.
- A
logs
folder will be created to store training logs. - An
output
directory will be generated to store checkpoints from training. - A
generated_texts.csv
file will be created for validation purposes.
- id: Incremental ID for each data item.
- video_id: Unique identifier for the video clip, also present in
valid_clips.csv
. - generated: The text generated by the model for the specific clip.
- true: The expected text for the specific clip.
- epoch: The epoch at which the evaluation occurred.
Run the evaluation script to calculate validation scores:
cd llava-next-video/huggingface_trainer
python llava_next_video_eval.py
- A
validation_scores.csv
file will be generated containing the following metrics after every epoch:- ROUGE-1
- ROUGE-2
- ROUGE-L
- BLEU
Navigate to the video-llava
directory and execute the following command:
cd video-llava
sbatch train.sh
Inference for the Video-LLaVA model can be performed using the Jupyter notebook located at:
video-llava/inference.ipynb