Listen, Think, and Understand

Introduction
Citation
OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset
Set the Virtual Environment
Inference
Finetune LTU and LTU-AS
- Finetune the LTU/LTU-AS Model with Toy Data
- Finetune the LTU/LTU-AS Model with Your Own Data
Reproduce LTU and LTU-AS Training
Pretrained Models
Contact

Introduction

This repository contains the official implementation (in PyTorch), pretrained checkpoints, and datasets of LTU and LTU-AS. LTU and LTU-AS are first generation of audio and speech large language model that bridges audio/speech perception with understanding. They not only achieve SOTA on multiple closed-ended audio and speech tasks, but also can answer any open-ended question based on the given audio. Please try the interactive demos to see how good they are!

[LTU Interactive Demo]

[LTU-AS Interactive Demo]

Citation

LTU-AS (Second Generation, Supports Speech and Audio):

LTU-AS was accepted at ASRU 2023. See you in Taipei!

[Paper] [HuggingFace Space] [ASRU Peer Review] [Compare LTU-1 and LTU-AS]

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

@inproceedings{gong_ltuas,
  title={Joint Audio and Speech Understanding},
  author={Gong, Yuan and Liu, Alexander H and Luo, Hongyin and Karlinsky, Leonid and Glass, James},
  year={2023},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
}

LTU (First Generation, Only Supports Audio):

[Paper] [HuggingFace Space]

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

@article{gong2023listen,
  title={Listen, Think, and Understand},
  author={Gong, Yuan and Luo, Hongyin and Liu, Alexander H and Karlinsky, Leonid and Glass, James},
  journal={arXiv preprint arXiv:2305.10790},
  year={2023}
}

OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset

We release the training data for LTU (OpenAQA) and LTU-AS (OpenASQA). Specifically, we release the (question, answer, audio_id) tuples. The actual audio wav files are from existing public datasets and need to be downloaded by the users. We provide full dataset (including all AQAs) as well as breakdowns (closed-ended and open-ended subsets, subsets of each original dataset, etc). All links are host on Dropbox and support wget.

For LTU (OpenAQA)

Toy Set (Contains Raw Audio Files, for Testing Purpose Only):

For LTU: [Meta] [Audio]

OpenAQA Training (Only Audio Datasets, 5.6M AQAs in Total):

Full Dataset (2.3GB): [Download]

Breakdown Subsets: [Download]

LTU Evaluation Data: [Download]

For LTU-AS (OpenASQA)

Toy Set (Contains Raw Audio Files, for Testing Purpose Only):

For LTU-AS: [Meta] [Audio and Whisper Feature]

OpenASQA Training (Audio and Speech Datasets, 10.2M AQAs in Total):

Full Dataset (4.6GB): [Download]

Breakdown Subsets: [Download]

LTU-AS Evaluation Data: [Download]

When prepare audio files, please make sure all audio files use a same sampling rate of 16kHz.

The format of the dataset is a json file of a list of dict, in the following format:

[
 {
  "instruction": "What is the significance of the sound of crying in this audio clip?", % the question
  "input": "I am so sad...", % the speech content
  "audio_id": "/data/sls/audioset/dave_version/audio/LZq4Neh-oWU.flac", % the audio id
  "dataset": "as_strong_train", % the original dataset
  "task": "open-ended question", % question type
  "output": "The sound of crying suggests that there is a sad or emotional situation happening in the audio clip." % the answer
 },
  ...
]

Set the Virtual Environment

For almost all usages, you would need to set a virtual environment. Note LTU and LTU-AS needs different environments. Their hf-dev and peft-main are different. Please do not mix use the venvs of LTU and LTU-AS.

Clone or download this repository as ltu-main, then,

For LTU:

cd /ltu-main/src/ltu
conda create --name venv_ltu python=3.10
conda activate venv_ltu
pip install -r requirements.txt
# install customerized huggingface transformer, original transformer won't work
pip install -e hf-dev/transformers-main
# install customerized huggingface peft, original peft won't work
pip install -e peft-main

For LTU-AS:

cd /ltu-main/src/ltu_as
conda create --name venv_ltu_as python=3.10
conda activate venv_ltu_as
# install customerized huggingface transformer, original transformer won't work
pip install -e hf-dev/transformers-main
# install customerized huggingface peft, original peft won't work
pip install -e peft-main/
# install customerized openai-whisper, original whisper won't work 
pip install -e whisper/

Inference

We provide code-free API inference at [LTU Inference] and [LTU-AS Inference]. Both support batch inference with API (see the buttom in the bottom of the page).

For local implementation, for all users, even you are only interested in training/finetuning, we suggest to start with running inference. This would help debugging. The bash scripts will automatically download default LTU/LTU-AS models, you do not need to do it by yourself. inference_gradio.py can be run on cpu or gpu.

For LTU:

conda activate venv_ltu
cd ltu/src/ltu
./inference_gradio.py

The script may output some warnings which can be ignored. After the script finishs, it will provide a gradio link for inference, which can be run on a browser of any machine. You can also modify the script to run it on local terminal.

We also provide batch inference script inference_batch.py

For LTU-AS:

conda activate venv_ltu_as
cd ltu/src/ltu_as
./inference_gradio.py

The script may output some warnings which can be ignored. After the script finishs, it will provide a gradio link for inference, which can be run on a browser of any machine.

We also provide batch inference script inference_batch.py, note this script loads pre-extracted whisper features, rather than raw wav files. If you want to use raw audio file, please use inference_gradio.py.

*GPU Issue: We find that Open-AI whisper features are different on different GPUs, which impacts the performance of LTU-AS as it takes Whisper feature as input. In the paper, we always use feature generated by old GPUs (Titan-X). But we do release a checkpoint that uses feature generated by newer GPUs (A5000/A6000), please manually switch the checkpoint if you are running on newer GPUs. Mismatch of training and inference GPU does not totally destroy the model, but would cause a performance drop.

Finetune LTU and LTU-AS

Finetune the LTU/LTU-AS Model with Toy Data

We do not provide raw audio files for OpenAQA and OpenASQA due to copyright reasons. However, for easy reproduction, we provide audio files and Whisper audio features for a small sample set. Specifically, we very simple, almost one-click script to finetune the model. Once success, you can change the toy data to your own data.

For both scripts:

You do not need to download the toy data, prep_train.sh will do this for you.
You do not need to download the pretrained model, prep_train.sh will download the default pretrained model. However, you can change which pretrained model to use in finetune_toy.sh.

For LTU:

conda activate venv_ltu
# this path matters, many code requires relative path
cd ltu/src/ltu/train_script
# allow script executable
chmod 777 *
# prepare toy data and pretrained models
./prep_train.sh
# run finetuning on the data
./finetune_toy.sh

For LTU-AS:

conda activate venv_ltu_as
# this path matters, many code requires relative path
cd ltu/src/ltu_as/train_script
# allow script executable
chmod 777 *
# prepare toy data and pretrained models
./prep_train.sh
# run finetuning on the data
./finetune_toy.sh

Finetune the LTU/LTU-AS Model with Your Own Data

For LTU, it is simple, you just need to replace --data_path '../../../openaqa/data/openaqa_toy_relative.json' in finetune_toy.sh to your own data. Note please make sure your own audios are 16kHz, absolute paths are encouraged, we use relative path just for simple one-click sample.

For LTU-AS, it is a bit more complex, our script does not load raw audio, but pre-extracted Whisper-features, so you would also need to first extract Whisper features for your own audio, and then change the code in HF transformer package to point to your dir for Whisper feature.

Reproduce LTU and LTU-AS Training

We suggest you first try a toy data finetuning then do this.

This is similar to finetuning, the difference is that both LTU and LTU-AS training are multi-stage curriculum, so you would need to start from stage 1, and then stage 2, .... For stage 2, you would need to change --base_model 'your_path_to_mdl/pytorch_model.bin' to the checkpoint of trained model in stage 1. And so on so forth.

For LTU:

conda activate venv_ltu
# this path matters, many code requires relative path
cd ltu/src/ltu/train_script
# allow script executable
chmod 777 *
# prepare toy data and pretrained models
./prep_train.sh
# run finetuning on the data
./stage1_proj_cla.sh
./stage2_all_cla.sh
./stage3_all_close.sh
./stage4_all_mix.sh

For LTU-AS:

conda activate venv_ltu_as
# this path matters, many code requires relative path
cd ltu/src/ltu_as/train_script
# allow script executable
chmod 777 *
# prepare toy data and pretrained models
./prep_train.sh
# run finetuning on the data
./finetune_toy.sh
./stage1_proj_cla.sh
./stage2_all_cla.sh
./stage4_all_mix_v2.sh

Pretrained Models

For most above applications, our script handles model download (so you do not need to do it by yourself), but we do provide more checkpoints.

Other models mentioned in the paper may be provided upon request, please create an github issue to ask.

LTU Model	Size	Train Seq Length	Train Steps	Whisper Feature GPU	Not Answerable Questions	Link
Original in Paper (Default)	370M	108	20000	-	Included	Download
Full-Finetuned (include LLM Parameters)	27G	108	20000	-	Included	Download

LTU Model	Size	Train Seq Length	Train Steps	Whisper Feature GPU	Not Answerable Questions	Link
Original in Paper	200M	108	40000	Old GPUs (Titan)	Included	Download
Long_sequence_exclude_noqa_old_gpu	200M	192	40000	Old GPUs (Titan)	Excluded	Download
Long_sequence_exclude_noqa_new_gpu (Default)	200M	192	40000	New GPUs (A5000/6000)	Excluded	Download
Full-Finetuned (include LLM Parameters)	27G	192	40000	Old GPUs (Titan)	Excluded	Download

Contact

If you have a question, please bring up an issue, I usually respond promptly, if delayed, please ping me. For more personal or confidential request, please send me an email yuangong@mit.edu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Listen, Think, and Understand

Introduction

Citation

OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset

For LTU (OpenAQA)

For LTU-AS (OpenASQA)

Set the Virtual Environment

Inference

Finetune LTU and LTU-AS

Finetune the LTU/LTU-AS Model with Toy Data

Finetune the LTU/LTU-AS Model with Your Own Data

Reproduce LTU and LTU-AS Training

Pretrained Models

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Listen, Think, and Understand

Introduction

Citation

OpenAQA (LTU) and OpenASQA (LTU-AS) Dataset

For LTU (OpenAQA)

For LTU-AS (OpenASQA)

Set the Virtual Environment

Inference

Finetune LTU and LTU-AS

Finetune the LTU/LTU-AS Model with Toy Data

Finetune the LTU/LTU-AS Model with Your Own Data

Reproduce LTU and LTU-AS Training

Pretrained Models

Contact