This is the repo for the codes to fine-tuning the transformers with the ADReSSo dataset. The dataset has been converted into Hugging Face dataset.
Our goal is to resolve the first, classification task in the 2021 ADReSSo Challenge at 2021 INTERSPEECH.
- Transformers 4.35.1
- Pytorch 2.1.0+cu121
- Datasets 2.14.6
- Tokenizers 0.14.1
You can get these package with the following command:
pip install transformers datasets evaluate accelerate
If you want to get more information, please refer to requirements.txt and environment.yaml.
Our datasets are retrieved from DementiaBank. Due to the limit of the license, they must be kept in private.
If you want to train our model in different datasets, here is the requirements.
You need to use hugging face dataset, and split them into train and test.
Each dataset need to have the following features:
- audio: the audio transform by librosa
- label: label of data (1 means control group and 0 means Alzheimer’s Dementia)
- mmse(option): Mini–mental state examination
Here is an example.
DatasetDict({
train: Dataset({
features: ['audio', 'label', 'mmse'],
num_rows: 237
})
test: Dataset({
features: ['audio', 'label', 'mmse'],
num_rows: 46
})
})
- -m: the model you want to train (must on huggingface)
- -d: sample duration (in seconds)
- -b: training batch size
- -g: Gradient Accumulation Steps
- -hp: enable half precision
Argument | Type | Default value |
---|---|---|
-m | string | facebook/wav2vec2-base |
-d | integer | 30 |
-b | integer | 8 |
-g | integer | 4 |
-hp | boolean | False |
So if the model fails to fit in the GPU, i.e., CUDA out of memory, try to decrease
You can refer to this repository
- You need to modify shebang of .py file.
- You need a hugging face dataset to support training.
- You need to create your hugging face repository to save your model.
Acoustics
Model Variant | Accuracy | F1 |
---|---|---|
distil-whisper-large-v2 | 0.8451 | 0.8607 |
whisper-large-v3 | 0.8451 | 0.8406 |
distil-whisper-medium.en | 0.8169 | 0.7936 |
whisper-medium | 0.7606 | 0.7792 |
whisper-medium.en | 0.7324 | 0.7324 |
Linguistic
Model Variant | Accuracy | F1 |
---|---|---|
roberta-large | 0.8310 | 0.8421 |
bart-large-mnli | 0.8028 | 0.8000 |
bert-large | 0.7746 | 0.7500 |
bart-large | 0.7465 | 0.7500 |
flan-t5-large | 0.7465 | 0.7273 |
Model Variant | RMSE |
---|---|
whisper-medium.en | 4.5335 |
whisper-large-v3 | 4.5682 |
distil-whisper-large-v2 | 4.7742 |
whisper-medium | 4.8297 |
distil-whisper-medium.en | 4.9445 |