Automated Audio Captioning training code on Clotho datasets with Listen-Attend-Spell and CNN-Tell models.
git clone https://github.com/Labbeti/dcase2021task6
cd AAC
conda create -n env_aac python=3.9 pip
conda activate env_aac
pip install -e .
cd standalone
python download.py
cd ../slurm
./dcase.sh
- Anaconda >= 4.8,
- java >= 1.8.0 for SPICE metric,
- Python dependencies can be installed with setup.py (if you use requirements.txt only you must run the shell script "post_setup.sh" manually).
git clone https://github.com/Labbeti/dcase2021task6
cd AAC
conda create -n env_aac python=3.9 pip
conda activate env_aac
pip install -e .
This repository requires Java +1.8.0 and Stanford-CoreNLP for compute the 'Cider' and 'Spider' metrics. On Ubuntu, Java can be installed with the following command :
sudo apt install default-jre
You can install the datasets with the script standalone/download.py
. The default root path is data
.
You can choose a dataset with the option data=DATASET
.
This script also install language models for NLTK, spaCy and LanguageTool for process captions and a pre-trained model "Wavegram" from PANN.
Example : (download Clotho v2.1)
python download.py data=clotho
After install the environment and the dataset, juste run the script dcase.sh
:
cd slurm
./dcase.sh
Just run in directory standalone
:
python train.py expt=lat data=clotho epochs=60
For training Listen-Attend-Tell model with Clotho dataset during 60 epochs.
The testing is automatically done at the end of the training, but it can be turn off with test=false
.
This project use Hydra for parsing parameters in terminal. The syntax is param_name=VALUE
instead of --param_name VALUE
.
- expt=EXPERIMENT
- lat (ListenAttendTell, a recurrent model based on Listen Attend Spell by Thomas Pellegrini)
- cnnt (CNN-Tell, a convolutional recurrent model with a pre-trained encoder and the same decoder than LAT)
The model and result data are saved in logs/Clotho/train_ListenAttendTell/{DATETIME}_{TAG}/
directory, where DATETIME is the date of the start of the process and TAG the value of the tag
option.
The results directory contains :
- a
hydra
directory which store hydra parameters, - a
checkpoint
directory which store the best and the last model among training, - a ̀
events.outs.tfevents.ID
file which contains tensorboard logs, - a
hparams.yaml
file which store the experiment model hyper-parameters, - a
metrics.yaml
file which store the metrics results done by the Evaluator callback, - a list of
result_SUBSET.csv
files for each test dataset SUBSET which store the output of the model for each sample. - a
vocabulary.json
file containing the list of ordered words used, with frequencies of each word in the training dataset(s).
- Thomas Pellegrini for the Listen-Attend-Spell model
- Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley for the Cnn14_DecisionLevelAtt model from PANN
- source code
- Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley. "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition." arXiv preprint arXiv:1912.10211 (2019).