Skip to content

Commit

Permalink
Version 0.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Labbeti committed Jan 12, 2024
1 parent 0068a66 commit 633e7fb
Show file tree
Hide file tree
Showing 172 changed files with 21,578 additions and 99 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Template: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: CoNeTTE test
name: CoNeTTE inference

on:
push:
Expand Down Expand Up @@ -51,7 +51,11 @@ jobs:
run: |
python -m pip install -e .[dev]
# --- TESTS ---
# --- TESTS ---
- name: Check format with Black
run: |
python -m black --check --diff src
- name: Print install info
run: |
conette-info
Expand Down
98 changes: 98 additions & 0 deletions .github/workflows/training.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Template: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: CoNeTTE training

on:
push:
branches: [ main, dev ]
pull_request:
branches: [ main, dev ]

env:
CACHE_NUMBER: 0 # increase to reset cache manually
DATAROOT: "$HOME/.cache/data"
LOGROOT: "logs"

# Cancel workflow if a new push occurs
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
build:
runs-on: ${{ matrix.os }}

strategy:
matrix:
os: [ubuntu-latest]
python-version: ["3.10"]

defaults:
run:
shell: bash -el {0}

steps:
# --- INSTALLATIONS ---
- name: Checkout repository and submodules
uses: actions/checkout@v2
with:
submodules: recursive

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'

- name: Install soundfile
run: |
# For soundfile dep
sudo apt-get install libsndfile1
- name: Install local packages
run: |
python -m pip install -e .[train]
- name: Print install info
run: |
conette-info
- name: Prepare spaCy models
run: |
conette-prepare data=none default=false verbose=2 spacy=true
- name: Load prepare cache
uses: actions/cache@v3
id: cache_preparation
with:
path: |
~/.cache/aac-metrics
~/.cache/conette
~/.cache/data/HDF
~/.cache/huggingface
~/.cache/torch
~/nltk_data
key: ${{ runner.os }}-cache_preparation-${{ hashFiles('src/conette/prepare.py') }}
restore-keys: |
${{ runner.os }}-cache_preparation
- name: Prepare data and other models if necessary
if: ${{ steps.cache_preparation.outputs.cache-hit != 'true' }}
run: |
echo "Prepare data in dataroot '$DATAROOT'"
cnext_bl_path="$HOME/.cache/torch/hub/checkpoints/convnext_tiny_465mAP_BL_AC_70kit.pth"
conette-prepare data=clotho default=true pann=false pack_to_hdf=true data.clean_archives=true data.subsets=[val] audio_t.src_sr=44100 audio_t.pretrain_path=${cnext_bl_path} post_hdf_name=bl pretag=cnext_bl csum_in_hdf_name=false path.data=$DATAROOT verbose=2
# --- TESTS ---
- name: Train a model
run: |
target_hdf="clotho_val_resample_mean_convnext_ident_bl.hdf"
conette-train pl=conette expt=[clotho_cnext_bl,task_ds_src_camw] dm.train_hdfs=${target_hdf} dm.val_hdfs=${target_hdf} dm.test_hdfs=${target_hdf} dm.predict_hdfs=[] trainer.accelerator=cpu enable_dspeed=false path.data=$DATAROOT verbose=2 trainer=lim2 dm.bsize=3 trainer.max_epochs=1 path.log_root=$LOGROOT
- name: Run CoNeTTE predict with trained model
run: |
latest_parent_logdir=`ls -Art "$LOGROOT" | grep train | tail -n 1`
latest_logdir=`ls -Art "$LOGROOT/$latest_parent_logdir" | tail -n 1`
model_path=$LOGROOT/$latest_parent_logdir/$latest_logdir
echo "Predict with $model_path..."
conette-predict --audio src/conette/data/sample.wav --model_path "$model_path"
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,7 @@
__pycache__/
*.egg-info/
Labbeti/conette/
tmp/
*tmp/
dist/
logs/
data/
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

All notable changes to this project will be documented in this file.

## [0.2.0] 2024-01-12
### Added
- CoNeTTE training source code, with entire data processing.
- ConvNeXt-trans baseline training source code, with entire data processing.
- ConvNeXt tag logits to CoNeTTE model outputs during inference.

## [0.1.4] 2023-11-20
### Fixed
- Fix forbid repetition mode argument.
Expand Down
69 changes: 61 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@

</div>

CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD.
CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. A simple interface to test CoNeTTE is available on [HuggingFace website](https://huggingface.co/spaces/Labbeti/conette).

## Installation
## Inference

### Installation
```bash
python -m pip install conette
```

## Usage with python
### Usage with python
```py
from conette import CoNeTTEConfig, CoNeTTEModel

Expand Down Expand Up @@ -57,26 +59,77 @@ candidate = outputs["cands"][0]
print(candidate)
```

## Usage with command line
### Usage with command line
Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.

```bash
conette-predict --audio "/your/path/to/audio.wav"
```

## Performance
### Performance
The model has been trained on AudioCaps (AC), Clotho (CL), MACS (MA) and WavCaps (WC). The performance on the test subsets are :

| Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| AC-test | 44.14 | 43.98 | 60.81 | 309 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/outputs_audiocaps_test.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/scores_audiocaps_test.yaml) |
| CL-eval | 30.97 | 30.87 | 51.72 | 636 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/outputs_clotho_eval.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/scores_clotho_eval.yaml) |

This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
This model checkpoint has been trained with focus on the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.

## Limitations
### Limitations
- The model expected audio sampled at **32 kHz**. The model automatically resample up or down the input audio files. However, it might give worse results, especially when using audio with lower sampling rates.
- The model has been trained on audio lasting from **1 to 30 seconds**. It can handle longer audio files, but it might require more memory and give worse results.

## Train a model
### Requirements
- Intended for Ubuntu 20.04 only. Requires **java** < 1.13, **ffmpeg**, **yt-dlp**, and **zip** commands.
- Minimal recommanded GPU: GPU V100-32G.
- WavCaps dataset might requires more than 2 TB of disk storage.

### Installation
By default, **only the inference requirements are installed for conette**. To install training requirements you need to use the following command:
```bash
python -m pip install conette[train]
```
If you already installed conette for inference, it is **highly recommanded to create another environment** before installing conette for training.

### Download external models and data
These steps might take a while (few hours to download and prepare everything depending on your CPU, GPU and SSD/HDD).

First, download the ConvNeXt, NLTK and spacy models :
```bash
conette-prepare data=none default=true pack_to_hdf=false csum_in_hdf_name=false pann=false
```

Then download the 4 datasets used to train CoNeTTE :
```bash
cnext_bl_path="$HOME/.cache/torch/hub/checkpoints/convnext_tiny_465mAP_BL_AC.pth"
common_args="data.download=true pack_to_hdf=true audio_t=resample_mean_convnext audio_t.pretrain_path=${cnext_bl_path} post_hdf_name=bl pretag=cnext_bl"

conette-prepare data=audiocaps audio_t.src_sr=32000 ${common_args}
conette-prepare data=clotho audio_t.src_sr=44100 ${common_args}
conette-prepare data=macs audio_t.src_sr=48000 ${common_args}
conette-prepare data=wavcaps audio_t.src_sr=32000 ${common_args} datafilter.min_audio_size=0.1 datafilter.max_audio_size=30.0 datafilter.sr=32000
```

### Train a model
CNext-trans (baseline) on CL only (~3 hours on 1 GPU V100-32G)
```bash
conette-train expt=[clotho_cnext_bl] pl=baseline
```

CoNeTTE on AC+CL+MA+WC, specialized for CL (~4 hours on 1 GPU V100-32G)
```bash
conette-train expt=[camw_cnext_bl_for_c,task_ds_src_camw] pl=conette
```

CoNeTTE on AC+CL+MA+WC, specialized for AC (~3 hours on 1 GPU V100-32G)
```bash
conette-train expt=[camw_cnext_bl_for_a,task_ds_src_camw] pl=conette
```

**About reproducibility** : any training with AC data cannot be reproduced because a part of this data is deleted from the YouTube source, and I cannot share my own audio files.

## Citation
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf

Expand All @@ -96,7 +149,7 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
## Additional information
- CoNeTTE stands for **Co**nv**Ne**Xt-**T**ransformer with **T**ask **E**mbedding.
- Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
- The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/record/8020843 under the filename "convnext_tiny_465mAP_BL_AC_70kit.pth".
- The weights of the encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/record/8020843 under the filename "convnext_tiny_465mAP_BL_AC_70kit.pth".

## Contact
Maintainer:
Expand Down
Loading

0 comments on commit 633e7fb

Please sign in to comment.