Version 0.2.0

Labbeti · Jan 12, 2024 · 633e7fb · 633e7fb
1 parent 0068a66
commit 633e7fb
Show file tree

Hide file tree

Showing 172 changed files with 21,578 additions and 99 deletions.
diff --git a/.github/workflows/python-package-pip.yaml → .github/workflows/inference.yaml b/.github/workflows/python-package-pip.yaml → .github/workflows/inference.yaml
@@ -1,6 +1,6 @@
 # Template: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
 
-name: CoNeTTE test
+name: CoNeTTE inference
 
 on:
   push:
@@ -51,7 +51,11 @@ jobs:
       run: |
         python -m pip install -e .[dev]
 
-    # --- TESTS ---
+    # --- TESTS ---  
+    - name: Check format with Black
+      run: |
+        python -m black --check --diff src
+
     - name: Print install info
       run: |
         conette-info

diff --git a/.github/workflows/training.yaml b/.github/workflows/training.yaml
@@ -0,0 +1,98 @@
+# Template: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: CoNeTTE training
+
+on:
+  push:
+    branches: [ main, dev ]
+  pull_request:
+    branches: [ main, dev ]
+
+env:
+  CACHE_NUMBER: 0  # increase to reset cache manually
+  DATAROOT: "$HOME/.cache/data"
+  LOGROOT: "logs"
+
+# Cancel workflow if a new push occurs
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    runs-on: ${{ matrix.os }}
+
+    strategy:
+      matrix:
+        os: [ubuntu-latest]
+        python-version: ["3.10"]
+
+    defaults:
+      run:
+        shell: bash -el {0}
+
+    steps:
+    # --- INSTALLATIONS ---
+    - name: Checkout repository and submodules
+      uses: actions/checkout@v2
+      with:
+        submodules: recursive
+
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+        cache: 'pip'
+
+    - name: Install soundfile
+      run: |
+        # For soundfile dep
+        sudo apt-get install libsndfile1
+
+    - name: Install local packages
+      run: |
+        python -m pip install -e .[train]
+
+    - name: Print install info
+      run: |
+        conette-info
+
+    - name: Prepare spaCy models
+      run: |
+        conette-prepare data=none default=false verbose=2 spacy=true
+
+    - name: Load prepare cache
+      uses: actions/cache@v3
+      id: cache_preparation
+      with:
+        path: |
+          ~/.cache/aac-metrics
+          ~/.cache/conette
+          ~/.cache/data/HDF
+          ~/.cache/huggingface
+          ~/.cache/torch
+          ~/nltk_data
+        key: ${{ runner.os }}-cache_preparation-${{ hashFiles('src/conette/prepare.py') }}
+        restore-keys: |
+          ${{ runner.os }}-cache_preparation
+
+    - name: Prepare data and other models if necessary
+      if: ${{ steps.cache_preparation.outputs.cache-hit != 'true' }}
+      run: |
+        echo "Prepare data in dataroot '$DATAROOT'"
+        cnext_bl_path="$HOME/.cache/torch/hub/checkpoints/convnext_tiny_465mAP_BL_AC_70kit.pth"
+        conette-prepare data=clotho default=true pann=false pack_to_hdf=true data.clean_archives=true data.subsets=[val] audio_t.src_sr=44100 audio_t.pretrain_path=${cnext_bl_path} post_hdf_name=bl pretag=cnext_bl csum_in_hdf_name=false path.data=$DATAROOT verbose=2
+
+    # --- TESTS ---
+    - name: Train a model
+      run: |
+        target_hdf="clotho_val_resample_mean_convnext_ident_bl.hdf"
+        conette-train pl=conette expt=[clotho_cnext_bl,task_ds_src_camw] dm.train_hdfs=${target_hdf} dm.val_hdfs=${target_hdf} dm.test_hdfs=${target_hdf} dm.predict_hdfs=[] trainer.accelerator=cpu enable_dspeed=false path.data=$DATAROOT verbose=2 trainer=lim2 dm.bsize=3 trainer.max_epochs=1 path.log_root=$LOGROOT
+
+    - name: Run CoNeTTE predict with trained model
+      run: |
+        latest_parent_logdir=`ls -Art "$LOGROOT" | grep train | tail -n 1`
+        latest_logdir=`ls -Art "$LOGROOT/$latest_parent_logdir" | tail -n 1`
+        model_path=$LOGROOT/$latest_parent_logdir/$latest_logdir
+        echo "Predict with $model_path..."
+        conette-predict --audio src/conette/data/sample.wav --model_path "$model_path"
diff --git a/.gitignore b/.gitignore
@@ -2,5 +2,7 @@
 __pycache__/
 *.egg-info/
 Labbeti/conette/
-tmp/
+*tmp/
 dist/
+logs/
+data/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,12 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.2.0] 2024-01-12
+### Added
+- CoNeTTE training source code, with entire data processing.
+- ConvNeXt-trans baseline training source code, with entire data processing.
+- ConvNeXt tag logits to CoNeTTE model outputs during inference.
+
 ## [0.1.4] 2023-11-20
 ### Fixed
 - Fix forbid repetition mode argument.

diff --git a/README.md b/README.md
@@ -9,14 +9,16 @@
 
 </div>
 
-CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. 
+CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. A simple interface to test CoNeTTE is available on [HuggingFace website](https://huggingface.co/spaces/Labbeti/conette).
 
-## Installation
+## Inference
+
+### Installation
 ```bash
 python -m pip install conette
 ```
 
-## Usage with python
+### Usage with python
 ```py
 from conette import CoNeTTEConfig, CoNeTTEModel
 
@@ -57,26 +59,77 @@ candidate = outputs["cands"][0]
 print(candidate)
 ```
 
-## Usage with command line
+### Usage with command line
 Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.
 
 ```bash
 conette-predict --audio "/your/path/to/audio.wav"
 ```
 
-## Performance
+### Performance
+The model has been trained on AudioCaps (AC), Clotho (CL), MACS (MA) and WavCaps (WC). The performance on the test subsets are :
 
 | Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |
 | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
 | AC-test | 44.14 | 43.98 | 60.81 | 309 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/outputs_audiocaps_test.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/scores_audiocaps_test.yaml) |
 | CL-eval | 30.97 | 30.87 | 51.72 | 636 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/outputs_clotho_eval.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/scores_clotho_eval.yaml) |
 
-This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
+This model checkpoint has been trained with focus on the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
 
-## Limitations
+### Limitations
 - The model expected audio sampled at **32 kHz**. The model automatically resample up or down the input audio files. However, it might give worse results, especially when using audio with lower sampling rates.
 - The model has been trained on audio lasting from **1 to 30 seconds**. It can handle longer audio files, but it might require more memory and give worse results.
 
+## Train a model
+### Requirements
+- Intended for Ubuntu 20.04 only. Requires **java** < 1.13, **ffmpeg**, **yt-dlp**, and **zip** commands.
+- Minimal recommanded GPU: GPU V100-32G.
+- WavCaps dataset might requires more than 2 TB of disk storage.
+
+### Installation
+By default, **only the inference requirements are installed for conette**. To install training requirements you need to use the following command:
+```bash
+python -m pip install conette[train]
+```
+If you already installed conette for inference, it is **highly recommanded to create another environment** before installing conette for training.
+
+### Download external models and data
+These steps might take a while (few hours to download and prepare everything depending on your CPU, GPU and SSD/HDD).
+
+First, download the ConvNeXt, NLTK and spacy models :
+```bash
+conette-prepare data=none default=true pack_to_hdf=false csum_in_hdf_name=false pann=false
+```
+
+Then download the 4 datasets used to train CoNeTTE :
+```bash
+cnext_bl_path="$HOME/.cache/torch/hub/checkpoints/convnext_tiny_465mAP_BL_AC.pth"
+common_args="data.download=true pack_to_hdf=true audio_t=resample_mean_convnext audio_t.pretrain_path=${cnext_bl_path} post_hdf_name=bl pretag=cnext_bl"
+
+conette-prepare data=audiocaps audio_t.src_sr=32000 ${common_args}
+conette-prepare data=clotho audio_t.src_sr=44100 ${common_args}
+conette-prepare data=macs audio_t.src_sr=48000 ${common_args}
+conette-prepare data=wavcaps audio_t.src_sr=32000 ${common_args} datafilter.min_audio_size=0.1 datafilter.max_audio_size=30.0 datafilter.sr=32000
+```
+
+### Train a model
+CNext-trans (baseline) on CL only (~3 hours on 1 GPU V100-32G)
+```bash
+conette-train expt=[clotho_cnext_bl] pl=baseline
+```
+
+CoNeTTE on AC+CL+MA+WC, specialized for CL (~4 hours on 1 GPU V100-32G)
+```bash
+conette-train expt=[camw_cnext_bl_for_c,task_ds_src_camw] pl=conette
+```
+
+CoNeTTE on AC+CL+MA+WC, specialized for AC (~3 hours on 1 GPU V100-32G)
+```bash
+conette-train expt=[camw_cnext_bl_for_a,task_ds_src_camw] pl=conette
+```
+
+**About reproducibility** : any training with AC data cannot be reproduced because a part of this data is deleted from the YouTube source, and I cannot share my own audio files.
+
 ## Citation
 The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
 
@@ -96,7 +149,7 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
 ## Additional information
 - CoNeTTE stands for **Co**nv**Ne**Xt-**T**ransformer with **T**ask **E**mbedding.
 - Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
-- The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/record/8020843 under the filename "convnext_tiny_465mAP_BL_AC_70kit.pth".
+- The weights of the encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/record/8020843 under the filename "convnext_tiny_465mAP_BL_AC_70kit.pth".
 
 ## Contact
 Maintainer: