PALM: Few-Shot Prompt Learning for Audio Language Models
Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, and Hanan Aldarmaki
Zero-Shot inference involves matching the embedding of the audio waveform with the embeddings of text prompts for each class. The class with the highest matching score is then assigned to the audio. Prompt Learning, as explored by Gu et al. 2023, automates this by learning text prompts from training data in few-shot setup. The first notable method, COOP, learns the context of text prompts in the token-embedding space. Our method PALM operates in the feature (output) space of text encoder. It requires only class names at the input of text encoder and optimizes the feature space by adding learnable context embeddings to text feature vectors. PALM not only outperforms COOP, but it is also more computationally efficient since it does not require gradients to flow through the text encoder, unlike COOP. |
Abstract
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding.
TLDR: We adapt vision-language prompt learning methods for audio-language models and introduce PALM, a new method that is computationally efficient and outperforms or matches baselines in audio classification across 11 datasets.
- Sep 20, 2024 : Accepted in EMNLP (Main) 2024 🎊 🎉
- Sep 25, 2024 : Released code for PALM
- Sep 28, 2024 : Released instructions for preparing datasets
- Create a conda environment
conda create --name palm python=3.8
conda activate palm
- Install PyTorch and other dependencies
git clone https://github.com/asif-hanif/palm
cd palm
pip install -r requirements.txt
We have shown the efficacy of PALM and other baselines (ZERO-SHOT, COOP, COCOOP) using PENGI model.
Download the pre-trained PENGI model using the link provided below and place the checkpoint file at path pengi/configs
(after clonning the repo).
Model | Link | Size |
---|---|---|
PENGI | Download | 2.2 GB |
PENGI checkpoint can also be downloaded with following command:
wget https://zenodo.org/records/8387083/files/base.pth
We have performed experiments on 11 audio classification datasets. Instructions for downloading/processing datasets used by our method have been provided in the DATASETS.md. All of the datasets have been uploaded on HuggingFace Datasets Hub 🤗 for easy access. We have also provided a Jupyter Notebook to download all datasets in one go. It might take some time to download all datasets, so we recommend running the notebook on a cloud instance or a machine with good internet speed.
Dataset | Type | Classes | Size | Link |
---|---|---|---|---|
Beijing-Opera | Instrument Classification | 4 | 69 MB | Instructions |
CREMA-D | Emotion Recognition | 6 | 606 MB | Instructions |
ESC50 | Sound Event Classification | 50 | 881 MB | Instructions |
ESC50-Actions | Sound Event Classification | 10 | 881 MB | Instructions |
GT-Music-Genre | Music Analysis | 10 | 1.3 GB | Instructions |
NS-Instruments | Instrument Classification | 10 | 18.5 GB | Instructions |
RAVDESS | Emotion Recognition | 8 | 1.1 GB | Instructions |
SESA | Surveillance Sound Classification | 4 | 70 MB | Instructions |
TUT2017 | Acoustic Scene Classification | 15 | 12.3 GB | Instructions |
UrbanSound8K | Sound Event Classification | 10 | 6.8 GB | Instructions |
VocalSound | Vocal Sound Classification | 6 | 8.2 GB | Instructions |
All datasets should be placed in a directory named Audio-Datasets
and the path of this directory should be specified in the variable DATASET_ROOT
in the shell scripts
. Once all datasets have been downloaded, the directory structure should look like as follows:
Audio-Datasets/
├── Beijing-Opera/
├── CREMA-D/
├── ESC50/
├── ESC50-Actions/
├── GT-Music-Genre/
├── NS-Instruments/
├── RAVDESS/
├── SESA/
├── TUT2017/
├── UrbanSound8K/
├── VocalSound/
There are three main folders in this repo: pengi
, palm
, utils
. Code in pengi
folder is taken from PENGI repo for model instantiation. Implementation of baselines (zeroshot
, coop
, cocoop
) and our method palm
is in palm
folder. Class definitions of audio and text encoder of PENGI model can be found in palm/encoders.py
file. Training and dataset related code is in utils
folder.
We have performed all experiments on NVIDIA A100-SXM4-40GB
GPU. Shell scripts to run experiments can be found in scripts
folder.
## General Command Structure
bash <SHELL_SCRIPT> <METHOD_NAME>
Following methods (including palm
) are supported in this repository:
zeroshot
coop
cocoop
palm
Examples to run palm
method on different audio classifiction datasets have been provided below:
bash scripts/beijing_opera.sh palm
bash scripts/crema_d.sh palm
bash scripts/esc50_actions.sh palm
bash scripts/esc50.sh palm
bash scripts/gt_music_genre.sh palm
bash scripts/ns_instruments.sh palm
bash scripts/ravdess.sh palm
bash scripts/sesa.sh palm
bash scripts/tut.sh palm
bash scripts/urban_sound.sh palm
bash scripts/vocal_sound.sh palm
Results are saved in json
format in logs
directory. To process results (take an average across all folds/seeds and print), run the following command (after running all experiments):
cd logs
bash results.sh
Note For multi-fold datasets, we run experiments using cross-validation and then report average results on each seed.
Comparison of PALM with Baselines The accuracy scores of the baselines (ZERO-SHOT, COOP and COCOOP, and our proposed method PALM) across 11 datasets are presented. For each method (except ZERO SHOT), experiments were performed using three different seeds. The accuracy scores for all seeds are reported, along with the average score. Bold values indicate the best average score in each row. Compared to the baselines, our proposed method achieves favorable results, with an average improvement of 5.5% over COOP and 3.1% over COCOOP. It should be noted that both COOP and COCOOP are computationally expensive, as these approaches require loss gradients to flow through the text encoder. Additionally, COCOOP has a feedback loop from audio features to the input space of the text encoder, making it even more computationally expensive. On the other hand, PALM is relatively less computationally expensive.
Comparison of PALM† and PALM Here, PALM† refers to setting in which the Learnable Context embeddings have been removed from the feature space of the text encoder. The removal of context embeddings drastically degrades performance, highlighting their importance.
If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citation.
@article{hanif2024palm,
title={PALM: Few-Shot Prompt Learning for Audio Language Models},
author={Hanif, Asif and Agro, Maha Tufail and Qazi, Mohammad Areeb and Aldarmaki, Hanan},
journal={arXiv preprint arXiv:2409.19806},
year={2024}
}
Should you have any questions, please create an issue on this repository or contact us at asif.hanif@mbzuai.ac.ae
We used PENGI for model instantiation and borrowed a part of code from COOP/COCOOP to implement baselines. We thank the respective authors for releasing the code.