Skip to content

[EMNLP 2024] Official code repository of paper titled "PALM: Few-Shot Prompt Learning for Audio Language Models" accepted in EMNLP 2024 conference.

License

Notifications You must be signed in to change notification settings

asif-hanif/palm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PALM: Few-Shot Prompt Learning for Audio Language Models (EMNLP'24)

PALM: Few-Shot Prompt Learning for Audio Language Models

Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, and Hanan Aldarmaki

page paper


main figure

Zero-Shot inference involves matching the embedding of the audio waveform with the embeddings of text prompts for each class. The class with the highest matching score is then assigned to the audio. Prompt Learning, as explored by Gu et al. 2023, automates this by learning text prompts from training data in few-shot setup. The first notable method, COOP, learns the context of text prompts in the token-embedding space. Our method PALM operates in the feature (output) space of text encoder. It requires only class names at the input of text encoder and optimizes the feature space by adding learnable context embeddings to text feature vectors. PALM not only outperforms COOP, but it is also more computationally efficient since it does not require gradients to flow through the text encoder, unlike COOP.




Abstract

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding.

TLDR: We adapt vision-language prompt learning methods for audio-language models and introduce PALM, a new method that is computationally efficient and outperforms or matches baselines in audio classification across 11 datasets.



Updates 🚀

  • Sep 20, 2024 : Accepted in EMNLP (Main) 2024    🎊 🎉
  • Sep 25, 2024 : Released code for PALM
  • Sep 28, 2024 : Released instructions for preparing datasets


Table of Contents



  1. Create a conda environment
conda create --name palm python=3.8
conda activate palm
  1. Install PyTorch and other dependencies
git clone https://github.com/asif-hanif/palm
cd palm
pip install -r requirements.txt

We have shown the efficacy of PALM and other baselines (ZERO-SHOT, COOP, COCOOP) using PENGI model.

Download the pre-trained PENGI model using the link provided below and place the checkpoint file at path pengi/configs (after clonning the repo).

Model Link Size
PENGI Download 2.2 GB

PENGI checkpoint can also be downloaded with following command:

wget https://zenodo.org/records/8387083/files/base.pth

We have performed experiments on 11 audio classification datasets. Instructions for downloading/processing datasets used by our method have been provided in the DATASETS.md. All of the datasets have been uploaded on HuggingFace Datasets Hub 🤗 for easy access. We have also provided a Jupyter Notebook to download all datasets in one go. It might take some time to download all datasets, so we recommend running the notebook on a cloud instance or a machine with good internet speed.

Dataset Type Classes Size Link
Beijing-Opera Instrument Classification 4 69 MB Instructions
CREMA-D Emotion Recognition 6 606 MB Instructions
ESC50 Sound Event Classification 50 881 MB Instructions
ESC50-Actions Sound Event Classification 10 881 MB Instructions
GT-Music-Genre Music Analysis 10 1.3 GB Instructions
NS-Instruments Instrument Classification 10 18.5 GB Instructions
RAVDESS Emotion Recognition 8 1.1 GB Instructions
SESA Surveillance Sound Classification 4 70 MB Instructions
TUT2017 Acoustic Scene Classification 15 12.3 GB Instructions
UrbanSound8K Sound Event Classification 10 6.8 GB Instructions
VocalSound Vocal Sound Classification 6 8.2 GB Instructions


All datasets should be placed in a directory named Audio-Datasets and the path of this directory should be specified in the variable DATASET_ROOT in the shell scripts. Once all datasets have been downloaded, the directory structure should look like as follows:

Audio-Datasets/
    ├── Beijing-Opera/
    ├── CREMA-D/
    ├── ESC50/ 
    ├── ESC50-Actions/
    ├── GT-Music-Genre/
    ├── NS-Instruments/
    ├── RAVDESS/
    ├── SESA/
    ├── TUT2017/
    ├── UrbanSound8K/
    ├── VocalSound/

There are three main folders in this repo: pengi, palm, utils. Code in pengi folder is taken from PENGI repo for model instantiation. Implementation of baselines (zeroshot, coop, cocoop) and our method palm is in palm folder. Class definitions of audio and text encoder of PENGI model can be found in palm/encoders.py file. Training and dataset related code is in utils folder.


We have performed all experiments on NVIDIA A100-SXM4-40GB GPU. Shell scripts to run experiments can be found in scripts folder.

## General Command Structure
bash  <SHELL_SCRIPT>  <METHOD_NAME>

Following methods (including palm) are supported in this repository:

zeroshot coop cocoop palm

Examples to run palm method on different audio classifiction datasets have been provided below:

bash scripts/beijing_opera.sh palm
bash scripts/crema_d.sh palm
bash scripts/esc50_actions.sh palm
bash scripts/esc50.sh palm
bash scripts/gt_music_genre.sh palm
bash scripts/ns_instruments.sh palm
bash scripts/ravdess.sh palm
bash scripts/sesa.sh palm
bash scripts/tut.sh palm
bash scripts/urban_sound.sh palm
bash scripts/vocal_sound.sh palm

Results are saved in json format in logs directory. To process results (take an average across all folds/seeds and print), run the following command (after running all experiments):

cd logs
bash results.sh
Sample Output

main figure

Note For multi-fold datasets, we run experiments using cross-validation and then report average results on each seed.


Comparison of PALM with Baselines The accuracy scores of the baselines (ZERO-SHOT, COOP and COCOOP, and our proposed method PALM) across 11 datasets are presented. For each method (except ZERO SHOT), experiments were performed using three different seeds. The accuracy scores for all seeds are reported, along with the average score. Bold values indicate the best average score in each row. Compared to the baselines, our proposed method achieves favorable results, with an average improvement of 5.5% over COOP and 3.1% over COCOOP. It should be noted that both COOP and COCOOP are computationally expensive, as these approaches require loss gradients to flow through the text encoder. Additionally, COCOOP has a feedback loop from audio features to the input space of the text encoder, making it even more computationally expensive. On the other hand, PALM is relatively less computationally expensive.

main figure



Comparison of PALM and PALM Here, PALM refers to setting in which the Learnable Context embeddings have been removed from the feature space of the text encoder. The removal of context embeddings drastically degrades performance, highlighting their importance.

main figure


If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citation.

@article{hanif2024palm,
  title={PALM: Few-Shot Prompt Learning for Audio Language Models},
  author={Hanif, Asif and Agro, Maha Tufail and Qazi, Mohammad Areeb and Aldarmaki, Hanan},
  journal={arXiv preprint arXiv:2409.19806},
  year={2024}
}

Should you have any questions, please create an issue on this repository or contact us at asif.hanif@mbzuai.ac.ae


We used PENGI for model instantiation and borrowed a part of code from COOP/COCOOP to implement baselines. We thank the respective authors for releasing the code.


About

[EMNLP 2024] Official code repository of paper titled "PALM: Few-Shot Prompt Learning for Audio Language Models" accepted in EMNLP 2024 conference.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published