PALM: Few-Shot Prompt Learning for Audio Language Models (EMNLP'24)

PALM: Few-Shot Prompt Learning for Audio Language Models

Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, and Hanan Aldarmaki

Zero-Shot inference involves matching the embedding of the audio waveform with the embeddings of text prompts for each class. The class with the highest matching score is then assigned to the audio. Prompt Learning, as explored by Gu et al. 2023, automates this by learning text prompts from training data in few-shot setup. The first notable method, COOP, learns the context of text prompts in the token-embedding space. Our method PALM operates in the feature (output) space of text encoder. It requires only class names at the input of text encoder and optimizes the feature space by adding learnable context embeddings to text feature vectors. PALM not only outperforms COOP, but it is also more computationally efficient since it does not require gradients to flow through the text encoder, unlike COOP.

Abstract
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding.

TLDR: We adapt vision-language prompt learning methods for audio-language models and introduce PALM, a new method that is computationally efficient and outperforms or matches baselines in audio classification across 11 datasets.

Updates 🚀

Sep 20, 2024 : Accepted in EMNLP (Main) 2024 🎊 🎉
Sep 25, 2024 : Released code for PALM
Sep 28, 2024 : Released instructions for preparing datasets

Installation ⚙️

Create a conda environment

conda create --name palm python=3.8
conda activate palm

Install PyTorch and other dependencies

git clone https://github.com/asif-hanif/palm
cd palm
pip install -r requirements.txt

Model 🔳

We have shown the efficacy of PALM and other baselines (ZERO-SHOT, COOP, COCOOP) using PENGI model.

Download the pre-trained PENGI model using the link provided below and place the checkpoint file at path pengi/configs (after clonning the repo).

Model	Link	Size
PENGI	Download	2.2 GB

PENGI checkpoint can also be downloaded with following command:

wget https://zenodo.org/records/8387083/files/base.pth

Datasets 📃

We have performed experiments on 11 audio classification datasets. Instructions for downloading/processing datasets used by our method have been provided in the DATASETS.md. All of the datasets have been uploaded on HuggingFace Datasets Hub 🤗 for easy access. We have also provided a Jupyter Notebook to download all datasets in one go. It might take some time to download all datasets, so we recommend running the notebook on a cloud instance or a machine with good internet speed.

Dataset	Type	Classes	Size	Link
Beijing-Opera	Instrument Classification	4	69 MB	Instructions
CREMA-D	Emotion Recognition	6	606 MB	Instructions
ESC50	Sound Event Classification	50	881 MB	Instructions
ESC50-Actions	Sound Event Classification	10	881 MB	Instructions
GT-Music-Genre	Music Analysis	10	1.3 GB	Instructions
NS-Instruments	Instrument Classification	10	18.5 GB	Instructions
RAVDESS	Emotion Recognition	8	1.1 GB	Instructions
SESA	Surveillance Sound Classification	4	70 MB	Instructions
TUT2017	Acoustic Scene Classification	15	12.3 GB	Instructions
UrbanSound8K	Sound Event Classification	10	6.8 GB	Instructions
VocalSound	Vocal Sound Classification	6	8.2 GB	Instructions

All datasets should be placed in a directory named Audio-Datasets and the path of this directory should be specified in the variable DATASET_ROOT in the shell scripts. Once all datasets have been downloaded, the directory structure should look like as follows:

Audio-Datasets/
    ├── Beijing-Opera/
    ├── CREMA-D/
    ├── ESC50/ 
    ├── ESC50-Actions/
    ├── GT-Music-Genre/
    ├── NS-Instruments/
    ├── RAVDESS/
    ├── SESA/
    ├── TUT2017/
    ├── UrbanSound8K/
    ├── VocalSound/

Code Structure ❄️

There are three main folders in this repo: pengi, palm, utils. Code in pengi folder is taken from PENGI repo for model instantiation. Implementation of baselines (zeroshot, coop, cocoop) and our method palm is in palm folder. Class definitions of audio and text encoder of PENGI model can be found in palm/encoders.py file. Training and dataset related code is in utils folder.

Run Experiments ⚡

We have performed all experiments on NVIDIA A100-SXM4-40GB GPU. Shell scripts to run experiments can be found in scripts folder.

## General Command Structure
bash  <SHELL_SCRIPT>  <METHOD_NAME>

Following methods (including palm) are supported in this repository:

zeroshot coop cocoop palm

Examples to run palm method on different audio classifiction datasets have been provided below:

bash scripts/beijing_opera.sh palm
bash scripts/crema_d.sh palm
bash scripts/esc50_actions.sh palm
bash scripts/esc50.sh palm
bash scripts/gt_music_genre.sh palm
bash scripts/ns_instruments.sh palm
bash scripts/ravdess.sh palm
bash scripts/sesa.sh palm
bash scripts/tut.sh palm
bash scripts/urban_sound.sh palm
bash scripts/vocal_sound.sh palm

Results are saved in json format in logs directory. To process results (take an average across all folds/seeds and print), run the following command (after running all experiments):

cd logs
bash results.sh

Sample Output

Note For multi-fold datasets, we run experiments using cross-validation and then report average results on each seed.

Results 🔬

Comparison of PALM with Baselines The accuracy scores of the baselines (ZERO-SHOT, COOP and COCOOP, and our proposed method PALM) across 11 datasets are presented. For each method (except ZERO SHOT), experiments were performed using three different seeds. The accuracy scores for all seeds are reported, along with the average score. Bold values indicate the best average score in each row. Compared to the baselines, our proposed method achieves favorable results, with an average improvement of 5.5% over COOP and 3.1% over COCOOP. It should be noted that both COOP and COCOOP are computationally expensive, as these approaches require loss gradients to flow through the text encoder. Additionally, COCOOP has a feedback loop from audio features to the input space of the text encoder, making it even more computationally expensive. On the other hand, PALM is relatively less computationally expensive.

Comparison of PALM^† and PALM Here, PALM^† refers to setting in which the Learnable Context embeddings have been removed from the feature space of the text encoder. The removal of context embeddings drastically degrades performance, highlighting their importance.

Citation ⭐

If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citation.

@article{hanif2024palm,
  title={PALM: Few-Shot Prompt Learning for Audio Language Models},
  author={Hanif, Asif and Agro, Maha Tufail and Qazi, Mohammad Areeb and Aldarmaki, Hanan},
  journal={arXiv preprint arXiv:2409.19806},
  year={2024}
}

Contact 📫

Should you have any questions, please create an issue on this repository or contact us at asif.hanif@mbzuai.ac.ae

Acknowledgement 🙏

We used PENGI for model instantiation and borrowed a part of code from COOP/COCOOP to implement baselines. We thank the respective authors for releasing the code.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
logs		logs
media		media
palm		palm
pengi		pengi
scripts		scripts
utils		utils
.gitignore		.gitignore
DATASETS.md		DATASETS.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PALM: Few-Shot Prompt Learning for Audio Language Models (EMNLP'24)

Updates 🚀

Table of Contents

Installation ⚙️

Model 🔳

Datasets 📃

Code Structure ❄️

Run Experiments ⚡

Results 🔬

Citation ⭐

Contact 📫

Acknowledgement 🙏

About

Releases

Packages

Languages

License

asif-hanif/palm

Folders and files

Latest commit

History

Repository files navigation

PALM: Few-Shot Prompt Learning for Audio Language Models (EMNLP'24)

Updates 🚀

Table of Contents

Installation ⚙️

Model 🔳

Datasets 📃

Code Structure ❄️

Run Experiments ⚡

Results 🔬

Citation ⭐

Contact 📫

Acknowledgement 🙏

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages