Skip to content

Lucas-Dunker/Semantic-Aware-Audio-Captioning

Repository files navigation

Semantic-Aware Confidence Calibration for Audio Captioning

Python 3.8+ PyTorch

Overview

This project addresses a critical limitation in automated audio captioning: models produce overconfident predictions regardless of semantic accuracy. We present a framework that integrates confidence prediction into audio captioning and redefines correctness through semantic similarity.

Key Features

  • Confidence Prediction Head: A learned neural network that estimates uncertainty from decoder hidden states
  • Semantic Correctness: Caption quality defined via CLAP and FENSE embeddings rather than n-gram overlap
  • Confidence-Guided Beam Search: Decoding that jointly optimizes likelihood and predicted confidence
  • Temperature Scaling: Post-hoc calibration of output probabilities

Results Summary

Metric Greedy Beam Search
BLEU-4 0.066 0.115
CIDEr 0.150 0.290
CLAP Similarity 0.596 0.685
CLAP ECE ↓ 0.488 0.071

Our approach achieves 85% reduction in calibration error while simultaneously improving caption quality.

Project Structure

├── CS7180_Final_Project.ipynb    # Main notebook with all code
├── ConvNext-Version.ipynb        # First draft of codebase with a different encoder & decoder
├── CS7180_Final_Project.tex      # LaTeX paper
├── results/
│   ├── final_summary.json        # Complete results
│   ├── method_comparison.json    # Greedy vs Beam Search comparison
│   ├── metrics.json              # Evaluation metrics
│   ├── calibration_curves.png    # Reliability diagrams
│   └── confidence_distribution.png
└── README.md

Requirements

Hardware

  • GPU with at least 8GB VRAM (tested on NVIDIA T4/V100)
  • ~10GB disk space for models and data

Software Dependencies

# Core dependencies
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install accelerate

# Audio processing
pip install librosa
pip install soundfile

# Semantic evaluation
pip install laion-clap
pip install sentence-transformers

# Captioning metrics
pip install pycocoevalcap
pip install nltk
pip install torchmetrics

# Utilities
pip install numpy==1.26.4
pip install pandas
pip install matplotlib
pip install tqdm

Or install all at once:

pip install torch transformers accelerate librosa soundfile laion-clap sentence-transformers pycocoevalcap nltk torchmetrics numpy==1.26.4 pandas matplotlib tqdm

NLTK Data

import nltk
nltk.download('punkt')
nltk.download('wordnet')

Dataset Setup

Clotho v2

  1. Download Clotho v2 from Zenodo:

    • clotho_audio_development.7z
    • clotho_audio_validation.7z
    • clotho_audio_evaluation.7z
    • clotho_captions_development.csv
    • clotho_captions_validation.csv
    • clotho_captions_evaluation.csv
  2. Extract and organize:

clotho/
├── development/          # 3,839 audio files
├── validation/           # 1,045 audio files
├── evaluation/           # 1,045 audio files
├── clotho_captions_development.csv
├── clotho_captions_validation.csv
└── clotho_captions_evaluation.csv
  1. Update the path in the notebook:
class Config:
    CLOTHO_ROOT = "/path/to/your/clotho"  # Update this

Running the Code

Option 1: Google Colab (Recommended)

  1. Upload CS7180_Final_Project.ipynb to Google Colab
  2. Mount Google Drive containing the Clotho dataset:
from google.colab import drive
drive.mount('/content/drive')
  1. Update CLOTHO_ROOT to point to your Drive location
  2. Run cells sequentially (restart runtime after installation cell)

Option 2: Local Execution

# Clone/download the project
cd CS7180_Final_Project

# Install dependencies
pip install -r requirements.txt

# Run as Jupyter notebook
jupyter notebook CS7180_Final_Project.ipynb

# Or convert to Python script and run
jupyter nbconvert --to script CS7180_Final_Project.ipynb
python CS7180_Final_Project.py

Code Structure

Our primary notebook, CS7180_Final_Project.ipynb, is organized into 5 parts. We've also included the first draft of our pipeline, ConvNext-Version.ipynb, which uses a different encoder and decoder. Given that it is not relevant to the final results in our paper, the draft project file includes simpler documentation and weaker-quality metrics.

Part 1: Setup and Data Loading

  • Configuration class with hyperparameters
  • Clotho dataset loader with audio preprocessing
  • Semantic evaluators (CLAP and FENSE)
  • DataLoader construction

Part 2: Model Architecture

  • WhisperForAudioCaptioning: Base Whisper model with style prefix support
  • ConfidencePredictionHead: 3-layer MLP for uncertainty estimation
  • TemperatureScaling: Learnable calibration parameter
  • WhisperWithConfidence: Combined model wrapper

Part 3: Training

  • CaptioningLoss: Combined cross-entropy and confidence loss
  • Trainer: Training loop with semantic supervision
  • TemperatureCalibrator: Post-hoc temperature optimization

Part 4: Inference

  • BeamCandidate: Data structure for beam search hypotheses
  • ConfidenceGuidedBeamSearch: Beam search with confidence reranking
  • CaptionGenerator: High-level generation interface
  • greedy_decode: Baseline greedy decoding

Part 5: Evaluation

  • CaptioningMetrics: BLEU, CIDEr, METEOR, SPICE
  • CalibrationMetrics: ECE, MCE, Brier score
  • ComprehensiveEvaluator: Full evaluation pipeline with visualizations

Configuration

Key hyperparameters in Config class:

# Model
MODEL_NAME = "MU-NLPC/whisper-small-audio-captioning"

# Training
BATCH_SIZE = 16
LR = 1e-4
NUM_EPOCHS = 5
GRADIENT_ACCUMULATION_STEPS = 2

# Confidence
CONFIDENCE_WEIGHT = 0.15      # λ in loss function
SEMANTIC_THRESHOLD = 0.6      # τ for correctness

# Beam Search
BEAM_SIZE = 5
CONFIDENCE_RERANK_WEIGHT = 0.3  # β in scoring

# Audio
SAMPLE_RATE = 16000
MAX_AUDIO_LENGTH = 30  # seconds

Evaluation Metrics

Caption Quality

  • BLEU-1/2/3/4: N-gram precision
  • CIDEr: TF-IDF weighted n-gram similarity
  • METEOR: Alignment-based metric with synonyms
  • SPICE: Scene graph semantic similarity

Semantic Similarity

  • CLAP Similarity: Cosine similarity in LAION-CLAP embedding space
  • FENSE Similarity: Sentence transformer (all-MiniLM-L6-v2) similarity
  • CLAP/FENSE Accuracy: Proportion exceeding threshold (0.6)

Calibration

  • ECE: Expected Calibration Error (lower is better)
  • Brier Score: Mean squared error of confidence vs. correctness
  • Reliability Diagrams: Visual calibration assessment

Example Usage

# Load trained model
model, processor = build_model_and_processor(config)
model.load_state_dict(torch.load("checkpoints/final_calibrated.pt")['model_state_dict'])

# Initialize generator
generator = CaptionGenerator(model, processor, config, use_beam_search=True)

# Generate caption for single audio
audio = load_audio_file("path/to/audio.wav")
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
result = generator.generate(input_features.to(device))

print(f"Caption: {result['caption']}")
print(f"Confidence: {result['confidence']:.3f}")

Citation

If you use this code in your research, please cite:

@article{dunker2024semantic,
  title={Semantic-Aware Confidence Calibration for Automated Audio Captioning},
  author={Dunker, Lucas and Menta, Sai Akshay and Addepalli, Snigdha Mohana and Garapati, Venkata Krishna Rayalu},
  journal={CS7180 Final Project, Northeastern University},
  year={2025}
}

Acknowledgments

Contact

For questions or issues, please contact:

About

Semantic-Aware Confidence Calibration for Audio Captioning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors