This project addresses a critical limitation in automated audio captioning: models produce overconfident predictions regardless of semantic accuracy. We present a framework that integrates confidence prediction into audio captioning and redefines correctness through semantic similarity.
- Confidence Prediction Head: A learned neural network that estimates uncertainty from decoder hidden states
- Semantic Correctness: Caption quality defined via CLAP and FENSE embeddings rather than n-gram overlap
- Confidence-Guided Beam Search: Decoding that jointly optimizes likelihood and predicted confidence
- Temperature Scaling: Post-hoc calibration of output probabilities
| Metric | Greedy | Beam Search |
|---|---|---|
| BLEU-4 | 0.066 | 0.115 |
| CIDEr | 0.150 | 0.290 |
| CLAP Similarity | 0.596 | 0.685 |
| CLAP ECE ↓ | 0.488 | 0.071 |
Our approach achieves 85% reduction in calibration error while simultaneously improving caption quality.
├── CS7180_Final_Project.ipynb # Main notebook with all code
├── ConvNext-Version.ipynb # First draft of codebase with a different encoder & decoder
├── CS7180_Final_Project.tex # LaTeX paper
├── results/
│ ├── final_summary.json # Complete results
│ ├── method_comparison.json # Greedy vs Beam Search comparison
│ ├── metrics.json # Evaluation metrics
│ ├── calibration_curves.png # Reliability diagrams
│ └── confidence_distribution.png
└── README.md
- GPU with at least 8GB VRAM (tested on NVIDIA T4/V100)
- ~10GB disk space for models and data
# Core dependencies
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install accelerate
# Audio processing
pip install librosa
pip install soundfile
# Semantic evaluation
pip install laion-clap
pip install sentence-transformers
# Captioning metrics
pip install pycocoevalcap
pip install nltk
pip install torchmetrics
# Utilities
pip install numpy==1.26.4
pip install pandas
pip install matplotlib
pip install tqdmOr install all at once:
pip install torch transformers accelerate librosa soundfile laion-clap sentence-transformers pycocoevalcap nltk torchmetrics numpy==1.26.4 pandas matplotlib tqdmimport nltk
nltk.download('punkt')
nltk.download('wordnet')-
Download Clotho v2 from Zenodo:
clotho_audio_development.7zclotho_audio_validation.7zclotho_audio_evaluation.7zclotho_captions_development.csvclotho_captions_validation.csvclotho_captions_evaluation.csv
-
Extract and organize:
clotho/
├── development/ # 3,839 audio files
├── validation/ # 1,045 audio files
├── evaluation/ # 1,045 audio files
├── clotho_captions_development.csv
├── clotho_captions_validation.csv
└── clotho_captions_evaluation.csv
- Update the path in the notebook:
class Config:
CLOTHO_ROOT = "/path/to/your/clotho" # Update this- Upload
CS7180_Final_Project.ipynbto Google Colab - Mount Google Drive containing the Clotho dataset:
from google.colab import drive
drive.mount('/content/drive')- Update
CLOTHO_ROOTto point to your Drive location - Run cells sequentially (restart runtime after installation cell)
# Clone/download the project
cd CS7180_Final_Project
# Install dependencies
pip install -r requirements.txt
# Run as Jupyter notebook
jupyter notebook CS7180_Final_Project.ipynb
# Or convert to Python script and run
jupyter nbconvert --to script CS7180_Final_Project.ipynb
python CS7180_Final_Project.pyOur primary notebook, CS7180_Final_Project.ipynb, is organized into 5 parts. We've also included the first draft of our pipeline, ConvNext-Version.ipynb, which uses a different encoder and decoder. Given that it is not relevant to the final results in our paper, the draft project file includes simpler documentation and weaker-quality metrics.
- Configuration class with hyperparameters
- Clotho dataset loader with audio preprocessing
- Semantic evaluators (CLAP and FENSE)
- DataLoader construction
WhisperForAudioCaptioning: Base Whisper model with style prefix supportConfidencePredictionHead: 3-layer MLP for uncertainty estimationTemperatureScaling: Learnable calibration parameterWhisperWithConfidence: Combined model wrapper
CaptioningLoss: Combined cross-entropy and confidence lossTrainer: Training loop with semantic supervisionTemperatureCalibrator: Post-hoc temperature optimization
BeamCandidate: Data structure for beam search hypothesesConfidenceGuidedBeamSearch: Beam search with confidence rerankingCaptionGenerator: High-level generation interfacegreedy_decode: Baseline greedy decoding
CaptioningMetrics: BLEU, CIDEr, METEOR, SPICECalibrationMetrics: ECE, MCE, Brier scoreComprehensiveEvaluator: Full evaluation pipeline with visualizations
Key hyperparameters in Config class:
# Model
MODEL_NAME = "MU-NLPC/whisper-small-audio-captioning"
# Training
BATCH_SIZE = 16
LR = 1e-4
NUM_EPOCHS = 5
GRADIENT_ACCUMULATION_STEPS = 2
# Confidence
CONFIDENCE_WEIGHT = 0.15 # λ in loss function
SEMANTIC_THRESHOLD = 0.6 # τ for correctness
# Beam Search
BEAM_SIZE = 5
CONFIDENCE_RERANK_WEIGHT = 0.3 # β in scoring
# Audio
SAMPLE_RATE = 16000
MAX_AUDIO_LENGTH = 30 # seconds- BLEU-1/2/3/4: N-gram precision
- CIDEr: TF-IDF weighted n-gram similarity
- METEOR: Alignment-based metric with synonyms
- SPICE: Scene graph semantic similarity
- CLAP Similarity: Cosine similarity in LAION-CLAP embedding space
- FENSE Similarity: Sentence transformer (all-MiniLM-L6-v2) similarity
- CLAP/FENSE Accuracy: Proportion exceeding threshold (0.6)
- ECE: Expected Calibration Error (lower is better)
- Brier Score: Mean squared error of confidence vs. correctness
- Reliability Diagrams: Visual calibration assessment
# Load trained model
model, processor = build_model_and_processor(config)
model.load_state_dict(torch.load("checkpoints/final_calibrated.pt")['model_state_dict'])
# Initialize generator
generator = CaptionGenerator(model, processor, config, use_beam_search=True)
# Generate caption for single audio
audio = load_audio_file("path/to/audio.wav")
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
result = generator.generate(input_features.to(device))
print(f"Caption: {result['caption']}")
print(f"Confidence: {result['confidence']:.3f}")If you use this code in your research, please cite:
@article{dunker2024semantic,
title={Semantic-Aware Confidence Calibration for Automated Audio Captioning},
author={Dunker, Lucas and Menta, Sai Akshay and Addepalli, Snigdha Mohana and Garapati, Venkata Krishna Rayalu},
journal={CS7180 Final Project, Northeastern University},
year={2025}
}- Base model: MU-NLPC/whisper-small-audio-captioning
- Dataset: Clotho v2
- CLAP: LAION-CLAP
- Sentence Transformers: all-MiniLM-L6-v2
For questions or issues, please contact: