-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hi, thank you for the excellent work on MokA! I've been trying to reproduce the results reported in your paper, but I'm encountering a significant gap between my results and the reported numbers. I would greatly appreciate your help in identifying what might be causing this discrepancy.
Environment Setup
- Python: 3.9
- PyTorch: 2.1.0
- Transformers: 4.37.2
- DeepSpeed: 0.12.6
- Hardware: 6× A100 80GB GPUs
Pre-trained Weights Used
I used the pre-trained projector weights provided in your repository:
- ✅ Audio projector:
pre-trained/av_unified/audio-pretrain/non_lora_trainables.bin - ✅ Visual projector:
pre-trained/av_unified/visual-pretrain/non_lora_trainables.bin - ✅ LLaMA-2-7B-Chat-HF
- ✅ CLIP-ViT-L/14
- ✅ BEATs (Fine-tuned BEATs_iter3+ AS2M)
Training Configuration
I followed the configuration in scripts/finetune/ft.sh:
| Hyperparameter | My Setting | From ft.sh |
|---|---|---|
| LLM Backbone | LLaMA-2-7B-Chat | LLaMA-2-7B-Chat |
| LoRA Rank | 444 (4×3 modalities) | 444 |
| LoRA Alpha | 16 | 16 |
| LoRA Dropout | 0.05 | 0.05 |
| Learning Rate | 1e-4 | 1e-4 |
| Weight Decay | 0.0 | 0.0 |
| Warmup Ratio | 0.03 | 0.03 |
| LR Scheduler | Cosine | Cosine |
| Epochs | 3 | 3 |
| Per-device Batch Size | 4 | 4 |
| Gradient Accumulation | 1 | 1 |
| GPUs | 6 | 16 (mentioned in README) |
| Global Batch Size | 24 | 64 (16 GPUs × 4) |
| BF16 | False | True |
| Visual Query Tokens | 32 | 32 |
| Audio Query Tokens | 32 | 32 |
| Video Frames | 10 | 10 |
My Reproduction Results
MUSIC-AVQA Dataset
| Metric | My Result | Paper (Table 1, LLaMA2) | Gap |
|---|---|---|---|
| Overall Accuracy | 70.23% | 75.71% | -5.48% |
Evaluation on 9185 test samples.
AVE Dataset
| Metric | My Result | Paper (Table 1, LLaMA2) | Gap |
|---|---|---|---|
| Event Classification | 94.78% | - | - |
| Temporal Localization (±1s) | 68.16% | - | - |
| Joint Accuracy | 64.68% | 74.68% | -10.00% |
Evaluation on 402/402 test samples.
Questions
-
Global Batch Size: The README mentions using 16 A100 GPUs for fine-tuning, which would give a global batch size of 64 (16 × 4). I only have 6 GPUs available, resulting in a global batch size of 24. Could this difference significantly impact the final performance?
-
AVE Evaluation Metric: The paper reports a single number (74.68%) for AVE. Is this the joint accuracy (both event classification and temporal localization correct)? Or is it a different metric?
-
Fine-tuned Checkpoints: Would it be possible to release the fine-tuned model checkpoints for MUSIC-AVQA and AVE? This would help verify whether the gap is due to training configuration differences or evaluation methodology.
-
Data Splits: Are you using the official train/test splits for MUSIC-AVQA and AVE? I'm using:
- MUSIC-AVQA: 9185 test samples
- AVE: 402 test samples
-
Any Other Critical Settings: Are there any other hyperparameters or settings not mentioned in the scripts that might be crucial for reproduction?
Thank you for your time and assistance!