Skip to content

[Reproduction] Unable to reproduce reported results on MUSIC-AVQA and AVE datasets #9

@ydchen0806

Description

@ydchen0806

Hi, thank you for the excellent work on MokA! I've been trying to reproduce the results reported in your paper, but I'm encountering a significant gap between my results and the reported numbers. I would greatly appreciate your help in identifying what might be causing this discrepancy.

Environment Setup

  • Python: 3.9
  • PyTorch: 2.1.0
  • Transformers: 4.37.2
  • DeepSpeed: 0.12.6
  • Hardware: 6× A100 80GB GPUs

Pre-trained Weights Used

I used the pre-trained projector weights provided in your repository:

  • ✅ Audio projector: pre-trained/av_unified/audio-pretrain/non_lora_trainables.bin
  • ✅ Visual projector: pre-trained/av_unified/visual-pretrain/non_lora_trainables.bin
  • ✅ LLaMA-2-7B-Chat-HF
  • ✅ CLIP-ViT-L/14
  • ✅ BEATs (Fine-tuned BEATs_iter3+ AS2M)

Training Configuration

I followed the configuration in scripts/finetune/ft.sh:

Hyperparameter My Setting From ft.sh
LLM Backbone LLaMA-2-7B-Chat LLaMA-2-7B-Chat
LoRA Rank 444 (4×3 modalities) 444
LoRA Alpha 16 16
LoRA Dropout 0.05 0.05
Learning Rate 1e-4 1e-4
Weight Decay 0.0 0.0
Warmup Ratio 0.03 0.03
LR Scheduler Cosine Cosine
Epochs 3 3
Per-device Batch Size 4 4
Gradient Accumulation 1 1
GPUs 6 16 (mentioned in README)
Global Batch Size 24 64 (16 GPUs × 4)
BF16 False True
Visual Query Tokens 32 32
Audio Query Tokens 32 32
Video Frames 10 10

My Reproduction Results

MUSIC-AVQA Dataset

Metric My Result Paper (Table 1, LLaMA2) Gap
Overall Accuracy 70.23% 75.71% -5.48%

Evaluation on 9185 test samples.

AVE Dataset

Metric My Result Paper (Table 1, LLaMA2) Gap
Event Classification 94.78% - -
Temporal Localization (±1s) 68.16% - -
Joint Accuracy 64.68% 74.68% -10.00%

Evaluation on 402/402 test samples.

Questions

  1. Global Batch Size: The README mentions using 16 A100 GPUs for fine-tuning, which would give a global batch size of 64 (16 × 4). I only have 6 GPUs available, resulting in a global batch size of 24. Could this difference significantly impact the final performance?

  2. AVE Evaluation Metric: The paper reports a single number (74.68%) for AVE. Is this the joint accuracy (both event classification and temporal localization correct)? Or is it a different metric?

  3. Fine-tuned Checkpoints: Would it be possible to release the fine-tuned model checkpoints for MUSIC-AVQA and AVE? This would help verify whether the gap is due to training configuration differences or evaluation methodology.

  4. Data Splits: Are you using the official train/test splits for MUSIC-AVQA and AVE? I'm using:

    • MUSIC-AVQA: 9185 test samples
    • AVE: 402 test samples
  5. Any Other Critical Settings: Are there any other hyperparameters or settings not mentioned in the scripts that might be crucial for reproduction?

Thank you for your time and assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions