[Reproduction] Unable to reproduce reported results on MUSIC-AVQA and AVE datasets

Hi, thank you for the excellent work on MokA! I've been trying to reproduce the results reported in your paper, but I'm encountering a significant gap between my results and the reported numbers. I would greatly appreciate your help in identifying what might be causing this discrepancy.

### Environment Setup

- **Python**: 3.9
- **PyTorch**: 2.1.0
- **Transformers**: 4.37.2
- **DeepSpeed**: 0.12.6
- **Hardware**: 6× A100 80GB GPUs

### Pre-trained Weights Used

I used the pre-trained projector weights provided in your repository:
- ✅ Audio projector: `pre-trained/av_unified/audio-pretrain/non_lora_trainables.bin`
- ✅ Visual projector: `pre-trained/av_unified/visual-pretrain/non_lora_trainables.bin`
- ✅ LLaMA-2-7B-Chat-HF
- ✅ CLIP-ViT-L/14
- ✅ BEATs (Fine-tuned BEATs_iter3+ AS2M)

### Training Configuration

I followed the configuration in `scripts/finetune/ft.sh`:

| Hyperparameter | My Setting | From `ft.sh` |
|----------------|------------|--------------|
| LLM Backbone | LLaMA-2-7B-Chat | LLaMA-2-7B-Chat |
| LoRA Rank | 444 (4×3 modalities) | 444 |
| LoRA Alpha | 16 | 16 |
| LoRA Dropout | 0.05 | 0.05 |
| Learning Rate | 1e-4 | 1e-4 |
| Weight Decay | 0.0 | 0.0 |
| Warmup Ratio | 0.03 | 0.03 |
| LR Scheduler | Cosine | Cosine |
| Epochs | 3 | 3 |
| Per-device Batch Size | 4 | 4 |
| Gradient Accumulation | 1 | 1 |
| GPUs | 6 | 16 (mentioned in README) |
| **Global Batch Size** | **24** | **64** (16 GPUs × 4) |
| BF16 | False | True |
| Visual Query Tokens | 32 | 32 |
| Audio Query Tokens | 32 | 32 |
| Video Frames | 10 | 10 |

### My Reproduction Results

#### MUSIC-AVQA Dataset

| Metric | My Result | Paper (Table 1, LLaMA2) | Gap |
|--------|-----------|-------------------------|-----|
| **Overall Accuracy** | **70.23%** | **75.71%** | **-5.48%** |

Evaluation on 9185 test samples.

#### AVE Dataset

| Metric | My Result | Paper (Table 1, LLaMA2) | Gap |
|--------|-----------|-------------------------|-----|
| Event Classification | 94.78% | - | - |
| Temporal Localization (±1s) | 68.16% | - | - |
| **Joint Accuracy** | **64.68%** | **74.68%** | **-10.00%** |

Evaluation on 402/402 test samples.

### Questions

1. **Global Batch Size**: The README mentions using 16 A100 GPUs for fine-tuning, which would give a global batch size of 64 (16 × 4). I only have 6 GPUs available, resulting in a global batch size of 24. Could this difference significantly impact the final performance?

2. **AVE Evaluation Metric**: The paper reports a single number (74.68%) for AVE. Is this the joint accuracy (both event classification and temporal localization correct)? Or is it a different metric?

3. **Fine-tuned Checkpoints**: Would it be possible to release the fine-tuned model checkpoints for MUSIC-AVQA and AVE? This would help verify whether the gap is due to training configuration differences or evaluation methodology.

4. **Data Splits**: Are you using the official train/test splits for MUSIC-AVQA and AVE? I'm using:
   - MUSIC-AVQA: 9185 test samples
   - AVE: 402 test samples

5. **Any Other Critical Settings**: Are there any other hyperparameters or settings not mentioned in the scripts that might be crucial for reproduction?

Thank you for your time and assistance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reproduction] Unable to reproduce reported results on MUSIC-AVQA and AVE datasets #9

Environment Setup

Pre-trained Weights Used

Training Configuration

My Reproduction Results

MUSIC-AVQA Dataset

AVE Dataset

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hyperparameter	My Setting	From `ft.sh`
LLM Backbone	LLaMA-2-7B-Chat	LLaMA-2-7B-Chat
LoRA Rank	444 (4×3 modalities)	444
LoRA Alpha	16	16
LoRA Dropout	0.05	0.05
Learning Rate	1e-4	1e-4
Weight Decay	0.0	0.0
Warmup Ratio	0.03	0.03
LR Scheduler	Cosine	Cosine
Epochs	3	3
Per-device Batch Size	4	4
Gradient Accumulation	1	1
GPUs	6	16 (mentioned in README)
Global Batch Size	24	64 (16 GPUs × 4)
BF16	False	True
Visual Query Tokens	32	32
Audio Query Tokens	32	32
Video Frames	10	10

Metric	My Result	Paper (Table 1, LLaMA2)	Gap
Event Classification	94.78%	-	-
Temporal Localization (±1s)	68.16%	-	-
Joint Accuracy	64.68%	74.68%	-10.00%

[Reproduction] Unable to reproduce reported results on MUSIC-AVQA and AVE datasets #9

Description

Environment Setup

Pre-trained Weights Used

Training Configuration

My Reproduction Results

MUSIC-AVQA Dataset

AVE Dataset

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions