Hi,
Excellent work and thank you for making the resources publicly available.
While conducting evaluation experiments on your dataset, I encountered some issues.
In the published paper, you stated:
Evaluation. We use answer prediction accuracy as the metric and evaluate model performance on answering different types of questions. The answer vocabulary consists of 42
possible answers (22 objects, 12 counting choices, 6 location types, and yes/no) to different types of questions in the dataset. For training, we use one single model to handle all
questions without training separated models for each type. So the accuracy with random choice is 1/42≈2.4%. Additionally, all models are trained on our AVQA dataset using
the same features for a fair comparison.
It seems that the task is muti-choice questions, right? Could you provide a list of the 42 possible answers?
Any additional details regarding how to perform the evaluation would be helpful. Thank you.
@DTaoo @ayameyao