Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the Emu Edit Benchmark metrics. #18

Open
hanzhn opened this issue Sep 6, 2024 · 2 comments
Open

About the Emu Edit Benchmark metrics. #18

hanzhn opened this issue Sep 6, 2024 · 2 comments

Comments

@hanzhn
Copy link

hanzhn commented Sep 6, 2024

I appreciate your excellent work of Instruction-based Editing. Thanks for your efforts!

I have some questions for you about the Emu Edit Benchmark metrics.

  1. What is the specific version of CLIP and DINO to calculate these metrics? I can't find any clue in Emu Edit and yours report.
  2. Have you noticed that the dataset Emu Edit provided in HuggingFace repo: emu_edit_test_set_generations may have mistakenly swapped the test and validation set, of which the record numbers are not aligned with the paper and their another HuggingFace repo: emu_edit_test_set. If this is true, which dataset split should I use for the metrics calculation, and which split did you use for your reported numbers?
@hanzhn
Copy link
Author

hanzhn commented Sep 6, 2024

Or can you point out where I can find authoritative code for these calculation? That will be helpful.

@HaozheZhao
Copy link
Owner

Hi,

Thank you for bringing these issues to our attention.

Versioning for Metrics Calculation

We've noticed that the original Emu Edit paper and dataset do not specify the versions of CLIP and DINO used. To align with other benchmarks, we adopted the settings used by MagicBrush (GitHub Repository). Specifically, the versions are "ViT-B/32" for CLIP and "dino_vits16" for DINO embeddings. We ensured consistency by rerunning all results in our paper based on the Emu Edit benchmark.

Dataset Splits and Inconsistencies

Regarding the dataset split issue: we utilized the test set of emu_edit_test_set for our evaluations. And due to mistakenly swapped dataset, our reported results were based on the validation set from the emu_edit_test_set_generations.

Also, there are known issues with the benchmark quality as discussed in this discussion thread. Some image-caption pairs seem incorrect, like placeholder captions (e.g., 'a train station in city') or identical source and target captions.

Evaluation Code

For the metrics evaluation, we adhered closely to the MagicBrush evaluation script (GitHub Link) for both benchmarks with no major modifications. We plan to share our refined evaluation code soon; however, in the meantime, you can refer to the provided script in MagicBrush for immediate use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants