Updated paper and README.

souradipp76 · souradipp76 · commit e734e228f393 · 2024-10-21T18:33:32.000-05:00
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![CI](https://github.com/souradipp76/MM-PoE/actions/workflows/main.yml/badge.svg)](https://github.com/souradipp76/MM-PoE/actions/workflows/main.yml)
 
 
-**Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models**
+**Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models**
 
 
 ## What is MM-PoE?
@@ -15,7 +15,7 @@ Multi-Modal Process of Elimination (MM-PoE) is a method to enhance vision langua
 
 Large Language models (LLMs) excel at in-context learning for multiple choice reasoning tasks but often treat all options equally, unlike humans who typically eliminate incorrect choices before selecting the correct answer. Same is true for vision language models (VLMs) in case of visual question answering tasks with multiple choices. This discrepancy can limit the effectiveness of vision language models in accurately solving such tasks. To address this, we introduce Multi-Modal Process of Elimination (MM-PoE), a two-step scoring method designed to enhance VLM performance by mimicking human reasoning strategies in multi-modal settings.
 
-In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios . Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
+In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios. Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
 
 By implementing MM-PoE, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning.
 
@@ -65,7 +65,10 @@ $ python -m mm_poe
 $ mm_poe
 ```
 
-The application will prompt the user to provide relevant inputs for a multiple choice question e.g a question, multiple answer choices for the question and the path to the image relevant the question context. Once the inputs are provided, the predicted answer will be displayed based on the selections. Note that this application runs inference for only a single sample at a time.
+The application will prompt the user to provide relevant inputs for a multiple choice question e.g. a question, multiple answer choices for the question and the path to the image relevant the question context. Once the inputs are provided, the predicted answer will be displayed based prompt outputs. Note that this application runs inference for only a single sample at a time.
+
+
+<img src="paper/figures/cli.png" alt="Example" width="500">
 
 ### Running Experiments
 
diff --git a/mm_poe/cli.py b/mm_poe/cli.py
@@ -65,7 +65,7 @@ def main():
     ).ask()
 
     args.loading_precision = questionary.select(
-        message="Select model checkpoint?",
+        message="Select model precision?",
         choices=["FP32", "FP16", "BF16", "INT8", "INT4"],
         default="FP32",
     ).ask()
@@ -116,7 +116,8 @@ def main():
         "Image Path?", default="./images/image.png"
     ).ask()
     args.label = questionary.select(
-        message="Answer:", choices=[str(x) for x in range(args.num_options)]
+        message="Ground Truth Option:",
+        choices=[str(x) for x in range(args.num_options)],
     ).ask()
     args.label = int(args.label)
     args.method = "process_of_elimination"
@@ -394,4 +395,4 @@ def main():
         )
     )
     option = int(lm_predictions.numpy()[0])
-    logger.info(f"Answer: {option}")
+    logger.info(f"Predicted Option: {option}. Answer: {args.choices[option]}")
diff --git a/paper/figures/cli.png b/paper/figures/cli.png
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -285,3 +285,26 @@ @conj{Idefics2
   version = {8b},
   howpublished = {\url{https://huggingface.co/HuggingFaceM4/idefics2-8b}}
 }
+
+@InProceedings{VQA,
+author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh},
+title = {VQA: Visual Question Answering},
+booktitle = {International Conference on Computer Vision (ICCV)},
+year = {2015},
+}
+
+@article{Kembhavi2016ADI,
+  title={A Diagram is Worth a Dozen Images},
+  author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
+  journal={ArXiv},
+  year={2016},
+  volume={abs/1603.07396},
+  url={https://api.semanticscholar.org/CorpusID:2682274}
+}
+
+@inproceedings{lu2022learn,
+    title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
+    author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan},
+    booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)},
+    year={2022}
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -24,13 +24,13 @@ bibliography: paper.bib
 
 # Summary
 
-This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the main limitations of the paper [@ma2023poe] by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings.
+This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning tasks by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the key limitations of the paper [@ma2023poe] by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings.
 
 # Statement of Need
 
 Large Language models (LLMs) excel at in-context learning for multiple choice reasoning tasks but often treat all options equally, unlike humans who typically eliminate incorrect choices before selecting the correct answer. Same is true for vision language models (VLMs) in case of visual question answering tasks with multiple choices. This discrepancy can limit the effectiveness of vision language models in accurately solving such tasks. To address this, we introduce Multi-Modal Process of Elimination (MM-PoE), a two-step scoring method designed to enhance VLM performance by mimicking human reasoning strategies in multi-modal settings.
 
-In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios . Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
+In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios. Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
 
 By implementing MM-PoE, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning.
 
@@ -106,13 +106,13 @@ To further explore the versatility of MM-PoE, we also examined its performance i
 
 ## Data
 
-Our experiments were conducted on three different multiple-choice visual reasoning datasets, selected to cover a broad spectrum of reasoning types and complexities. These tasks include both traditional reasoning tasks and more specialized ones designed to test specific reasoning skills. To ensure a comprehensive evaluation, we used train sets from established benchmarks when available; otherwise, we utilized development sets.
+Our experiments were conducted on three different multiple-choice visual reasoning datasets - Visual Question Answering(VQA) [@VQA], ScienceQA [@lu2022learn] and Diagram Understanding(AI2D) [@Kembhavi2016ADI], selected to cover a broad spectrum of reasoning types and complexities. These tasks include both traditional visual reasoning tasks and more specialized ones designed to test specific reasoning skills. To ensure a comprehensive evaluation, we used train sets from established benchmarks when available; otherwise, we utilized development sets. In case of varying number of options in the multiple-choice answers for SceinceQA and AI2D datasets, we filtered questions containing image context and exactly four options.
 
 | Dataset | #Options | Train  | Dev  | Test |
 |----|------|------|------|-----------|
-|VQA v1.0| 18 | 248,349 | 121,512 | 244,302  |
-|ScienceQA  | 4 | 2221 |  |  |
-| AI2D | 4 |  |  |     |
+| VQA | 18 | 248,349 | 121,512 | 244,302 |
+| ScienceQA | 4 | 12726 | 4241 | 4241 |
+| AI2D | 4 | 3921 | 982 | - |
 
 ## Model
 
@@ -145,7 +145,7 @@ MM-PoE consistently outperformed or matched the best-performing baselines across
 
 | Model | Dataset | LM | AVG | Calibration | Channel | MCP  | PoE  |
 |----|------|------|------|-----------|---|---|---|
-|microsoft/git-base-vqav2| VQA   | | | | | | |     |
+|microsoft/git-base-vqav2| VQA   | 45 | 43 | 38| 14 | 2| |     |
 |microsoft/git-base-vqav2| ScienceQA  | 27.4 | | 23.2| 24.6 | 25.8 | 27.2 |
 |microsoft/git-base-vqav2| AI2D  | 25.4| | 26.4| 25.4 | 25.3 | 26.5 |
 |microsoft/git-base-textvqa| VQA   | | | | | | |
@@ -156,11 +156,12 @@ MM-PoE consistently outperformed or matched the best-performing baselines across
 
 ## Example
 
-<img src="figures/image.png" alt="Example" width="300">
+<img src="figures/image.png" alt="Example" width="500">
 
 **Question**: Which of these states is farthest north?<br>
 **Choices**: West Virginia, Louisiana, Arizona, Oklahoma<br>
-**Predicted**: 0
+**Masked Choices**: West Virginia, Louisiana, [MASK], [MASK]<br>
+**Predicted**: West Virginia
 
 # Conclusion