Skip to content

Commit e734e22

Browse files
committed
Updated paper and README.
1 parent ace5271 commit e734e22

File tree

5 files changed

+43
-15
lines changed

5 files changed

+43
-15
lines changed

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
[![CI](https://github.com/souradipp76/MM-PoE/actions/workflows/main.yml/badge.svg)](https://github.com/souradipp76/MM-PoE/actions/workflows/main.yml)
55

66

7-
**Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models**
7+
**Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models**
88

99

1010
## What is MM-PoE?
@@ -15,7 +15,7 @@ Multi-Modal Process of Elimination (MM-PoE) is a method to enhance vision langua
1515

1616
Large Language models (LLMs) excel at in-context learning for multiple choice reasoning tasks but often treat all options equally, unlike humans who typically eliminate incorrect choices before selecting the correct answer. Same is true for vision language models (VLMs) in case of visual question answering tasks with multiple choices. This discrepancy can limit the effectiveness of vision language models in accurately solving such tasks. To address this, we introduce Multi-Modal Process of Elimination (MM-PoE), a two-step scoring method designed to enhance VLM performance by mimicking human reasoning strategies in multi-modal settings.
1717

18-
In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios . Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
18+
In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios. Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
1919

2020
By implementing MM-PoE, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning.
2121

@@ -65,7 +65,10 @@ $ python -m mm_poe
6565
$ mm_poe
6666
```
6767

68-
The application will prompt the user to provide relevant inputs for a multiple choice question e.g a question, multiple answer choices for the question and the path to the image relevant the question context. Once the inputs are provided, the predicted answer will be displayed based on the selections. Note that this application runs inference for only a single sample at a time.
68+
The application will prompt the user to provide relevant inputs for a multiple choice question e.g. a question, multiple answer choices for the question and the path to the image relevant the question context. Once the inputs are provided, the predicted answer will be displayed based prompt outputs. Note that this application runs inference for only a single sample at a time.
69+
70+
71+
<img src="paper/figures/cli.png" alt="Example" width="500">
6972

7073
### Running Experiments
7174

mm_poe/cli.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ def main():
6565
).ask()
6666

6767
args.loading_precision = questionary.select(
68-
message="Select model checkpoint?",
68+
message="Select model precision?",
6969
choices=["FP32", "FP16", "BF16", "INT8", "INT4"],
7070
default="FP32",
7171
).ask()
@@ -116,7 +116,8 @@ def main():
116116
"Image Path?", default="./images/image.png"
117117
).ask()
118118
args.label = questionary.select(
119-
message="Answer:", choices=[str(x) for x in range(args.num_options)]
119+
message="Ground Truth Option:",
120+
choices=[str(x) for x in range(args.num_options)],
120121
).ask()
121122
args.label = int(args.label)
122123
args.method = "process_of_elimination"
@@ -394,4 +395,4 @@ def main():
394395
)
395396
)
396397
option = int(lm_predictions.numpy()[0])
397-
logger.info(f"Answer: {option}")
398+
logger.info(f"Predicted Option: {option}. Answer: {args.choices[option]}")

paper/figures/cli.png

35.8 KB
Loading

paper/paper.bib

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,3 +285,26 @@ @conj{Idefics2
285285
version = {8b},
286286
howpublished = {\url{https://huggingface.co/HuggingFaceM4/idefics2-8b}}
287287
}
288+
289+
@InProceedings{VQA,
290+
author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh},
291+
title = {VQA: Visual Question Answering},
292+
booktitle = {International Conference on Computer Vision (ICCV)},
293+
year = {2015},
294+
}
295+
296+
@article{Kembhavi2016ADI,
297+
title={A Diagram is Worth a Dozen Images},
298+
author={Aniruddha Kembhavi and Michael Salvato and Eric Kolve and Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi},
299+
journal={ArXiv},
300+
year={2016},
301+
volume={abs/1603.07396},
302+
url={https://api.semanticscholar.org/CorpusID:2682274}
303+
}
304+
305+
@inproceedings{lu2022learn,
306+
title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
307+
author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan},
308+
booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)},
309+
year={2022}
310+
}

paper/paper.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,13 @@ bibliography: paper.bib
2424

2525
# Summary
2626

27-
This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the main limitations of the paper [@ma2023poe] by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings.
27+
This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, also know as Multi-Modal Process of Elimination (MM-PoE), a method to enhance vision language models' performance on multiple-choice visual reasoning tasks by employing a two-step scoring system that first eliminates incorrect options and then predicts from the remaining ones. Our experiments across three question answering datasets show the method's effectiveness, particularly in visual reasoning tasks. This method addresses one of the key limitations of the paper [@ma2023poe] by extending to tasks involving multi-modalities and also includes experimentation techniques for few-shot settings.
2828

2929
# Statement of Need
3030

3131
Large Language models (LLMs) excel at in-context learning for multiple choice reasoning tasks but often treat all options equally, unlike humans who typically eliminate incorrect choices before selecting the correct answer. Same is true for vision language models (VLMs) in case of visual question answering tasks with multiple choices. This discrepancy can limit the effectiveness of vision language models in accurately solving such tasks. To address this, we introduce Multi-Modal Process of Elimination (MM-PoE), a two-step scoring method designed to enhance VLM performance by mimicking human reasoning strategies in multi-modal settings.
3232

33-
In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios . Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
33+
In the first step, the method evaluates and scores each option, systematically eliminating those that appear incorrect. The second step involves masking these eliminated options, allowing the VLM to focus solely on the remaining viable choices to make a final prediction. Our zero-shot experiments across three datasets demonstrate MM-PoE's effectiveness, particularly excelling in logical reasoning scenarios. Additionally, MM-PoE proves adaptable to few-shot settings and is compatible with the current state-of-the-art vision language models (VLMs).
3434

3535
By implementing MM-PoE, researchers and practitioners can experiment and significantly improve the accuracy and reliability of VLMs in multiple choice reasoning tasks, making it a valuable tool for advancing machine learning models for visual reasoning.
3636

@@ -106,13 +106,13 @@ To further explore the versatility of MM-PoE, we also examined its performance i
106106

107107
## Data
108108

109-
Our experiments were conducted on three different multiple-choice visual reasoning datasets, selected to cover a broad spectrum of reasoning types and complexities. These tasks include both traditional reasoning tasks and more specialized ones designed to test specific reasoning skills. To ensure a comprehensive evaluation, we used train sets from established benchmarks when available; otherwise, we utilized development sets.
109+
Our experiments were conducted on three different multiple-choice visual reasoning datasets - Visual Question Answering(VQA) [@VQA], ScienceQA [@lu2022learn] and Diagram Understanding(AI2D) [@Kembhavi2016ADI], selected to cover a broad spectrum of reasoning types and complexities. These tasks include both traditional visual reasoning tasks and more specialized ones designed to test specific reasoning skills. To ensure a comprehensive evaluation, we used train sets from established benchmarks when available; otherwise, we utilized development sets. In case of varying number of options in the multiple-choice answers for SceinceQA and AI2D datasets, we filtered questions containing image context and exactly four options.
110110

111111
| Dataset | #Options | Train | Dev | Test |
112112
|----|------|------|------|-----------|
113-
|VQA v1.0| 18 | 248,349 | 121,512 | 244,302 |
114-
|ScienceQA | 4 | 2221 | | |
115-
| AI2D | 4 | | | |
113+
| VQA | 18 | 248,349 | 121,512 | 244,302 |
114+
| ScienceQA | 4 | 12726 | 4241 | 4241 |
115+
| AI2D | 4 | 3921 | 982 | - |
116116

117117
## Model
118118

@@ -145,7 +145,7 @@ MM-PoE consistently outperformed or matched the best-performing baselines across
145145

146146
| Model | Dataset | LM | AVG | Calibration | Channel | MCP | PoE |
147147
|----|------|------|------|-----------|---|---|---|
148-
|microsoft/git-base-vqav2| VQA | | | | | | | |
148+
|microsoft/git-base-vqav2| VQA | 45 | 43 | 38| 14 | 2| | |
149149
|microsoft/git-base-vqav2| ScienceQA | 27.4 | | 23.2| 24.6 | 25.8 | 27.2 |
150150
|microsoft/git-base-vqav2| AI2D | 25.4| | 26.4| 25.4 | 25.3 | 26.5 |
151151
|microsoft/git-base-textvqa| VQA | | | | | | |
@@ -156,11 +156,12 @@ MM-PoE consistently outperformed or matched the best-performing baselines across
156156

157157
## Example
158158

159-
<img src="figures/image.png" alt="Example" width="300">
159+
<img src="figures/image.png" alt="Example" width="500">
160160

161161
**Question**: Which of these states is farthest north?<br>
162162
**Choices**: West Virginia, Louisiana, Arizona, Oklahoma<br>
163-
**Predicted**: 0
163+
**Masked Choices**: West Virginia, Louisiana, [MASK], [MASK]<br>
164+
**Predicted**: West Virginia
164165

165166
# Conclusion
166167

0 commit comments

Comments
 (0)