- Overview
- Features
- VQA Shortcomings
- Project Structure
- Getting Started
- Model Architecture
- Performance
- Loss
- Examples
- Troubleshooting
- Acknowledgments
- Contributing
- License
- Contact
MediVisor is a Visual Question Answering (VQA) system tailored for the medical domain. It leverages state-of-the-art transformer models, specifically the ViLT (Vision-and-Language Transformer) model fine-tuned for medical images, to answer questions based on visual content. The project aims to aid medical professionals and researchers by providing automated insights and answers to medical image-related queries.
- Real-Time Answer Prediction: Given a medical image and a question, the model predicts the most likely answer.
- Performance Evaluation: The model's performance can be evaluated using BLEU and METEOR scores, which are common metrics for assessing natural language processing tasks.
- Custom Dataset Integration: The system is flexible and can be adapted to different datasets within the medical domain.
- Transfer Learning: Uses pre-trained models fine-tuned on the PathVQA dataset, making it suitable for medical applications.
VQA systems are trained on large datasets of images and text, and these datasets can be biased. This can lead to VQA systems that are biased in their answers.
VQA systems can struggle to answer open-ended questions, as these questions often require a deep understanding of context and the ability to reason.
VQA systems often lack commonsense reasoning, which can lead to them providing inaccurate or misleading answers.
VQA models can be sensitive to slight changes in input, such as variations in image quality or phrasing of questions. This lack of robustness can result in inconsistent performance across different scenarios.
Some questions may have multiple valid answers, and VQA systems can struggle to choose the most appropriate one. This ambiguity can lead to confusion or incorrect answers if the model is not equipped to handle it properly.
Integrating visual and textual data effectively is challenging, and VQA models may fail to correctly align the information from both modalities, leading to errors in understanding and answering questions.
Before you begin, ensure you have met the following requirements:
- Python 3.7+ installed
- Pip package manager installed
- PyTorch and transformers library installed
- nltk library installed for BLEU and METEOR scores
-
Clone the Repository:
git clone https://github.com/your-username/medivisor.git cd medivisor
-
Install Dependencies:
Install the required Python packages using pip:
pip install -r requirements.txt
-
Download the Dataset:
The project uses the PathVQA dataset. You can load it directly using the
datasets
library:from datasets import load_dataset ds = load_dataset("flaviagiammarino/path-vqa")
-
Set Up the Environment:
If you’re using Google Colab, ensure that your runtime has GPU enabled for faster processing.
Use the predict_answer
function to predict the answer to a question based on a medical image.
from src.predict import predict_answer
# Example usage
image_path = '/path/to/your/image.jpeg'
question = "What are the blue dots?"
predicted_answer = predict_answer(model, processor, image_path, question, answer_vocab)
print(f"Question: {question}")
print(f"Predicted Answer: {predicted_answer}")
Evaluate the model's performance using BLEU and METEOR scores with the calculate_bleu_meteor
function.
from src.evaluate import calculate_bleu_meteor
# Example usage
sampled_dataset = [{'image': ann['image'], 'question': ann['question'], 'answer': ann['answer']} for ann in ds['train'][:10]]
avg_bleu, avg_meteor, references, hypotheses = calculate_bleu_meteor(model, processor, sampled_dataset, answer_vocab)
print(f"Average BLEU score: {avg_bleu}")
print(f"Average METEOR score: {avg_meteor}")
- Dataset: If you have a custom medical dataset, ensure it follows the structure expected by the
PathVQADataset
class. Modify the keys and processing steps as necessary. - Model Fine-Tuning: You can fine-tune the ViLT model further on your custom dataset using the provided training scripts and the
Trainer
class from Hugging Face'stransformers
library.
- CUDA Errors: Ensure that your GPU drivers are correctly installed and that PyTorch is configured to use CUDA.
- Model Prediction Issues: Check if the image and question are being correctly preprocessed before being fed into the model.
- BLEU/METEOR Calculation: Ensure that the references and hypotheses are correctly tokenized before calculating the scores.
This project is built upon the work of many open-source projects and researchers. Special thanks to:
- The authors of the ViLT model.
- The creators of the PathVQA dataset.
- The Hugging Face team for their incredible transformers library.
Contributions are welcome! If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome.
This project is licensed under the MIT License - see the LICENSE file for details.
For further queries, feel free to reach out:
- Gmail - samama4200@gmail.com
- LinkedIn - www.linkedin.com/in/samama-