NOTE: Users with appropriate Physionet credentials (requires MIMIC-III and MIMIC-CXR DUA agreement) can directly download preprocessed datasets from Vilmedic Datasets.
Alternatively, you can follow the following steps to obtain each dataset.
- Download: Access the dataset on Physionet.
- Preprocessing: Run the
convert_radgraph_to_dygiepp.py
script, specifying the location where the RadGraph dataset is saved.
- Download: Get the
Rad-SpRL.xml
file from Mendeley. - Preprocessing: Run the
convert_radsprl_to_dygiepp.py
script, specifying the location where the RadSpRL dataset is saved.
- Download: Access the MIMIC-III dataset.
- Preprocessing: In the
mimiciii_procedure_selection
folder, run thecreate_dataset.py
followed by thecreate_train_dev_test_split.py
scripts, specifying the location where the MIMIC dataset is saved.
A link for access to this dataset will be provided upon institutional review board approval. The dataset will be provided without requiring further preprocessing.
- Download: Follow the instructions for dataset download on GitHub.
- Download: You must create a Physionet account with permissions to download the MIMIC-CXR Database. Get the
mimic-cxr-reports.zip
and related files from Physionet and organize them as mentioned. - Preprocessing: Navigate to
datasets/rrg_rrs/mimic-cxr
and run the provided commands to preprocess the dataset.
We encourage the community to contribute and expand the RaLEs benchmark by adding new datasets. If you have a dataset that can be valuable for radiology language tasks, please follow the instructions below:
- Prepare Your Dataset Details: Ensure you have all the details about your dataset ready as per the guidelines below.
- Pull Request: Submit a pull request to this repository with the dataset details.
- Review: Your dataset will be reviewed based on the provided information. The review process typically takes up to 10 business days. If your dataset is accepted, it will be added to the RaLEs benchmark.
- Name: The official name of the dataset.
- Description: A brief description of the dataset.
- Task: Specify the NLP task (e.g., NLU, NLG, Classification, Relation Extraction, Summarization, etc.).
- Clinical/Scientific Relevance: Explain the clinical or scientific relevance of the dataset.
- Format: Describe the format of the dataset (e.g., CSV, JSON, etc.).
- Size: Provide the size details such as the number of patients, reports, and images.
- Download Instructions: Step-by-step instructions or links for downloading the dataset.
- Preparation Scripts: If applicable, provide scripts or instructions for preprocessing or preparing the dataset.
- Labeling Method: Explain how the dataset was labeled (e.g., manual annotation, expert-reviewed, etc.).
- Models Evaluated: List the models that have been evaluated on the dataset and their performance metrics.
- Preferred Metric: Specify the main metric that should be used for evaluating models on this dataset.
- Additional Relevant Metrics: Mention any other metrics that can be used for evaluation.
- Potential Biases: Discuss any known biases in the dataset.
- Related Existing Datasets: If applicable, mention any datasets that are related or similar to yours.
If your dataset addresses a task already present in the RaLEs benchmark, you should benchmark the best performing RaLEs models for that task. For example, if your dataset is for procedure selection, you should evaluate it using the best performing RaLEs procedure selection model.
## Dataset Submission for RaLEs Benchmark
- **Name**: [Your Dataset Name]
- **Description**: [Brief Description]
- **Task**: [NLP Task]
- **Clinical/Scientific Relevance**: [Relevance Explanation]
- **Format**: [Dataset Format]
- **Size**:
- **Patients**: [Number]
- **Reports**: [Number]
- **Images**: [Number]
- **Download Instructions**: [Instructions or Links]
- **Preparation Scripts**: [Scripts or Instructions]
- **Labeling Method**: [Method Description]
- **Models Evaluated**: [Models and Performance Metrics]
- **Preferred Metric**: [Metric Name]
- **Additional Relevant Metrics**: [Other Metrics]
- **Potential Biases**: [Known Biases]
- **Related Existing Datasets**: [Related Datasets]
[Any additional information or notes]