COVIDx-US: An open-access benchmark dataset of ultrasound imaging data for AI-driven COVID-19 analytics
The COVID-19 pandemic continues to have a devastating effect on the health and well-being of the global population. Apart from the global health crises, the pandemic has also caused significant economic and financial difficulties and socio-physiological implications. Effective screening, prognosis, and treatment planning plays a key role in controlling the pandemic. A few recent studies highlighted the role of point-of-care ultrasound imaging for COVID-19 screening and prognosis, particularly given its non-invasive nature, widespread global accessibility and availability, and easy-to-sanitize nature. Motivated by this and the promise of artificial intelligence tools to aid clinicians, and as part of a large open-source initiative, the COVID-Net initiative, we introduce COVIDx-US, an open-access benchmark dataset of COVID-19 related ultrasound imaging data that is the largest of its kind. The COVIDx-US dataset was curated from multiple sources and consists of 242 lung ultrasound videos and 29,651 processed images of patients with COVID-19 infection, non-COVID-19 infection, normal cases, as well as patients with other lung diseases/conditions. It also contains a standardized and unified lung ultrasound score per video file, providing better interpretation while enabling other research avenues such as severity assessment. The dataset was systematically processed and validated specifically for the purpose of building and evaluating artificial intelligence algorithms and models.
Update 05/30/2022: COVIDx-US v1.5 is released. The dataset now contains a unified and standardized human "gold standard" lung ultrasound score (LUSS) per video file!
Update 07/13/2021: COVIDx-US v1.4 is released. We added three new data sources. The dataset now comprises 242 ultrasound videos and 29,651 processed ultrasound images.
Update 04/29/2021: COVIDx-US v1.3 is released. We added two new data sources (Radiopaedia and CoreUltrasound). The dataset now comprises 173 ultrasound videos and 16,822 processed ultrasound images.
Update 04/12/2021: Data dictionary added. This excel file contains detailed information about the variables/features in the metadata files.
Update 04/07/2021: COVIDx-US v1.2 is released. We added 41 new ultrasound videos. The dataset now comprises 150 ultrasound videos and 12,493 processed ultrasound images. In addition, three labelling metadata files were released (located under the labels folder) to ease up formulation of data science problems built on COVIDx-US to binary, 3-class, and 4-class classification problems.
Update 04/01/2021: COVIDx-US v1.1 is released. We added 16 new ultrasound videos. The dataset now comprises 109 ultrasound videos and 11,307 processed ultrasound images.
Update 03/18/2021: For a detailed description of the COVIDx-US dataset, please see our paper.
Update 03/17/2021: COVIDx-US v1.0 is released. The dataset comprises 93 ultrasound videos and 10,774 processed ultrasound images.
The current COVIDx-US dataset is constructed from the following datasets:
- ButterflyNetwork
- GrepMed
- The POCUS Atlas
- LITFL
- Radiopaedia
- CoreUltrasound
- University of Florida
- Scientific publications
- Clarius
Our goal is to encourage broad adoption and contribution to this project. The COVID-US project is an open-source open-access initiative under the terms of the GNU Affero General Public License 3.0. Please review the LICENCE document for terms. Contact the team if you wish to licence COVID-US under different terms.
-
Data sources with Creative Commons (CC) license:
- The POCUS Atlas - CC BY-NC 4.0
- LITFL - CC BY-NC-SA 4.0
- Radiopaedia - CC BY-NC-SA 3.0
-
Data sources without license information (no data usage license is mentioned on their websites):
- ButterflyNetwork
- GrepMed
- CoreUltrasound
- Clarius
Notes
- The above data sources are all public sources.
- We do not host any data on the COVIDx-US repository.
- Users have the responsibility to verify with the unlicensed data sources to see if their intended usage is allowed. We take no responsibility for any data use by users.
- For the licensed data sources, it's users' responsibility to verify if their usage is allowed according to the license.
Conceptual flow of the data collection and processing |
---|
US video of a COVID-19 patient | Cropped video | First frame | First frame mask | Frame-67 | Frame-67 mask |
---|---|---|---|---|---|
- National Research Council Canada
- Ashkan Ebadi (ashkan.ebadi@nrc-cnrc.gc.ca)
- Pengcheng Xi
- Stephane Tremblay
- Vision and Image Processing Research Group, University of Waterloo, Canada
- Alexander Wong (alexander.wong@uwaterloo.ca)
- Alexander MacLean
- St. Mary’s Hospital, McGill University, Canada
- Adrian Florea
To generate the COVIDx-US dataset:
- Python >=3.6
- Pandas >=1.1.3
- BeautifulSoup
- selenium >=3.141.0
- requests >=2.24.0
- vimeo-downloader >=0.2.4
- zipfile
- Jupyter
- Use create_COVIDxUS.ipynb to extract the ultrasound videos from multiple sources and integrate them in the COVIDx-US dataset.
- Note 1: Make sure to modify the file paths in the code to your own paths, if reuqired.
- Note 2: See data dictionary file for details about variables/features in the metadata files.
Ultrasound videos distribution per label and probe type
Class | Convex | Linear | Total |
---|---|---|---|
COVID-19 | 63 | 8 | 71 |
Pneumonia | 40 | 9 | 49 |
Normal | 19 | 9 | 28 |
Other | 68 | 26 | 94 |
Ultrasound videos distribution per label and data source
Class | ButterflyNetwork | PocusAtlas | GrepMed | LITFL | Radiopaedia | CoreUltrasound | Papers | UF | Clarius | Total |
---|---|---|---|---|---|---|---|---|---|---|
COVID-19 | 33 | 18 | 8 | 0 | 0 | 1 | 7 | 0 | 4 | 71 |
Pneumonia | 0 | 9 | 9 | 19 | 1 | 3 | 0 | 1 | 7 | 49 |
Normal | 2 | 5 | 3 | 3 | 1 | 1 | 4 | 6 | 3 | 28 |
Other | 0 | 0 | 0 | 41 | 3 | 13 | 11 | 17 | 9 | 94 |
Please consider citing the following paper when using COVIDx-US dataset/scripts:
@article{COVIDxUS2021,
title={COVIDx-US - An Open-Access Benchmark Dataset of Ultrasound Imaging Data for AI-Driven COVID-19 Analytics},
author={Ebadi, Ashkan and Xi, Pengcheng and MacLean, Alexander and Tremblay, Stéphane and Kohli, Sonny and Wong, Alexander},
journal={arXiv:2103.10003},
year={2021}
}
After reading the README and past/current issues use the issue tracker to report genuine bugs, mistakes or even small typos in the COVID-US project files. The tracker lets you browse and search all documented issues, comment on open issues, and track their progress. Note that issues are not meant for technical support; open an issue only for an error which is precise and reproducible.
You can contribute to the COVID-US initiative by providing/adding more data/data sources, implementing new features and functionalities in the scripts, correcting errors, or even improving documentation. Feel free to submit small corrections and contributions as issues in the issue tracker. For more extensive contributions, familiarize yourself with git and github, work on your own COVID-US fork and submit your changes via a pull request.