-
if git clone and
uvinstalled, justcdinto the repo anduv sync. -
To run just do
uv run $PATH_TO_FILE. There are examples inscripts/
We regard base datasets as datasets that are solely used for later mapping of external datasets. Note that DATA_DIR in src/configs/constants.py is the path to your data folder.
The structure in which the data folder should be in is the following:
data
├── csn
│ ├── preprocessed_1250
│ ├── preprocessed_500
│ └── preprocessed_2500
├── cpsc
│ └── ...
├── ptb_xl
│ └── ...
├── mimic_iv
│ └── ...
└── code15
└── ...
These base datasets are enough if you want to solely use the ECG datasets for pretraining with https://github.com/ELM-Research/ecg_nn or finetuning an ELM with https://github.com/ELM-Research/ELM. The output of the base dataset preprocessing pipeline is a folder with .npy files that have the ECG signal matrix and a textual report if available.
-
Please download the PTB-XL dataset through this link.
-
Please create a
datafolder, unzip the zip file inside thedatafolder and rename the folder asptb_xl.
-
Please download the Mimic IV ECG dataset through this link.
-
Unzip the zip file inside the
datadirectory and rename the unzipped directory asmimic_iv.
-
First create a
code15folder inside thedatadirectory. -
Then inside
data/code15execute the following bash script to download the data and unzip it:
#!/bin/bash
for i in {0..17}; do
echo "Downloading part ${i}..."
wget -O "exams_part${i}.zip" "https://zenodo.org/records/4916206/files/exams_part${i}.zip?download=1"
if [ $? -eq 0 ]; then
echo "Successfully downloaded part ${i}"
echo "Extracting part ${i}..."
unzip -q "exams_part${i}.zip"
if [ $? -eq 0 ]; then
echo "Successfully extracted part ${i}"
rm "exams_part${i}.zip"
else
echo "Error extracting part ${i}"
fi
else
echo "Error downloading part ${i}"
fi
done
echo "All downloads and extractions completed"
-
Create a
csnfolder inside thedatadirectory. -
Inside
data/csnexecute the following command in the terminal:
wget https://physionet.org/static/published-projects/ecg-arrhythmia/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0.zip
- Unzip the file and inside of
data/csn/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0move all of the contents outside todata/csn. Then you may delete thea-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0folder.
-
Create a
cpscfolder inside thedatadirectory. -
Inside
data/cpscexecute the following command in the terminal:
wget https://physionet.org/static/published-projects/challenge-2020/classification-of-12-lead-ecgs-the-physionetcomputing-in-cardiology-challenge-2020-1.0.2.zip
- Unzip the file and inside of
data/cpsc/classification-of-12-lead-ecgs-the-physionetcomputing-in-cardiology-challenge-2020-1.0.2/trainingmove thecpsc_2018andcpsc_2018_extrafolders into thedata/cpscdirectory. Then delete theclassification-of-12-lead-ecgs-the-physionetcomputing-in-cardiology-challenge-2020-1.0.2folder.
Mapping datasets are datasets that are mapped to the base datasets. Create these datasets if you want to transform your custom dataset to a format compatbile for https://github.com/ELM-Research/ELM. We provide several examples below as well as the ability to upload the mapped dataset to Huggingface. These are not required to do as we have already uploaded most datasets on Huggingface. Here are the currently supported datasets.
ECG-QA dataset curated by ECG-QA, Oh et al.
-
We exactly follow the instructions in this section of the repository for mapping the PTB-XL and MIMIC IV ECG dataset to the question and answers.
cdinto ecg-qa and execute the following commands in the terminal to prepare the ECG-QA dataset. -
To map the ECG-QA dataset to mimic and ptb, execute the following scripts respectively.
uv run src/datasets/map/ecg_qa/mapping_ptbxl_samples.py src/datasets/map/ecg_qa/ecgqa/ptbxl/ --ptbxl-data-dir ../data/ptb_xl
uv run src/datasets/map/ecg_qa/mapping_mimic_iv_ecg_samples.py src/datasets/map/ecg_qa/ecgqa/mimic-iv-ecg --mimic-iv-ecg-data-dir ../data/mimic
- After mapping the datasets, you should have an output folder in the
data/ecg-qafolder with the mappedparaphrasedandtemplatequestion and answers.
Pretrain MIMIC dataset curated by ECG-Chat, Zhao et al.
- Download the
pretrain_mimic.jsonfile from this dropbox link and place it in the corresponding folder src/datasets/map/pretrain_mimic/.
Instruct 45k MIMIC dataset curated by ECG-Chat, Zhao et al.
- Download the
ecg_instruct_45k.jsonfile from this link and place it in the corresponding folder src/datasets/map/ecg_intruct_45k/.
ECG Instruct Pulse dataset curated by PULSE, Liu et al.
- Downlod the
ECGInstruct.jsonfrom this link. Rename it toecg_instruct_pulse.jsonand place it in the corresponding folder src/datasets/map/ecg_instruct_pulse.
ECG Bench Pulse dataset curated by PULSE, Liu et al.
- The ECG Bench Pulse dataset is exclusively on HuggingFace with
.parquetfiles, therefore, we utilize thedatasetslibrary directly to download the dataset.
ECG Grounding Datasets curated by GEM, Lan et al.
- Download the
ECG_Grounding_30k.json,ecg-grounding-test.jsonandgrounding_train_30k.jsonfrom this link and place it in the corresponding folder src/datasets/map/ecg_grounding. A quick note is thatgrounding_train_30k.jsonis a subset ofECG_Grounding_30k.json, whereECG_Grounding_30k.jsoncontains all 30k ECG grounding samples found ingrounding_train_30k.json, with additional ECG conversational data from the ECG Instruct PULSE dataset.
We also implement training the BPE algorithm from ECG-Byte. This should be trained only after preprocessing the MIMIC-IV base dataset.
Please execute bash scripts/train_ecg_byte.sh.
We have also released the code for uploading the preprocessed, mapped datasets onto HuggingFace datasets. Please view scripts/upload_hf.sh for the script!