Skip to content

ELM-Research/ecg_preprocess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ECG Preprocessing

Installation

  1. if git clone and uv installed, just cd into the repo and uv sync.

  2. To run just do uv run $PATH_TO_FILE. There are examples in scripts/

Base Datasets

We regard base datasets as datasets that are solely used for later mapping of external datasets. Note that DATA_DIR in src/configs/constants.py is the path to your data folder. The structure in which the data folder should be in is the following:

data
├── csn
│   ├── preprocessed_1250
│   ├── preprocessed_500
│   └── preprocessed_2500
├── cpsc
│   └── ...
├── ptb_xl
│   └── ...
├── mimic_iv
│   └── ...
└── code15
    └── ...

These base datasets are enough if you want to solely use the ECG datasets for pretraining with https://github.com/ELM-Research/ecg_nn or finetuning an ELM with https://github.com/ELM-Research/ELM. The output of the base dataset preprocessing pipeline is a folder with .npy files that have the ECG signal matrix and a textual report if available.

PTB-XL

  1. Please download the PTB-XL dataset through this link.

  2. Please create a data folder, unzip the zip file inside the data folder and rename the folder as ptb_xl.

MIMIC

  1. Please download the Mimic IV ECG dataset through this link.

  2. Unzip the zip file inside the data directory and rename the unzipped directory as mimic_iv.

Code-15

  1. First create a code15 folder inside the data directory.

  2. Then inside data/code15 execute the following bash script to download the data and unzip it:

#!/bin/bash

for i in {0..17}; do
    echo "Downloading part ${i}..."
    wget -O "exams_part${i}.zip" "https://zenodo.org/records/4916206/files/exams_part${i}.zip?download=1"
    
    if [ $? -eq 0 ]; then
        echo "Successfully downloaded part ${i}"
        
        echo "Extracting part ${i}..."
        unzip -q "exams_part${i}.zip"
        
        if [ $? -eq 0 ]; then
            echo "Successfully extracted part ${i}"
            rm "exams_part${i}.zip"
        else
            echo "Error extracting part ${i}"
        fi
    else
        echo "Error downloading part ${i}"
    fi
done

echo "All downloads and extractions completed"

CSN

  1. Create a csn folder inside the data directory.

  2. Inside data/csn execute the following command in the terminal:

wget https://physionet.org/static/published-projects/ecg-arrhythmia/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0.zip
  1. Unzip the file and inside of data/csn/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0 move all of the contents outside to data/csn. Then you may delete the a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0 folder.

CPSC

  1. Create a cpsc folder inside the data directory.

  2. Inside data/cpsc execute the following command in the terminal:

wget https://physionet.org/static/published-projects/challenge-2020/classification-of-12-lead-ecgs-the-physionetcomputing-in-cardiology-challenge-2020-1.0.2.zip
  1. Unzip the file and inside of data/cpsc/classification-of-12-lead-ecgs-the-physionetcomputing-in-cardiology-challenge-2020-1.0.2/training move the cpsc_2018 and cpsc_2018_extra folders into the data/cpsc directory. Then delete the classification-of-12-lead-ecgs-the-physionetcomputing-in-cardiology-challenge-2020-1.0.2 folder.

Mapping Datasets

Mapping datasets are datasets that are mapped to the base datasets. Create these datasets if you want to transform your custom dataset to a format compatbile for https://github.com/ELM-Research/ELM. We provide several examples below as well as the ability to upload the mapped dataset to Huggingface. These are not required to do as we have already uploaded most datasets on Huggingface. Here are the currently supported datasets.

data Link
ecg-qa-ptbxl-250-2500 willxxy/ecg-qa-ptbxl-250-2500
ecg-qa-mimic-iv-ecg-250-2500 willxxy/ecg-qa-mimic-iv-ecg-250-2500
pretrain-mimic-250-2500 willxxy/pretrain-mimic-250-2500
ecg-grounding-250-2500 willxxy/ecg-grounding-250-2500
ecg-instruct-pulse-250-2500 willxxy/ecg-instruct-pulse-250-2500
ecg-bench-pulse-250-2500 willxxy/ecg-bench-pulse-250-2500
ecg-instruct-45k-250-2500 willxxy/ecg-instruct-45k-250-2500

ECG-QA dataset curated by ECG-QA, Oh et al.

  1. We exactly follow the instructions in this section of the repository for mapping the PTB-XL and MIMIC IV ECG dataset to the question and answers. cd into ecg-qa and execute the following commands in the terminal to prepare the ECG-QA dataset.

  2. To map the ECG-QA dataset to mimic and ptb, execute the following scripts respectively.

uv run src/datasets/map/ecg_qa/mapping_ptbxl_samples.py src/datasets/map/ecg_qa/ecgqa/ptbxl/ --ptbxl-data-dir ../data/ptb_xl
uv run src/datasets/map/ecg_qa/mapping_mimic_iv_ecg_samples.py src/datasets/map/ecg_qa/ecgqa/mimic-iv-ecg --mimic-iv-ecg-data-dir ../data/mimic
  1. After mapping the datasets, you should have an output folder in the data/ecg-qa folder with the mapped paraphrased and template question and answers.

Pretrain MIMIC dataset curated by ECG-Chat, Zhao et al.

  1. Download the pretrain_mimic.json file from this dropbox link and place it in the corresponding folder src/datasets/map/pretrain_mimic/.

Instruct 45k MIMIC dataset curated by ECG-Chat, Zhao et al.

  1. Download the ecg_instruct_45k.json file from this link and place it in the corresponding folder src/datasets/map/ecg_intruct_45k/.

ECG Instruct Pulse dataset curated by PULSE, Liu et al.

  1. Downlod the ECGInstruct.jsonfrom this link. Rename it to ecg_instruct_pulse.json and place it in the corresponding folder src/datasets/map/ecg_instruct_pulse.

ECG Bench Pulse dataset curated by PULSE, Liu et al.

  1. The ECG Bench Pulse dataset is exclusively on HuggingFace with .parquet files, therefore, we utilize the datasets library directly to download the dataset.

ECG Grounding Datasets curated by GEM, Lan et al.

  1. Download the ECG_Grounding_30k.json, ecg-grounding-test.json and grounding_train_30k.json from this link and place it in the corresponding folder src/datasets/map/ecg_grounding. A quick note is that grounding_train_30k.json is a subset of ECG_Grounding_30k.json, where ECG_Grounding_30k.json contains all 30k ECG grounding samples found in grounding_train_30k.json, with additional ECG conversational data from the ECG Instruct PULSE dataset.

ECG Byte Training

We also implement training the BPE algorithm from ECG-Byte. This should be trained only after preprocessing the MIMIC-IV base dataset. Please execute bash scripts/train_ecg_byte.sh.

Hugging Face upload

We have also released the code for uploading the preprocessed, mapped datasets onto HuggingFace datasets. Please view scripts/upload_hf.sh for the script!

About

Simple, efficient preprocessing pipelines for publicly available ECG datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors