The following tables show ROC-AUC x 100%
of ChordMixer with and without pretraining, compared with other models.
Model/Dataset | GRCh38 |
---|---|
FineTunedChordMixer | 89.87 |
ProbedChordMixer | 86.28 |
ChordMixer | 84.90 |
KeGRU | 70.16 |
DeeperDeepSEA | 86.93 |
Transformer | 68.69 |
Nyströmformer | 82.58 |
Poolformer | 76.00 |
Linformer | 83.22 |
Model/Dataset | A. Thaliana | B. Distachyon | O. Sativa MH | O. Sativa ZS | S. Italica | S. Bicolor | Z. Mays |
---|---|---|---|---|---|---|---|
FineTunedChordMixer | 93.01 | 93.50 | 93.99 | 93.52 | 94.50 | 96.59 | 96.69 |
ProbedChordMixer | 91.51 | 93.25 | 92.73 | 92.40 | 93.18 | 95.53 | 96.98 |
ChordMixer | 89.53 | 91.14 | 90.95 | 90.79 | 91.93 | 94.52 | 92.98 |
KeGRU | 90.83 | 92.35 | 92.15 | 92.14 | 92.99 | 95.84 | 94.64 |
DeeperDeepSEA | 90.12 | 90.05 | 91.02 | 89.70 | 91.92 | 94.54 | 94.84 |
Transformer | 62.51 | 75.98 | 71.40 | 75.01 | 82.31 | 82.47 | 60.96 |
Nyströmformer | 73.54 | 81.21 | 77.89 | 76.85 | 83.55 | 87.17 | 76.95 |
Poolformer | 74.81 | 79.64 | 74.56 | 76.20 | 81.22 | 83.69 | 74.21 |
Linformer | 60.33 | 72.75 | 62.39 | 71.04 | 70.98 | 78.05 | 53.14 |
Model/Dataset | Carassius vs. Labeo | Sus vs. Bos | Danio vs. Cyprinus |
---|---|---|---|
FineTunedChordMixer | 97.35 | 96.59 | 98.67 |
ProbedChordMixer | 97.55 | 96.67 | 98.62 |
ChordMixer | 97.19 | 95.70 | 98.93 |
KeGRU | 97.02 | 94.36 | 98.74 |
DeeperDeepSEA | 97.49 | 96.53 | 99.12 |
Transformer | 92.19 | 86.70 | 90.06 |
Nyströmformer | 89.85 | 91.62 | 94.31 |
Poolformer | 90.72 | 86.55 | 92.81 |
Linformer | 86.44 | 87.88 | 87.00 |
All of the requirements and their versions are available in requirements.txt
, and can be installed by running:
pip install -r requirements.txt
The configurations and hyperparameters can be found in the configs
folder. As these are managed by Hydra, they can simply be modified and overwritten, either directly in the config files, or by passing them as arguments to the training script, as follows:
python train.py --config-name=MAIN_CONFIG_NAME dataset=DATASET_CONFIG_NAME parameter=new_value
Note that --config-name
and dataset
always have to be passed to train.py
as arguments. For a list of available config names and datasets, as well as other configuration parameters, run:
python train.py --help
Three pretrained models are availble for fine-tuning and probing. These can be found under models
directory.
pcm-cl-1000-human.pt
trained on human reference genome GRCh38pcm-cl-1000-plant.pt
trained on plant DNApcm-vl.pt
trained on a dataset containing DNA sequences of varying lengths
Pretrained models can be instantiated and fine-tuned as follows:
class FineTunedChordMixer(nn.Module):
def __init__(self, model_path, freeze, variable_length, n_class):
super(FineTunedChordMixer, self).__init__()
self.encoder = ChordMixerEncoder.from_pretrained(
model_path=model_path,
freeze=freeze,
variable_length=variable_length
)
self.classifier = ...
def forward(self, batch):
...
The model expectes one hot encoded DNA sequences as input. Run fine-tuning:
python train.py --config-name=chordmixer_finetuning dataset=DATASET_CONFIG_NAME
Pretraining can be initiated using the following command:
python train.py --config-name=CONFIG_NAME