Skip to content

Latest commit

 

History

History
27 lines (21 loc) · 1.33 KB

TODO.md

File metadata and controls

27 lines (21 loc) · 1.33 KB

TODO

Model

  • Technically the weighted average of proteins has an additional division operation
    • Not sure if this matters that much since the embedding values are constantly being normalized
    • Would also require retraining the model
  • Could train a final model with both training and test datasets for true downstream use
  • Should we consider averaging over scaffolds instead of the entire genome?
    • Technically the protein IDs are specific to the scaffold (for multi-scaffold viruses), and not contiguously numbered for the entire genome
    • The easiest fix is keep track of both the scaffold as well as the entire genome
      • This probably requires to different index pointers for each
    • Then proteins can attend only within each scaffold
    • Then proteins can be aggregated over each scaffold and then again over all scaffolds in each genome
  • Huggingface integration

Package

  • This would be so much easier to use as a pypi package, but pypi does not like that I have my custom fork of pydantic-argparse as a requirement
    • Need to resolve this, perhaps using a submodule

User experience

  • Should add the following utilities:
    • Pipeline to go directly from FASTA files to PST outputs
      • This will be challenging since each requires different python/pytorch versions
    • Script to reformat protein embeddings into the graph format