- Technically the weighted average of proteins has an additional division operation
- Not sure if this matters that much since the embedding values are constantly being normalized
- Would also require retraining the model
- Could train a final model with both training and test datasets for true downstream use
- Should we consider averaging over scaffolds instead of the entire genome?
- Technically the protein IDs are specific to the scaffold (for multi-scaffold viruses), and not contiguously numbered for the entire genome
- The easiest fix is keep track of both the scaffold as well as the entire genome
- This probably requires to different index pointers for each
- Then proteins can attend only within each scaffold
- Then proteins can be aggregated over each scaffold and then again over all scaffolds in each genome
- Huggingface integration
- This would be so much easier to use as a pypi package, but pypi does not like that I have my custom fork of
pydantic-argparse
as a requirement- Need to resolve this, perhaps using a submodule
- Should add the following utilities:
- Pipeline to go directly from FASTA files to PST outputs
- This will be challenging since each requires different python/pytorch versions
- Script to reformat protein embeddings into the graph format
- Pipeline to go directly from FASTA files to PST outputs