LucaVirus: Modeling the Evolutionary and Functional Landscape of Viruses with a Unified Genome-Protein Language Model
Huggingface
https://huggingface.co/LucaGroup
On June 16, 2025, the preprint version was released and LucaVirus and LucaVirusTasks were open-sourced.
Fig. 1 The workflow of LucaVirus.
OpenVirus
We curated OpenVirus, a comprehensive, large-scale data set of viral sequences used to train the LucaVirus model.
This data set comprises 15.7 million viral sequences, totaling 25.4 billion tokens—including 23.7 billion nucleotide tokens from 10.4 million sequences and 1.6 billion amino acid tokens from 5.2 million protein sequences.
Nucleotide sequences were primarily sourced from the NCBI Virus database and seven independent viral diversity studies (9, 20-24), ensuring inclusion of sequences not available in NCBI.
Protein sequences were obtained from the UniProtKB and MGnify databases.
The OpenVirus data set covers all known viral taxa.
The major groups include: double-strand (ds) DNA viruses (27% of sequences), RNA viruses (26%), reverse-transcribing viruses (20%), single-strand (ss) DNA viruses and others (6%), and unclassified viruses (21%).
These four groups collectively account for 94% of the total sequence count.
The data set includes viruses infecting all three domains and six kingdoms of cellular life, including animals (48%), bacteria (25%), plants (12%), protists (2%), fungi (2%), archaea (1%), and unknown hosts (22%).
LucaVirus employs a semi-supervised pre-training strategy, building on the framework established by LucaOne.
The model initializes its corresponding layers with weights derived from LucaOne’s latest training checkpoint at step 1,760,000.
The pre-training process integrates self-supervised masked language modeling (MLM) with seven biologically relevant supervised tasks to enhance the model’s ability to capture diverse biological features.
These tasks are categorized as follows:
Sequence-level classification tasks:
(i) Order taxonomy prediction for nucleotide sequences;
(ii) Order taxonomy prediction for protein sequences;
and (iii) UniProt functional keyword prediction for protein sequences.
Token-level classification tasks:
(i) gene prediction for nucleotide sequences;
(ii) protein homologous superfamily annotation;
(iii) protein conserved domain annotation;
and (iv) protein active site prediction.
Fig. 2 LucaVirus learns interpretable representation of viral sequences that reflect genetic divergence.
2) Exploring the hidden diversity and functional proteins of viruses
Fig. 3 Exploring the hidden diversity and functional proteins of viruses.
Fig. 4 Fitting and predicting the fitness landscapes of a viral protein.
Fig. 5 Performance of LucaVirus in antibody-antigen binding prediction.
sudo yum update
sudo yum install git-all
sudo apt-get update
sudo apt install git-all
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
sh Anaconda3-2022.05-Linux-x86_64.sh
source ~/.bashrc
conda create -n lucavirus python=3.9.13
conda activate lucavirus
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
conda activate lucavirus
conda install ipykernel
python -m ipykernel install --user --name lucavirus --display-name "Python(LucaVirus)"
jupyter kernelspec list
jupyter kernelspec uninstall lucavirus
TrainedCheckPoints
This project will download automatically LucaVirus Trained-CheckPoint from FTP when embedding inference.
using src/get_embedding.py
or src/embedding/get_embedding.py
usage information refer to src/embedding/README.md
or src/get_embedding_guidance.md
run_multi_v1.0.sh
use the LucaOne's checkpoint(step=17600000
or 36000000
) for LucaVirus training.
run_multi_v1.0_continue.sh
continue training when an interruption occurs.
run_multi_mask_v1.0.sh
training LucaVirus only using mask pretrain task.
run_multi_v1.0_gene.sh
training LucaVirus only using viral gene(DNA + RNA) data.
run_multi_v1.0_prot.sh
training LucaVirus only using viral protein data.
run_multi_v1.0_single.sh
training LucaVirus only using one GPU card.
tensorboard --logdir tb-logs --bind_all --port 8008
The pre-training data will be opened soon.
The downstream tasks datasets and downstream tasks checkpoints can access at: LucaVirus.
or at: Zenodo
Foundation Model: LucaVirus
Downstream Tasks: LucaVirusTasks
Yong He,
Yuan-Fei Pan,
Zhaorong Li,
Mang Shi,
Yuqi Liu
@article { LucaVirus,
author = {Pan, Yuan-Fei* and He, Yong* and Liu, Yu-Qi and Shan, Yong-Tao and Liu, Shu-Ning and Liu, Xue and Pan, Xiaoyun and Bai, Yinqi and Xu, Zan and Wang, Zheng and Ye, Jieping and Holmes, Edward C. and Li, Bo and Chen, Yao-Qing and Li, Zhao-Rong and Shi, Mang},
title = {Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus},
elocation-id = {2025.06.14.659722},
year = {2025},
doi = {10.1101/2025.06.14.659722},
publisher = {Cold Spring Harbor Laboratory },
URL = {https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722 },
eprint = {https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722.full.pdf },
journal = {bioRxiv}
}