Skip to content

Latest commit

 

History

History
37 lines (24 loc) · 1.41 KB

db_NT.md

File metadata and controls

37 lines (24 loc) · 1.41 KB

NCBI Nucleotide database (NT) wihtout viruses

The following are the steps to exclude virus sequences from the NT database from the sequence file in the .fasta format (NT fasta).

1. Remove the viral sequences listed in NCBI Virus.

We have pre-collected all virus nucleic acid sequence IDs (as of 18 June 2024) available in the NCBI Virus website.

Example code

seqkit grep -f NCBI_Virus_DNA_viruses_AccessionList_20240617.tsv -v nt.fasta -o NT_noViruses_0.fas

2.Remove the viral sequences based on their taxonomic IDs (taxids)

nucl_gb.accession2taxid

taxdump

seqkit seq -n -i NT_noViruses_0.fas > NT_noViruses_0.id

grep -F -f NT_noViruses_0.id nucl_gb.accession2taxid > NT_noViruses_0.id2taxid

sed -i '/^$/d' NT_noViruses_0.id2taxid


cut -f3 NT_noViruses_0.id2taxid \
| sort -u \
| taxonkit --data-dir ./taxdump lineage \
| awk '$2>0' \
| taxonkit --data-dir ./taxdump reformat -f "{k}\t{kID}\t{p}\t{pID}\t{c}\t{cID}\t{o}\t{oID}\t{f}\t{fID}\t{g}\t{gID}\t{s}\t{sID}" -F \
| cut -f1,2,4,6,8,10,12,14 \
> NT_noViruses_0.tax

awk -F'\t' '$2 != 10239' NT_noViruses_0.tax > NT_noViruses_1.tax