Extract COG Sequences

This script extracts nucleotide and amino acid sequences corresponding to a list of COGs (Clusters of Orthologous Groups) from annotated microbial genomes. It works with Prokka-annotated genomes, eggNOG-mapper output, and genome FASTA files, and organizes the extracted sequences into per-COG folders.

Features

Parses .emapper.annotations files from eggNOG-mapper to identify gene-to-COG mappings.
Matches genes to coordinates in Prokka GFF files and extracts nucleotide and amino acid sequences.
Organizes outputs into folders per COG (with gene name included).
Supports different overwrite policies:
- ask: prompts for each existing file
- always: always overwrite existing files
- never: never overwrite
- dry-run: simulate what would be done, without writing anything
Logs all steps to cog_extraction_log.txt and prints progress to terminal.

Required Input Files

Genome annotations (Prokka):

./prokka/<genome_id>/<genome_id>.gff
./prokka/<genome_id>/<genome_id>.faa

eggNOG-mapper outputs:
```
./eggnog/*.emapper.annotations
```
Genome nucleotide FASTA files (.fna):
```
./genomes_selected/*.fna
```
COG list (cog_gene_list.txt)
A tab-delimited file with:
- a required column cog
- and an optional column gene (used for folder naming)
Example:
```
cog	gene
COG0605	Superoxide_dismutase
COG2032	CU/Zn_SOD
COG2077
```
- If gene is missing or empty, the COG ID will be used as the gene name.
- This file must exist in the same directory as the script.
(Optional) Genome list (genome_list.txt)
If present, this file restricts the analysis to the listed genomes.
- Format: plain text, one genome ID per line.
- Example:
```
GCA_00012345.1
GCA_00067890.1
```
- If this file is not present, the script will use all genome folders in the Prokka directory.

Output

Nucleotide sequences:

./cog_sequences_nt/COG0605_Superoxide_dismutase/GCA_00012345_COG0605.fna

Amino acid sequences:

./cog_sequences_aa/COG0605_Superoxide_dismutase/GCA_00012345_COG0605.faa

Log file:
```
cog_extraction_log.txt
```

Usage

python Extract_COG_sequences.py [ask|always|never|dry-run]

If no argument is passed, the default is ask.

Examples

Run using all genome folders (default behavior):

python Extract_COG_sequences.py always

Run using a genome ID list file:

Create a genome_list.txt file in the same directory as the script:
```
GCA_00012345.1
GCA_00067890.1
```
Then run the script:
```
python Extract_COG_sequences.py ask
```

Perform a dry run to preview actions:

python Extract_COG_sequences.py dry-run

Dependencies

Python 3.6+
Biopython (pip install biopython)
A properly structured input directory tree as described above
Date: July 2025
Institution: Okinawa Institute of Science and Technology (OIST)
Author: Fatima Li-Hau (with assistance from ChatGPT)

Citation

If you use this code in your research, please cite it as:

Li-Hau, F. (2025). Extract COG Sequences Tool (Version 1.0) [Computer software]. Okinawa Institute of Science and Technology. https://github.com/microfafa-gh/Utilities/

Please also consider citing the tools used in this workflow, such as:

Cock et al., 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics.
Seemann, T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics.
Huerta-Cepas et al., 2017. eggNOG-Mapper: functional annotation of orthologs. Molecular Biology and Evolution.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Extract_COG_sequences.py		Extract_COG_sequences.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extract COG Sequences

Features

Required Input Files

Output

Usage

Examples

Dependencies

Citation

About

Uh oh!

Releases

Packages

Languages

License

microfafa-gh/Utilities

Folders and files

Latest commit

History

Repository files navigation

Extract COG Sequences

Features

Required Input Files

Output

Usage

Examples

Dependencies

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages