Skip to content
Anand Maurya edited this page Dec 21, 2022 · 36 revisions

Requirements:

Python 3.x


Installation:

The user is free to install the dependencies in the base environment or setup a new virtual environment.

Download or clone the repository and use the requirements.txt provided in the package to install all the dependencies.

pip3 install -r requirements.txt

Usage:

sra-annotator [-q] [-o OUTPUT] [-m {quick,full}] [-f] [-d DICTIONARY] [-v] [-h] query
  • User must specify a query term or use -q to input a list of SRA accessions. An input file or a search keyword is a required parameter.
  • Arguments shown in [] are optional.

Examples:

sra-annotator -q PRJNA868738
sra-annotator -q PRJNA868738 -m full
sra-annotator -q PRJNA868738 -m full -d example/keyword-dict.json
sra-annotator -q example/sra-accn-list.txt -m full -f

Input options

Short Long Default Details
-q --query required Query string to search the SRA. Please use quotes '' if the query contains multiple words. eg:

PRJNA868738
SRR15736787
'PRJNA761299 OR SRR15736787'
'trna[Text Word] AND "Danio rerio"[Organism]'
'(2008[Publication Date] : 2009[Publication Date]) AND "arabidopsis thaliana"[Organism]'
'petals[Text Word] AND Arabidopsis thaliana[Organism]'

Please refer https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/ to learn more about basic and advanced search in NCBI SRA.

This option can also accept a list of SRA accessions in a plain text file.

Please make sure that the file contains one accession per line.

An example file containing the list of accession is provided in the example/ directory.
-o --output pwd Output directory to store the results.
-m --mode quick quick mode dumps the run level annotation, whereas full mode attempts to retrieve both the run level and sample level annotation.

full mode also converts the annotation from JSON to CSV for each run accession.

full mode might not work with all complex queries.
-f --fastq optional Locate the web address to the raw data and generate a script to download the fastq file(s).

Depending on the number of fastq files to be searched, this can take some time.
-d --dictionary optional This option takes a JSON file as input and uses the designated keywords from the file to identify the samples.

Example: { "tissue": [flower, petals], "reagent": "trizol" }.

In the aforementioned example, the tool will search the metadata of each sample for the keywords flowers, petals, and trizol. It will then produce a report with the columns tissue and reagent and list all the accessions that match the keyords.

When entering several terms into the dictionary, it is advised to utilize full mode to enhance the likelihood of matches.

An example json file is provided in the example/ directory.
-h --help Show the usage instructions.
-v --version Show the version.

Output

The tool organizes the output in 3 folders:

  • sra_data/fastq_source contains the bash script to download the raw fastq files (paired-end or single-end). These files are generated when -f argument is enabled.
  • sra_data/annotation_json contains the run-level annotation of each SRA run accession in .json format. These files are generated when either -m quick or -m full is enabled.
  • sra_data/annotation_text contains the run-level and sample-level annotation of each SRA run accession in .csv format. This takes longer than the -m quick mode. These files are generated when -m full is enabled.
  • sra_data/keyword_hits.csv is created when the option -d is used. SRA entries that match a specific keyword are listed beneath the corresponding header in the file. The keys provided in the JSON file are used as headers in this output file.

Troubleshooting

  • Check your internet connection.

  • Try simplifying the search term.

  • Find the failed_to_parse.txt file in the output directory. It contains a list of run accessions that most likely have incorrect or missing annotation in SRA.

Clone this wiki locally