Cite us: https://doi.org/10.1101/2024.05.16.594538
iSeq is a Bash script that allows you to download sequencing data and metadata from GSA, SRA, ENA, and DDBJ databases. See Detail Pipeline for iSeq. Here is the basic pipeline of iSeq:
Important
To use iSeq, Your system must be connected to the network and support FTP, HTTP, and HTTPS protocols.
- Multiple Database Support: Supports multiple bioinformatics databases (GSA/SRA/ENA/DDBJ/GEO).
- Multiple Input Formats: Supports multiple accessions (Project, Study, Sample, Experiment, or Run accession).
- Metadata Download: Supports download sample metadata for each accession.
More features
- File Format Selection: Users can choose to directly download gzip-formatted FASTQ files or download SRA files and convert them to FASTQ format.
- Multi-threading Support: Supports the use of multi-threading to accelerate the conversion of SRA to FASTQ files or the compression of FASTQ files.
- File Merging: For experiment-level accession, the script can merge multiple FASTQ files into one.
- Parallel Download: Supports parallel download connections, allowing the specification of the number of connections to speed up download speeds.
- Support for Aspera High-speed Download: For GSA/ENA databases, the script supports high-speed data transfer using Aspera.
- Automatic Retry Mechanism: If a download or verification fails, the script will automatically retry until a set number of attempts have been reached.
- Automated File Verification: After the download is complete, the script will automatically verify the integrity of the files, including checking file sizes and MD5 checksums.
- Error Handling: The script provides error messages and suggestions for solutions when encountering errors.
conda create -n iseq -c conda-forge -c bioconda iseq
conda activate iseq
2. The latest version of iSeq (v1.0) can also be installed from source, see INSTALL
# Use the following command to check whether dependent software is installed
iseq --version
Usage 中文教程✨
$ iseq --help
Usage:
iseq -i accession [options]
Options:
Required parameter:
-i, --input TEXT accession (Project, Study, Sample, Experiment, or Run)
Optional parameters:
-m, --metadata Skip the sequencing data downloads and only fetch the metadata for the accession.
-g, --gzip Download FASTQ files in gzip format directly (*.fastq.gz).
note: if *.fastq.gz files are not available, SRA files will be downloaded and converted to *.fastq.gz files.
-q, --fastq Convert SRA files to FASTQ format.
-t, --threads INT The number of threads to use for converting SRA to FASTQ files or compressing FASTQ files (default: 8).
-e, --merge Merge multiple fastq files into one fastq file for each Experiment, the accession can't be the Run ID.
-d, --database [ena|sra] The database to download SRA files from (default: auto-detect),
note: some SRA files may not be available in the ENA database, even if you specify "ena".
-p, --parallel INT Download sequencing data in parallel, the number of connections needs to be specified, such as -p 10.
note: breakpoint continuation cannot be shared between different numbers of connections.
-a, --aspera Use Aspera to download sequencing data, only support GSA/ENA database.
-h, --help Show the help information.
-v, --version Show the script version.
Input the accession you want to download.
iseq -i PRJNA211801
Firstly, iSeq will retrieve the metadata of the accession, then proceed to download each Run contained within.
Currently supports 6 accession formats from the following 5 databases, with supported accession prefixes as follows:
Databases | BioProject | Study | BioSample | Sample | Experiment | Run |
---|---|---|---|---|---|---|
GSA | PRJC | CRA | SAMC | \ | CRX | CRR |
SRA | PRJNA | SRP | SAMN | SRS | SRX | SRR |
ENA | PRJEB | ERP | SAME | ERS | ERX | ERR |
DDBJ | PRJDB | DRP | SAMD | DRS | DRX | DRR |
GEO | GSE | \ | GSM | \ | \ | \ |
Additionally, for the two data formats (GSE/GSM
) from the GEO database, it will directly retrieve the associated PRJNA/SAMN
, then proceed to obtain the contained Runs and download the sequencing data. Therefore, essentially, it still downloads sequencing data from the SRA database.
Here are some examples:
Accession Type | Prefixes | Example |
---|---|---|
BioProject | PRJEB, PRJNA, PRJDB, PRJC, GSE | PRJEB42779, PRJNA480016, PRJDB14838, PRJCA000613, GSE122139 |
Study | ERP, DRP, SRP, CRA | ERP126685, DRP009283, SRP158268, CRA000553 |
BioSample | SAMD, SAME, SAMN, SAMC | SAMD00258402, SAMEA7997453, SAMN06479985, SAMC017083 |
Sample | ERS, DRS, SRS, GSM | ERS5684710, DRS259711, SRS2024210, GSM7417667 |
Experiment | ERX, DRX, SRX, CRX | ERX5050800, DRX406443, SRX4563689, CRX020217 |
Run | ERR, DRR, SRR, CRR | ERR5260405, DRR421224, SRR7706354, CRR311377 |
In summary, regardless of the data format of your accession among the six options, it will eventually download and check the MD5 value of each contained Run. If the MD5 value does not match that in the public database, it will attempt a maximum of three rounds of re-downloading. If successful after three attempts of downloading and verification, the file name will be stored in success.log
; otherwise, if the download fails, the file name will be stored in fail.log
.
Download only the sample information of the accession and skip the download of sequencing data.
iseq -i PRJNA211801 -m
iseq -i CRR343031 -m
Therefore, regardless of whether the -m
parameter is used or not, the sample information of the accession will be obtained. If metadata cannot be retrieved, the iSeq program will exit without proceeding to the subsequent download.
Note
Note 1: If the retrieved accession is in the SRA/ENA/DDBJ/GEO databases, iSeq will first search in the ENA database. If sample information can be retrieved, it will download metadata in TSV
format via the ENA API, typically containing 191 columns. However, some recently released data in the SRA database may not be promptly synchronized to the ENA database. Therefore, if metadata cannot be obtained from the ENA database, iSeq will directly download metadata in CSV
format via the SRA Database Backend, typically containing 30 columns. To maintain consistency with the TSV format, it will be converted to TSV format using sed -i 's/,/\t/g'
. However, if a single field contains a comma, it may cause column disorder. Ultimately, you will obtain sample information named ${accession}.metadata.tsv
.
Note
Note 2: If the retrieved accession is in the GSA database, iSeq will obtain sample information via GSA's getRunInfo interface, downloading metadata in CSV
format, typically containing 25 columns. The metadata obtained above will be saved as ${accession}.metadata.csv
. To supplement more detailed metadata information, iSeq will automatically obtain metadata information for the Project to which the accession belongs via GSA's exportExcelFile interface, downloading metadata in XLSX
format, typically with 3 sheets: Sample
, Experiment
, Run
. The final metadata information will be saved as ${accession}.metadata.xlsx
. In summary, you will ultimately obtain sample information named ${accession}.metadata.csv
and CRA*.metadata.xlsx
.
Directly download FASTQ files in gzip format. If direct download is not possible, SRA files will be downloaded and converted to gzip format using multi-threading for decomposition and compression.
iseq -i SRR1178105 -g
Since the majority of data formats stored directly in the GSA database are in gzip format, if the accession being searched for is from the GSA database, whether the -g
parameter is used or not, you can directly download FASTQ files in gzip format.
If the accession is from the SRA/ENA/DDBJ/GEO databases, iSeq will first attempt to access the ENA database. If it can directly download FASTQ files in gzip format, it will do so; otherwise, it will download SRA files and convert them to FASTQ format using the fasterq-dump
tool, then compress the FASTQ files using the pigz
tool, ultimately obtaining FASTQ files in gzip format.
Tip
parallel-fastq-dump can also convert SRA to gzip-compressed FASTQ files, typically 2-3 times faster than fasterq-dump + pigz
. However, considering IO limitations, iSeq
currently does not support parallel-fastq-dump
.
After downloading the SRA files, they will be decomposed into multiple uncompressed FASTQ files.
iseq -i SRR1178105 -q
This parameter is only effective when the accession is from the SRA/ENA/DDBJ/GEO databases and the downloaded files are SRA files. After downloading the SRA files, iSeq will use the fasterq-dump
tool to convert them into FASTQ files. Additionally, you can specify the number of threads for conversion using the -t
parameter.
Note
Note1: -q
is particularly useful for downloading single-cell data, especially for scATAC-Seq data, as it can effectively decompose the files into four parts: I1
, R1
, R2
, R3
. However, if FASTQ files are directly downloaded via the -g
parameter, only R1
and R3
files will be obtained (e.g., SRR13450125), which may cause issues during subsequent data analysis.
Note
Note 2: When -q
and -g
are used together, the SRA file will first be downloaded, then converted to FASTQ
files using the fasterq-dump
tool, and finally compressed into gzip format using pigz
. It does not directly download FASTQ
files in gzip format, which is very useful for obtaining comprehensive single-cell data.
Specifies the number of threads to use for decompressing SRA files into FASTQ files or compressing FASTQ files. The default value is 8
.
iseq -i SRR1178105 -q -t 10
Considering that sequencing data files are generally large, you can specify the number of threads for decomposition using the -t
parameter. However, more threads does not necessarily mean better performance because excessive threads can lead to high CPU or IO loads, especially since fasterq-dump
consumes a considerable amount of IO, potentially impacting the execution of other tasks.
Merge multiple FASTQ files from an Experiment into one FASTQ file.
iseq -i SRX003906 -e -g
Although in most cases, an Experiment contains only one Run, some sequencing data may have multiple Runs within an Experiment (e.g., SRX003906, CRX020217). Hence, you can use the -e
parameter to merge multiple FASTQ files from an Experiment into one. Considering paired-end sequencing, where fastq_1
and fastq_2
files need to be merged simultaneously and the sequence names in corresponding lines need to remain consistent, iSeq will merge multiple FASTQ files in the same order. Ultimately, for single-end sequencing data, a single file SRX*.fastq.gz
will be generated, and for paired-end sequencing data, two files SRX*_1.fastq.gz
and SRX*_2.fastq.gz
will be generated.
Note
Note 1: If the accession is a Run ID, the -e
parameter cannot be used. Currently, iSeq supports merging both gzip-compressed and uncompressed FASTQ files, but does not support merging files such as BAM files and tar.gz files.
Note
Note 2: Normally, when an Experiment contains only one Run, identical Runs should have the same prefix. For example, SRR52991314_1.fq.gz
and SRR52991314_2.fq.gz
have the same prefix SRR52991314
. In this case, iSeq will directly rename them to SRX*_1.fastq.gz
and SRX*_2.fastq.gz
. However, there are exceptions, such as in CRX006713 where a Run CRR007192
contains files with different prefixes. In such cases, iSeq will rename them as SRX*_original_filename
, for example, they will be renamed as CRX006713_CRD015671.gz
and CRX006713_CRD015672.gz
.
Specifies the database for downloading SRA files, supporting ENA and SRA databases.
iseq -i SRR1178105 -d sra
By default, iSeq will automatically detect available databases, so specifying the -d
parameter is usually unnecessary. However, some SRA files may download slowly from the ENA database. In such cases, you can force downloading from the SRA database by specifying -d sra
.
Note
Note: If the corresponding SRA file is not found in the ENA database, even if the -d ena
parameter is specified, iSeq will still automatically switch to downloading from the SRA database.
Enables multi-threaded downloading and requires specifying the number of threads.
iseq -i PRJNA211801 -p 10
Considering that wget
may be slow in some cases, you can use the -p
parameter to let iSeq utilize the axel
tool for multi-threaded downloading.
Note
Note 1: The resumable download feature of multi-threaded downloading is only effective within the same thread. That is, if the -p 10
parameter is used for the first download, it must also be used for the second download to enable resumable download.
Note
Note 2: As mentioned, iSeq will maintain 10 connections throughout the download process. Therefore, you will see multiple occurrences of the same Connection * finished
popping up during the download process. This is because some connections are released immediately after completing the download and then new connections are established for downloading.
Use Aspera for downloading.
iseq -i PRJNA211801 -a -g
As Aspera offers faster download speeds, you can use the -a
parameter to instruct iSeq to use the ascp
tool for downloading. Unfortunately, Aspera downloading is currently only supported by the GSA and ENA databases. The NCBI SRA database cannot utilize Aspera for downloading as it predominantly employs Google Cloud and AWS Cloud technologies and other reasons, see Avoid-using-ascp.
Note
Note 1: When accessing the GSA database, if download links from Huawei Cloud are available, iSeq will prioritize downloading through Huawei Cloud, even if the -a
parameter is used. This is because Huawei Cloud offers faster and more stable download speeds. Therefore, when downloading GSA data, it's recommended to use the -a
parameter. This way, if access to Huawei Cloud is unavailable, downloading through the Aspera channel is still relatively fast. Otherwise, you'll have to resort to downloading via wget
or axel
, which are slower methods.
Note
Note 2: Since Aspera requires a key file, iSeq will automatically search for the key file in the conda
environment or the ~/.aspera
directory. If the key file is not found, downloading will not be possible.
- If the query accession in SRA/ENA/DDBJ/GEO database, the following files will be generated:
Output | Description |
---|---|
SRA files | Can be converted to FASTQ files using -q option |
.metadata.tsv | Metadata for query accession |
success.log | Save the SRA file name that has been downloaded successfully |
fail.log | Save the SRA file name that has been downloaded failed |
- If the query accession in GSA database, the following files will be generated:
Output | Description |
---|---|
GSA files | Mostly in *.gz format, and a few are bam/tar/bz2 format |
.metadata.csv | Metadata for query accession |
.metadata.xlsx | Metadata for Project including query accession in xlsx format |
success.log | Save the GSA file name that has been downloaded successfully |
fail.log | Save the GSA file name that has been downloaded failed |
Example See more
- Download all Run sequencing data and metadata associated with an accession.
iseq -i PRJNA211801
- Batch download by Aspera with
-a
to directly download gzip-formatted FASTQ files with-g
.
cat SRR_Acc_List.txt | while read Run; do
iseq -i $Run -a -g
done
iSeq was inspired by fastq-dl, fetchngs, pysradb, Kingfisher. These excellent tools may also be very helpful. Below are multiple comparisons of different software:
Software name | Program languages | Supported databases | Supported accessions | Supported formats | Supported methods | Fetch metadata | MD5 check | Resumable download | Parallel download | Merge FASTQ | Skip downloaded | Conda installable | URL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
iSeq | Shell | GSA, SRA, ENA, DDBJ, GEO | All | fq, fq.gz, sra, bam | wget, axel, aspera | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | 🔗 |
edgeturbo | C | GSA | All denied | fq, fq.gz, bam | edgeturbo download | ❌ | ❌ | ✔ | ❌ | ❌ | ❌ | ❌ | 🔗 |
SRA Toolkit | C | SRA, ENA, DDBJ | All denied expect Run ID | fq, fq.gz, sra | prefetch | ❌ | ✔ | ✔ | ❌ | ❌ | ✔ | ✔ | 🔗 |
enaBrowserTools | Python | SRA, ENA, DDBJ | All except GSA/GEO ID | fq, fq.gz, sra | urllib, aspera | ✔ | ✔ | ✔ | ❌ | ❌ | ✔ | ✔ | 🔗 |
fastq-dl | Python | SRA, ENA, DDBJ | All except GSA/GEO ID | fq, fq.gz, sra, sra.lite | wget | ✔ | ✔ | ❌ | ❌ | ✔ | ✔ | ✔ | 🔗 |
fetchngs | Python | SRA, ENA, DDBJ, GEO | All except GSA ID | fq, fq.gz | wget, aspera, prefetch | ✔ | ✔ | ✔ | ❌ | ❌ | ✔ | ❌ | 🔗 |
pysradb | Python | SRA, ENA, DDBJ, GEO | All except GSA ID | fq, fq.gz, sra, bam | requests, aspera | ✔ | ✔ | ✔ | ❌ | ❌ | ✔ | ✔ | 🔗 |
Kingfisher | Python | SRA, ENA, DDBJ | All except GSA/GEO ID | fq, fq.gz, sra | curl, aria2c, aspera | ✔ | ✔ | ❌ | ✔ | ❌ | ✔ | ✔ | 🔗 |
Contributions to iSeq are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on the project's GitHub repository. If you would like to contribute code, please fork the repository, make your changes, and submit a pull request.
This project is licensed under the MIT License.