Merge pull request #54 from nf-core/dev

Dev -> Master for 1.4 release
nf-core · Nov 9, 2021 · 0c43cc7 · 0c43cc7
2 parents 2d593fb + 8b0cbb9
commit 0c43cc7
Show file tree

Hide file tree

Showing 46 changed files with 1,888 additions and 767 deletions.
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -61,21 +61,16 @@ For further information/help, please consult the [nf-core/fetchngs documentation
 
 To make the nf-core/fetchngs code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written.
 
-### Adding a new step
-
-If you wish to contribute a new step, please use the following coding standards:
-
-1. Define the corresponding input channel into your new process from the expected previous process channel
-2. Write the process block (see below).
-3. Define the output channel if needed (see below).
-4. Add any new flags/options to `nextflow.config` with a default (see below).
-5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build`).
-6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter).
-7. Add sanity checks for all relevant parameters.
-8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`.
-9. Do local tests that the new code works properly and as expected.
-10. Add a new test command in `.github/workflow/ci.yml`.
-11. Add any descriptions of output files to `docs/output.md`.
+### Adding a new step or module
+
+If you wish to contribute a new step or module please see the [official guidelines](https://nf-co.re/developers/adding_modules#new-module-guidelines-and-pr-review-checklist) and use the following coding standards:
+
+1. Add any new flags/options to `nextflow.config` with a default (see section below).
+2. Add any new flags/options to `nextflow_schema.json` with help text via `nf-core schema build`.
+3. Add sanity checks for all relevant parameters.
+4. Perform local tests to validate that the new code works as expected.
+5. If applicable, add a new test command in `.github/workflow/ci.yml`.
+6. Add any descriptions of output files to `docs/output.md`.
 
 ### Default values
 
@@ -87,40 +82,17 @@ Once there, use `nf-core schema build` to add to `nextflow_schema.json`.
 
 Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.
 
-The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block.
-
-### Naming schemes
+### Channel naming convention
 
 Please use the following naming schemes, to make it easy to understand what is going where.
 
-* initial process channel: `ch_output_from_<process>`
-* intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
+* Initial process channel: `ch_output_from_<process>`
+* Intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
 
 ### Nextflow version bumping
 
 If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]`
 
-### Software version reporting
-
-If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process.
-
-Add to the script block of the process, something like the following:
-
-```bash
-<YOUR_TOOL> --version &> v_<YOUR_TOOL>.txt 2>&1 || true
-```
-
-or
-
-```bash
-<YOUR_TOOL> --help | head -n 1 &> v_<YOUR_TOOL>.txt 2>&1 || true
-```
-
-You then need to edit the script `bin/scrape_software_versions.py` to:
-
-1. Add a Python regex for your tool's `--version` output (as in stored in the `v_<YOUR_TOOL>.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1`
-2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC.
-
 ### Images and figures
 
 For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines).
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -18,12 +18,11 @@ jobs:
     if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/fetchngs') }}
     runs-on: ubuntu-latest
     env:
-      NXF_VER: ${{ matrix.nxf_ver }}
       NXF_ANSI_LOG: false
     strategy:
       matrix:
-        # Nextflow versions: check pipeline minimum and current latest
-        nxf_ver: ["21.04.0", ""]
+        # Nextflow versions: check pipeline minimum and latest edge version
+        nxf_ver: ["NXF_VER=21.04.0", "NXF_EDGE=1"]
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@v2
@@ -34,8 +33,10 @@ jobs:
         run: |
           wget -qO- get.nextflow.io | bash
           sudo mv nextflow /usr/local/bin/
+          export ${{ matrix.nxf_ver }}
+          nextflow self-update
 
-      - name: Run pipeline with test data
+      - name: Run pipeline with SRA test data
         run: |
           nextflow run ${GITHUB_WORKSPACE} -profile test,docker
 
@@ -53,6 +54,7 @@ jobs:
             "--nf_core_pipeline rnaseq",
             "--ena_metadata_fields run_accession,experiment_accession,library_layout,fastq_ftp,fastq_md5 --sample_mapping_fields run_accession,library_layout",
             --skip_fastq_download,
+            --force_sratools_download
           ]
     steps:
       - name: Check out pipeline code

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -2,4 +2,9 @@ lint:
   files_unchanged:
     - .github/CONTRIBUTING.md
     - assets/sendmail_template.txt
+    - lib/NfcoreSchema.groovy
     - lib/NfcoreTemplate.groovy
+  files_exist:
+    - bin/scrape_software_versions.py
+    - modules/local/get_software_versions.nf
+  actions_ci: False
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,30 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[1.4](https://github.com/nf-core/fetchngs/releases/tag/1.4)] - 2021-11-09
+
+### Enhancements & fixes
+
+* Convert pipeline to updated Nextflow DSL2 syntax for future adoption across nf-core
+* Added a workflow to download FastQ files and to create samplesheets for ids from the [Synapse platform](https://www.synapse.org/) hosted by [Sage Bionetworks](https://sagebionetworks.org/).
+* SRA identifiers not available for direct download via the ENA FTP will now be downloaded via [`sra-tools`](https://github.com/ncbi/sra-tools).
+* Added `--force_sratools_download` parameter to preferentially download all FastQ files via `sra-tools` instead of ENA FTP.
+* Correctly handle errors from SRA identifiers that do **not** return metadata, for example, due to being private.
+* Retry an error in prefetch via bash script in order to allow it to resume interrupted downloads.
+* Name output FastQ files by `{EXP_ACC}_{RUN_ACC}*fastq.gz` instead of `{EXP_ACC}_{T*}*fastq.gz` for run id provenance
+* [[#46](https://github.com/nf-core/fetchngs/issues/46)] - Bug in sra_ids_to_runinfo.py
+* Added support for [DDBJ ids](https://www.ddbj.nig.ac.jp/index-e.html). See examples below:
+
+| `DDBJ`        |
+|---------------|
+| PRJDB4176     |
+| SAMD00114846  |
+| DRA008156     |
+| DRP004793     |
+| DRR171822     |
+| DRS090921     |
+| DRX162434     |
+
 ## [[1.3](https://github.com/nf-core/fetchngs/releases/tag/1.3)] - 2021-09-15
 
 ### Enhancements & fixes

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -14,6 +14,8 @@
 
 * [Requests](https://docs.python-requests.org/)
 
+* [sra-tools](https://github.com/ncbi/sra-tools)
+
 ## Pipeline resources
 
 * [ENA](https://pubmed.ncbi.nlm.nih.gov/33175160/)
@@ -22,9 +24,15 @@
 * [SRA](https://pubmed.ncbi.nlm.nih.gov/21062823/)
     > Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011 Jan;39 (Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9. PubMed PMID: 21062823; PubMed Central PMCID: PMC3013647.
 
+* [DDBJ](https://pubmed.ncbi.nlm.nih.gov/33156332/)
+    > Fukuda A, Kodama Y, Mashima J, Fujisawa T, Ogasawara O. DDBJ update: streamlining submission and access of human data. Nucleic Acids Res. 2021 Jan 8;49(D1):D71-D75. doi: 10.1093/nar/gkaa982. PubMed PMID: 33156332; PubMed Central PMCID: PMC7779041.
+
 * [GEO](https://pubmed.ncbi.nlm.nih.gov/23193258/)
     > Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27. PubMed PMID: 23193258; PubMed Central PMCID: PMC3531084.
 
+* [Synapse](https://pubmed.ncbi.nlm.nih.gov/24071850/)
+    > Omberg L, Ellrott K, Yuan Y, Kandoth C, Wong C, Kellen MR, Friend SH, Stuart J, Liang H, Margolin AA. Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Nat Genet. 2013 Oct;45(10):1121-6. doi: 10.1038/ng.2761. PMID: 24071850; PMCID: PMC3950337.
+
 ## Software packaging/containerisation tools
 
 * [Anaconda](https://anaconda.com)

diff --git a/README.md b/README.md
@@ -16,21 +16,34 @@
 
 ## Introduction
 
-**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from public databases. At present, the pipeline supports SRA / ENA / GEO ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
+**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
 
 The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
 
 On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/fetchngs/results).
 
 ## Pipeline summary
 
-Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/public_database_ids.txt)) the pipeline performs the following steps:
+Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/fetchngs/sra_ids_test.txt)) the pipeline performs the following steps:
+
+### SRA / ENA / DDBJ / GEO ids
 
 1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
-2. Fetch extensive id metadata including direct download links to FastQ files via ENA API
-3. Download FastQ files in parallel via `curl` and perform `md5sum` check
+2. Fetch extensive id metadata via ENA API
+3. Download FastQ files:
+    - If direct download links are available from the ENA API, fetch in parallel via `curl` and perform `md5sum` check
+    - Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
 4. Collate id metadata and paths to FastQ files in a single samplesheet
 
+### Synapse ids
+
+1. Resolve Synapse directory ids to their corresponding FastQ files ids via the `synapse list` command.
+2. Retrieve FastQ file metadata including FastQ file names, md5sums, etags, annotations and other data provenance via the `synapse show` command.
+3. Download FastQ files in parallel via `synapse get`
+4. Collate paths to FastQ files in a single samplesheet
+
+### Samplesheet format
+
 The columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input) and the Illumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format). You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core.
 
 ## Quick Start
@@ -45,9 +58,9 @@ The columns in the auto-created samplesheet can be tailored to be accepted out-o
     nextflow run nf-core/fetchngs -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
     ```
 
-    > * Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
-    > * If you are using `singularity` then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
-    > * If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
+    > - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
+    > - If you are using `singularity` then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
+    > - If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
 
 4. Start running your own analysis!
 
@@ -61,7 +74,7 @@ The nf-core/fetchngs pipeline comes with documentation about the pipeline [usage
 
 ## Credits
 
-nf-core/fetchngs was originally written by Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [The Bioinformatics & Biostatistics Group](https://www.crick.ac.uk/research/science-technology-platforms/bioinformatics-and-biostatistics/) at [The Francis Crick Institute, London](https://www.crick.ac.uk/) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/).
+nf-core/fetchngs was originally written by Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [Seqera Labs, Spain](https://seqera.io/) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/). Support for download of sequencing reads without FTP links via sra-tools was added by Moritz E. Beber ([@Midnighter](https://github.com/Midnighter)) from [Unseen Bio ApS, Denmark](https://unseenbio.com). The Synapse workflow was added by Daisy Han [@daisyhan97](https://github.com/daisyhan97) and Bruno Grande [@BrunoGrandePhD](https://github.com/BrunoGrandePhD) from [Sage Bionetworks, Seattle](https://sagebionetworks.org/).
 
 ## Contributions and Support
 

diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -8,8 +8,8 @@
         "type": "array",
         "items": {
             "type": "string",
-            "pattern": "^[SEPG][RAS][RXSMPAJXE][EN]?[AB]?\\d{4,9}$",
-            "errorMessage": "Please provide a valid SRA, GEO or ENA identifier"
+            "pattern":"^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(GS[EM])|(syn))(\\d+)$",
+            "errorMessage": "Please provide a valid SRA, ENA, DDBJ or GEO identifier"
         }
     }
 }