Skip to content

Nextflow pipeline to transfer data from Synapse to a CDS AWS bucket

Notifications You must be signed in to change notification settings

ncihtan/nf-cdstransfer

Repository files navigation

nf-cdstransfer

Overview

This Nextflow workflow is designed to process a sample sheet (samplesheet.csv), retrieve files from Synapse based on entityId, and upload them to an AWS S3 bucket.

[!NOTE]
The workflow consists of two main steps:

  1. synapse_get: Downloads the files from Synapse using the entityId from the sample sheet.
  2. cds_upload: Uploads the downloaded files to a specified AWS S3 bucket.

Example Usage

nextflow run ncihtan/nf-cdstransfer --input samplesheet.csv

Input

Parameters

Parameter Type Description
input str Path to input samplesheet CSV file containing entityId and aws_uri columns. Required.
take_n int Number of samples to process from the samplesheet. Use -1 to process all samples. Default: -1
dryrun bool If true, adds --dryrun flag to AWS copy commands for testing without actual file transfer. Default: false
aws_secret_prefix str Prefix for AWS credential environment variables. Used to construct variable names like ${aws_secret_prefix}_AWS_ACCESS_KEY_ID. Useful for managing multiple AWS credential sets. Default: ""

Workflow Details

Overview

The workflow transfers files from Synapse to CDS (Cloud Data Service) in three main steps:

  1. Read and parse input samplesheet
  2. Download files from Synapse
  3. Upload files to CDS S3 bucket
  4. Generate transfer report

Secrets

Nextflow secrets are used to ensure that tokens, keys and secrets are not exposed

The following Nextflow secrets should be set:

  • SYNAPSE_AUTH_TOKEN: Synapse authentication token.
  • <params.aws_secret_prefix>_AWS_ACCESS_KEY_ID: AWS access key ID. eg CDS_AWS_ACCESS_KEY_ID
  • <params.aws_secret_prefix>: AWS secret access key. eg CDS_AWS_SECRET_ACCESS_KEY
nextflow secrets set SYNAPSE_AUTH_TOKEN <SUPER_SECRET_THING>

Input Samplesheet Requirements

Field Required Pattern Description Example
entityId Yes ^syn\d+$ Synapse entity ID starting with 'syn' followed by numbers syn123456
file_url_in_cds Yes ^s3://.+ URL to the file location in AWS S3, must start with 's3://' s3://mybucket/path/to/file

Notes:

  • Additional columns are allowed but not validated
  • Both fields are mapped internally:
    • entityIdentityid
    • file_url_in_cdsaws_uri

Configuration

Plugins Used

The workflow uses the following plugins:

  • nf-schema: For parameter validation and schema management
  • nf-boost: For enhanced functionality and utilities

Default settings

The included nextflow.config file specifies the following default options. These are used if not overridden by a custom config or profile.

  • docker.enabled = true

Profiles

The nextflow.config file defines several profiles to customize the workflow execution. Below are the available profiles and the parameters/settings they configure:

Setting / Profile test CDS local docker tower
params.input $projectDir/samplesheet.csv - - - -
params.aws_secret_prefix TEST CDS - - -
params.dryrun true - - - -
docker.enabled true true true true true
process.executor local - local - -
process.cpus - - - 1 * task.attempt
process.memory - - - 1.GB * task.attempt
process.maxRetries - - - 3
process.errorStrategy - - - retrys

Process 1: synapse_get

Downloads files from Synapse using entityIds.

Input

  • meta: Object containing entityId and aws_uri

Output

  • Tuple of (meta, downloaded file path)

Dependencies

  • Requires SYNAPSE_AUTH_TOKEN secret
  • Uses synapsepythonclient container

Process 2: cds_upload

Uploads downloaded files to CDS S3 bucket.

Input

  • Tuple of (meta, file path) from synapse_get

Output

  • Tuple of (meta, upload success boolean)

Dependencies

  • Requires AWS credentials:
    • ${aws_secret_prefix}_AWS_ACCESS_KEY_ID
    • ${aws_secret_prefix}_AWS_SECRET_ACCESS_KEY
  • Uses AWS CLI container

Output

No specific outputs are generated by the workflow.
By default a trace file is saved to reports/trace.csv

Running the Workflow

Prerequisites

  • Ensure Nextflow is installed.
  • Ensure you have access to the necessary containers (synapseclient, awscli).
  • Ensure you have the appropriate credentials for Synapse and AWS.

Example usage

Run the workflow with the following command:

nextflow run ncihtan/nf-cdstransfer --input path/to/samplesheet.csv

Using the test profile will use samplesheet.csv when stored in your projectDir. Please generate your own samplesheet and use aws_secret_prefix TEST when setting your relevent AWS Nextflow secrets

nextflow run ncihtan/nf-cdstransfer -profile test

To avoid having to reset secrets when moving between destination accounts you can set your secrets using a prefix

nextflow secrets set MYCREDS_AWS_ACCESS_KEY_ID
nextflow secrets set MYCREDS_AWS_SECRET_ACCESS_KEY
nextflow run ncihtan/nf-cdstransfer --aws_secret_prefix MYCREDS

or use a configured profile in which params.aws_secret_prefix is set

nextflow run ncihtan/nf-cdstransfer -profile CDS --input samplesheet.csv

About

Nextflow pipeline to transfer data from Synapse to a CDS AWS bucket

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published