Merge branch 'rnacentre:main' into main

OOAAHH · Oct 30, 2024 · 6f7ab02 · 6f7ab02
2 parents 14ba13a + cb1caea
commit 6f7ab02
Show file tree

Hide file tree

Showing 16,225 changed files with 2,797,312 additions and 0 deletions.
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -0,0 +1,57 @@
+name: deploy
+
+on:
+  # 每当 push 到 main 分支时触发部署
+  # Deployment is triggered whenever a push is made to the main branch.
+  push:
+    branches: main
+  # 手动触发部署
+  # Manually trigger deployment
+  workflow_dispatch:
+
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          # “最近更新时间” 等 git 日志相关信息，需要拉取全部提交记录
+          # "Last updated time" and other git log-related information require fetching all commit records.
+          fetch-depth: 0
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          # 选择要使用的 node 版本
+          node-version: 20
+
+      # 安装依赖
+      # Install dependencies
+      - name: Install Dependencies
+        run: |
+          cd docs
+          npm ci
+
+
+
+      # 运行构建脚本
+      # Run the build script
+      - name: Build VuePress site
+        run: |
+          cd docs
+          npm run docs:build
+
+
+      # 查看 workflow 的文档来获取更多信息
+      # @see https://github.com/crazy-max/ghaction-github-pages
+      - name: Deploy to GitHub Pages
+        uses: crazy-max/ghaction-github-pages@v4
+        with:
+          # 部署到 gh-pages 分支
+          target_branch: gh-pages
+          # 部署目录为 VuePress 的默认输出目录
+          build_dir: docs/docs/.vuepress/dist
+        env:
+          # @see https://docs.github.com/cn/actions/reference/authentication-in-a-workflow#about-the-github_token-secret
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/README.md b/README.md
@@ -0,0 +1,96 @@
+# SRA_Analysis_WDL
+- An Auto Pipeline
+- Robustness
+- Designed for Bio-os
+
+Here, we need a script, a program or an other things, to meet our need. 
+
+[![Typing SVG](https://readme-typing-svg.herokuapp.com?font=Courier+New&pause=1000&color=6B4DF7&multiline=true&random=false&width=435&height=80&lines=%E7%AB%99++%E5%9C%A8++%E5%B7%A8++%E4%BA%BA++%E7%9A%84++%E8%82%A9++%E8%86%80++;Stand+on+the+shoulders+of+giants)](https://git.io/typing-svg)
+
+What we need?
+-----------------------
+We have a platform built for fetch raw sequencing data to our system. Once the data is under our gover, we will get our pipline on woring. Given the complexity of the situation, our tools should be packaged as stand-alone toolkits, or they should take advantage of infrastructure that is readily available.
+
+
+What we have,now
+-----------------------
+  - 10X Cellranger count WDL
+  - 10X Cellranger ATAC count WDL
+  - 10X Cellranger VDJ WDL
+  - 10X Spaceranger WDL
+  - 10X Cellranger multi WDL (for GEX + VDJ-T/VDJ-B or both of them)
+  - SeqWell & Drop-seq & BD WDL (STARsolo)
+  - SMART-seq WDL (STARsolo, too)
+`Praise the god of STAR`
+
+  - Dockers at here: `https://hub.docker.com/repositories/ooaahhdocker`
+
+
+Update
+-----------------------
+
+### 2024.10.10 : let's have a try
+Try using rust as an encapsulation for the command part of the wdl, replacing python.
+The results are here: `_SRAtoFastqgz/2.0_rust`. 
+
+I'm looking forward to this being a start, a start to be able to judge the type of vdj files (or any other files) quickly.
+
+### 2024.10.9 : ATTENTON!
+All of those images' name should be replaced as followed.
+  - **registry-vpc.miracle.ac.cn/gznl/**
+    - registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/ooaahhdocker/python_pigz:1.0
+    - registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/py39_scanpy1-10-1
+    - registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/starsolo2:3.0
+    - registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/starsolo2:2.0
+    - registry-vpc.miracle.ac.cn/gznl/python:3.9.19-slim-bullseye
+
+### 2024.5.14 : Function added
+  - For cellranger count WDL, updated naming conventions of h5ad files.
+    - `filtered_feature_bc_matrix.h5ad` convert to `~{sample}_filtered_feature_bc_matrix.h5ad`
+
+### 2024.5.11 : Refining Code Logic
+  - When handling the extraction of SRA files to fastq files, an issue with file attribution was encountered, making it difficult to accurately determine the correct naming of the extracted fastq files. Solution: The R1 data contains a large number of duplicate sequences composed of barcodes and UMIs. When performing high compression ratio file compression, the size of the R1 data file should be smaller than that of R2.
+  - Simply reverse the order of the file compression and renaming logic.
+
+### 2024.5.9 : multi need to set NA as []
+  - Array[File] Cannot accept input with an empty string, use [] as insted.
+
+
+### 2024.5.4 : Updated naming logic for files
+  - The extent of the impact "SRA > fastq.gz"
+
+### 2024.4.28 : Added unplanned WDL files
+  - 10X Cellranger multi WDL
+
+### 2024.4.28 : Bugs fix
+  - For VDJ files(SRA), we have to use parameters: "`--split-file` combined with `--include-technologies`".
+  - ps. For SpaceRanger, we need to use parameters `--split-3`. Therefore, in the case of 10X, we need to choose the appropriate workflow for the specific situation.
+
+### 2024.4.26 : Function added
+  - For local fastq files, I had added `cellranger_singleFile.wdl`.
+
+### 2024.4.23 : Function added
+  - Increased the output of h5ad&bam files as much as possible.
+
+### 2024.4.22 : Added STARsolo WDL files, which could used in BD&SeqWell&Dropseq, without umitools.
+  - ps. Set `--soloBarcodeReadLength=0` to skip the barcode and umi checks.
+  - Docker pull: ooaahhdocker/starsolo2:3.0, with python3.9/scanpy1.10.1/star2.7.11 inside.
+  - Attention!
+    - To make the agreement between STARsolo and CellRanger even more perfect, you can add
+
+    `args_dict['--genomeSAsparseD'] = ['3']`
+
+    - CellRanger 3.0.0 use advanced filtering based on the EmptyDrop algorithm developed by Lun et al. This algorithm calls extra cells compared to the knee filtering, allowing for       cells that have relatively fewer UMIs but are transcriptionally different from the ambient RNA. In STARsolo, this filtering can be activated by:
+
+    `args_dict['--soloCellFilter'] =['EmptyDrops_CR']`
+
+### 2024.4.16 : Must come with full image information, slide number, etc.
+  - For spaceranger, complete image information is a must, and the data provided by some authors is incomplete.
+
+### 2024.4.12 : The technical roadmap has been updated, and sra files are now reused using fasterq-dump
+  - Docker pull: ooaahhdocker/python_pigz:1.0 with python3.9/pigz, which meet fastq file to fastq compressed file fast implementation.
+
+### 2024.4.11 : Resolving compatibility issues
+  - Lower versions of cellranger(2.9.6) are unable to handle newer 10X scRNA-seq data.
+  - Added a way to externally import the cellranger package
+
diff --git a/_CellRanger_ATAC/cellranger_atac_count.wdl b/_CellRanger_ATAC/cellranger_atac_count.wdl
@@ -0,0 +1,171 @@
+version 1.0
+##############################################
+# This is a WDL File                         #
+# This file was formed for CellRanger ATAC   #
+# Author: Sun Hao                            #
+# Date: YYYY/MM/DD 2024/04/08                #
+##############################################
+workflow cellranger_count_workflow {
+    input {
+        # An array of FASTQ file paths
+        Array[File] fastq_file_paths
+        # Tar.gz reference in this format
+        File reference_genome_tar_gz
+        # Tar.gz cellranger_atac packges
+        File cellranger_atac_tar_gz
+        # Sample/Run ID, for this WDL,
+        # we difined that each lanes' output files as a run: per run pre CellRanger job.
+        String run_id
+        # Required. Sample name as specified in the sample sheet supplied to cellranger mkfastq.
+        String sample
+        # Memory string, e.g. 120G
+        String memory = "225 GB"
+        # Disk space in GB
+        String disk_space = "500 GB"
+        # Number of cpus per cellranger job
+        Int cpu = 32
+        # chemistry of the channel
+        String chemistry = "auto"
+
+        String? no_bam
+        String? secondary
+
+        Int? force_cells
+        String? dim_reduce
+        File? peaks
+
+    }
+
+    call run_cellranger_count {
+        input:
+            fastq_file_paths = fastq_file_paths,
+            reference_genome_tar_gz = reference_genome_tar_gz,
+            cellranger_atac_tar_gz = cellranger_atac_tar_gz,
+            run_id = run_id,
+            sample = sample,
+            memory = memory,
+            cpu = cpu,
+            disk_space = disk_space,
+            no_bam = no_bam,
+            chemistry = chemistry,
+            secondary = secondary,
+            force_cells = force_cells,
+            dim_reduce = dim_reduce,
+            peaks = peaks,
+    }
+}
+
+task run_cellranger_count {
+    input {
+
+        # An array of FASTQ file paths
+        Array[File] fastq_file_paths
+        # Tar.gz reference in this format
+        File reference_genome_tar_gz
+        # Tar.gz cellranger_atac packges
+        File cellranger_atac_tar_gz
+        # Sample/Run ID, for this WDL,
+        # we difined that each lanes' output files as a run: per run pre CellRanger job.
+        String run_id
+        # Required. Sample name as specified in the sample sheet supplied to cellranger mkfastq.
+        String sample
+        # Memory string, e.g. 120G
+        String memory
+        # Disk space in GB
+        String disk_space
+        # Number of cpus per cellranger job
+        Int cpu
+        # chemistry of the channel
+        String chemistry
+
+        String? no_bam
+        String? secondary
+
+        # Force pipeline to use this number of cells, bypassing the cell detection algorithm
+        Int? force_cells
+        # Choose the algorithm for dimensionality reduction prior to clustering and tsne: 'lsa' (default), 'plsa', or 'pca'.
+        String? dim_reduce
+        # A BED file to override peak caller
+        File? peaks
+
+    }
+
+    parameter_meta {
+        run_id: "Required. A unique run ID string,The name is arbitrary and will be used to name the directory containing all pipeline-generated files and outputs. Only letters, numbers, underscores, and hyphens are allowed (maximum of 64 characters)."
+        sample: "Required. Sample name as specified in the sample sheet supplied to cellranger mkfastq.Can take multiple comma-separated values, which is helpful if the same library was sequenced on multiple flow cells with different sample names, which therefore have different FASTQ file prefixes. Doing this will treat all reads from the library, across flow cells, as one sample."
+        fastq_file_paths: "Required. Array of fastq files"
+        reference_genome_tar_gz: "Required. CellRanger-compatible transcriptome reference (in tar.gz)(can be generated with cellranger mkref)"
+        memory: "Required. The minimum amount of RAM to use for the Cromwell VM"
+        disk_space: "Required. Amount of disk space (GB) to allocate to the Cromwell VM"
+        cpu: "Required. The minimum number of cores to use for the Cromwell VM"
+        chemistry: "Optional. The chemistry of the channel, e.g. V2 or V3. You could choose 'auto' of course."
+    }
+
+    command {
+        set -e
+        run_id=$(echo "~{run_id}" | sed 's/\./_/g')
+        sample=$(echo "~{sample}" | sed 's/\./_/g')
+
+        mkdir cellranger_atac
+        tar -zxf ~{cellranger_atac_tar_gz} -C cellranger_atac --strip-components 1
+        # Set PATH to include CellRanger-atac binaries
+        export PATH=$(pwd)/cellranger_atac:$PATH
+
+        # Unpack the reference genome
+        mkdir transcriptome_dir
+        tar xf ${reference_genome_tar_gz} -C transcriptome_dir --strip-components 1
+
+        python <<CODE
+        import os
+        from subprocess import check_call
+
+        # Convert the WDL Array[File] input to a Python list
+        fastq_file_paths = ["${sep='","' fastq_file_paths}"]
+        fastq_dirs = set([os.path.dirname(f) for f in fastq_file_paths])
+        print(fastq_dirs)
+
+        call_args = ['cellranger-atac']
+        call_args.append('count')
+        call_args.append('--jobmode=local')
+        call_args.append('--reference=transcriptome_dir')
+        call_args.append('--id=' + "~{run_id}")
+        call_args.append('--fastqs=' + ','.join(list(fastq_dirs)))
+        call_args.append('--sample=' + "~{sample}")
+        if '~{force_cells}' != '':
+            call_args.append('--force-cells=~{force_cells}')
+        if '~{dim_reduce}' != '':
+            call_args.append('--dim-reduce=~{dim_reduce}')
+        if '~{peaks}' != '':
+            call_args.append('--peaks=~{peaks}')
+        if "~{chemistry}" != 'auto':
+            call_args.append('--chemistry=' + "~{chemistry}")
+        if '~{no_bam}' == 'True':
+            call_args.append('--no-bam')
+        else:
+            print('We have bam files in output directory')
+        if '~{secondary}' == 'True':
+            call_args.append('--nosecondary')
+        else:
+            print('We have secondary analysis here')
+        call_args.append('--disable-ui')
+        print('Executing:', ' '.join(call_args))
+        check_call(call_args)
+        CODE
+
+        tar -czvf ~{run_id}_outs.tar.gz ~{run_id}/outs
+    }
+
+    output {
+        File output_count_directory = "~{run_id}_outs.tar.gz"
+        File output_metrics_summary = "~{run_id}/outs/metrics_summary.csv"
+        File output_web_summary = "~{run_id}/outs/web_summary.html"
+    }
+
+    runtime {
+        docker: "python:3.9.19-slim-bullseye"
+        cpu: cpu
+        memory: memory
+        disk: disk_space
+    }
+}
+