Skip to content

Commit

Permalink
Merge branch 'rnacentre:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
OOAAHH authored Oct 30, 2024
2 parents 14ba13a + cb1caea commit 6f7ab02
Show file tree
Hide file tree
Showing 16,225 changed files with 2,797,312 additions and 0 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
57 changes: 57 additions & 0 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: deploy

on:
# 每当 push 到 main 分支时触发部署
# Deployment is triggered whenever a push is made to the main branch.
push:
branches: main
# 手动触发部署
# Manually trigger deployment
workflow_dispatch:

jobs:
docs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
with:
# “最近更新时间” 等 git 日志相关信息,需要拉取全部提交记录
# "Last updated time" and other git log-related information require fetching all commit records.
fetch-depth: 0

- name: Setup Node.js
uses: actions/setup-node@v4
with:
# 选择要使用的 node 版本
node-version: 20

# 安装依赖
# Install dependencies
- name: Install Dependencies
run: |
cd docs
npm ci
# 运行构建脚本
# Run the build script
- name: Build VuePress site
run: |
cd docs
npm run docs:build
# 查看 workflow 的文档来获取更多信息
# @see https://github.com/crazy-max/ghaction-github-pages
- name: Deploy to GitHub Pages
uses: crazy-max/ghaction-github-pages@v4
with:
# 部署到 gh-pages 分支
target_branch: gh-pages
# 部署目录为 VuePress 的默认输出目录
build_dir: docs/docs/.vuepress/dist
env:
# @see https://docs.github.com/cn/actions/reference/authentication-in-a-workflow#about-the-github_token-secret
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
96 changes: 96 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# SRA_Analysis_WDL
- An Auto Pipeline
- Robustness
- Designed for Bio-os

Here, we need a script, a program or an other things, to meet our need.

[![Typing SVG](https://readme-typing-svg.herokuapp.com?font=Courier+New&pause=1000&color=6B4DF7&multiline=true&random=false&width=435&height=80&lines=%E7%AB%99++%E5%9C%A8++%E5%B7%A8++%E4%BA%BA++%E7%9A%84++%E8%82%A9++%E8%86%80++;Stand+on+the+shoulders+of+giants)](https://git.io/typing-svg)

What we need?
-----------------------
We have a platform built for fetch raw sequencing data to our system. Once the data is under our gover, we will get our pipline on woring. Given the complexity of the situation, our tools should be packaged as stand-alone toolkits, or they should take advantage of infrastructure that is readily available.


What we have,now
-----------------------
- 10X Cellranger count WDL
- 10X Cellranger ATAC count WDL
- 10X Cellranger VDJ WDL
- 10X Spaceranger WDL
- 10X Cellranger multi WDL (for GEX + VDJ-T/VDJ-B or both of them)
- SeqWell & Drop-seq & BD WDL (STARsolo)
- SMART-seq WDL (STARsolo, too)
`Praise the god of STAR`

- Dockers at here: `https://hub.docker.com/repositories/ooaahhdocker`


Update
-----------------------

### 2024.10.10 : let's have a try
Try using rust as an encapsulation for the command part of the wdl, replacing python.
The results are here: `_SRAtoFastqgz/2.0_rust`.

I'm looking forward to this being a start, a start to be able to judge the type of vdj files (or any other files) quickly.

### 2024.10.9 : ATTENTON!
All of those images' name should be replaced as followed.
- **registry-vpc.miracle.ac.cn/gznl/**
- registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/ooaahhdocker/python_pigz:1.0
- registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/py39_scanpy1-10-1
- registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/starsolo2:3.0
- registry-vpc.miracle.ac.cn/gznl/ooaahhdocker/starsolo2:2.0
- registry-vpc.miracle.ac.cn/gznl/python:3.9.19-slim-bullseye

### 2024.5.14 : Function added
- For cellranger count WDL, updated naming conventions of h5ad files.
- `filtered_feature_bc_matrix.h5ad` convert to `~{sample}_filtered_feature_bc_matrix.h5ad`

### 2024.5.11 : Refining Code Logic
- When handling the extraction of SRA files to fastq files, an issue with file attribution was encountered, making it difficult to accurately determine the correct naming of the extracted fastq files. Solution: The R1 data contains a large number of duplicate sequences composed of barcodes and UMIs. When performing high compression ratio file compression, the size of the R1 data file should be smaller than that of R2.
- Simply reverse the order of the file compression and renaming logic.

### 2024.5.9 : multi need to set NA as []
- Array[File] Cannot accept input with an empty string, use [] as insted.


### 2024.5.4 : Updated naming logic for files
- The extent of the impact "SRA > fastq.gz"

### 2024.4.28 : Added unplanned WDL files
- 10X Cellranger multi WDL

### 2024.4.28 : Bugs fix
- For VDJ files(SRA), we have to use parameters: "`--split-file` combined with `--include-technologies`".
- ps. For SpaceRanger, we need to use parameters `--split-3`. Therefore, in the case of 10X, we need to choose the appropriate workflow for the specific situation.

### 2024.4.26 : Function added
- For local fastq files, I had added `cellranger_singleFile.wdl`.

### 2024.4.23 : Function added
- Increased the output of h5ad&bam files as much as possible.

### 2024.4.22 : Added STARsolo WDL files, which could used in BD&SeqWell&Dropseq, without umitools.
- ps. Set `--soloBarcodeReadLength=0` to skip the barcode and umi checks.
- Docker pull: ooaahhdocker/starsolo2:3.0, with python3.9/scanpy1.10.1/star2.7.11 inside.
- Attention!
- To make the agreement between STARsolo and CellRanger even more perfect, you can add

`args_dict['--genomeSAsparseD'] = ['3']`

- CellRanger 3.0.0 use advanced filtering based on the EmptyDrop algorithm developed by Lun et al. This algorithm calls extra cells compared to the knee filtering, allowing for cells that have relatively fewer UMIs but are transcriptionally different from the ambient RNA. In STARsolo, this filtering can be activated by:

`args_dict['--soloCellFilter'] =['EmptyDrops_CR']`

### 2024.4.16 : Must come with full image information, slide number, etc.
- For spaceranger, complete image information is a must, and the data provided by some authors is incomplete.

### 2024.4.12 : The technical roadmap has been updated, and sra files are now reused using fasterq-dump
- Docker pull: ooaahhdocker/python_pigz:1.0 with python3.9/pigz, which meet fastq file to fastq compressed file fast implementation.

### 2024.4.11 : Resolving compatibility issues
- Lower versions of cellranger(2.9.6) are unable to handle newer 10X scRNA-seq data.
- Added a way to externally import the cellranger package

171 changes: 171 additions & 0 deletions _CellRanger_ATAC/cellranger_atac_count.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
version 1.0
##############################################
# This is a WDL File #
# This file was formed for CellRanger ATAC #
# Author: Sun Hao #
# Date: YYYY/MM/DD 2024/04/08 #
##############################################
workflow cellranger_count_workflow {
input {
# An array of FASTQ file paths
Array[File] fastq_file_paths
# Tar.gz reference in this format
File reference_genome_tar_gz
# Tar.gz cellranger_atac packges
File cellranger_atac_tar_gz
# Sample/Run ID, for this WDL,
# we difined that each lanes' output files as a run: per run pre CellRanger job.
String run_id
# Required. Sample name as specified in the sample sheet supplied to cellranger mkfastq.
String sample
# Memory string, e.g. 120G
String memory = "225 GB"
# Disk space in GB
String disk_space = "500 GB"
# Number of cpus per cellranger job
Int cpu = 32
# chemistry of the channel
String chemistry = "auto"

String? no_bam
String? secondary

Int? force_cells
String? dim_reduce
File? peaks

}

call run_cellranger_count {
input:
fastq_file_paths = fastq_file_paths,
reference_genome_tar_gz = reference_genome_tar_gz,
cellranger_atac_tar_gz = cellranger_atac_tar_gz,
run_id = run_id,
sample = sample,
memory = memory,
cpu = cpu,
disk_space = disk_space,
no_bam = no_bam,
chemistry = chemistry,
secondary = secondary,
force_cells = force_cells,
dim_reduce = dim_reduce,
peaks = peaks,
}
}

task run_cellranger_count {
input {

# An array of FASTQ file paths
Array[File] fastq_file_paths
# Tar.gz reference in this format
File reference_genome_tar_gz
# Tar.gz cellranger_atac packges
File cellranger_atac_tar_gz
# Sample/Run ID, for this WDL,
# we difined that each lanes' output files as a run: per run pre CellRanger job.
String run_id
# Required. Sample name as specified in the sample sheet supplied to cellranger mkfastq.
String sample
# Memory string, e.g. 120G
String memory
# Disk space in GB
String disk_space
# Number of cpus per cellranger job
Int cpu
# chemistry of the channel
String chemistry

String? no_bam
String? secondary

# Force pipeline to use this number of cells, bypassing the cell detection algorithm
Int? force_cells
# Choose the algorithm for dimensionality reduction prior to clustering and tsne: 'lsa' (default), 'plsa', or 'pca'.
String? dim_reduce
# A BED file to override peak caller
File? peaks

}

parameter_meta {
run_id: "Required. A unique run ID string,The name is arbitrary and will be used to name the directory containing all pipeline-generated files and outputs. Only letters, numbers, underscores, and hyphens are allowed (maximum of 64 characters)."
sample: "Required. Sample name as specified in the sample sheet supplied to cellranger mkfastq.Can take multiple comma-separated values, which is helpful if the same library was sequenced on multiple flow cells with different sample names, which therefore have different FASTQ file prefixes. Doing this will treat all reads from the library, across flow cells, as one sample."
fastq_file_paths: "Required. Array of fastq files"
reference_genome_tar_gz: "Required. CellRanger-compatible transcriptome reference (in tar.gz)(can be generated with cellranger mkref)"
memory: "Required. The minimum amount of RAM to use for the Cromwell VM"
disk_space: "Required. Amount of disk space (GB) to allocate to the Cromwell VM"
cpu: "Required. The minimum number of cores to use for the Cromwell VM"
chemistry: "Optional. The chemistry of the channel, e.g. V2 or V3. You could choose 'auto' of course."
}

command {
set -e
run_id=$(echo "~{run_id}" | sed 's/\./_/g')
sample=$(echo "~{sample}" | sed 's/\./_/g')

mkdir cellranger_atac
tar -zxf ~{cellranger_atac_tar_gz} -C cellranger_atac --strip-components 1
# Set PATH to include CellRanger-atac binaries
export PATH=$(pwd)/cellranger_atac:$PATH

# Unpack the reference genome
mkdir transcriptome_dir
tar xf ${reference_genome_tar_gz} -C transcriptome_dir --strip-components 1

python <<CODE
import os
from subprocess import check_call
# Convert the WDL Array[File] input to a Python list
fastq_file_paths = ["${sep='","' fastq_file_paths}"]
fastq_dirs = set([os.path.dirname(f) for f in fastq_file_paths])
print(fastq_dirs)
call_args = ['cellranger-atac']
call_args.append('count')
call_args.append('--jobmode=local')
call_args.append('--reference=transcriptome_dir')
call_args.append('--id=' + "~{run_id}")
call_args.append('--fastqs=' + ','.join(list(fastq_dirs)))
call_args.append('--sample=' + "~{sample}")
if '~{force_cells}' != '':
call_args.append('--force-cells=~{force_cells}')
if '~{dim_reduce}' != '':
call_args.append('--dim-reduce=~{dim_reduce}')
if '~{peaks}' != '':
call_args.append('--peaks=~{peaks}')
if "~{chemistry}" != 'auto':
call_args.append('--chemistry=' + "~{chemistry}")
if '~{no_bam}' == 'True':
call_args.append('--no-bam')
else:
print('We have bam files in output directory')
if '~{secondary}' == 'True':
call_args.append('--nosecondary')
else:
print('We have secondary analysis here')
call_args.append('--disable-ui')
print('Executing:', ' '.join(call_args))
check_call(call_args)
CODE
tar -czvf ~{run_id}_outs.tar.gz ~{run_id}/outs
}
output {
File output_count_directory = "~{run_id}_outs.tar.gz"
File output_metrics_summary = "~{run_id}/outs/metrics_summary.csv"
File output_web_summary = "~{run_id}/outs/web_summary.html"
}
runtime {
docker: "python:3.9.19-slim-bullseye"
cpu: cpu
memory: memory
disk: disk_space
}
}
Loading

0 comments on commit 6f7ab02

Please sign in to comment.