Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sambamba for duplicate marking. #1082

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions definitions/pipelines/rnaseq.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,6 @@ steps:
run: ../tools/mark_duplicates_and_sort.cwl
in:
bam: index_bam/indexed_bam
input_sort_order:
default: "coordinate"
out:
[sorted_bam, metrics_file]
stringtie:
Expand Down
2 changes: 0 additions & 2 deletions definitions/pipelines/rnaseq_star_fusion.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -238,8 +238,6 @@ steps:
run: ../tools/mark_duplicates_and_sort.cwl
in:
bam: sort_bam/sorted_bam
input_sort_order:
default: "coordinate"
out:
[sorted_bam, metrics_file]
index_bam:
Expand Down
2 changes: 0 additions & 2 deletions definitions/pipelines/rnaseq_star_fusion_with_xenosplit.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -257,8 +257,6 @@ steps:
run: ../tools/mark_duplicates_and_sort.cwl
in:
bam: sort_bam/sorted_bam
input_sort_order:
default: "coordinate"
out:
[sorted_bam, metrics_file]
index_bam:
Expand Down
24 changes: 20 additions & 4 deletions definitions/tools/generate_fda_tables.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ class: CommandLineTool
label: "Script to create FDA-requested summary tables"
requirements:
- class: DockerRequirement
dockerPull: "python:3.7.4-slim-buster"
dockerPull: "python:3.10.8-slim-buster"
- class: ResourceRequirement
ramMin: 8000
- class: InitialWorkDirRequirement
Expand Down Expand Up @@ -232,9 +232,25 @@ requirements:

def parse_duplication_metrics(duplication_metrics):
with open(duplication_metrics) as f:
raw_chunk = f.read().split('\n\n')[1]
pct_dup = raw_chunk.splitlines()[2].split('\t')[8]
return {'PERCENT_DUPLICATION': pct_dup}
pairs = None
singles = None
duplicates = None
lines = f.read().splitlines()
for line in lines:
if match_pairs := re.search(r'sorted (\d+) end pairs', line):
pairs = match_pairs.group(1)
elif match_singles := re.search(r'and (\d+) single ends', line):
singles = match_singles.group(1)
elif match_duplicates := re.search(r'found (\d+) duplicates', line):
duplicates = match_duplicates.group(1)
if pairs is None:
raise ValueError('Failed to parse number of end pairs')
if singles is None:
raise ValueError('Failed to parse number of single ends')
if duplicates is None:
raise ValueError('Failed to parse number of duplicates')

return {'PERCENT_DUPLICATION': str(float(duplicates)/(2.0*float(pairs) + float(singles))*100.0)}

def parse_insert_size_metrics(insert_size_metrics):
with open(insert_size_metrics) as f:
Expand Down
21 changes: 7 additions & 14 deletions definitions/tools/mark_duplicates_and_sort.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,22 @@ label: "Mark duplicates and Sort"
baseCommand: ["/bin/bash", "markduplicates_helper.sh"]
requirements:
- class: ResourceRequirement
coresMin: 8
coresMin: 16
ramMin: 40000
- class: DockerRequirement
dockerPull: "mgibio/mark_duplicates-cwl:1.0.1"
dockerPull: "quay.io/biocontainers/sambamba:0.8.2--h98b6b92_2"
- class: InitialWorkDirRequirement
listing:
- entryname: 'markduplicates_helper.sh'
entry: |
set -o pipefail
set -o errexit

declare MD_BARCODE_TAG
if [ ! -z "$6" ]; then
MD_BARCODE_TAG="BARCODE_TAG=$6"
/usr/bin/java -Xmx16g -jar /opt/picard/picard.jar MarkDuplicates I=$1 O=/dev/stdout ASSUME_SORT_ORDER=$5 METRICS_FILE=$4 QUIET=true COMPRESSION_LEVEL=0 VALIDATION_STRINGENCY=LENIENT "$MD_BARCODE_TAG" | /usr/bin/sambamba sort -t $2 -m 18G -o $3 /dev/stdin
else
/usr/bin/java -Xmx16g -jar /opt/picard/picard.jar MarkDuplicates I=$1 O=/dev/stdout ASSUME_SORT_ORDER=$5 METRICS_FILE=$4 QUIET=true COMPRESSION_LEVEL=0 VALIDATION_STRINGENCY=LENIENT | /usr/bin/sambamba sort -t $2 -m 18G -o $3 /dev/stdin
fi
CORES="$2"
CORES_PER_JOB=`perl -E 'my $x = int($ARGV[0]/2); say($x < 1? 1 : $x)'` $CORES

sambamba markdup -l 0 -t $CORES_PER_JOB "$1" /dev/stdout 2> "$4" \
| sambamba sort -t $CORES_PER_JOB -m 16G -o "$3" /dev/stdin
arguments:
- position: 2
valueFrom: "$(runtime.cores)"
Expand All @@ -35,11 +33,6 @@ inputs:
type: File
inputBinding:
position: 1
input_sort_order:
type: string
default: "queryname"
inputBinding:
position: 5
output_name:
type: string?
default: 'MarkedSorted.bam'
Expand Down