Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all of the samples requested have provided input #40

Open
umasstr opened this issue Sep 9, 2021 · 6 comments
Open

Not all of the samples requested have provided input #40

umasstr opened this issue Sep 9, 2021 · 6 comments

Comments

@umasstr
Copy link

umasstr commented Sep 9, 2021

After preparing the required input, the pipeline can't seem to find the specified files or output directory. I don't see in the log files whether or not my sample file is recognized. I am hoping that there is an obvious issue with my file paths or the config, but I'm just not seeing it. Any help would be much appreciated.

Also, sample data ran to completion.

Log file:
(snakemake) root@c844f1072fc5:/varCA# cat out/*

/varCA/Snakefile:51: UserWarning: Not all of the samples requested have provided input. Proceeding with as many samples as is possible...
rule all:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Singularity containers: ignored
Job counts:
count jobs
1 all
1
[Thu Sep 9 11:59:06 2021]
localrule all:
jobid: 0
[Thu Sep 9 11:59:06 2021]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /varCA/.snakemake/log/2021-09-09T115906.454263.snakemake.log

My config (note that the output directory I specify is ignored):

(snakemake) root@c844f1072fc5:/varCA# grep -vF '#' configs/config.yaml | sed '/^$/d'

sample_file: DATA/data/samples.tsv
SAMP_NAMES: [2294, 2296]
genome: DATA/bwa/genome.fa
out: DATA/out_01
snp_callers: [gatk-snp, varscan-snp, vardict-snp]
indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strelka]
snp_filter: ['gatk-snp~DP>10']
indel_filter: ['gatk-indel~DP>10']
snp_model: DATA/data/snp.rda
indel_model: DATA/data/indel.rda

Sample file:

(snakemake) root@c844f1072fc5:/varCA# cat DATA/data/samples.tsv

2294 DATA/2294.dup.fix.bam DATA/2294.bed
2296 DATA/2296.dup.fix.bam DATA/2296.bed

BAM and bed files referenced in samples.tsv are present:

(snakemake) root@c844f1072fc5:/varCA# ls DATA/229* | xargs -n 1 basename

2294.bam
2294.bam.bai
2294.bed
2294.dup.bam
2294.dup.bam.bai
2294.dup.fix.bam
2294.dup.fix.bam.bai
2294_peaks.narrowPeak
2296.bam
2296.bam.bai
2296.bed
2296.dup.bam
2296.dup.bam.bai
2296.dup.fix.bam
2296.dup.fix.bam.bai
2296_peaks.narrowPeak

As are indexes:

root@c844f1072fc5:/varCA# ls DATA/bwa | xargs -n 1 basename

genome.dict
genome.fa
genome.fa.amb
genome.fa.ann
genome.fa.bwt
genome.fa.fai
genome.fa.pac
genome.fa.sa

And models:

(snakemake) root@c844f1072fc5:/varCA# ls DATA/data | xargs -n 1 basename

README.md
indel.rda
indel.tsv.gz
samples.tsv
snp.rda
snp.tsv.gz

@aryarm
Copy link
Owner

aryarm commented Sep 9, 2021

hmm... yeah, I can't really seem to find anything wrong with what you have, either
This might sound obvious, but I can't think of anything else: could you try checking that the samples file is tab-separated?

cat -t DATA/data/samples.tsv

Also, it might help debug things if you comment out the SAMP_NAMES config option. Then you'll be able to see what it's parsing the samples as.

Also, regarding the output directory being ignored: It's because I've set up the run.bash script to override the output directory that you specify in the config file. So you should specify the output directory there rather than in the config.
I know this is confusing and unintuitive, but at the time, it was the only way to put the log files in the same directory as the output. I've already changed the way this is handled in a future release coming soon!

@umasstr
Copy link
Author

umasstr commented Sep 9, 2021

Thanks for the quick reply.
Tabs look okay to me:

2294^IDATA/2294.dup.fix.bam^IDATA/2294.bed
2296^IDATA/2296.dup.fix.bam^IDATA/2296.bed

I'm not very familiar with snakemake, so trying to execute snakefile line-by-line to debug provided limited insight:

  1. There might be an issue string splitting paths containing nested directories or filenames with multiple "."
  2. line 69 in snakefile may not tolerate my use of numerical samples (i.e. adding # to characters)
  3. object "config" may not be defined at the sample indexing step in snakefile

FWIW, the pipeline ran when I moved all required files into the same varCA subdirectory, "data" ("DATA" prompted a similar error message), made sample names non-numerical, and removed "." not preceding required extensions (i.e. name.extension)

Unfortunately, the run failed, but I'm guessing this isn't related?

[Thu Sep 9 13:58:39 2021]
rule vcf2tsv:
input: out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz, out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.tbi
output: out/callers/Two/illumina-strelka/illumina-strelka.cd856d8a.tsv
jobid: 81
wildcards: sample=Two, caller=illumina-strelka, hash=cd856d8a
Activating conda environment: /varCA/.snakemake/conda/e6058793
Removing temporary output file out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.
Removing temporary output file out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.tbi.
[Thu Sep 9 13:58:43 2021]
Finished job 81.
48 of 157 steps (31%) done
Removing temporary output file out/callers/Two/varscan-indel/varscan-indel.vcf.gz.tbi.
Removing temporary output file out/callers/Two/varscan-indel/varscan-indel.vcf.gz.
[Thu Sep 9 13:58:48 2021]
Finished job 75.
49 of 157 steps (31%) done
Removing temporary output file out/callers/Two/varscan-snp/varscan-snp.vcf.gz.tbi.
Removing temporary output file out/callers/Two/varscan-snp/varscan-snp.vcf.gz.
[Thu Sep 9 13:59:16 2021]
Finished job 67.
50 of 157 steps (32%) done
Exiting because a job execution failed. Look above for error message

@aryarm
Copy link
Owner

aryarm commented Sep 9, 2021

ah, I just thought of something! When you write 2294 in a YAML file, it automatically gets parsed as an integer. But 2294 is getting parsed as a string when it's being read from the samples file. So you just needed to put quotes around the sample names in the config file:

SAMP_NAMES: ['2294', '2296']

instead of

SAMP_NAMES: [2294, 2296]

It probably would have also worked if you had commented out the SAMP_NAMES config option completely:

# SAMP_NAMES: [2294, 2296]

I moved all required files into the same varCA subdirectory

I'm not sure why it helped that you moved all of the files into the same subdirectory. The paths in the samples file are supposed to be interpreted relative to the directory that you execute the pipeline from.

There might be an issue string splitting paths containing nested directories or filenames with multiple "."

removed "." not preceding required extensions (i.e. name.extension)

I'm also not sure why it's having trouble with the multiple dots "." in your file names. I've looked through the code again, and there's nothing I can think of that would affect that. In fact, if anything, I think it should fail if you're missing the dot before the extension:

lambda wildcards: SAMP[wildcards.sample][0] if SAMP[wildcards.sample][0].endswith('.bam') else rules.align.output

If you have DATA/2294.dup.fixbam instead of DATA/2294.dup.fix.bam in your samples file, it will treat the file as a FASTQ file instead of a BAM file. I'll keep thinking about it and let you know if I come up with something.

object "config" may not be defined at the sample indexing step in snakefile

The config dictionary should be defined as early as this line:

configfile: "configs/config.yaml"

Unfortunately, the run failed, but I'm guessing this isn't related?

Without more info, I can't be sure what the cause is. Are there any error messages in the qlog file?

@umasstr
Copy link
Author

umasstr commented Sep 9, 2021

Regarding the formatting issues, I will try to re-run using your suggestions after troubleshooting the issues below. For now, the below samples worked well:

Two data/two.bam data/two.bed
Three data/three.bam data/three.bed

Regarding the run failure, I don't see any obvious errors in qlog. I ran twice and the same step (51) failed both times.
(snakemake) root@9fadc1b19969:/varCA# tail -50 out/qlog

sorting RP complete.
Reads_RP.size(): 100
sorting read-pair
sorting read-pair finished.
Modify RP complete.
adding BD from RP.
modify and summarize interchr RP.
adding BD from interChr RP.
summarize BP as BD complete. Now start sorting BD...
sorting BD... done.
external BD events: 0 Added BreakDancer-like events: 0
The number of one end mapped read: 41630
The number of one end mapped read: 42503
The number of one end mapped read: 42503
There are 268098 reads supporting the reference allele.
There are 1 samples.
SampleName2Index done
declaring g_RefCoverageRegion for 1 samples and 20568 positions.
There are 0 split-reads mapped by aligner.
Added: Sorting and outputing deletions with non-template sequences ...
Successfully created workflow run script.
To execute the workflow, run the following script and set appropriate options:
/varCA/out/callers/Three/illumina-strelka/runWorkflow.py
Successfully created workflow run script.
To execute the workflow, run the following script and set appropriate options:
/varCA/out/callers/Two/illumina-strelka/runWorkflow.py

@aryarm
Copy link
Owner

aryarm commented Sep 9, 2021

hmm... yeah, I can't tell which job is the job that is failing from the log outputs that you included.

Here are some debugging strategies:

  1. Check to see if any of the output files have weird file sizes. This can be an indication that a job failed early on but Snakemake just didn't catch it. If that is the case, it can be useful to delete the files and rerun Snakemake after fixing the issue. Otherwise, Snakemake will continue to restart the workflow from the corrupted files.
    Every time you run the workflow, it will start from the last step that failed (exited with a non-zero exit code). But the issue might have been earlier.
  2. You can search for errors in the log files by grepping for rror or using less to search for it with the forward slash. (Note that I recommend rror instead of Error, in case the error is mentioned with a lowercase e error.)
  3. Once you figure out which job failed, you can execute run.bash with the -npr options to see what shell code it is running. Then, you can activate the conda environment that it is using and run the shell code manually, instead of letting Snakemake do it for you. That way, it will be easier to debug the code and catch any errors or warnings.

One of the requirements for the BAM files is that they have read-group information? Do yours?

@umasstr
Copy link
Author

umasstr commented Oct 17, 2022

Hi @aryarm,

Just getting back to this. The pipeline runs past the previously mentioned errors by adding read groups. Thanks for the tips!

Unfortunately, I'm confronted with a more cryptic error later on. Log here

It looks like bcftools may not be installed as a dependency? Total runtime is a few hours per sample, so I wanted to check in to see if there was anything else I'm missing in the log.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants