simplify how samples are specified #72

aryarm · 2021-02-04T23:43:07Z

Come up with an intuitive data structure for internal use, and then parse into it from the following options:

Specify as a samples.tsv file
Specify as a custom user-provided py script
Specify as yaml containing strings as paths with wildcards in them

You could create a parser script for this.

aryarm · 2021-03-15T17:07:00Z

There are a total of 8 different ways to configure this depending on config options like asoc, unpaired, rna_only, and interleaved! We need to simplify this.

aryarm · 2021-06-07T23:11:32Z

@AaronJeeHo and @zrcjessica

Going forward, I think the easiest way to start would be to implement the third option above (ie specify as yaml containing strings as paths with wildcards in them), and then we can think about including the other two options depending on how we feel...

Basically, I'm thinking we could add a config option called data that gets parsed into a dictionary. Each entry in data could be a string for the path to the FASTQ or BAM files. And then we could have a wildcard within that string for the sample name. We could label the keys in the data dictionary by the names that we would have in the samples file (ex: dna_bam_path), and we could use those keys to determine which of the 8 config options we're dealing with. I'm not sure how we would map between the sample ID and the VCF sample ID, but one idea could be to have a config option that specifies the path to a file with those two things as the two columns in the file (like the original samples file)?

aryarm · 2021-06-13T17:03:11Z

Here are some examples for the possible contents of a data config option in a new samples.yml file. I'm thinking any of these would be considered valid samples.yml files. In each case, the sample ID is inferred by matching the {sample} wildcard.

ex1

data:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam # this is the dna bam
    rna:
        path: path/to/the/rna/files/{sample}.other.things.fastq.gz
        unpaired: True
    vcf_id: path/to/sample/id/mappings.tsv

ex2

# Since the unpaired config option isn't provided here, we could just assume it is paired by default (ie unpaired = False by default)
# Also, aln_cmd is a required input if the RNA or ATAC data is in BAM format. We could also try to parse this out of the contents of the file, itself.
data:
    rna:
        path: path/to/the/rna/files/{sample}.other.things.bam
        aln_cmd: "bwa mem -M {input.ref} {input.fastq} > {output}"
    vcf_id: path/to/sample/id/mappings.tsv

ex3

# In this case, we don't have any other config parameters (ie aln_cmd or unpaired), so we just set the entire value of the rna dict to a single string (ie the path)
# Since unpaired defaults to False, this situation would imply interleaved is True
data:
    rna: path/to/the/rna/files/{sample}.other.things.fq.gz
    vcf_id: path/to/sample/id/mappings.tsv

ex4

# In this case, the VCF ID is the same as the sample ID except that the VCF ID doesn't have "ID-" preceding it, so we just specify a regex expression to extract that
data:
    rna: path/to/the/rna/files/{sample}.other.things.fq
    vcf_re: '(?<=ID-).*$'

ex5

data:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam
    atac: [path/to/the/atac/files/{sample}.other.things.1.fq.gz, path/to/the/atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/atac/files/{sample}.other.things.bed.gz
    vcf_id: path/to/sample/id/mappings.tsv

ex6

# in this situation, some of our samples have a corresponding dna bam but others don't
data1:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam
    atac: [path/to/the/wgs-atac/files/{sample}.other.things.1.fq.gz, path/to/the/wgs-atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/wgs-peak/files/{sample}.other.things.narrowPeak
    vcf_id: path/to/sample/id/mappings1.tsv
data2:
    atac: [path/to/the/atac/files/{sample}.other.things.1.fq.gz, path/to/the/atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/peak/files/{sample}.other.things.narrowPeak.gz
    vcf_id: path/to/sample/id/mappings2.tsv

aryarm · 2021-06-22T01:55:57Z

We can use glob_wildcards() to infer the sample IDs from the paths. For example, if you have a config file loaded into a dictionary samples_yml with the contents from ex1 above,

samples = glob_wildcards(samples_yml['data']['wgs']).sample

aryarm added the enhancement New feature or request label Feb 4, 2021

aryarm mentioned this issue Mar 15, 2021

priorities #75

Open

12 tasks

aryarm mentioned this issue Jul 4, 2021

snakemake validation aryarm/varCA#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify how samples are specified #72

simplify how samples are specified #72

aryarm commented Feb 4, 2021 •

edited

Loading

aryarm commented Mar 15, 2021 •

edited

Loading

aryarm commented Jun 7, 2021 •

edited

Loading

aryarm commented Jun 13, 2021 •

edited

Loading

aryarm commented Jun 22, 2021 •

edited

Loading

simplify how samples are specified #72

simplify how samples are specified #72

Comments

aryarm commented Feb 4, 2021 • edited Loading

aryarm commented Mar 15, 2021 • edited Loading

aryarm commented Jun 7, 2021 • edited Loading

aryarm commented Jun 13, 2021 • edited Loading

ex1

ex2

ex3

ex4

ex5

ex6

aryarm commented Jun 22, 2021 • edited Loading

aryarm commented Feb 4, 2021 •

edited

Loading

aryarm commented Mar 15, 2021 •

edited

Loading

aryarm commented Jun 7, 2021 •

edited

Loading

aryarm commented Jun 13, 2021 •

edited

Loading

aryarm commented Jun 22, 2021 •

edited

Loading