Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify how samples are specified #72

Open
aryarm opened this issue Feb 4, 2021 · 4 comments
Open

simplify how samples are specified #72

aryarm opened this issue Feb 4, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@aryarm
Copy link
Owner

aryarm commented Feb 4, 2021

Come up with an intuitive data structure for internal use, and then parse into it from the following options:

  1. Specify as a samples.tsv file
  2. Specify as a custom user-provided py script
  3. Specify as yaml containing strings as paths with wildcards in them

You could create a parser script for this.

@aryarm aryarm added the enhancement New feature or request label Feb 4, 2021
@aryarm
Copy link
Owner Author

aryarm commented Mar 15, 2021

There are a total of 8 different ways to configure this depending on config options like asoc, unpaired, rna_only, and interleaved! We need to simplify this.

@aryarm aryarm mentioned this issue Mar 15, 2021
12 tasks
@aryarm
Copy link
Owner Author

aryarm commented Jun 7, 2021

@AaronJeeHo and @zrcjessica

Going forward, I think the easiest way to start would be to implement the third option above (ie specify as yaml containing strings as paths with wildcards in them), and then we can think about including the other two options depending on how we feel...

Basically, I'm thinking we could add a config option called data that gets parsed into a dictionary. Each entry in data could be a string for the path to the FASTQ or BAM files. And then we could have a wildcard within that string for the sample name. We could label the keys in the data dictionary by the names that we would have in the samples file (ex: dna_bam_path), and we could use those keys to determine which of the 8 config options we're dealing with. I'm not sure how we would map between the sample ID and the VCF sample ID, but one idea could be to have a config option that specifies the path to a file with those two things as the two columns in the file (like the original samples file)?

@aryarm
Copy link
Owner Author

aryarm commented Jun 13, 2021

Here are some examples for the possible contents of a data config option in a new samples.yml file. I'm thinking any of these would be considered valid samples.yml files. In each case, the sample ID is inferred by matching the {sample} wildcard.

ex1

data:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam # this is the dna bam
    rna:
        path: path/to/the/rna/files/{sample}.other.things.fastq.gz
        unpaired: True
    vcf_id: path/to/sample/id/mappings.tsv

ex2

# Since the unpaired config option isn't provided here, we could just assume it is paired by default (ie unpaired = False by default)
# Also, aln_cmd is a required input if the RNA or ATAC data is in BAM format. We could also try to parse this out of the contents of the file, itself.
data:
    rna:
        path: path/to/the/rna/files/{sample}.other.things.bam
        aln_cmd: "bwa mem -M {input.ref} {input.fastq} > {output}"
    vcf_id: path/to/sample/id/mappings.tsv

ex3

# In this case, we don't have any other config parameters (ie aln_cmd or unpaired), so we just set the entire value of the rna dict to a single string (ie the path)
# Since unpaired defaults to False, this situation would imply interleaved is True
data:
    rna: path/to/the/rna/files/{sample}.other.things.fq.gz
    vcf_id: path/to/sample/id/mappings.tsv

ex4

# In this case, the VCF ID is the same as the sample ID except that the VCF ID doesn't have "ID-" preceding it, so we just specify a regex expression to extract that
data:
    rna: path/to/the/rna/files/{sample}.other.things.fq
    vcf_re: '(?<=ID-).*$'

ex5

data:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam
    atac: [path/to/the/atac/files/{sample}.other.things.1.fq.gz, path/to/the/atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/atac/files/{sample}.other.things.bed.gz
    vcf_id: path/to/sample/id/mappings.tsv

ex6

# in this situation, some of our samples have a corresponding dna bam but others don't
data1:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam
    atac: [path/to/the/wgs-atac/files/{sample}.other.things.1.fq.gz, path/to/the/wgs-atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/wgs-peak/files/{sample}.other.things.narrowPeak
    vcf_id: path/to/sample/id/mappings1.tsv
data2:
    atac: [path/to/the/atac/files/{sample}.other.things.1.fq.gz, path/to/the/atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/peak/files/{sample}.other.things.narrowPeak.gz
    vcf_id: path/to/sample/id/mappings2.tsv

@aryarm
Copy link
Owner Author

aryarm commented Jun 22, 2021

We can use glob_wildcards() to infer the sample IDs from the paths. For example, if you have a config file loaded into a dictionary samples_yml with the contents from ex1 above,

samples = glob_wildcards(samples_yml['data']['wgs']).sample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant