Create genome reference files (#78)

* Updated .gitignore to remove cache-based items * Updated snakemake installation * Updated documentation to include snakemake generic cluster commands * `ruff` formatting changes * Updated cpu request command * Ignore vscode * Use `threads` instead of `resources.threads` * Remove dependency of external FTP servers for fastq contaminant genomes * Update zip_url for get_contaminant_genomes rule * Format using `snakefmt` * Remove `groups`, add additional details on what changing settings will do * Include information on why tissue name was left blank * Do not provide a tissue name for default resources. Require all rules to define one, or set it empty (`tissue_name=""`) * Perform `prefetch` and `dump fastq` by default * Format using ruff * Use inline if-else statements for (i think?) cleaner/clearer requirements for `rule all` * Fix formatting changes from `snakefmt` because I don't like them. Change the get_contaminant_genomes url because I made a mistake making the initial zip archive by excluding the `Adapters` component * Mark root output as "directory" in rule `get_contaminant_genomes` * Properly indent `zip` and closing parenthesis * Fix comment * Reduce complexity of prefetch, also fix downloading to scratch directory * Expand nested list to create a single list * Ignore any control files * Change profile to a default of `cluster` * Ignore `master_control.csv` * Added `ruff.toml` to modify settings for ruff * Updated environment requirements for genome generation * Added functions to generate the required genome-related files automatically * Moved genome generation to its own rule, modified star genome indexing to a rule with a better name Fixed rules that relied on `rule generate_genome`, as this rule is different than it was previously * Updated config values based on auto-generating genome-related files * Update documentation to install latest version of python * Remove master_control.csv from git tracking * Re-add master_control.csv * Add blank master control * re-ignore master control * Added environment for genome generation * Use delimiter directly * Rename genome values * Remove genome validation because the genome is created automatically now * Remove genome validation, it is no longer required * Hide ruff.toml so it is not shown to users * Ignore all control files * Custom cache for getting latest ensembl release as the traditional `@cache` method requires function arguments * Fix genome sizes filename * Fix missing commas * Reorder output files * Install python in the `generate_genome` environment * Use private URL variable names * Fix comment * Fix snakemake's expected output location of genome-related files * Add `rich` to environment * Remove un-necessary imports
HelikarLab · Apr 30, 2024 · adbd512 · adbd512
1 parent f2f84f6
commit adbd512
Show file tree

Hide file tree

Showing 10 changed files with 876 additions and 197 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,4 +13,5 @@
 /genome
 /logs
 /results
-/venv
+/venv
+/controls
diff --git a/.ruff.toml b/.ruff.toml
@@ -0,0 +1 @@
+line-length = 120
diff --git a/Snakefile b/Snakefile
diff --git a/config.yaml b/config.yaml
@@ -59,19 +59,32 @@ BYPASS_GENOME_VALIDATION: False
 BENCHMARK_TIMES: 1
 
 # The following items dictate the location for the reference genome
-GENERATE_GENOME:
-  # The full path where the genome generation data should be saved (i.e., the output of `STAR --runMode genomeGenerate`)
-  GENOME_SAVE_DIR: "genome/star"
-
-  # The full input path of the genome fasta file and the GTF file
-  GENOME_FASTA_FILE: "genome/Mus_musculus.GRCm39.dna.primary_assembly.fa"
-  GTF_FILE: "genome/Mus_musculus.GRCm39.111.gtf"
-
-  # RefFlat0 file location built from the GTF file for generating RNAseq metrics option
-  REF_FLAT_FILE: "genome/refFlat_GRCh38.111.txt"
-
-  # rRNA interval list file location for generating RNAseq metrics
-  RRNA_INTERVAL_LIST: "genome/GRCh38.p5.rRNA.interval_list"
-
-  # The reference BED file built from the GTF for RSEQC option
-  BED_FILE: "genome/mm10_RefSeq.bed"
+GENOME:
+  # The following files will be downloaded or created
+  #   - Genome FASTA file (i.e., primary assembly, downloaded from Ensembl)
+  #   - GTF file (i.e., gene annotation, downloaded from Ensembl)
+  #   - UCSC-style refFlat file (downloaded from UCSC)
+  #   - BED file (created from refFlat file)
+  #   - rRNA Interval List (created from GTF file)
+  #   - STAR index (created from FASTA file)
+
+  # The full path where the genome generation data should be saved
+  SAVE_DIR: "genome"
+
+  # The species of the genome to download
+  # Get your species from https://www.ncbi.nlm.nih.gov/Taxonomy
+  # Examples are:
+  #   Homo Sapiens: 9606
+  #   Mus Musculus: 10090
+  #   Macaca Mulatta: 9544
+  TAXONOMY_ID: 9606
+
+  # The version of the genome to download
+  # "latest" will get the most recent version
+  # If you want to use a specific version, find it from: https://ftp.ensembl.org/pub/
+  # Examples are: "latest", "release-112", "release-111", etc.
+  VERSION: "latest"
+
+  # Should the download progress be shown?
+  # If False, no progress is shown.
+  SHOW_PROGRESS: True
diff --git a/controls/master_control.csv b/controls/master_control.csv
diff --git a/docs/pages/fastq/fastq_setup_conda.md b/docs/pages/fastq/fastq_setup_conda.md
@@ -85,7 +85,7 @@ module load mamba
 ### Install Snakemake and Benchmarking Requirements
 Snakemake is required to run the pipeline.
 ```bash
-mamba install --name snakemake --channel conda-forge --channel bioconda snakemake
+mamba install --name snakemake --channel conda-forge --channel bioconda snakemake python
 pip install snakemake-executor-plugin-cluster-generic
 ```
 
@@ -99,6 +99,7 @@ We must install tabulate version `0.8.10` as anything under the `0.9.*` release
 | `--channel conda-forge` |         The channel to install software from          |
 |  `--channel bioconda`   |         The channel to install software from          |
 |       `snakemake`       |  The software to install, defaults to latest version  |
+|        `python`         |             The latest version of python              |
 
 |                  Component                  |                        Description                         |
 |:-------------------------------------------:|:----------------------------------------------------------:|

diff --git a/envs/generate_genome.yaml b/envs/generate_genome.yaml
@@ -0,0 +1,7 @@
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - conda-forge::python
+  - conda-forge::rich<14.0.0
+  - bioconda::samtools<2.0
diff --git a/envs/star.yaml b/envs/star.yaml
@@ -2,3 +2,5 @@ channels:
   - bioconda
 dependencies:
   - star=2.7.9a
+  - httpx<1.0.0
+  - rich<14.0.0
-Original file line number
+Diff line change
@@ Expand Up / @@ -13,4 +13,5 @@ @@
     /genome
     /logs
     /results
-    /venv
+    /venv
+    /controls