Skip to content

Commit

Permalink
Create genome reference files (#78)
Browse files Browse the repository at this point in the history
* Updated .gitignore to remove cache-based items

* Updated snakemake installation

* Updated documentation to include snakemake generic cluster commands

* `ruff` formatting changes

* Updated cpu request command

* Ignore vscode

* Use `threads` instead of `resources.threads`

* Remove dependency of external FTP servers for fastq contaminant genomes

* Update zip_url for get_contaminant_genomes rule

* Format using `snakefmt`

* Remove `groups`, add additional details on what changing settings will do

* Include information on why tissue name was left blank

* Do not provide a tissue name for default resources. Require all rules to define one, or set it empty (`tissue_name=""`)

* Perform `prefetch` and `dump fastq` by default

* Format using ruff

* Use inline if-else statements for (i think?) cleaner/clearer requirements for `rule all`

* Fix formatting changes from `snakefmt` because I don't like them.

Change the get_contaminant_genomes url because I made a mistake making the initial zip archive by excluding the `Adapters` component

* Mark root output as "directory" in rule `get_contaminant_genomes`

* Properly indent `zip` and closing parenthesis

* Fix comment

* Reduce complexity of prefetch, also fix downloading to scratch directory

* Expand nested list to create a single list

* Ignore any control files

* Change profile to a default of `cluster`

* Ignore `master_control.csv`

* Added `ruff.toml` to modify settings for ruff

* Updated environment requirements for genome generation

* Added functions to generate the required genome-related files automatically

* Moved genome generation to its own rule, modified star genome indexing to a rule with a better name

Fixed rules that relied on `rule generate_genome`, as this rule is different than it was previously

* Updated config values based on auto-generating genome-related files

* Update documentation to install latest version of python

* Remove master_control.csv from git tracking

* Re-add master_control.csv

* Add blank master control

* re-ignore master control

* Added environment for genome generation

* Use delimiter directly

* Rename genome values

* Remove genome validation because the genome is created automatically now

* Remove genome validation, it is no longer required

* Hide ruff.toml so it is not shown to users

* Ignore all control files

* Custom cache for getting latest ensembl release as the traditional `@cache` method requires function arguments

* Fix genome sizes filename

* Fix missing commas

* Reorder output files

* Install python in the `generate_genome` environment

* Use private URL variable names

* Fix comment

* Fix snakemake's expected output location of genome-related files

* Add `rich` to environment

* Remove un-necessary imports
  • Loading branch information
Josh Loecker authored Apr 30, 2024
1 parent f2f84f6 commit adbd512
Show file tree
Hide file tree
Showing 10 changed files with 876 additions and 197 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@
/genome
/logs
/results
/venv
/venv
/controls
1 change: 1 addition & 0 deletions .ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
line-length = 120
384 changes: 355 additions & 29 deletions Snakefile

Large diffs are not rendered by default.

45 changes: 29 additions & 16 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,19 +59,32 @@ BYPASS_GENOME_VALIDATION: False
BENCHMARK_TIMES: 1

# The following items dictate the location for the reference genome
GENERATE_GENOME:
# The full path where the genome generation data should be saved (i.e., the output of `STAR --runMode genomeGenerate`)
GENOME_SAVE_DIR: "genome/star"

# The full input path of the genome fasta file and the GTF file
GENOME_FASTA_FILE: "genome/Mus_musculus.GRCm39.dna.primary_assembly.fa"
GTF_FILE: "genome/Mus_musculus.GRCm39.111.gtf"

# RefFlat0 file location built from the GTF file for generating RNAseq metrics option
REF_FLAT_FILE: "genome/refFlat_GRCh38.111.txt"

# rRNA interval list file location for generating RNAseq metrics
RRNA_INTERVAL_LIST: "genome/GRCh38.p5.rRNA.interval_list"

# The reference BED file built from the GTF for RSEQC option
BED_FILE: "genome/mm10_RefSeq.bed"
GENOME:
# The following files will be downloaded or created
# - Genome FASTA file (i.e., primary assembly, downloaded from Ensembl)
# - GTF file (i.e., gene annotation, downloaded from Ensembl)
# - UCSC-style refFlat file (downloaded from UCSC)
# - BED file (created from refFlat file)
# - rRNA Interval List (created from GTF file)
# - STAR index (created from FASTA file)

# The full path where the genome generation data should be saved
SAVE_DIR: "genome"

# The species of the genome to download
# Get your species from https://www.ncbi.nlm.nih.gov/Taxonomy
# Examples are:
# Homo Sapiens: 9606
# Mus Musculus: 10090
# Macaca Mulatta: 9544
TAXONOMY_ID: 9606

# The version of the genome to download
# "latest" will get the most recent version
# If you want to use a specific version, find it from: https://ftp.ensembl.org/pub/
# Examples are: "latest", "release-112", "release-111", etc.
VERSION: "latest"

# Should the download progress be shown?
# If False, no progress is shown.
SHOW_PROGRESS: True
Empty file modified controls/master_control.csv
100755 → 100644
Empty file.
3 changes: 2 additions & 1 deletion docs/pages/fastq/fastq_setup_conda.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ module load mamba
### Install Snakemake and Benchmarking Requirements
Snakemake is required to run the pipeline.
```bash
mamba install --name snakemake --channel conda-forge --channel bioconda snakemake
mamba install --name snakemake --channel conda-forge --channel bioconda snakemake python
pip install snakemake-executor-plugin-cluster-generic
```

Expand All @@ -99,6 +99,7 @@ We must install tabulate version `0.8.10` as anything under the `0.9.*` release
| `--channel conda-forge` | The channel to install software from |
| `--channel bioconda` | The channel to install software from |
| `snakemake` | The software to install, defaults to latest version |
| `python` | The latest version of python |

| Component | Description |
|:-------------------------------------------:|:----------------------------------------------------------:|
Expand Down
7 changes: 7 additions & 0 deletions envs/generate_genome.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
channels:
- conda-forge
- bioconda
dependencies:
- conda-forge::python
- conda-forge::rich<14.0.0
- bioconda::samtools<2.0
2 changes: 2 additions & 0 deletions envs/star.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ channels:
- bioconda
dependencies:
- star=2.7.9a
- httpx<1.0.0
- rich<14.0.0
Loading

0 comments on commit adbd512

Please sign in to comment.