Skip to content

Commit 872c044

Browse files
author
Josh Loecker
authored
Documentation update + a lot more (#76)
* Updated .gitignore to remove cache-based items * Updated snakemake installation * Updated documentation to include snakemake generic cluster commands * `ruff` formatting changes * Updated cpu request command * Ignore vscode * Use `threads` instead of `resources.threads` * Remove dependency of external FTP servers for fastq contaminant genomes * Update zip_url for get_contaminant_genomes rule * Format using `snakefmt` * Remove `groups`, add additional details on what changing settings will do * Include information on why tissue name was left blank * Do not provide a tissue name for default resources. Require all rules to define one, or set it empty (`tissue_name=""`) * Perform `prefetch` and `dump fastq` by default * Format using ruff * Use inline if-else statements for (i think?) cleaner/clearer requirements for `rule all` * Fix formatting changes from `snakefmt` because I don't like them. Change the get_contaminant_genomes url because I made a mistake making the initial zip archive by excluding the `Adapters` component * Mark root output as "directory" in rule `get_contaminant_genomes` * Properly indent `zip` and closing parenthesis * Fix comment * Reduce complexity of prefetch, also fix downloading to scratch directory * Expand nested list to create a single list * Ignore any control files * Change profile to a default of `cluster`
1 parent 162673c commit 872c044

File tree

12 files changed

+500
-724
lines changed

12 files changed

+500
-724
lines changed

.gitignore

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
1+
**/.DS_Store
12
**/.snakemake/
2-
/venv
3-
/.idea
3+
**/*cache*
4+
5+
.vscode
6+
.idea
7+
8+
/controls
9+
/benchmarks
10+
/COMO_input
11+
/controls
412
/docs/_site/
513
/genome
614
/logs
715
/results
8-
/controls
9-
/benchmarks
10-
/COMO_input
16+
/venv

Snakefile

Lines changed: 345 additions & 638 deletions
Large diffs are not rendered by default.

cluster/config.v8+.yaml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# This file configures various settings for snakemake to execute jobs to a SLURM cluster.
2+
# FROM: https://github.com/jdblischak/smk-simple-slurm
3+
4+
executor: cluster-generic
5+
cluster-generic-submit-cmd:
6+
mkdir -p logs/{rule}/{resources.tissue_name} &&
7+
sbatch
8+
--job-name=smk-{rule}-{wildcards}
9+
--account=helikarlab
10+
--cpus-per-task={threads}
11+
--output=logs/{rule}/{resources.tissue_name}/{rule}-{wildcards}.out
12+
--mem={resources.mem_mb}
13+
--time={resources.runtime}
14+
--parsable
15+
16+
# Define tissue name
17+
default-resources:
18+
- mem_mb=2048
19+
20+
# Job submittion
21+
cores: 16 # max cores used in snakefile
22+
cluster-generic-cancel-cmd: scancel
23+
cluster-generic-cancel-nargs: 50
24+
restart-times: 0
25+
max-jobs-per-second: 10
26+
max-status-checks-per-second: 5
27+
latency-wait: 60
28+
jobs: 100
29+
30+
# Other settings
31+
printshellcmds: True
32+
rerun-incomplete: True
33+
34+
# Do not change these settings. This pipeline will fail to execute or execute extremely slowly if they are changed
35+
use-conda: True
36+
conda-frontend: mamba

cluster/config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@ cluster:
1313
--parsable
1414

1515
default-resources:
16-
- tissue_name=""
1716
- mem_mb=2048
1817

1918
# Job submittion

config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,11 @@ PERFORM_GET_RNASEQ_METRICS: True
3131

3232
# Prefetch SRA files before writing to FastQ (default: True)
3333
# If you are processing local FastQ items, set to False
34-
PERFORM_PREFETCH: False
34+
PERFORM_PREFETCH: True
3535

3636
# Get RNA information from SRA files (default: True)
3737
# If you are processing local FastQ items, set to False
38-
PERFORM_DUMP_FASTQ: False
38+
PERFORM_DUMP_FASTQ: True
3939

4040
# Determine insert sizes for fragment length calculation (default: True)
4141
PERFORM_GET_INSERT_SIZE: True

docs/pages/fastq/fastq_running.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,12 +74,12 @@ snakemake --profile cluster --dry-run
7474
```
7575

7676
{% include note.html content="If you did renamed the `cluster` directory to something else, replace the `--profile cluster` with the name of your directory" %}
77-
{% include note.html content="If you receive an error when running `snakemake --profile cluster --dry-run`, replcae `slurm` with `./cluster`" %}
77+
{% include note.html content="If you receive an error when running `snakemake --profile cluster --dry-run`, replcae `cluster` with `./cluster`" %}
7878

7979
After several seconds, many lines should move through the terminal.<br>
8080
It should end with `This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.`
8181

82-
If this is not the case, an error has occured, and it will need to be investigated before continuing. If you are having troubles, please [Open an Issue](https://github.com/HelikarLab/FastqToGeneCounts/issues)
82+
If this is not the case, an error has occured, and it will need to be investigated before continuing. If you are having troubles, please [Open an Issue](https://github.com/HelikarLab/FastqToGeneCounts/issues/new)
8383

8484
## Execution
8585
Once you have confirmed that a dry-run will execute successfully, it is time to start a real run of the workflow.<br>

docs/pages/fastq/fastq_setup_conda.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ In most cluster environments (i.e., HCC), you must activate the `mamba` module.
1717
module load mamba
1818
```
1919

20-
| Component | Description |
20+
| Component | Description |
2121
|:---------:|:-------------------------------------------------------------:|
2222
| `module` | The module command is used to load, unload, and list modules. |
2323
| `load` | The load command is used to load a module. |
@@ -85,24 +85,30 @@ module load mamba
8585
### Install Snakemake and Benchmarking Requirements
8686
Snakemake is required to run the pipeline.
8787
```bash
88-
mamba install --name snakemake --channel bioconda --channel plotly snakemake==6.15.5 tabulate==0.8.10 plotly==5.11.0
88+
mamba install --name snakemake --channel conda-forge --channel bioconda snakemake
89+
pip install snakemake-executor-plugin-cluster-generic
8990
```
9091

9192
We must install tabulate version `0.8.10` as anything under the `0.9.*` release causes issues for our current version of Snakemake
9293

93-
| Component | Description |
94-
|:--------------------:|:-----------------------------------------------------:|
95-
| `mamba` | Use mamba to install additional software more quickly |
96-
| `install` | The mamba command to install software |
97-
| `--name snakemake` | The environment to install software into |
98-
| `--channel bioconda` | The channel to install software from |
99-
| `--channel plotly` | Install from the plotly channel |
100-
| `snakemake==6.15.5` | The software and version to install |
101-
| `tabulate==0.8.10` | The software and version to install |
102-
| `plotly==5.11.0` | Install plotly for graph creation |
94+
| Component | Description |
95+
|:-----------------------:|:-----------------------------------------------------:|
96+
| `mamba` | Use mamba to install additional software more quickly |
97+
| `install` | The mamba command to install software |
98+
| `--name snakemake` | The environment to install software into |
99+
| `--channel conda-forge` | The channel to install software from |
100+
| `--channel bioconda` | The channel to install software from |
101+
| `snakemake` | The software to install, defaults to latest version |
102+
103+
| Component | Description |
104+
|:-------------------------------------------:|:----------------------------------------------------------:|
105+
| `pip` | Use pip to install python-only dependencies |
106+
| `install` | The pip command to install software |
107+
| `snakemake-executor-plugin-cluster-generic` | The package required to use profiles in `snakemake>=8.0.0` |
103108

104109
## Test Installations
110+
The following command should return a valid number, ideally greater than 7.x. If this is not the case, investigate why a lower version was installed, or [open an issue](https://github.com/HelikarLab/FastqToGeneCounts/issues/new) on our GitHub page.
111+
105112
```bash
106113
snakemake --version
107-
# Returns "7.25.0"
108114
```

docs/pages/fastq/fastq_setup_genome.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,6 @@ assembly_release=109
1818
GRCh_version=38
1919
```
2020

21-
```bash
22-
2321
## Genome FASTA File
2422
```bash
2523
# Change directories into your `genome` directory
@@ -72,7 +70,7 @@ rm refFlat.tmp.txt
7270
```bash
7371
# Execute this in the `genome` directory!
7472

75-
-# Download the BED file, then set the file name
73+
# Download the BED file, then set the file name
7674
wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v${GRCh_version}.bed.gz/download
7775
mv download hg38_GENCODE.v${GRCh_version}.bed.gz
7876

@@ -95,6 +93,8 @@ To run the riboInt.sh file, perform the following:
9593

9694
```bash
9795
# Change directories to the location the pipeline was downloaded; for example:
96+
97+
# WARNING: Change the next line to your own installation directory!
9898
cd /work/helikarlab/joshl/FastqToGeneCounts
9999
sh riboInt.sh
100100
```

docs/pages/fastq/fastq_setup_profile.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,11 @@ Snakemake Profiles have multiple benefits:
1919
Perhaps the most important part of Profiles is Point 4. Instead of requesting, for example, 40 cores, 50GB of RAM, and multiple hours for the entire Snakemake workflow (as this is the maximum resources we require), Profiles will request small amounts of resources for rules that do not require them. This is especially important for rules that are not CPU intensive, as they take far less time to run. This means we can run more jobs at once, and our jobs will finish faster.
2020

2121
## Setup
22-
A default profile was downloaded with this repository under the `cluster` directory. If you are not a part of our wonderful Helikar Lab, you must edit the `slurm_account` line to your slurm account. This information can be found with the following command:
22+
A default profile was downloaded with this repository under the `cluster` directory. If you are not a part of our wonderful Helikar Lab, you must edit the `--account=` line to match your slurm account. This information can be found with the command listed below. In this example, the value listed under `Def Acct` should be included, like this: `--account=helikarlab`
2323
```bash
2424
> sacctmgr show user $USER accounts
2525

2626
User Def Acct Def WCKey Admin
2727
---------- ---------- ---------- ---------
2828
joshl helikarlab None
2929
```
30-
31-
The account name is the second column; in this case, `helikarlab`. This should be entered into the `slurm_account` line in the `cluster/config.yaml` file.

execute.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
execution_successful=false
44
profile=$1
55

6-
# If `profile` is empty, set `profile` to "slurm"
6+
# If `profile` is empty, set `profile` to "cluster"
77
if [ -z "$profile" ]; then
8-
profile="slurm"
8+
profile="cluster"
99
fi
1010

1111

utils/get.py

Lines changed: 41 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from typing import Union
66

77
import snakemake
8-
from snakemake import io
8+
from snakemake import io as snakemake_io
99

1010
from utils import perform
1111
from utils.constants import Layout
@@ -15,59 +15,64 @@ def from_master_config(config: dict, attribute: str) -> list[str]:
1515
valid_inputs = ["SRR", "tissue", "tag", "PE_SE"]
1616
sub_list = ["tissue", "tag"]
1717
if attribute not in valid_inputs:
18-
sys.exit(f"\nInvalid attribute input. '{attribute}' is not one of: {valid_inputs}\n")
19-
18+
sys.exit(
19+
f"\nInvalid attribute input. '{attribute}' is not one of: {valid_inputs}\n"
20+
)
21+
2022
collect_attributes = []
2123
index_value = valid_inputs.index(attribute)
22-
24+
2325
# We have to subtract one because "tissue" and "tag" are in the same index, thus the index value in valid_inputs is increased by one
2426
if index_value >= 2:
2527
index_value -= 1
26-
28+
2729
control_lines = open(config["MASTER_CONTROL"], "r")
2830
dialect = csv.Sniffer().sniff(control_lines.read(1024))
2931
control_lines.seek(0)
3032
reader = csv.reader(control_lines, delimiter=str(dialect.delimiter))
31-
33+
3234
for line in reader:
33-
3435
# Get the column from master_control we are interested in
3536
column_value = line[index_value]
3637
PE_SE_value = Layout[line[2]] # PE, SE, or SLC
37-
38+
target_attribute = []
39+
3840
# test if we are looking for "tissue" or "tag", as these two values are located at master_control index 1
3941
if attribute in sub_list:
4042
sub_index = sub_list.index(attribute)
4143
split_list = str(line[index_value]).split("_")
42-
44+
4345
# We must append the target attribute twice if it is paired end, once if it is single end
4446
if PE_SE_value in [Layout.PE, Layout.SLC]:
4547
target_attribute = [split_list[sub_index], split_list[sub_index]]
4648
elif PE_SE_value == Layout.SE:
4749
target_attribute = [split_list[sub_index]]
48-
50+
4951
# Test if we are gathering the ended-ness (PE, SE, SLC)
5052
elif attribute == "PE_SE":
5153
# We must append the target attribute twice if it is paired end, once if it is single end
52-
if column_value in [Layout.PE.name, Layout.SLC.name]: # paired end or single cell
54+
if column_value in [
55+
Layout.PE.name,
56+
Layout.SLC.name,
57+
]: # paired end or single cell
5358
target_attribute = ["1", "2"]
5459
elif column_value == Layout.SE.name: # Single end
5560
target_attribute = ["S"]
56-
61+
5762
# If we are doing anything else, simply append the column value the appropriate number of times
5863
else:
5964
if PE_SE_value in [Layout.PE, Layout.SLC]:
6065
target_attribute = [line[index_value], line[index_value]]
6166
elif PE_SE_value == Layout.SE:
6267
target_attribute = [line[index_value]]
63-
68+
6469
collect_attributes += target_attribute
65-
70+
6671
control_lines.close()
6772
return collect_attributes
6873

6974

70-
def srr_code(config: dict) -> list[str]:
75+
def srr_code(config: dict) -> list[str] | None:
7176
"""
7277
Only should be getting SRR values if we are performing prefetch
7378
"""
@@ -79,26 +84,35 @@ def tissue_name(config: dict) -> list[str]:
7984
if perform.prefetch(config=config):
8085
return from_master_config(config=config, attribute="tissue")
8186
else:
82-
fastq_input = snakemake.io.glob_wildcards(
83-
os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
87+
fastq_input = snakemake_io.glob_wildcards(
88+
os.path.join(
89+
config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
90+
)
91+
)
8492
return fastq_input.tissue_name
8593

8694

8795
def tags(config: dict) -> list[str]:
8896
if perform.prefetch(config=config):
8997
return from_master_config(config=config, attribute="tag")
9098
else:
91-
fastq_input = snakemake.io.glob_wildcards(
92-
os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
99+
fastq_input = snakemake_io.glob_wildcards(
100+
os.path.join(
101+
config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
102+
)
103+
)
93104
return fastq_input.tag
94105

95106

96107
def PE_SE(config: dict) -> list[str]:
97108
if perform.prefetch(config=config):
98109
return from_master_config(config=config, attribute="PE_SE")
99110
else:
100-
fastq_input = snakemake.io.glob_wildcards(
101-
os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
111+
fastq_input = snakemake_io.glob_wildcards(
112+
os.path.join(
113+
config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
114+
)
115+
)
102116
return fastq_input.PE_SE
103117

104118

@@ -120,10 +134,13 @@ def sample(config: dict) -> list[str]:
120134
if perform.prefetch(config=config):
121135
tag = from_master_config(config=config, attribute="tag")
122136
else:
123-
fastq_input = snakemake.io.glob_wildcards(
124-
os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
137+
fastq_input = snakemake_io.glob_wildcards(
138+
os.path.join(
139+
config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
140+
)
141+
)
125142
tag = fastq_input.tag
126-
143+
127144
sample = []
128145
for t in tag:
129146
sample.append(t.split("R")[0])

0 commit comments

Comments
 (0)