Documentation update + a lot more (#76)

Josh Loecker · web-flow · commit 872c0449fe5f · 2024-04-25T14:42:12.000-05:00
* Updated .gitignore to remove cache-based items

* Updated snakemake installation

* Updated documentation to include snakemake generic cluster commands

* `ruff` formatting changes

* Updated cpu request command

* Ignore vscode

* Use `threads` instead of `resources.threads`

* Remove dependency of external FTP servers for fastq contaminant genomes

* Update zip_url for get_contaminant_genomes rule

* Format using `snakefmt`

* Remove `groups`, add additional details on what changing settings will do

* Include information on why tissue name was left blank

* Do not provide a tissue name for default resources. Require all rules to define one, or set it empty (`tissue_name=""`)

* Perform `prefetch` and `dump fastq` by default

* Format using ruff

* Use inline if-else statements for (i think?) cleaner/clearer requirements for `rule all`

* Fix formatting changes from `snakefmt` because I don't like them.

Change the get_contaminant_genomes url because I made a mistake making the initial zip archive by excluding the `Adapters` component

* Mark root output as "directory" in rule `get_contaminant_genomes`

* Properly indent `zip` and closing parenthesis

* Fix comment

* Reduce complexity of prefetch, also fix downloading to scratch directory

* Expand nested list to create a single list

* Ignore any control files

* Change profile to a default of `cluster`
diff --git a/.gitignore b/.gitignore
@@ -1,10 +1,16 @@
+**/.DS_Store
 **/.snakemake/
-/venv
-/.idea
+**/*cache*
+
+.vscode
+.idea
+
+/controls
+/benchmarks
+/COMO_input
+/controls
 /docs/_site/
 /genome
 /logs
 /results
-/controls
-/benchmarks
-/COMO_input
+/venv
diff --git a/Snakefile b/Snakefile
diff --git a/cluster/config.v8+.yaml b/cluster/config.v8+.yaml
@@ -0,0 +1,36 @@
+# This file configures various settings for snakemake to execute jobs to a SLURM cluster.
+# FROM: https://github.com/jdblischak/smk-simple-slurm
+
+executor: cluster-generic
+cluster-generic-submit-cmd:
+  mkdir -p logs/{rule}/{resources.tissue_name} &&
+  sbatch
+    --job-name=smk-{rule}-{wildcards}
+    --account=helikarlab
+    --cpus-per-task={threads}
+    --output=logs/{rule}/{resources.tissue_name}/{rule}-{wildcards}.out
+    --mem={resources.mem_mb}
+    --time={resources.runtime}
+    --parsable
+
+# Define tissue name
+default-resources:
+  - mem_mb=2048
+
+# Job submittion
+cores: 16  # max cores used in snakefile
+cluster-generic-cancel-cmd: scancel
+cluster-generic-cancel-nargs: 50
+restart-times: 0
+max-jobs-per-second: 10
+max-status-checks-per-second: 5
+latency-wait: 60
+jobs: 100
+
+# Other settings
+printshellcmds: True
+rerun-incomplete: True
+
+# Do not change these settings. This pipeline will fail to execute or execute extremely slowly if they are changed
+use-conda: True
+conda-frontend: mamba
diff --git a/cluster/config.yaml b/cluster/config.yaml
@@ -13,7 +13,6 @@ cluster:
     --parsable
 
 default-resources:
-  - tissue_name=""
   - mem_mb=2048
 
 # Job submittion
diff --git a/config.yaml b/config.yaml
@@ -31,11 +31,11 @@ PERFORM_GET_RNASEQ_METRICS: True
 
 # Prefetch SRA files before writing to FastQ (default: True)
 # If you are processing local FastQ items, set to False
-PERFORM_PREFETCH: False
+PERFORM_PREFETCH: True
 
 # Get RNA information from SRA files (default: True)
 # If you are processing local FastQ items, set to False
-PERFORM_DUMP_FASTQ: False
+PERFORM_DUMP_FASTQ: True
 
 # Determine insert sizes for fragment length calculation (default: True)
 PERFORM_GET_INSERT_SIZE: True
diff --git a/docs/pages/fastq/fastq_running.md b/docs/pages/fastq/fastq_running.md
@@ -74,12 +74,12 @@ snakemake --profile cluster --dry-run
 ```
 
 {% include note.html content="If you did renamed the `cluster` directory to something else, replace the `--profile cluster` with the name of your directory" %}
-{% include note.html content="If you receive an error when running `snakemake --profile cluster --dry-run`, replcae `slurm` with `./cluster`" %}
+{% include note.html content="If you receive an error when running `snakemake --profile cluster --dry-run`, replcae `cluster` with `./cluster`" %}
 
 After several seconds, many lines should move through the terminal.<br>
 It should end with `This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.`
 
-If this is not the case, an error has occured, and it will need to be investigated before continuing. If you are having troubles, please [Open an Issue](https://github.com/HelikarLab/FastqToGeneCounts/issues)
+If this is not the case, an error has occured, and it will need to be investigated before continuing. If you are having troubles, please [Open an Issue](https://github.com/HelikarLab/FastqToGeneCounts/issues/new)
 
 ## Execution
 Once you have confirmed that a dry-run will execute successfully, it is time to start a real run of the workflow.<br>
diff --git a/docs/pages/fastq/fastq_setup_conda.md b/docs/pages/fastq/fastq_setup_conda.md
@@ -17,7 +17,7 @@ In most cluster environments (i.e., HCC), you must activate the `mamba` module.
 module load mamba
 ```
 
-| Component |                          Description                          | 
+| Component |                          Description                          |
 |:---------:|:-------------------------------------------------------------:|
 | `module`  | The module command is used to load, unload, and list modules. |
 |  `load`   |          The load command is used to load a module.           |
@@ -85,24 +85,30 @@ module load mamba
 ### Install Snakemake and Benchmarking Requirements
 Snakemake is required to run the pipeline.
 ```bash
-mamba install --name snakemake --channel bioconda --channel plotly snakemake==6.15.5 tabulate==0.8.10 plotly==5.11.0
+mamba install --name snakemake --channel conda-forge --channel bioconda snakemake
+pip install snakemake-executor-plugin-cluster-generic
 ```
 
 We must install tabulate version `0.8.10` as anything under the `0.9.*` release causes issues for our current version of Snakemake
 
-|      Component       |                      Description                      |
-|:--------------------:|:-----------------------------------------------------:|
-|       `mamba`        | Use mamba to install additional software more quickly |
-|      `install`       |         The mamba command to install software         |
-|  `--name snakemake`  |       The environment to install software into        |
-| `--channel bioconda` |         The channel to install software from          |
-|  `--channel plotly`  |            Install from the plotly channel            |
-| `snakemake==6.15.5`  |          The software and version to install          |
-|  `tabulate==0.8.10`  |          The software and version to install          |
-|   `plotly==5.11.0`   |           Install plotly for graph creation           |
+|        Component        |                      Description                      |
+|:-----------------------:|:-----------------------------------------------------:|
+|         `mamba`         | Use mamba to install additional software more quickly |
+|        `install`        |         The mamba command to install software         |
+|   `--name snakemake`    |       The environment to install software into        |
+| `--channel conda-forge` |         The channel to install software from          |
+|  `--channel bioconda`   |         The channel to install software from          |
+|       `snakemake`       |  The software to install, defaults to latest version  |
+
+|                  Component                  |                        Description                         |
+|:-------------------------------------------:|:----------------------------------------------------------:|
+|                    `pip`                    |        Use pip to install python-only dependencies         |
+|                  `install`                  |            The pip command to install software             |
+| `snakemake-executor-plugin-cluster-generic` | The package required to use profiles in `snakemake>=8.0.0` |
 
 ## Test Installations
+The following command should return a valid number, ideally greater than 7.x. If this is not the case, investigate why a lower version was installed, or [open an issue](https://github.com/HelikarLab/FastqToGeneCounts/issues/new) on our GitHub page.
+
 ```bash
 snakemake --version
-# Returns "7.25.0"
 ```
diff --git a/docs/pages/fastq/fastq_setup_genome.md b/docs/pages/fastq/fastq_setup_genome.md
@@ -18,8 +18,6 @@ assembly_release=109
 GRCh_version=38
 ```
 
-```bash
-
 ## Genome FASTA File
 ```bash
 # Change directories into your `genome` directory
@@ -72,7 +70,7 @@ rm refFlat.tmp.txt
 ```bash
 # Execute this in the `genome` directory!
 
- -# Download the BED file, then set the file name
+# Download the BED file, then set the file name
 wget https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/hg38_GENCODE.v${GRCh_version}.bed.gz/download
 mv download hg38_GENCODE.v${GRCh_version}.bed.gz
 
@@ -95,6 +93,8 @@ To run the riboInt.sh file, perform the following:
 
 ```bash
 # Change directories to the location the pipeline was downloaded; for example:
+
+# WARNING: Change the next line to your own installation directory!
 cd /work/helikarlab/joshl/FastqToGeneCounts
 sh riboInt.sh
 ```
diff --git a/docs/pages/fastq/fastq_setup_profile.md b/docs/pages/fastq/fastq_setup_profile.md
@@ -19,13 +19,11 @@ Snakemake Profiles have multiple benefits:
 Perhaps the most important part of Profiles is Point 4. Instead of requesting, for example, 40 cores, 50GB of RAM, and multiple hours for the entire Snakemake workflow (as this is the maximum resources we require), Profiles will request small amounts of resources for rules that do not require them. This is especially important for rules that are not CPU intensive, as they take far less time to run. This means we can run more jobs at once, and our jobs will finish faster.
 
 ## Setup
-A default profile was downloaded with this repository under the `cluster` directory. If you are not a part of our wonderful Helikar Lab, you must edit the `slurm_account` line to your slurm account. This information can be found with the following command:
+A default profile was downloaded with this repository under the `cluster` directory. If you are not a part of our wonderful Helikar Lab, you must edit the `--account=` line to match your slurm account. This information can be found with the command listed below. In this example, the value listed under `Def Acct` should be included, like this: `--account=helikarlab`
 ```bash
 > sacctmgr show user $USER accounts
 
       User   Def Acct  Def WCKey     Admin
 ---------- ---------- ---------- ---------
      joshl helikarlab                 None
 ```
-
-The account name is the second column; in this case, `helikarlab`. This should be entered into the `slurm_account` line in the `cluster/config.yaml` file.
diff --git a/execute.sh b/execute.sh
@@ -3,9 +3,9 @@
 execution_successful=false
 profile=$1
 
-# If `profile` is empty, set `profile` to "slurm"
+# If `profile` is empty, set `profile` to "cluster"
 if [ -z "$profile" ]; then
-    profile="slurm"
+    profile="cluster"
 fi
 
 
diff --git a/utils/get.py b/utils/get.py
@@ -5,7 +5,7 @@
 from typing import Union
 
 import snakemake
-from snakemake import io
+from snakemake import io as snakemake_io
 
 from utils import perform
 from utils.constants import Layout
@@ -15,59 +15,64 @@ def from_master_config(config: dict, attribute: str) -> list[str]:
     valid_inputs = ["SRR", "tissue", "tag", "PE_SE"]
     sub_list = ["tissue", "tag"]
     if attribute not in valid_inputs:
-        sys.exit(f"\nInvalid attribute input. '{attribute}' is not one of: {valid_inputs}\n")
-    
+        sys.exit(
+            f"\nInvalid attribute input. '{attribute}' is not one of: {valid_inputs}\n"
+        )
+
     collect_attributes = []
     index_value = valid_inputs.index(attribute)
-    
+
     # We have to subtract one because "tissue" and "tag" are in the same index, thus the index value in valid_inputs is increased by one
     if index_value >= 2:
         index_value -= 1
-    
+
     control_lines = open(config["MASTER_CONTROL"], "r")
     dialect = csv.Sniffer().sniff(control_lines.read(1024))
     control_lines.seek(0)
     reader = csv.reader(control_lines, delimiter=str(dialect.delimiter))
-    
+
     for line in reader:
-        
         # Get the column from master_control we are interested in
         column_value = line[index_value]
         PE_SE_value = Layout[line[2]]  # PE, SE, or SLC
-        
+        target_attribute = []
+
         # test if we are looking for "tissue" or "tag", as these two values are located at master_control index 1
         if attribute in sub_list:
             sub_index = sub_list.index(attribute)
             split_list = str(line[index_value]).split("_")
-            
+
             # We must append the target attribute twice if it is paired end, once if it is single end
             if PE_SE_value in [Layout.PE, Layout.SLC]:
                 target_attribute = [split_list[sub_index], split_list[sub_index]]
             elif PE_SE_value == Layout.SE:
                 target_attribute = [split_list[sub_index]]
-        
+
         # Test if we are gathering the ended-ness (PE, SE, SLC)
         elif attribute == "PE_SE":
             # We must append the target attribute twice if it is paired end, once if it is single end
-            if column_value in [Layout.PE.name, Layout.SLC.name]:  # paired end or single cell
+            if column_value in [
+                Layout.PE.name,
+                Layout.SLC.name,
+            ]:  # paired end or single cell
                 target_attribute = ["1", "2"]
             elif column_value == Layout.SE.name:  # Single end
                 target_attribute = ["S"]
-        
+
         # If we are doing anything else, simply append the column value the appropriate number of times
         else:
             if PE_SE_value in [Layout.PE, Layout.SLC]:
                 target_attribute = [line[index_value], line[index_value]]
             elif PE_SE_value == Layout.SE:
                 target_attribute = [line[index_value]]
-        
+
         collect_attributes += target_attribute
-    
+
     control_lines.close()
     return collect_attributes
 
 
-def srr_code(config: dict) -> list[str]:
+def srr_code(config: dict) -> list[str] | None:
     """
     Only should be getting SRR values if we are performing prefetch
     """
@@ -79,26 +84,35 @@ def tissue_name(config: dict) -> list[str]:
     if perform.prefetch(config=config):
         return from_master_config(config=config, attribute="tissue")
     else:
-        fastq_input = snakemake.io.glob_wildcards(
-            os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
+        fastq_input = snakemake_io.glob_wildcards(
+            os.path.join(
+                config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
+            )
+        )
         return fastq_input.tissue_name
 
 
 def tags(config: dict) -> list[str]:
     if perform.prefetch(config=config):
         return from_master_config(config=config, attribute="tag")
     else:
-        fastq_input = snakemake.io.glob_wildcards(
-            os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
+        fastq_input = snakemake_io.glob_wildcards(
+            os.path.join(
+                config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
+            )
+        )
         return fastq_input.tag
 
 
 def PE_SE(config: dict) -> list[str]:
     if perform.prefetch(config=config):
         return from_master_config(config=config, attribute="PE_SE")
     else:
-        fastq_input = snakemake.io.glob_wildcards(
-            os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
+        fastq_input = snakemake_io.glob_wildcards(
+            os.path.join(
+                config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
+            )
+        )
         return fastq_input.PE_SE
 
 
@@ -120,10 +134,13 @@ def sample(config: dict) -> list[str]:
     if perform.prefetch(config=config):
         tag = from_master_config(config=config, attribute="tag")
     else:
-        fastq_input = snakemake.io.glob_wildcards(
-            os.path.join(config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"))
+        fastq_input = snakemake_io.glob_wildcards(
+            os.path.join(
+                config["LOCAL_FASTQ_FILES"], "{tissue_name}_{tag}_{PE_SE}.fastq.gz"
+            )
+        )
         tag = fastq_input.tag
-    
+
     sample = []
     for t in tag:
         sample.append(t.split("R")[0])
diff --git a/utils/validate.py b/utils/validate.py