-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swap simuG to conda-based install #168
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces several modifications across multiple files in the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (4)
harpy/popgroup.py (1)
Line range hint
41-47
: Consider enhancing file handling robustnessWhile the error messages are clear and helpful, consider adding checks for:
- File permissions when writing the output file
- Directory write permissions before attempting to create the output file
- Maximum filename length limitations
Here's a suggested improvement for the file handling:
def check_write_permissions(filepath): """Check if we have permission to write to the directory""" directory = os.path.dirname(filepath) or '.' return os.access(directory, os.W_OK) # Add before opening the output file: if not check_write_permissions(output): print_error("permission denied", f"Cannot write to output file: {output}") print_solution("Please ensure you have write permissions or choose a different output location") sys.exit(1)🧰 Tools
🪛 Ruff (0.7.0)
34-34: Do not use bare
except
(E722)
harpy/snakefiles/simulate_variants.smk (1)
112-138
: Consider unifying the rename implementation approach.The two rename rules use different approaches to achieve the same goal:
rename_haploid
uses Python'sos.rename
rename_diploid
uses shellmv
commandsConsider unifying the implementation for better maintainability.
-rule rename_diploid: - input: - fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.simseq.genome.fa", - mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.refseq2simseq.map.txt" - output: - fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.fasta", - mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.{variant}.map" - container: - None - shell: - """ - mv {input.fasta} {output.fasta} - mv {input.mapfile} {output.mapfile} - """ +rule rename_diploid: + input: + fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.simseq.genome.fa", + mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.refseq2simseq.map.txt" + output: + fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.fasta", + mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.{variant}.map" + run: + for i,j in zip(input, output): + os.rename(i,j)harpy/snakefiles/simulate_snpindel.smk (2)
141-148
: Consider standardizing log file namingWhile the input/output paths follow the new naming convention, the log file path at line 149 could be updated to match:
- log: - f"{outdir}/logs/{outprefix}.hap{{haplotype}}.log" + log: + f"{outdir}/logs/{outprefix}.hap{{haplotype}}.simseq.log"Also applies to: 153-154, 158-158
187-201
: Consider enhancing VCF concatenation robustnessThe current implementation could benefit from:
- Sorting variants by position
- Validating VCF format
- Checking for duplicate variants
rule concatenate_variants: input: f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.indel.vcf", f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.snp.vcf" output: f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.snpindel.vcf" + conda: + f"{envdir}/simulations.yaml" shell: - import shutil - shutil.copy(input[0], output[0]) - with open(output[0], "a") as outvcf, open(input[1], "r") as invcf: - for record in invcf: - if record.startswith("#"): - continue - outvcf.write(record) + "bcftools concat -a {input} | bcftools sort -o {output}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
harpy/_conda.py
(1 hunks)harpy/popgroup.py
(1 hunks)harpy/simulate.py
(0 hunks)harpy/snakefiles/simulate_snpindel.smk
(2 hunks)harpy/snakefiles/simulate_variants.smk
(2 hunks)
💤 Files with no reviewable changes (1)
- harpy/simulate.py
🧰 Additional context used
🪛 Ruff (0.7.0)
harpy/popgroup.py
34-34: Do not use bare except
(E722)
🔇 Additional comments (10)
harpy/popgroup.py (1)
28-28
: LGTM! Good improvement to regex patterns
The switch to raw string literals (r"..."
) for regular expressions is a good practice that improves readability and prevents potential issues with escape sequences.
Also applies to: 36-36
harpy/_conda.py (1)
63-68
: Verify if all perl-related dependencies are still needed
Since we're moving from a local perl-based simuG to a conda package, please verify if all the perl-related dependencies (perl-math-random
, perl-inline-c
, perl-parse-recdescent
, perl
) are still required by the conda version of simuG.
✅ Verification successful
Perl dependencies are still required - local Perl script is present
The verification shows that Perl dependencies are still needed as the package includes a Perl script (harpy/scripts/HaploSim.pl
) that actively uses these modules:
Math::Random
for random number generationInline::C
for C code integration- The script also uses various other Perl modules
While the package now uses the conda version of simuG, the Perl dependencies are required for other functionality within the package itself.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check simuG conda package dependencies to verify perl requirements
# Get simuG package info from bioconda
curl -s https://api.anaconda.org/package/bioconda/simug | jq -r '.dependencies[]' | grep -i 'perl'
# Search for perl usage in package
conda create -n simug_test bioconda::simug -y 2>/dev/null
conda list -n simug_test | grep -i 'perl'
Length of output: 344
Script:
#!/bin/bash
# Let's try a different approach to verify perl dependencies
# First, check if simug is using perl in its implementation
rg -i "perl|use\s+[\w:]+|require\s+[\w:]+" --type perl
# Check for any perl script execution or perl-related system calls
rg -i "system.*perl|exec.*perl|\`perl|\$PERL|perl\s+[-\w./]+"
# Look for perl module imports or requirements in the codebase
rg -i "use\s+Math::Random|use\s+Inline::C|use\s+Parse::RecDescent"
# Check package metadata files for perl dependencies
fd "meta.ya?ml|environment.ya?ml|setup.py|pyproject.toml" --exec cat {}
Length of output: 3931
harpy/snakefiles/simulate_variants.smk (5)
60-62
: LGTM! Improved output file naming clarity.
The explicit output paths better reflect simuG's default naming convention, making the workflow more predictable and easier to understand.
75-78
: LGTM! Consistent path naming updates.
The path updates in both rules maintain consistency with the new naming convention, and the temporary VCF file is properly managed.
Also applies to: 96-101
143-147
: LGTM! Consistent workflow summary updates.
The input paths properly reflect the new file naming convention and maintain appropriate conditional logic for heterozygosity.
Line range hint 1-147
: Verify no broken dependencies on old file paths.
Let's check for any remaining references to the old file naming pattern that might cause broken dependencies.
✅ Verification successful
No dependencies found on old simuG file patterns
The verification confirms that there are no references to the old simuG default naming patterns (simseq.genome.fa
or refseq2simseq
) outside of Snakemake files. The renaming changes in the workflow are self-contained and do not impact any external dependencies.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for potential references to old file patterns
rg -g '!*.smk' "simseq.genome.fa|refseq2simseq" || echo "No references found to simuG default naming patterns outside Snakemake files"
Length of output: 212
71-71
: Verify simuG availability in conda environment.
The command now directly uses simuG
instead of the script. Let's verify the conda environment configuration.
✅ Verification successful
Let me try to locate the conda environment file using a different approach.
Let me check the content of the conda-related files we found to verify the simuG configuration.
simuG is properly configured in the conda environment
The verification shows that simuG is correctly specified in the conda dependencies:
- It's included in the
simulations
environment with version constraintbioconda::simug>1.0.0
- The environment is automatically created by the
_conda.py
script when needed - The direct usage of
simuG
command is appropriate as it will be available in the conda environment
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if simuG is properly specified in conda environment file
cat .harpy_envs/simulations.yaml || echo "Conda environment file not found at expected location"
Length of output: 212
Script:
#!/bin/bash
# Search for any yaml/yml files that might contain environment specifications
fd -e yaml -e yml
# Also check if there's any mention of simuG in the repository
rg -i "simug" -g "!{.git,node_modules}" --no-binary
Length of output: 1553
Script:
#!/bin/bash
# Check conda-related YAML files
cat resources/meta.yaml
echo "-------------------"
cat resources/harpy.yaml
# Also check the Python file that seems to handle conda dependencies
cat harpy/_conda.py
Length of output: 5704
harpy/snakefiles/simulate_snpindel.smk (3)
97-100
: LGTM: Output paths are well-structured
The output paths follow a clear naming convention and correctly handle conditional SNP/indel outputs.
113-116
: LGTM: Consistent directory structure and naming
The diploid variant rules maintain a clear and consistent directory structure under 'diploid/' with well-organized output paths.
Also applies to: 134-137
206-211
: LGTM: Comprehensive workflow summary updates
The workflow summary correctly captures all output files with appropriate conditionals for different variant types and ploidy scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
harpy/view.py (2)
20-21
: Minor docstring improvement needed.The docstring mentions "Use
--snakefile
or--config
" but splits across lines in a way that might affect readability.- of a Harpy output directory. Use `--snakefile` or `--config` to view the workflow - snakefile or config.yaml file instead, respectively. Output is printed to the screen via `less` and + of a Harpy output directory. Use `--snakefile` or `--config` to view the workflow's + snakefile or config.yaml file, respectively. Output is printed to the screen via `less` and
44-51
: Consider making the config filename configurable.The config file path is currently hard-coded as
config.yaml
. Consider making this configurable to support different naming conventions or multiple config files.+DEFAULT_CONFIG_NAME = "config.yaml" + def view(directory, snakefile, config): ... elif config: - files = [f"{directory}/workflow/config.yaml"] + files = [f"{directory}/workflow/{DEFAULT_CONFIG_NAME}"] err_dir = f"{directory}/workflow/" - err_file = "There is no [blue]config.yaml[/blue] file" + err_file = f"There is no [blue]{DEFAULT_CONFIG_NAME}[/blue] file"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
harpy/view.py
(4 hunks)
🔇 Additional comments (3)
harpy/view.py (3)
12-15
: LGTM! Well-structured CLI option addition.
The new --config
option is properly integrated with appropriate flags and help text, maintaining consistency with the existing --snakefile
option.
32-34
: LGTM! Good error handling for mutually exclusive options.
The mutual exclusivity check is well-implemented with a clear error message using rich formatting.
69-69
: LGTM! Consistent error message formatting.
The error message follows the established pattern and properly uses rich formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🧹 Outside diff range and nitpick comments (49)
harpy/container.py (1)
18-18
: Address the TODO comment regarding environment completeness.The TODO comment suggests that not all required conda environments are being included in the container build. This could lead to missing dependencies when running workflows in the containerized environment.
Would you like me to help identify all the required conda environments across the codebase? I can create a GitHub issue to track this task.
harpy/snakefiles/containerize.smk (1)
4-4
: Consider making the environment directory configurableWhile the path construction is correct, consider making the environment directory configurable through Snakemake's config system or workflow.basedir instead of hardcoding the relative path. This would provide more flexibility for different deployment scenarios.
-envdir = os.path.join(os.getcwd(), "container/workflow/envs") +envdir = os.path.join(workflow.basedir, config.get("env_dir", "container/workflow/envs"))harpy/snakefiles/preflight_bam.smk (2)
Line range hint
7-12
: Preserve error logs for debuggingThe current
onerror
handler removes the log file, which could make debugging failures more difficult. Consider preserving error logs in a dedicated directory.onerror: - os.remove(logger.logfile) + error_log = os.path.join(outdir, "logs", "error.log") + os.makedirs(os.path.dirname(error_log), exist_ok=True) + shutil.copy2(logger.logfile, error_log) + logger.error(f"Workflow failed. Error log saved to {error_log}")
Line range hint
89-103
: Enhance workflow summary with environment detailsThe workflow summary could be more informative by including details about the conda environments and container versions used.
run: os.makedirs(f"{outdir}/workflow/", exist_ok= True) summary = ["The harpy preflight bam workflow ran using these parameters:"] + # Add container version + summary.append(f"Container version: {containerized}") + # Add conda environment details + if os.path.exists(envdir): + env_files = glob.glob(os.path.join(envdir, "*.yaml")) + if env_files: + summary.append("\nConda environments:") + for env in env_files: + summary.append(f"\t{os.path.basename(env)}") valids = "Validations were performed with:\n"harpy/snakefiles/preflight_fastq.smk (1)
Line range hint
48-60
: LGTM with a minor suggestionThe concatenation logic is solid with proper temporary file handling and sorted output. Consider using a heredoc for the header line to improve readability:
shell: """ - cat {input} | sort -k1 > {output.tmp} - echo -e "file\treads\tnoBX\tbadBX\tbadSamSpec\tbxNotLast\n$(cat {output.tmp})" > {output.final} + cat {input} | sort -k1 > {output.tmp} + cat << 'EOF' > {output.final} +file reads noBX badBX badSamSpec bxNotLast +EOF + cat {output.tmp} >> {output.final} """harpy/resume.py (1)
27-34
: Fix typo in docstring requirementsThe docstring updates accurately reflect the changes, but there's a typo in the requirements section.
- - the targest directory has `workflow/envs/*.yaml` present in it + - the target directory has `workflow/envs/*.yaml` present in itharpy/snakefiles/deconvolve.smk (1)
Line range hint
1-1
: Consider standardizing environment management approach.The workflow mixes different environment management approaches:
- Some rules use containers (
containerized
directive at the top)- Some rules explicitly set
container: None
- Some rules use conda environments
Consider standardizing the approach to use either conda environments consistently (aligning with the PR's objective of conda-based installation) or containers throughout the workflow. This would make the workflow more maintainable and reduce potential environment-related issues.
Also applies to: 18-18, 41-44, 52-59, 61-67
harpy/_conda.py (2)
107-113
: Consider enhancing the SPAdes post-deployment scriptThe current implementation could be improved in several ways:
- Add SHA256 checksum verification for the downloaded file
- Make the SPAdes version configurable
- Check for wget availability before use
Here's a suggested improvement:
- shellscript.write("wget -O .spades.tar.gz https://github.com/ablab/spades/releases/download/v4.0.0/SPAdes-4.0.0-Linux.tar.gz\n") - shellscript.write("tar -xvzf .spades.tar.gz && rm .spades.tar.gz\n") + shellscript.write(""" +SPADES_VERSION="4.0.0" +SPADES_SHA256="904e52c505fb3ce2e8ca1147957f31500208df7dd55547474cf7da42c6935c2c" + +command -v wget >/dev/null 2>&1 || { echo "wget is required but not installed. Aborting." >&2; exit 1; } + +wget -O .spades.tar.gz "https://github.com/ablab/spades/releases/download/v${SPADES_VERSION}/SPAdes-${SPADES_VERSION}-Linux.tar.gz" +echo "${SPADES_SHA256} .spades.tar.gz" | sha256sum -c - || { echo "Checksum verification failed!" >&2; exit 1; } +tar -xzf .spades.tar.gz && rm .spades.tar.gz +""")
134-134
: Fix typo in error messageThere's a typo in "otheriwse".
- "Check that the names conform to Harpy's expectations, otheriwse you can recreate this directory using the [green bold]--conda[/green bold] option.", + "Check that the names conform to Harpy's expectations, otherwise you can recreate this directory using the [green bold]--conda[/green bold] option.",harpy/snakefiles/demultiplex_gen1.smk (2)
Line range hint
89-96
: Consider improving the robustness of sample demultiplexing.The current grep-based approach for demultiplexing might be fragile:
- Fixed pattern matching (
A..{params}B..D
) could break if the format changes- Silent failure with
touch {output}
might mask issuesConsider:
- Using a more robust parsing approach
- Adding error logging
- Implementing validation of the output file size
rule demultiplex_samples: input: outdir + "/demux_R{FR}_001.fastq.gz" output: outdir + "/{sample}.R{FR}.fq.gz" params: c_barcode = lambda wc: samples[wc.get("sample")] + log: + outdir + "/logs/demux/{sample}.R{FR}.log" container: None shell: """ - ( zgrep -A3 "A..{params}B..D" {input} | grep -v "^--$" | gzip -q > {output} ) || touch {output} + ( + zgrep -A3 "A..{params}B..D" {input} | grep -v "^--$" | gzip -q > {output} + if [ ! -s {output} ]; then + echo "Warning: No reads found for barcode pattern A..{params}B..D" >> {log} + touch {output} + fi + ) 2>> {log} """
Line range hint
98-127
: Consider improving error handling in quality assessment.The hardcoded fallback output in case of failure might become outdated if the falco tool's output format changes.
Consider:
- Moving the fallback template to a separate file
- Adding version checking
- Implementing proper error codes
rule assess_quality: input: outdir + "/{sample}.R{FR}.fq.gz" output: outdir + "/reports/data/{sample}.R{FR}.fastqc" log: outdir + "/logs/{sample}.R{FR}.qc.log" params: f"{outdir}/reports/data" threads: 1 conda: f"{envdir}/qc.yaml" shell: """ - ( falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 || -cat <<EOF > {output} -##Falco 1.2.4 ->>Basic Statistics fail -#Measure Value -Filename {wildcards.sample}.R{wildcards.FR}.fq.gz -File type Conventional base calls -Encoding Sanger / Illumina 1.9 -Total Sequences 0 -Sequences flagged as poor quality 0 -Sequence length 0 -%GC 0 ->>END_MODULE -EOF + falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} > {log} 2>&1 + if [ $? -ne 0 ]; then + echo "Error: Quality assessment failed for {wildcards.sample}.R{wildcards.FR}" >> {log} + exit 1 + fi """harpy/metassembly.py (1)
72-72
: Consider version pinning conda environments for reproducibility.While the conda environments list is comprehensive for the workflow needs, consider adding version specifications to ensure reproducible builds across different systems.
Example modification:
- conda_envs = ["align", "assembly", "metassembly", "qc", "spades"] + conda_envs = { + "align": "1.0", + "assembly": "1.0", + "metassembly": "1.0", + "qc": "1.0", + "spades": "3.15.5" # or your preferred version + }Also applies to: 85-85
harpy/qc.py (1)
79-79
: Consider making conda environments configurable.While the current hardcoded list
["qc", "r"]
works for the current setup, consider making these environments configurable through a configuration file or CLI parameter for better flexibility and maintainability.harpy/impute.py (1)
85-93
: Consider improving conda environment configuration.While the addition of conda environments is aligned with the PR objectives, consider these improvements:
- Move the environment list to a configuration file for better maintainability
- Add validation for required environment names
- Consider if additional environments might be needed for the imputation workflow
Example implementation:
+ from typing import List + def validate_conda_envs(envs: List[str]) -> None: + required_envs = {"r", "stitch"} + if not required_envs.issubset(set(envs)): + missing = required_envs - set(envs) + raise ValueError(f"Missing required conda environments: {missing}") - conda_envs = ["r", "stitch"] + # Load from config file + conda_envs = load_conda_config() # implement this function + validate_conda_envs(conda_envs)harpy/snakefiles/assembly.smk (3)
Line range hint
1-1
: Consider using a specific container version for reproducibility.Using the
latest
tag for the container image can lead to reproducibility issues as the image content may change over time. Consider pinning to a specific version.-containerized: "docker://pdimens/harpy:latest" +containerized: "docker://pdimens/harpy:1.0.0" # Replace with actual version
Line range hint
6-9
: Improve log file handling for better debugging.The current implementation removes the log file in both success and error cases, which might make debugging harder. Consider:
- Preserving the log file on error
- Moving successful logs to an archive directory
onstart: logger.logger.addHandler(logging.FileHandler(config["snakemake_log"])) onsuccess: - os.remove(logger.logfile) + os.rename(logger.logfile, os.path.join(outdir, "logs", "archive", os.path.basename(logger.logfile))) onerror: - os.remove(logger.logfile) + logger.logger.error("Workflow failed. Log file preserved at: %s", logger.logfile)
Line range hint
156-169
: Improve error handling in BUSCO analysis.The current implementation silently creates an empty output file on failure (
touch {output}
), which could mask real issues. Consider failing the workflow or at least logging the error.rule BUSCO_analysis: input: f"{outdir}/scaffolds.fasta" output: f"{outdir}/busco/short_summary.specific.{lineagedb}_odb10.busco.txt" log: f"{outdir}/logs/busco.log" params: output_folder = outdir, out_prefix = "-o busco", lineage = f"-l {lineagedb}", download_path = f"--download_path {outdir}/busco", metaeuk = "--metaeuk" if organism == "eukaryote" else "" threads: workflow.cores conda: f"{envdir}/assembly.yaml" shell: - "( busco -f -i {input} -c {threads} -m genome --out_path {params} > {log} 2>&1 ) || touch {output}" + """ + if ! busco -f -i {input} -c {threads} -m genome --out_path {params} > {log} 2>&1; then + echo "BUSCO analysis failed. Check the log file at {log}" >&2 + exit 1 + fi + """harpy/snakefiles/simulate_variants.smk (2)
60-62
: LGTM! Consider updating documentationThe changes correctly implement the switch to conda-based simuG installation by:
- Removing the Perl script dependency
- Using more descriptive output filenames that include the variant type
Consider adding a comment explaining the output file naming convention for future maintainability.
Also applies to: 71-71
112-138
: Standardize implementation of rename rulesThe two rename rules use different approaches to achieve the same goal:
rename_haploid
uses Python'sos.rename
rename_diploid
uses shellmv
commandConsider standardizing the implementation. Here's a suggested refactor for
rename_diploid
:- container: - None - shell: - """ - mv {input.fasta} {output.fasta} - mv {input.mapfile} {output.mapfile} - """ + run: + for i,j in zip(input, output): + os.rename(i,j)harpy/phase.py (1)
Line range hint
83-94
: Consider environment validation and configuration flexibility.While the addition of conda environments aligns with the migration to conda-based installation, consider these improvements:
- Add validation to ensure the required environments exist
- Consider moving environment names to a configuration file for better maintainability
Here's a suggested approach:
- conda_envs = ["phase", "r"] + # Import from a central configuration + from ._config import PHASE_ENVIRONMENTS + + conda_envs = PHASE_ENVIRONMENTS + + # Validate environments before proceeding + if not all(validate_conda_env(env) for env in conda_envs): + raise ValueError("One or more required conda environments are missing")harpy/snakefiles/sv_naibr.smk (1)
Line range hint
191-193
: Consider cleaning up temporary workflow directories on completionThe workflow creates temporary directories under
workflow/input
for NAIBR configuration files, but these aren't cleaned up after the workflow completes. Only the final summary is retained in the workflow directory.Consider adding cleanup in the
onsuccess
directive:onsuccess: os.remove(logger.logfile) + shutil.rmtree(os.path.join(outdir, "workflow", "input"), ignore_errors=True)
Don't forget to add the import:
import os import re import logging +import shutil from pathlib import Path
Also applies to: 207-208
harpy/assembly.py (1)
Line range hint
87-128
: Architecture supports conda-based installation effectivelyThe changes effectively support the transition to conda-based installation by:
- Explicitly defining required conda environments
- Integrating them into the workflow configuration
- Properly managing them through the create_conda_recipes function
This approach eliminates the need for a local copy of Harpy while maintaining workflow functionality.
harpy/snakefiles/simulate_linkedreads.smk (2)
Line range hint
73-83
: Consider version pinning for external toolsThe workflow uses several external tools (dwgsim, seqtk, msort, perl) without version pinning. This could lead to reproducibility issues.
Add version constraints in the conda environment file:
# simulations.yaml dependencies: - - dwgsim - - seqtk - - msort - - perl + - dwgsim=0.1.13 # Replace with actual version + - seqtk=1.4 # Replace with actual version + - msort=8.53 # Replace with actual version + - perl=5.32.1 # Replace with actual versionAlso applies to: 127-137
Line range hint
186-218
: Enhance workflow summary with version informationThe summary provides good documentation of parameters but lacks version information for the tools used. This information is crucial for reproducibility.
Consider adding version information to the summary:
summary = ["The harpy simulate linkedreas workflow ran using these parameters:"] + # Add version information + versions = ["Tool versions used:"] + for tool in ["dwgsim", "seqtk", "msort", "perl"]: + result = shell(f"conda list {tool} | grep {tool}", capture_output=True) + versions.append(f"{tool}: {result.stdout.decode().strip()}") + summary.extend(["\n".join(versions), ""]) summary.append(f"Genome haplotype 1: {gen_hap1}")harpy/snakefiles/snp_mpileup.smk (4)
Line range hint
27-37
: Add input validation for configuration and filesThe region handling code could benefit from additional validation:
- Validate file existence before opening
- Add type checking for configuration values
- Handle malformed lines in region file
Consider this improved implementation:
if regiontype == "region": + if not isinstance(regioninput, str): + raise ValueError("Region input must be a string") intervals = [regioninput] regions = {f"{regioninput}" : f"{regioninput}"} else: + if not os.path.exists(regioninput): + raise FileNotFoundError(f"Region file not found: {regioninput}") with open(regioninput, "r") as reg_in: intervals = set() for line in reg_in: - cont,startpos,endpos = line.split() + try: + cont,startpos,endpos = line.strip().split() + # Validate numeric positions + int(startpos), int(endpos) + except (ValueError, TypeError) as e: + raise ValueError(f"Invalid region format in line: {line}") from e intervals.add(f"{cont}:{startpos}-{endpos}") regions = dict(zip(intervals, intervals))
Line range hint
114-127
: Add error handling to mpileup ruleThe pipe output in the mpileup rule could lead to silent failures if downstream processing fails. Consider adding error checking and logging.
rule mpileup: input: bamlist = outdir + "/logs/samples.files", bam = bamlist, bai = collect("{bam}.bai", bam = bamlist), genome = f"Genome/{bn}", genome_fai = f"Genome/{bn}.fai" output: bcf = pipe(outdir + "/mpileup/{part}.mp.bcf"), logfile = temp(outdir + "/logs/{part}.mpileup.log") params: region = lambda wc: "-r " + regions[wc.part], extra = mp_extra container: None shell: - "bcftools mpileup --fasta-ref {input.genome} --bam-list {input.bamlist} --annotate AD --output-type b {params} > {output.bcf} 2> {output.logfile}" + """ + set -o pipefail + bcftools mpileup --fasta-ref {input.genome} \ + --bam-list {input.bamlist} \ + --annotate AD \ + --output-type b \ + {params} > {output.bcf} 2> {output.logfile} || {{ + echo "mpileup failed for region {wildcards.part}" >> {output.logfile} + exit 1 + }} + """
Line range hint
246-276
: Enhance workflow summary with more detailed informationThe workflow summary could benefit from additional information about the execution environment and input validation results.
rule workflow_summary: default_target: True input: vcf = collect(outdir + "/variants.{file}.bcf", file = ["raw"]), agg_log = outdir + "/logs/mpileup.log", reports = collect(outdir + "/reports/variants.{file}.html", file = ["raw", "normalized"]) if not skip_reports else [] params: ploidy = f"--ploidy {ploidy}", populations = f"--populations {groupings}" if groupings else "--populations -" run: - summary = ["The harpy snp freebayes workflow ran using these parameters:"] + summary = ["# Harpy SNP Mpileup Workflow Summary"] + summary.append("\n## Environment Information") + summary.append(f"Execution timestamp: {datetime.datetime.now()}") + summary.append(f"Harpy version: {config.get('version', 'unknown')}") + + summary.append("\n## Input Validation") + summary.append(f"Number of BAM files: {len(bamlist)}") + summary.append(f"Number of intervals processed: {len(intervals)}") + + summary.append("\n## Workflow Parameters") summary.append(f"The provided genome: {bn}") if windowsize: summary.append(f"Size of intervals to split genome for variant calling: {windowsize}") else: summary.append(f"Genomic positions for which variants were called: {regioninput}") + + summary.append("\n## Command Parameters") mpileup = "The mpileup parameters:\n" mpileup += f"\tbcftools mpileup --fasta-ref GENOME --region REGION --bam-list BAMS --annotate AD --output-type b {mp_extra}" summary.append(mpileup)
Line range hint
1-276
: Overall assessment: Approved with suggestionsThe changes align well with the PR objectives of moving to conda-based installation. The code is well-structured and maintains good separation of concerns. Consider implementing the suggested improvements for input validation and error handling in future iterations.
For future improvements, consider:
- Adding a configuration validation step at workflow start
- Implementing retry logic for failed regions
- Adding progress tracking for long-running operations
harpy/preflight.py (1)
Line range hint
79-153
: Architecture feedback: Clean conda integrationThe changes cleanly integrate conda environment management into the existing preflight check system. The implementation:
- Maintains separation of concerns
- Uses consistent patterns across commands
- Preserves the original functionality while adding conda support
Consider documenting the conda environment requirements in the docstrings to help users understand what environments are needed and why.
harpy/snakefiles/phase.smk (2)
Line range hint
58-71
: Add error handling for alignment file matchingThe helper functions could benefit from better error handling when no matches or multiple matches are found.
Consider this improvement:
def get_alignments(wildcards): """returns a list with the bam file for the sample based on wildcards.sample""" r = re.compile(fr".*/({wildcards.sample})\.(bam|sam)$", flags = re.IGNORECASE) aln = list(filter(r.match, bamlist)) + if not aln: + raise ValueError(f"No alignment file found for sample {wildcards.sample}") + if len(aln) > 1: + raise ValueError(f"Multiple alignment files found for sample {wildcards.sample}: {aln}") return aln[0]
Line range hint
132-148
: Add resource specifications for compute-intensive rulesConsider adding resource specifications for compute-intensive rules like
process_genome
to help with workflow scheduling and resource allocation.Add resources like this:
rule process_genome: input: genomefile output: geno container: None + resources: + mem_mb=4000, + runtime=60 shell: "seqtk seq {input} > {output}"harpy/snakefiles/sv_naibr_pop.smk (3)
18-19
: Consider relocating conda environments directoryThe conda environments directory (
envdir
) is currently placed under the output directory. This could lead to duplicate environments across different runs and waste disk space. Consider moving it to a central location, such as.conda/envs
in the project root.-envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs") +envdir = os.path.join(os.getcwd(), ".conda", "envs")
Line range hint
283-311
: Improve robustness and safety of workflow summary generationSeveral issues need attention:
- Using
os.system
for file removal is unsafe and should be replaced withshutil
- String concatenation could be simplified using list operations
- Hardcoded paths should be derived from configuration
Here's a safer implementation:
- os.system(f"rm -rf {outdir}/naibrlog") + import shutil + naibrlog_path = os.path.join(outdir, "naibrlog") + if os.path.exists(naibrlog_path): + shutil.rmtree(naibrlog_path) summary = ["The harpy sv naibr workflow ran using these parameters:"] summary.append(f"The provided genome: {bn}") - concat = "The alignments were concatenated using:\n" - concat += "\tconcatenate_bam.py -o groupname.bam -b samples.list" - summary.append(concat) + summary.append("The alignments were concatenated using:\n\tconcatenate_bam.py -o groupname.bam -b samples.list") - naibr = "naibr variant calling ran using these configurations:\n" - naibr += "\tbam_file=BAMFILE\n" - naibr += "\tprefix=PREFIX\n" - naibr += "\toutdir=Variants/naibr/PREFIX\n" - naibr += "\n\t".join([f"{k}={v}" for k,v in argdict.items()]) - summary.append(naibr) + naibr_config = [ + "naibr variant calling ran using these configurations:", + "\tbam_file=BAMFILE", + "\tprefix=PREFIX", + "\toutdir=Variants/naibr/PREFIX", + *[f"\t{k}={v}" for k,v in argdict.items()] + ] + summary.append("\n".join(naibr_config)) - sm = "The Snakemake workflow was called via command line:\n" - sm = f"\t{config['workflow_call']}" - summary.append(sm) + summary.append(f"The Snakemake workflow was called via command line:\n\t{config['workflow_call']}")
Line range hint
1-311
: Consider enhancing error handling across rulesWhile the conda integration looks good, consider adding error handling for common failure scenarios:
- Missing input files
- Failed conda environment creation
- NAIBR execution failures
Example approach for handling NAIBR failures:
# Add to the call_variants rule onerror: logger.error(f"NAIBR execution failed for {wildcards.population}. Check {log} for details.") # Clean up partial outputs for f in [output.bedpe, output.refmt, output.vcf]: if os.path.exists(f): os.remove(f)harpy/snakefiles/sv_leviathan_pop.smk (2)
Line range hint
123-123
: Add validation for conda environment filesThe workflow references several conda environment files but doesn't validate their existence. Consider adding an
onstart
check to ensure all required environment files exist.Add this validation to the
onstart
block:required_envs = ["variants.yaml", "align.yaml", "r.yaml"] for env in required_envs: env_path = os.path.join(envdir, env) if not os.path.exists(env_path): raise FileNotFoundError(f"Required conda environment file not found: {env_path}")Also applies to: 144-144, 186-186, 251-251, 267-267
Line range hint
6-11
: Improve error handling for log filesCurrently, log files are removed on both success and error, which might hinder debugging. Consider:
- Archiving logs on success instead of removing them
- Preserving logs on error for debugging
onstart: logger.logger.addHandler(logging.FileHandler(config["snakemake_log"])) onsuccess: - os.remove(logger.logfile) + if os.path.exists(logger.logfile): + archive_dir = os.path.join(outdir, "archived_logs") + os.makedirs(archive_dir, exist_ok=True) + os.rename(logger.logfile, os.path.join(archive_dir, f"success_{os.path.basename(logger.logfile)}")) onerror: - os.remove(logger.logfile) + if os.path.exists(logger.logfile): + error_dir = os.path.join(outdir, "error_logs") + os.makedirs(error_dir, exist_ok=True) + os.rename(logger.logfile, os.path.join(error_dir, f"error_{os.path.basename(logger.logfile)}"))harpy/snakefiles/align_strobealign.smk (2)
Line range hint
1-1
: Consider standardizing environment management across rules.The workflow currently mixes container-based and conda-based rules, with some rules having no explicit environment specification. Given the PR's focus on conda-based installations, consider standardizing the environment management approach across all rules.
Consider:
- Documenting the rationale for using different environments in different rules
- Standardizing on conda environments where possible
- Explicitly specifying environments for all rules
Also applies to: 17-17, 73-73, 89-89, 108-108, 156-156, 186-186, 205-205
Line range hint
207-249
: Consider enhancing the workflow summary with environment information.The workflow summary captures command-line parameters and workflow steps but could be enhanced to include information about the environments used (conda/container) and their locations.
Consider adding environment information to the summary:
summary.append(duplicates) + environments = "Environment configuration:\n" + environments += f"\tConda environments directory: {envdir}\n" + environments += "\tContainer image: docker://pdimens/harpy:latest" + summary.append(environments) sm = "The Snakemake workflow was called via command line:\n"harpy/snakefiles/simulate_snpindel.smk (1)
141-148
: LGTM with suggestion: Consider adding error handlingThe path handling and command construction are well-implemented. However, consider adding error handling for the simuG command execution.
- "simuG -refseq {input.geno} -prefix {params.prefix} {params.snp} {params.indel} > {log}" + """ + if ! simuG -refseq {input.geno} -prefix {params.prefix} {params.snp} {params.indel} > {log} 2>&1; then + echo "simuG failed. Check the log file at {log}" >&2 + exit 1 + fi + """Also applies to: 152-154, 158-159
harpy/snakefiles/metassembly.smk (2)
Line range hint
1-24
: Consider documenting conda environment setup in workflow summaryThe switch to conda-based installation is a significant change. Consider enhancing the workflow summary to include:
- Required conda environment setup steps
- Minimum conda version requirements
- Instructions for environment creation
This documentation would help users understand the new conda-based setup and ensure reproducibility across different systems.
Line range hint
82-103
: Add error handling for conda environment failuresThe rules that depend on conda environments should include error handling for cases where environment creation or activation fails. Currently, only the BUSCO rule has explicit error handling.
Consider adding similar error handling to other conda-dependent rules. For example:
rule spades_assembly: input: fastq_R1C = outdir + "/error_correction/corrected/input.R1.fq00.0_0.cor.fastq.gz", fastq_R2C = outdir + "/error_correction/corrected/input.R2.fq00.0_0.cor.fastq.gz", fastq_UNC = outdir + "/error_correction/corrected/input.R_unpaired00.0_0.cor.fastq.gz" output: f"{outdir}/spades_assembly/contigs.fasta" params: outdir = outdir + "/spades_assembly", k = k_param, mem = max_mem // 1000, extra = extra log: outdir + "/logs/spades_assembly.log" threads: workflow.cores resources: mem_mb=max_mem conda: f"{envdir}/spades.yaml" container: None shell: - "metaspades.py -t {threads} -m {params.mem} -k {params.k} {params.extra} -1 {input.fastq_R1C} -2 {input.fastq_R2C} -s {input.fastq_UNC} -o {params.outdir} --only-assembler > {log}" + """ + (metaspades.py -t {threads} -m {params.mem} -k {params.k} {params.extra} \ + -1 {input.fastq_R1C} -2 {input.fastq_R2C} -s {input.fastq_UNC} \ + -o {params.outdir} --only-assembler > {log} 2>&1) || \ + (echo "Error: SPAdes assembly failed. Check conda environment and logs." >> {log} && exit 1) + """Also applies to: 104-125, 126-146, 147-156, 157-170, 206-225, 226-251, 252-270, 271-280
harpy/snakefiles/align_ema.smk (2)
Line range hint
391-419
: Consider using f-strings consistently in workflow_summary.The
workflow_summary
rule mixes string concatenation styles. Consider using f-strings consistently for better readability.Example improvement for the counts variable:
- counts = "Barcodes were counted and validated with EMA using:\n" - counts += f"\tseqtk mergepe forward.fq.gz reverse.fq.gz | ema count {params.beadtech}" + counts = f"""Barcodes were counted and validated with EMA using: + seqtk mergepe forward.fq.gz reverse.fq.gz | ema count {params.beadtech}"""
Line range hint
183-196
: Consider quoting variables in shell commands.In the
mark_duplicates
andindex_duplicates
rules, consider quoting variables in shell commands to handle paths with spaces.Example improvement:
samtools collate -O -u {input.sam} | - samtools fixmate -m -u - - | - samtools sort -T {params.tmpdir} -u --reference {input.genome} -l 0 -m {resources.mem_mb}M - | - samtools markdup -@ {threads} -S --barcode-tag BX -f {log} - {output} - rm -rf {params.tmpdir} + samtools fixmate -m -u - - | + samtools sort -T "{params.tmpdir}" -u --reference "{input.genome}" -l 0 -m {resources.mem_mb}M - | + samtools markdup -@ {threads} -S --barcode-tag BX -f "{log}" - "{output}" + rm -rf "{params.tmpdir}"Also applies to: 197-208
harpy/align.py (2)
Line range hint
116-127
: Consider moving conda environments list to a module-level constant.The conda environments list
["align", "r", "qc"]
is duplicated across all three alignment commands. To improve maintainability and reduce duplication, consider moving it to a module-level constant.+ # At the top of the file, with other imports + ALIGNMENT_CONDA_ENVS = ["align", "r", "qc"] - conda_envs = ["align", "r", "qc"] + conda_envs = ALIGNMENT_CONDA_ENVSAlso applies to: 140-140
221-222
: Remove unnecessary empty line.The empty line between the conda environments list declaration and the configs dictionary is unnecessary and inconsistent with other commands.
conda_envs = ["align", "r", "qc"] - configs = {
harpy/simulate.py (1)
201-201
: Consider adding error handling for create_conda_recipes callsThe create_conda_recipes calls could benefit from error handling to gracefully handle potential failures in conda environment setup.
Example improvement:
- create_conda_recipes(output_dir, conda_envs) + try: + create_conda_recipes(output_dir, conda_envs) + except Exception as e: + print_error("conda setup", f"Failed to create conda recipes: {str(e)}") + sys.exit(1)Also applies to: 338-338, 444-444, 563-563, 665-665
harpy/snakefiles/qc.smk (3)
Line range hint
47-70
: Ensure correct expansion ofparams
in thefastp
shell commandIn the
fastp
rule within thedeconvolve
branch,{params}
is passed directly into the shell command. Sinceparams
is a dictionary of parameter values, it may not expand correctly in the shell command, leading to potential runtime errors.Proposed Fix:
Concatenate the parameter values into a single string and pass it to the shell command. Update the
params
section to include afullparams
string:params: minlen = f"--length_required {min_len}", maxlen = f"--max_len1 {max_len}", trim_adapters = "--detect_adapter_for_pe" if trimadapters else "--disable_adapter_trimming", dedup = "-D" if dedup else "", extra = extra, + fullparams = " ".join([minlen, maxlen, trim_adapters, dedup, extra]) shell: """ - fastp --trim_poly_g --cut_right {params} --thread {threads} -i {input.fw} -I {input.rv} --stdout -h {output.html} -j {output.json} -R "{wildcards.sample} QC Report" 2> {log.serr} > {output.fq} + fastp --trim_poly_g --cut_right {params.fullparams} --thread {threads} -i {input.fw} -I {input.rv} --stdout -h {output.html} -j {output.json} -R "{wildcards.sample} QC Report" 2> {log.serr} > {output.fq} """This ensures that all parameters are correctly passed to
fastp
as a single string.
Line range hint
170-182
: Address undefined variables inworkflow_summary
Within the
workflow_summary
rule'srun
block, variablesdecon_k
,decon_w
,decon_d
, anddecon_a
are referenced when constructing the summary. However, these variables may not be accessible within this scope, leading to aNameError
at runtime.Proposed Fix:
Pass these variables into the
workflow_summary
rule using theparams
section:rule workflow_summary: default_target: True input: fq = collect(outdir + "/{sample}.{FR}.fq.gz", FR = ["R1", "R2"], sample = samplenames), bx_report = outdir + "/reports/barcode.summary.html" if not skip_reports else [], agg_report = outdir + "/reports/qc.report.html" if not skip_reports else [] params: minlen = f"--length_required {min_len}", maxlen = f"--max_len1 {max_len}", trim_adapters = "--detect_adapter_for_pe" if trimadapters else "--disable_adapter_trimming", dedup = "-D" if dedup else "", extra = extra, + decon_k = decon_k if deconvolve else None, + decon_w = decon_w if deconvolve else None, + decon_d = decon_d if deconvolve else None, + decon_a = decon_a if deconvolve else None, + threads = workflow.cores run: summary = ["The harpy qc workflow ran using these parameters:"] fastp = "fastp ran using:\n" fastp += f"\tfastp --trim_poly_g --cut_right {params}" summary.append(fastp) if deconvolve: deconv = "Deconvolution occurred using QuickDeconvolution:\n" - deconv += "\tQuickDeconvolution -t threads -i infile.fq -o output.fq -k {decon_k} -w {decon_w} -d {decon_d} -a {decon_a}" + deconv += f"\tQuickDeconvolution -t {params.threads} -i infile.fq -o output.fq -k {params.decon_k} -w {params.decon_w} -d {params.decon_d} -a {params.decon_a}" summary.append(deconv)This modification ensures that all variables used within the
run
block are properly defined and accessible.
Line range hint
172-176
: Use actual thread count in deconvolution summaryIn the deconvolution summary within
workflow_summary
, the number of threads is listed as a string literal"threads"
rather than the actual thread count.Suggested Improvement:
Replace the string
"threads"
with the actual thread count:- deconv += "\tQuickDeconvolution -t threads -i infile.fq -o output.fq -k {params.decon_k} -w {params.decon_w} -d {params.decon_d} -a {params.decon_a}" + deconv += f"\tQuickDeconvolution -t {params.threads} -i infile.fq -o output.fq -k {params.decon_k} -w {params.decon_w} -d {params.decon_d} -a {params.decon_a}"This provides accurate information in the summary about the resources used during deconvolution.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (40)
harpy/_conda.py
(3 hunks)harpy/_validations.py
(0 hunks)harpy/align.py
(9 hunks)harpy/assembly.py
(3 hunks)harpy/container.py
(1 hunks)harpy/deconvolve.py
(2 hunks)harpy/demultiplex.py
(2 hunks)harpy/impute.py
(2 hunks)harpy/metassembly.py
(3 hunks)harpy/phase.py
(3 hunks)harpy/preflight.py
(2 hunks)harpy/qc.py
(2 hunks)harpy/resume.py
(2 hunks)harpy/simulate.py
(15 hunks)harpy/snakefiles/align_bwa.smk
(1 hunks)harpy/snakefiles/align_ema.smk
(1 hunks)harpy/snakefiles/align_strobealign.smk
(1 hunks)harpy/snakefiles/assembly.smk
(1 hunks)harpy/snakefiles/containerize.smk
(1 hunks)harpy/snakefiles/deconvolve.smk
(1 hunks)harpy/snakefiles/demultiplex_gen1.smk
(1 hunks)harpy/snakefiles/impute.smk
(1 hunks)harpy/snakefiles/metassembly.smk
(1 hunks)harpy/snakefiles/phase.smk
(1 hunks)harpy/snakefiles/preflight_bam.smk
(1 hunks)harpy/snakefiles/preflight_fastq.smk
(1 hunks)harpy/snakefiles/qc.smk
(1 hunks)harpy/snakefiles/simulate_linkedreads.smk
(1 hunks)harpy/snakefiles/simulate_snpindel.smk
(3 hunks)harpy/snakefiles/simulate_variants.smk
(3 hunks)harpy/snakefiles/snp_freebayes.smk
(1 hunks)harpy/snakefiles/snp_mpileup.smk
(1 hunks)harpy/snakefiles/sv_leviathan.smk
(1 hunks)harpy/snakefiles/sv_leviathan_pop.smk
(1 hunks)harpy/snakefiles/sv_naibr.smk
(1 hunks)harpy/snakefiles/sv_naibr_phase.smk
(1 hunks)harpy/snakefiles/sv_naibr_pop.smk
(1 hunks)harpy/snakefiles/sv_naibr_pop_phase.smk
(1 hunks)harpy/snp.py
(6 hunks)harpy/sv.py
(6 hunks)
💤 Files with no reviewable changes (1)
- harpy/_validations.py
🔇 Additional comments (68)
harpy/container.py (1)
19-19
: Verify container build completeness with new conda-based simuG.
The change to create_conda_recipes("container")
aligns with the PR's objective to switch to conda-based simuG installation. However, we should verify that all necessary environments are properly configured.
Let's check the conda environment configurations and their usage:
harpy/snakefiles/containerize.smk (3)
11-53
: LGTM! Consistent environment rule structure
The environment rules follow a consistent pattern and correctly use the new envdir
variable. The structure is clean and maintainable.
5-8
: Verify environment completeness for simuG integration
The change from 'snp' to 'variants' aligns with the PR objectives. Let's verify that all required environments for the conda-based simuG installation are included.
✅ Verification successful
Environment list is complete and correctly updated for simuG integration
The environment list in the rule all
is complete and correctly matches the environment definitions in _conda.py
. The change from 'snp' to 'variants' aligns with the codebase, where simuG (>1.0.0) is properly included in the 'simulations' environment, and variant-calling tools are grouped in the 'variants' environment. All nine environments listed in the rule (align
, assembly
, metassembly
, phase
, qc
, r
, simulations
, stitch
, variants
) are defined in the codebase with their respective dependencies.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if all necessary environment files exist and contain required dependencies
# Check for environment files
echo "Checking for environment files..."
for env in align assembly metassembly phase qc r simulations stitch variants; do
if [ -f "container/workflow/envs/${env}.yaml" ]; then
echo "Found ${env}.yaml"
# Check if simuG is included in variants.yaml
if [ "$env" = "variants" ]; then
echo "Checking for simuG in variants.yaml..."
grep -A 5 "dependencies:" "container/workflow/envs/${env}.yaml"
fi
else
echo "Missing ${env}.yaml"
fi
done
Length of output: 1678
Script:
#!/bin/bash
# Let's check if the environment files are in a different location
echo "Searching for yaml/yml files..."
fd -e yaml -e yml
# Also check if there are any Snakefile rules that might indicate the correct path
echo -e "\nSearching for Snakefile rules with environment definitions..."
rg -A 3 "conda:"
Length of output: 24504
Script:
#!/bin/bash
# Let's check the content of _conda.py to verify the environment definitions
echo "Checking environment definitions in _conda.py..."
cat harpy/_conda.py
# Also check if there are any other environment-related files
echo -e "\nChecking for other environment files..."
fd -t f -e yaml -e yml -e json "env|conda" --exclude "*.env"
Length of output: 5675
20-23
: Verify the variants environment configuration
The new variants
rule replaces the old snp
rule. Let's ensure this change is reflected in the corresponding environment file and that it includes all necessary dependencies for simuG.
harpy/snakefiles/preflight_bam.smk (2)
Line range hint 1-103
: Overall structure looks good
The workflow is well-organized and follows Snakemake best practices. The changes to environment management align with the PR's objective of improving conda-based installations.
18-18
: Verify environment path changes across workflows
The environment directory path has been moved from a global .harpy_envs
to a workflow-specific location. While this provides better isolation, we should verify that all related workflows are updated consistently.
Consider adding environment cleanup logic to the onsuccess
handler to prevent accumulation of unused conda environments when running multiple workflows. For example:
onsuccess:
os.remove(logger.logfile)
# Add cleanup of workflow-specific conda environments
if os.path.exists(envdir):
shutil.rmtree(envdir)
✅ Verification successful
Environment path changes are consistent across workflows
The verification shows that:
- All Snakemake workflow files consistently use the new environment path pattern:
os.path.join(os.getcwd(), outdir, "workflow", "envs")
- The only remaining reference to
.harpy_envs
is in_printing.py
, which is just an error message text and not a functional path.
The environment path changes have been properly propagated across all workflow files, ensuring consistent behavior. The suggested cleanup logic in the original review is still valid but optional.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to the old .harpy_envs path
# and verify consistency of the new environment path across workflow files
# Check for any remaining references to .harpy_envs
echo "Checking for old environment path references..."
rg -l "\.harpy_envs"
# Check environment path construction in other Snakemake files
echo "Checking environment path construction in other Snakemake files..."
rg -A 2 "envdir\s*=.*" "*.smk"
Length of output: 368
Script:
#!/bin/bash
# Let's search more precisely for Snakemake files and environment path references
# Find all Snakemake files
echo "=== Finding all Snakemake files ==="
fd -e smk
# Check environment path construction in Snakemake files
echo -e "\n=== Checking environment paths in Snakemake files ==="
fd -e smk --exec rg -A 2 "envdir\s*=.*" {}
# Check for .harpy_envs references in _printing.py
echo -e "\n=== Checking .harpy_envs reference in _printing.py ==="
cat harpy/_printing.py | grep -A 2 -B 2 "\.harpy_envs"
Length of output: 5302
harpy/snakefiles/preflight_fastq.smk (3)
18-18
: Verify environment directory creation
The environment directory path now includes the output directory, which is good for artifact containment. However, we should ensure this directory exists before it's used.
✅ Verification successful
Environment directory creation is properly handled
The environment directory {outdir}/workflow/envs
is created by the _conda.py
module with the exist_ok=True
flag, ensuring the directory exists before use.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the environment directory is created in other workflow files
rg -A 3 "os\.makedirs.*envs"
Length of output: 233
Line range hint 37-46
: Verify check_fastq.py script availability and functionality
The rules depend on check_fastq.py
but we need to ensure:
- The script is available in PATH
- The script handles both gzipped and uncompressed FASTQ files
Additionally, setting container: None
might affect reproducibility. Consider using the defined Docker container instead.
✅ Verification successful
Script is available and handles both gzipped and uncompressed FASTQ files correctly
The check_fastq.py
script is properly located in harpy/bin/
and uses pysam.FastxFile
which handles both gzipped and uncompressed FASTQ files. The script performs comprehensive checks on FASTQ headers including BX tag validation, SAM spec compliance, and tag positioning.
The only remaining concern is:
- Consider using the defined Docker container instead of
container: None
to ensure reproducibility
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if check_fastq.py exists in the repository
fd "check_fastq.py$"
Length of output: 47
Script:
#!/bin/bash
# Check the content of check_fastq.py to verify FASTQ handling capabilities
cat harpy/bin/check_fastq.py
Length of output: 2090
Line range hint 62-89
: LGTM! Verify R markdown template
The report generation and workflow summary rules are well-structured with proper logging and environment handling. Let's verify the R markdown template exists:
harpy/resume.py (3)
10-10
: LGTM: Import changes align with conda management improvements
The switch from check_envdir
to check_environments
reflects the architectural improvement in conda environment management.
17-17
: LGTM: CLI help text updated appropriately
The simplified help text for the --conda
flag better reflects the new environment management approach.
41-45
: Verify config validation and enhance error handling
While the new conda environment management logic is cleaner, consider adding validation for the conda_environments
config key and explicit error handling.
with open(f"{directory}/workflow/config.yaml", 'r', encoding="utf-8") as f:
harpy_config = yaml.full_load(f)
+ if "conda_environments" not in harpy_config:
+ print_error("Invalid config", "Missing 'conda_environments' in config.yaml")
+ sys.exit(1)
conda_envs = harpy_config["conda_environments"]
Let's verify the config structure across the codebase:
harpy/snakefiles/deconvolve.smk (2)
18-18
: LGTM! Path construction follows best practices.
The use of os.path.join()
with os.getcwd()
ensures proper path construction across different platforms.
18-18
: Verify conda environment path accessibility.
Please ensure that the conda environment path will be accessible when the workflow is executed from different working directories.
✅ Verification successful
Let me try a different approach to verify this concern.
Let me check one more thing to understand how the output directory is handled in the workflow.
Based on the comprehensive analysis of the codebase, I can now provide the final response:
The environment path construction is correctly implemented and consistent across the workflow.
The current implementation using os.path.join(os.getcwd(), outdir, "workflow", "envs")
is actually the correct approach because:
- The
output_directory
is consistently configured through the config dictionary across all Snakemake files - Using
os.getcwd()
is intentional here as it ensures the environment path is always resolved relative to the workflow execution directory - This pattern is uniformly implemented across all 23+ Snakemake files in the workflow, indicating it's a deliberate design choice
- The path construction follows Snakemake's recommended practice for handling workflow-specific conda environments
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify if similar environment path constructions exist in other workflow files
# and check if the environment directory is created in the expected location
# Check for similar environment path constructions in other workflow files
echo "Checking for environment path patterns in other Snakemake files..."
rg -l "envdir.*=.*os\.path\.join.*outdir.*workflow.*envs" --type snakemake
# Check if the qc.yaml environment file exists
echo "Checking for the referenced conda environment file..."
fd -p "workflow/envs/qc.yaml$"
Length of output: 403
Script:
#!/bin/bash
# Let's check for any environment path constructions and environment files
# Search for files with .smk extension
echo "Finding Snakemake files..."
fd -e smk
# Check for environment path constructions
echo -e "\nChecking environment path patterns..."
rg "envdir.*=.*os\.path\.join.*outdir.*workflow.*envs"
# Look for environment files
echo -e "\nLooking for environment files in the repository..."
fd -e yaml -e yml "workflow/envs"
# Check the content of deconvolve.smk
echo -e "\nChecking deconvolve.smk content..."
rg -A 5 -B 5 "envdir.*=.*os\.path\.join" harpy/snakefiles/deconvolve.smk
Length of output: 4088
Script:
#!/bin/bash
# Let's check how outdir is used and if there's a Snakefile that imports these modules
# Check for main Snakefile
echo "Looking for main Snakefile..."
fd Snakefile
# Check how outdir is configured
echo -e "\nChecking outdir configuration in Snakefiles..."
rg "output_directory" -A 2 -B 2
# Check for workflow configuration files
echo -e "\nLooking for workflow configuration files..."
fd -e yaml -e yml -e json "config"
Length of output: 15315
harpy/deconvolve.py (2)
83-85
: LGTM! Verify create_conda_recipes implementation
The updated call to create_conda_recipes
with explicit parameters improves configuration flexibility. The setup_only
flag provides appropriate workflow control.
✅ Verification successful
Function signature matches across all usages
The verification shows that create_conda_recipes
is properly defined in harpy/_conda.py
with the signature (outdir: str, envs: list=None)
, which matches the usage in deconvolve.py
. The function is consistently imported from ._conda
module and called with appropriate parameters across all modules.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify create_conda_recipes implementation across the codebase
ast-grep --pattern 'def create_conda_recipes($output_dir, $conda_envs)'
Length of output: 73
Script:
#!/bin/bash
# Search for create_conda_recipes function definition with ripgrep for better context
rg "def create_conda_recipes" -A 5
# Also search for any imports or usage of create_conda_recipes
rg "create_conda_recipes"
Length of output: 2776
Line range hint 67-77
: Verify conda environment configuration for simuG
The PR objectives mention switching to conda-based simuG installation, but only the "qc" environment is configured. Please verify if:
- simuG should be included in the "qc" environment
- A separate conda environment for simuG is needed
- The simuG configuration is handled elsewhere
harpy/_conda.py (4)
4-11
: LGTM: Clean import additions and function signature update
The new imports and updated function signature provide better error handling and flexibility for environment management.
71-71
: Skipping comment about simuG version pinning
A previous review has already addressed this concern.
89-105
: LGTM: Well-structured recipe creation logic
The implementation includes proper error handling, clear error messages, and safe directory creation.
115-138
: LGTM: Excellent environment validation implementation
The new check_environments
function provides thorough validation with clear, user-friendly error reporting using rich tables.
harpy/demultiplex.py (2)
85-85
: LGTM! Config structure maintained.
The addition of conda environments to the configuration maintains the existing structure while properly integrating the new conda-based workflow requirements.
79-79
: LGTM! Verify QC environment definition.
The addition of the conda environment list is appropriate. However, let's verify that the "qc" environment is properly defined in the conda recipes.
✅ Verification successful
QC environment is properly defined but needs to be created
The "qc" environment is correctly defined in the codebase. The environment definition exists in harpy/_conda.py
with the following dependencies:
- falco=1.2.4
- fastp
- multiqc=1.25
- pysam=0.22
- quickdeconvolution
The environment will be created when needed through the workflow's conda integration.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the QC environment definition exists
# Look for environment.yaml or similar files defining the QC environment
fd -g "*environment*.y*ml" | xargs grep -l "name:.*qc"
Length of output: 56
Script:
#!/bin/bash
# Let's try a broader search for conda environment definitions
fd -e yaml -e yml | xargs grep -l "name:.*qc"
# Also search for any files that might define conda environments
fd -e yaml -e yml
# And let's check the content of any conda-related files
rg -l "conda"
Length of output: 1400
Script:
#!/bin/bash
# Let's check the content of the conda-related YAML files and _conda.py
cat resources/harpy.yaml resources/meta.yaml
cat harpy/_conda.py
# Also check the QC-related files for conda environment usage
cat harpy/qc.py
cat harpy/snakefiles/qc.smk
Length of output: 21679
harpy/metassembly.py (1)
Line range hint 72-98
: Verify conda environment usage across workflow files.
Let's ensure the conda environment names are consistently used across the workflow files and that all necessary tools are available in these environments.
✅ Verification successful
Based on the shell script results, I can now generate the final response:
Conda environments are correctly configured and consistently used across the workflow
The verification shows:
- The conda environments defined in
metassembly.py
(align
,assembly
,metassembly
,qc
,spades
) are properly set up in_conda.py
- Required tools are correctly distributed:
spades
environment: cloudspadesalign
environment: bwa, samtoolsqc
environment: quast, multiqc
- The workflow files consistently use these environments and their respective tools
- The environment setup is properly handled by
create_conda_recipes()
function
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for conda environment references in workflow files
# Search for environment references in workflow files
echo "Checking environment references in workflow files..."
rg -l "conda:" . | while read -r file; do
echo "=== $file ==="
rg "conda:.*(?:align|assembly|metassembly|qc|spades)" "$file" || true
done
# Search for tool usage to ensure they're in correct environments
echo -e "\nChecking tool usage in workflow files..."
rg -l "spades|bwa|samtools|quast" . | while read -r file; do
echo "=== $file ==="
rg "(spades|bwa|samtools|quast)" "$file" || true
done
Length of output: 33036
harpy/qc.py (2)
97-97
: LGTM!
The addition of conda environments to the config dictionary is well-structured and consistent with the conda-based installation approach.
104-104
: Verify create_conda_recipes implementation.
The function call looks correct, but let's ensure the create_conda_recipes
function in _conda.py
properly handles these parameters.
✅ Verification successful
Let me gather more information about the function implementation to ensure it properly handles the parameters.
Let me get the actual implementation of the function to verify it properly handles both parameters.
Function implementation verified and properly handles parameters
The create_conda_recipes
function in _conda.py
correctly handles both parameters:
outdir
: str - Used to create the conda environment files in{outdir}/workflow/envs/
envs
: list (optional) - List of environment names to create, defaults to all environments if None
The function is properly imported and consistently used across multiple modules with the same parameter pattern.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify create_conda_recipes function implementation
# Look for the function definition and its usage
# Search for the function definition
echo "Searching for create_conda_recipes function definition:"
ast-grep --pattern 'def create_conda_recipes($_, $_):
$$$
'
# Search for other usages of the function
echo "Searching for other create_conda_recipes usages:"
rg "create_conda_recipes\(" -A 2
Length of output: 3784
Script:
#!/bin/bash
# Get the full implementation of create_conda_recipes function
ast-grep --pattern 'def create_conda_recipes($_, $_):
$$$
}' harpy/_conda.py
# Also check if there are any imports or dependencies related to this function
rg "from.*_conda.*import create_conda_recipes|import.*_conda" -A 1
Length of output: 1740
Script:
#!/bin/bash
# Get the content of _conda.py to see the full implementation
cat harpy/_conda.py
Length of output: 5411
harpy/impute.py (1)
Line range hint 85-106
: Verify conda environment consistency across codebase.
Let's ensure the conda environment configuration is consistent with other workflow files.
✅ Verification successful
Conda environments are consistently defined and used across the codebase
The verification shows that:
- The
create_conda_recipes
function is used consistently across all workflow files - Each workflow defines its specific conda environments based on its requirements
- The impute workflow's conda environments ["r", "stitch"] are unique to its functionality and don't conflict with other workflows
- The resume functionality correctly reads conda environments from the config file
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other create_conda_recipes calls and conda environment definitions
echo "Checking create_conda_recipes usage:"
rg "create_conda_recipes\(" --type py -A 2
echo -e "\nChecking conda environment definitions:"
rg "conda_envs.*=.*\[.*\]" --type py
Length of output: 4852
harpy/snakefiles/assembly.smk (1)
16-16
: Verify conda environment resolution with the new path structure.
The environment directory path has been updated to include the output directory and a standardized "workflow/envs" subdirectory structure. While this change aligns with the PR's objective of moving to conda-based installation, we should verify that conda can properly resolve environments in this new location.
✅ Verification successful
Environment directory path change is safe and follows standard practice
The change to include outdir
in the environment directory path (os.path.join(os.getcwd(), outdir, "workflow", "envs")
) is safe and follows Snakemake's best practices for workflow organization. The evidence shows:
- The environment directory is consistently referenced across multiple Snakemake files using the
envdir
variable - The path structure
workflow/envs
follows Snakemake's recommended workflow organization - The conda environments are properly referenced in rules using f-strings (e.g.,
f"{envdir}/assembly.yaml"
) - The environment files are expected to be created during workflow execution, which explains why they're not present in the repository
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence and structure of conda environment files
# Check if the environment files exist in any output directories
fd -e yaml -e yml . | rg "workflow/envs/(assembly|qc)\.ya?ml$"
# Verify the content of these environment files to ensure they include necessary dependencies
fd -e yaml -e yml . | rg "workflow/envs/(assembly|qc)\.ya?ml$" -l | xargs cat 2>/dev/null | rg "dependencies:"
Length of output: 163
Script:
#!/bin/bash
# Let's check for any environment files in the repository and their locations
fd -e yaml -e yml . | sort
# Check if there are any references to conda environments in Snakemake files
fd -e smk . | xargs rg -A 2 -B 2 'conda:'
# Check how envdir is used in Snakemake files
fd -e smk . | xargs rg -A 2 -B 2 'envdir'
# Look for any conda-related configuration
fd 'conda|environment' . | grep -i 'conda\|environment'
Length of output: 62550
Script:
#!/bin/bash
# Let's check if the environment directory exists in the expected location
fd -t d "workflow/envs" .
# Check for environment files in resources directory
fd -e yaml -e yml . resources/ | grep -i "env"
# Check if there are any environment files in the repository root
ls -la *.yaml *.yml 2>/dev/null
# Check for any environment-related configuration in the config files
fd -e yaml -e yml . | xargs grep -l "workflow/envs\|conda"
Length of output: 200
harpy/snakefiles/simulate_variants.smk (2)
143-147
: LGTM! Input paths correctly updated
The workflow summary rule has been properly updated to reflect the new file naming convention while maintaining the existing conditional logic for heterozygosity.
15-15
: Verify environment directory path construction
The environment directory path is now constructed relative to the output directory. While the path construction is correct, ensure that:
- The output directory exists before the workflow starts
- The workflow/envs subdirectory structure is created
✅ Verification successful
Environment directory path construction is correct and consistent
The path construction for the environment directory is correct and consistent with the codebase's design. The evidence shows that:
- The same path construction pattern (
os.path.join(os.getcwd(), outdir, "workflow", "envs")
) is used consistently across multiple Snakemake workflow files - The environment directory is managed by Harpy's conda environment handling system through
create_conda_recipes()
function - The conda environment files (*.yaml) are properly referenced in the workflow files
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if similar path construction exists in other workflow files
rg -A 1 "envdir.*=.*os\.path\.join.*output" --type snakemake
Length of output: 96
Script:
#!/bin/bash
# Check if similar path construction exists in other Snakemake files
rg -A 1 "envdir.*=.*os\.path\.join.*output"
# Find all Snakemake files to understand the pattern
fd "Snakefile|\.smk$"
# Check the content of the specific file to understand the context better
cat harpy/snakefiles/simulate_variants.smk
Length of output: 7721
Script:
#!/bin/bash
# Check if the envs directory is created by other means
rg -A 1 "mkdir.*workflow/envs|workflow/envs.*directory"
# Check if conda env files exist and their location
fd "\.yaml$" harpy
# Check if there are any other environment directory references
rg "envdir|conda.*env"
Length of output: 18288
harpy/phase.py (1)
108-108
: Verify create_conda_recipes implementation.
The updated function call aligns with the new conda-based installation approach. Let's verify the implementation of this function to ensure proper environment setup.
✅ Verification successful
Let me analyze the implementation further to ensure proper environment setup.
Let me get the complete implementation to verify the environment setup.
Function implementation verified and working as expected
The create_conda_recipes
function is properly implemented with:
- Correct parameter handling (output directory and optional environment list)
- Comprehensive environment definitions for different workflows
- Proper error handling for invalid environment names
- Creation of environment YAML files with correct structure
- Special handling for SPAdes environment with post-deployment script
The function call in harpy/phase.py
is consistent with the implementation and correctly passes the required parameters.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for create_conda_recipes implementation
ast-grep --pattern 'def create_conda_recipes($output_dir, $conda_envs)'
# Check for any error handling around this function call
rg -A 5 'create_conda_recipes\('
Length of output: 9174
Script:
#!/bin/bash
# Get the full implementation of create_conda_recipes
ast-grep --pattern 'def create_conda_recipes($outdir, $envs) {
$$$
}'
# Check if there's any error handling in _conda.py
rg -A 10 'create_conda_recipes' harpy/_conda.py
Length of output: 516
Script:
#!/bin/bash
# Get the full file content to see the complete implementation
cat harpy/_conda.py
# Check if there are any related utility functions
rg -g '!*.pyc' -g '!*.yaml' 'def.*conda' harpy/
Length of output: 15199
harpy/snakefiles/snp_freebayes.smk (1)
16-17
: 🛠️ Refactor suggestion
Consider more robust path handling for environment directory.
While moving to configuration-based output directory is good, the environment directory construction has potential issues:
- Using
os.getcwd()
makes the path sensitive to the working directory at runtime. - Nesting conda environments deep in the output directory (
outdir/workflow/envs
) might not be the best structure.
Consider this alternative approach:
-envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
+# Option 1: Use workflow.basedir (Snakemake's recommended way)
+envdir = os.path.join(workflow.basedir, "envs")
+
+# Option 2: If environments must be in output directory, use absolute paths
+envdir = os.path.abspath(os.path.join(outdir, "workflow", "envs"))
Let's verify the current usage of environment files:
harpy/snakefiles/sv_leviathan.smk (2)
Line range hint 64-77
: LGTM! Well-structured rules for barcode indexing and variant calling
The implementation of both rules is solid with:
- Proper dependency management
- Efficient use of temporary files
- Good thread allocation
- Comprehensive logging
Also applies to: 91-116
Line range hint 71-71
: Verify conda environment files existence and contents
Let's ensure all required conda environment files are present and properly configured.
Also applies to: 106-106
✅ Verification successful
Conda environments are managed programmatically through _conda.py
The verification reveals that this project uses a programmatic approach to conda environment management rather than static YAML files. The environments referenced in the Snakefiles (align.yaml, variants.yaml, etc.) are generated dynamically through Python code, likely during runtime. This is a valid and maintainable approach to conda environment management.
Key findings:
- Environment references in Snakefiles point to dynamically generated environments
- The project uses conda/mamba for package management as documented in README.md
- Environment definitions are handled programmatically through harpy/_conda.py
- The only YAML files found (harpy.yaml and meta.yaml) are in resources/ and likely serve other purposes
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for conda environment files and their contents
# Find all conda environment references in the workflow
echo "Checking conda environment references in workflow:"
rg -A 1 'conda:'
# Check if the environment files exist
echo -e "\nChecking for environment files:"
fd -e yaml . | grep -i "envs"
# Display contents of variant environment file if found
echo -e "\nChecking variants.yaml contents:"
fd -e yaml . | grep -i "variants.yaml" -A 5
Length of output: 12870
Script:
#!/bin/bash
# Let's check for the conda environment files in the repository
# First, let's identify all unique environment files referenced
echo "Unique environment files referenced:"
rg "f\"\{envdir\}/(.*?)\.yaml\"" -o --no-filename | sort -u | sed 's/f"{envdir}\///g' | sed 's/"}//g'
# Now let's check if these files exist in the repository
echo -e "\nSearching for environment files in the repository:"
fd "\.yaml$" harpy/envs/
# If no files found in harpy/envs, let's search in other common locations
echo -e "\nSearching in other potential locations:"
fd "\.yaml$" envs/
fd "\.yaml$" environment/
Length of output: 897
Script:
#!/bin/bash
# Let's check if conda environment definitions are stored in _conda.py
echo "Checking conda environment definitions in _conda.py:"
rg "def get_.*_env" harpy/_conda.py -A 2
# Let's also check if there are any YAML files in the repository that might contain environment definitions
echo -e "\nSearching for any YAML files in the repository:"
fd -e yml -e yaml .
# Check for any documentation about conda environments
echo -e "\nChecking documentation about conda environments:"
rg -i "conda.*environment" README.md -A 2
Length of output: 1081
harpy/snakefiles/sv_naibr.smk (1)
17-18
: Ensure output directory exists before accessing envdir path
The envdir
path now includes subdirectories under the output directory. While this provides better organization, we should ensure these directories exist to prevent potential issues.
Let's verify if the workflow creates these directories:
Consider adding a directory creation step in the onstart
directive:
onstart:
logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
+ os.makedirs(os.path.join(outdir, "workflow", "envs"), exist_ok=True)
harpy/assembly.py (1)
Line range hint 88-115
: LGTM! Configuration structure is well-organized
The configuration dictionary is well-structured and properly integrates the conda environments while maintaining all necessary parameters for the assembly workflow.
harpy/snakefiles/simulate_linkedreads.smk (2)
Line range hint 1-218
: LGTM! The changes successfully implement conda-based installation
The changes effectively implement the transition to conda-based installation while maintaining the workflow's functionality. The introduction of the envdir
variable provides a clean way to manage conda environments.
Line range hint 1-1
: Consider standardizing environment management
The workflow mixes containerized and conda environments:
- Some rules use
container: None
- Others use conda environments
- The top-level container is defined but not consistently used
Consider standardizing the approach across all rules for better maintainability.
Also applies to: 73-73, 89-89, 108-108, 127-127, 147-147, 164-164
harpy/preflight.py (3)
141-153
: LGTM: Consistent implementation across commands
The changes to conda environment management are consistently implemented between the fastq and bam commands, following good software engineering practices.
91-91
: Verify create_conda_recipes parameter changes
The function call has been updated to pass parameters. Let's verify this change is consistent across the codebase.
✅ Verification successful
Let me check the implementation of create_conda_recipes
to understand the expected parameters.
Let me try a different approach to find the function definition using ripgrep.
Based on the function definition and all its usage across the codebase, I can now provide the final response:
Function parameters are consistent with definition
The create_conda_recipes
function is defined with two parameters: outdir: str
and an optional envs: list=None
. From the previous search results, all calls to this function across the codebase follow one of two valid patterns:
- Two parameters:
create_conda_recipes(output_dir, conda_envs)
- Single parameter:
create_conda_recipes("container")
(only in container.py)
The call in the reviewed file create_conda_recipes(output_dir, conda_envs)
matches the expected signature.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for all create_conda_recipes calls
ast-grep --pattern 'create_conda_recipes($$$)'
Length of output: 1639
Script:
#!/bin/bash
# Search for create_conda_recipes definition
ast-grep --pattern 'def create_conda_recipes($$$)'
Length of output: 52
Script:
#!/bin/bash
# Search for create_conda_recipes definition using ripgrep
rg "def create_conda_recipes" -A 3
Length of output: 267
79-85
: Consider adding simuG to conda environments
The conda environments list only includes "r", but according to the PR objectives, this change is meant to support simuG through conda installation. Consider adding the simuG environment if required.
harpy/snakefiles/align_bwa.smk (1)
Line range hint 72-72
: Verify conda environment files and their dependencies
The following conda environment files are referenced in the workflow:
align.yaml
: Used bybwa_index
andalign
rulesr.yaml
: Used bysample_reports
andbarcode_report
rulesqc.yaml
: Used bysamtools_report
rule
Let's verify the environment files and their contents:
Also applies to: 108-108, 185-185, 246-246, 262-262
✅ Verification successful
Let's try another verification strategy by directly looking for the environment files in their typical locations and checking their contents:
Let's check the repository structure to find where these environment files are actually located, since we see references to {envdir}
in the Snakefile:
Environment files are dynamically created at runtime
Based on the verification results:
- The environment files (
align.yaml
,r.yaml
,qc.yaml
) are not present in the repository - The
envdir
is dynamically set to{outdir}/workflow/envs
at runtime, whereoutdir
is provided during workflow execution - This is a common pattern in Snakemake workflows where environment files are generated during workflow initialization
The absence of these files in the repository is not an issue because:
- The main package dependencies are properly defined in
resources/harpy.yaml
andresources/meta.yaml
- The workflow-specific environment files will be created in the
workflow/envs
directory when the workflow is executed - This is a standard Snakemake practice for managing workflow-specific conda environments
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check conda environment files for required dependencies
# Function to check yaml file content
check_yaml() {
local file=$1
echo "Checking $file:"
if [ -f "$file" ]; then
# Look for key dependencies
case "$file" in
*align.yaml)
echo "Looking for BWA in align.yaml:"
rg -i "bwa" "$file"
;;
*r.yaml)
echo "Looking for R packages in r.yaml:"
rg -i "^[[:space:]]*-[[:space:]]*r-" "$file"
;;
*qc.yaml)
echo "Looking for MultiQC in qc.yaml:"
rg -i "multiqc" "$file"
;;
esac
else
echo "File not found"
fi
echo
}
# Search for environment files
for env in align r qc; do
fd -e yaml -e yml "$env"
done | while read -r file; do
check_yaml "$file"
done
Length of output: 371
Script:
#!/bin/bash
# Check common locations for conda environment files
echo "Checking for environment files in common locations:"
fd -t f ".*\.ya?ml$" envs/ conda/ environment/ .
echo -e "\nChecking content of any found files:"
for file in $(fd -t f ".*\.ya?ml$" envs/ conda/ environment/ .); do
echo -e "\n=== Content of $file ==="
cat "$file"
done
echo -e "\nChecking Snakefile for conda directive paths:"
rg -A 1 "conda:" "harpy/snakefiles/align_bwa.smk"
Length of output: 3621
Script:
#!/bin/bash
# Look for environment files and envdir definition
echo "Looking for environment files:"
fd -t f -e yaml -e yml
echo -e "\nSearching for envdir definition:"
rg "envdir\s*=" harpy/
echo -e "\nChecking harpy directory structure:"
fd . harpy/ -t d
Length of output: 2831
harpy/snakefiles/phase.smk (3)
Line range hint 42-56
: LGTM! Robust genome file handling implementation.
The implementation handles both gzipped and uncompressed genome files correctly, and the conditional logic for indels processing is well-structured.
Line range hint 177-190
: LGTM! Well-structured conda environment usage
The rules consistently use conda environments with the updated path structure, which aligns well with the PR's objective of moving to conda-based installation.
Also applies to: 191-204, 205-221
24-24
: Verify environment directory path consistency across workflows
The change to use outdir
in the environment path is good for organization. However, let's ensure this path structure is consistently applied across all workflow files.
✅ Verification successful
Environment directory path is consistent across workflow files
All Snakemake workflow files in the repository use the same environment directory path construction os.path.join(os.getcwd(), outdir, "workflow", "envs")
, except for containerize.smk
which intentionally uses a different path for container builds.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check environment directory path construction across workflow files
# Expected: All workflow files should use the same path structure
# Search for environment directory path construction in workflow files
rg -A 1 "envdir\s*=" "harpy/snakefiles/"
Length of output: 4296
harpy/snakefiles/sv_naibr_pop.smk (1)
Line range hint 134-149
: Enhance environment management and thread safety
A few suggestions to improve robustness:
- The conda environment file (
variants.yaml
) should be version-pinned to ensure reproducibility - The thread count calculation should account for system limits
- threads:
- min(10, workflow.cores - 1)
+ threads:
+ lambda wildcards, threads: min(10, threads)
Also, please verify that the variants.yaml
environment file exists and contains version-pinned dependencies.
harpy/snakefiles/align_strobealign.smk (1)
17-17
: LGTM! Verify conda environment creation path.
The change to use os.path.join
for constructing the conda environment path is a good practice. The new structure ensures environments are created under the output directory, which aligns with the PR's goal of better managing tool installations.
Let's verify that all conda environment references in the workflow will work with the new path:
✅ Verification successful
Path construction for conda environments is consistent across workflows
The verification shows that:
- The new
envdir
path construction usingos.path.join(os.getcwd(), outdir, "workflow", "envs")
is consistently implemented across all Snakemake workflow files, except forcontainerize.smk
which intentionally uses a different path for container-specific environments. - The conda environment references are properly managed through yaml files, as evidenced by their presence in the workflow files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify conda environment references in the workflow
# Check for conda environment yaml files referenced in the workflow
rg -l 'conda:.*yaml' .
# Check if the new envdir path structure is consistently used across files
rg 'envdir.*=.*os\.path\.join.*workflow.*envs'
Length of output: 2533
harpy/snakefiles/impute.smk (3)
Line range hint 1-277
: LGTM! Well-structured workflow with clear documentation
The workflow maintains a clear structure with distinct phases for preprocessing, imputation, and reporting. The workflow summary provides comprehensive documentation of all parameters and steps. The change to conda-based installation is cleanly integrated without affecting the core workflow logic.
22-22
: Verify environment directory setup for conda-based installation
The change to store conda environments under the workflow output directory ({outdir}/workflow/envs
) aligns with the PR's goal of switching to conda-based installation. However, we need to ensure proper initialization of this directory structure.
Let's verify the environment setup:
✅ Verification successful
Based on the inspection of impute.smk
, I can now generate the final response:
The environment directory change is properly integrated with conda usage
The change to envdir
is verified to be correct and well-integrated:
- The file properly uses conda environments in the
impute
andimpute_reports
rules - The environment setup is handled by the
_conda.py
module which creates the directory structure (workflow/envs
) - The environment path is consistently used across the codebase, as seen in other files like
resume.py
and_conda.py
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for environment files and directory structure
# Check if environment files exist in the repository
echo "Checking for environment files in repository..."
fd -e yaml -e yml . | grep -E "stitch.ya?ml|r.ya?ml"
# Check for environment setup in other Snakemake files
echo -e "\nChecking for environment directory setup in other files..."
rg -A 3 "workflow/envs"
Length of output: 2762
Script:
#!/bin/bash
# Let's check the content of environment files and their setup
# Check for environment setup in the main Snakefile or related configuration
echo "Checking main Snakefile and configuration..."
rg -l "conda:" --type snakemake
# Check the conda environment initialization code
echo -e "\nChecking conda environment initialization..."
rg -A 5 "def setup_conda|def create_conda"
# Look for stitch and R environment definitions
echo -e "\nLooking for stitch and R environment definitions..."
fd environ . | xargs cat
Length of output: 781
Script:
#!/bin/bash
# Let's check the specific file and its conda environment usage
# Check the impute.smk file for conda directives and environment usage
echo "Checking impute.smk for conda environment usage..."
cat harpy/snakefiles/impute.smk
# Check conda environment creation code specifically for environments
echo -e "\nChecking conda environment definitions..."
rg -A 10 'environ = {' harpy/_conda.py
Length of output: 10396
Line range hint 89-90
: Verify conda environment specifications
The rules impute
and impute_reports
use conda environments from the new location. We should verify that these environment files contain the necessary dependencies, particularly for STITCH and R.
Let's check the environment specifications:
Also applies to: 171-172
harpy/snakefiles/simulate_snpindel.smk (5)
15-15
: LGTM: Environment directory path properly configured
The environment directory path is correctly constructed using os.path.join
and properly references the output directory structure.
97-100
: LGTM: Improved output path handling and simuG invocation
The changes properly handle conditional outputs and use direct simuG invocation, aligning with the conda-based installation approach.
Also applies to: 109-109
113-116
: LGTM: Improved haplotype output organization
The output paths are well-organized with clear haplotype separation and intuitive naming conventions.
160-186
: Referencing existing review comment about error handling
206-211
: LGTM: Comprehensive workflow summary configuration
The input collection and parameter handling are well-implemented, covering all possible output scenarios.
Also applies to: 215-216
harpy/snakefiles/align_ema.smk (1)
28-28
: LGTM! Verify consistency across workflow files.
The change to store conda environments within the output directory is a good practice as it:
- Isolates environments per workflow run
- Prevents conflicts between different runs
- Keeps all workflow artifacts contained
Let's verify this change is consistent across other workflow files:
✅ Verification successful
Change is consistent across workflow files, with one exception
The verification shows that all workflow files in harpy/snakefiles/
use the same envdir
path construction (os.path.join(os.getcwd(), outdir, "workflow", "envs")
), except for containerize.smk
which uses a different path (os.path.join(os.getcwd(), "container/workflow/envs")
). This is expected since containerize.smk
is a special workflow for container creation.
All conda environment references in the rules consistently use the envdir
variable through f-strings (e.g., f"{envdir}/align.yaml"
), indicating proper usage throughout the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if other workflow files use the same envdir structure
# Expected: All workflow files should use the same envdir path construction
# Search for envdir definitions in other workflow files
rg -A 1 'envdir\s*=' 'harpy/snakefiles/'
# Search for conda: directives to ensure all environment references are updated
rg -A 1 'conda:' 'harpy/snakefiles/'
Length of output: 14493
harpy/snp.py (2)
142-142
: LGTM! Function call updated correctly.
The create_conda_recipes call has been properly updated to match its new signature, passing both the output directory and conda environments list.
220-220
: LGTM! Verify variants environment dependencies.
The addition of both "r" and "variants" conda environments is appropriate for freebayes functionality and report generation.
Let's verify the variants environment dependencies:
Also applies to: 236-236, 248-248
✅ Verification successful
LGTM! The variants environment is properly configured.
The variants environment is correctly defined in harpy/_conda.py
with all necessary dependencies:
- freebayes=1.3.6
- bcftools=1.20
- leviathan
- naibr-plus
These dependencies align with the workflow requirements in snp_freebayes.smk
and are sufficient for variant calling functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for variants environment definition
fd -t f 'variants.ya?ml' workflow/envs/
# Check for freebayes-related commands in the workflow
rg 'freebayes' workflow/snp_freebayes.smk
Length of output: 252
Script:
#!/bin/bash
# Let's first check the repository structure to locate environment files
fd -t d 'env'
# Search for any yaml files that might contain environment definitions
fd -t f '\.ya?ml$'
# Search for freebayes-related files and configurations
rg -l 'freebayes'
Length of output: 217
Script:
#!/bin/bash
# Check the content of conda-related files for variants environment definition
cat harpy/_conda.py
# Check the snakefile for freebayes dependencies
cat harpy/snakefiles/snp_freebayes.smk
# Check if environment definitions exist in resources
cat resources/harpy.yaml
Length of output: 13136
harpy/sv.py (3)
Line range hint 108-118
: LGTM: Appropriate conda environments for leviathan workflow
The conda environments ("align", "r", "variants") are well-chosen for the leviathan structural variant detection workflow, covering alignment operations, R-based reporting, and variant calling functionalities.
132-132
: LGTM: Updated create_conda_recipes call with proper parameters
The function call has been properly updated to pass both the output directory and conda environments list, maintaining consistency with the new function signature.
Line range hint 210-221
: LGTM: Well-chosen conda environments for naibr workflow
The conda environments are appropriately selected for the naibr workflow:
- "phase" environment for handling phased BAM files (a requirement for naibr)
- "r" for R-based reporting
- "variants" for variant calling
The configuration and create_conda_recipes call are properly updated to use these environments.
Also applies to: 236-236
harpy/align.py (1)
Line range hint 317-329
: LGTM! Consistent implementation across all alignment commands.
The conda environment configuration and usage is consistently implemented across all three alignment commands (BWA, EMA, and Strobe), making the code maintainable and predictable.
Also applies to: 342-342
harpy/simulate.py (3)
Line range hint 175-188
: LGTM: Consistent conda environment setup across simulation functions
The changes standardize the conda environment setup by using a single "simulations" environment across all functions. This approach is more maintainable and aligns well with the PR's objective of switching to conda-based installation.
Also applies to: 302-327, 415-433, 531-552, 637-653
201-201
: LGTM: Consistent update of create_conda_recipes calls
The changes consistently update all create_conda_recipes calls to include output_dir and conda_envs parameters, supporting the transition to conda-based installation.
Also applies to: 338-338, 444-444, 563-563, 665-665
Line range hint 1-677
: Verify complete removal of simuG.pl dependencies
Let's ensure all references to simuG.pl have been removed from the codebase.
✅ Verification successful
Based on the results showing no matches for "simuG.pl" in the codebase, I can generate the final response. However, let me do one additional verification to check for any remaining fetch_script calls in the simulate.py file to ensure completeness.
Successfully verified removal of simuG.pl
The verification confirms that:
- No references to "simuG.pl" exist in the codebase
- The only fetch_script call in simulate.py is for "HaploSim.pl", which is unrelated to simuG
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to simuG.pl
# Search for any remaining references to simuG.pl
echo "Searching for simuG.pl references..."
rg -i "simug\.pl"
# Search for fetch_script calls that might be related
echo "Searching for fetch_script calls..."
ast-grep --pattern 'fetch_script($_, "simuG.pl")'
Length of output: 235
Script:
#!/bin/bash
# Check for all fetch_script calls in simulate.py
ast-grep --pattern 'fetch_script($_,$_)' harpy/simulate.py
Length of output: 127
harpy/snakefiles/qc.smk (3)
18-18
: Confirm the impact of changing the envdir
location
Changing envdir
to os.path.join(os.getcwd(), outdir, "workflow", "envs")
moves the conda environments into the output directory specified by outdir
. This could lead to the environments being recreated if the output directory is cleaned or deleted between runs, potentially affecting reproducibility and performance. Please verify that this change aligns with your workflow expectations and won't introduce unintended side effects.
Line range hint 88-94
: Verify the necessity of setting container: None
in extract_forward
rule
In the extract_forward
rule, you've set container: None
, which overrides the global container setting for this rule. This means that the rule will execute outside of the specified container environment. Ensure that this is intentional and that the required tool seqtk
is available in the execution environment.
If seqtk
is installed on the host system and you intend to run this rule outside the container, no action is needed. Otherwise, consider removing container: None
or specifying an appropriate container that includes seqtk
.
Line range hint 150-155
: Consider checking the availability of QuickDeconvolution
in the conda environment
In the deconvolve
rule, you are running QuickDeconvolution
, but it's not specified whether this tool is included in the conda environment defined by qc.yaml
.
Run the following script to verify if QuickDeconvolution
is installed in the qc.yaml
conda environment:
Ensure that QuickDeconvolution
is included in your qc.yaml
conda environment or adjust the environment as necessary.
harpy/snakefiles/sv_naibr_phase.smk (1)
17-18
: LGTM: Proper initialization of outdir
and envdir
variables
Defining outdir
and envdir
using config["output_directory"]
ensures that output and environment directories are correctly set up, enhancing the organization and consistency of the workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
harpy/_conda.py (2)
101-103
: Enhance error message for invalid environment namesThe current error message could be more user-friendly. Consider including a newline and formatting for better readability.
- sys.stderr.write(f"Key '{i}' is not an available conda environment name. The options are: " + ", ".join(environ.keys())) + sys.stderr.write(f"\nError: Environment '{i}' not found.\nAvailable environments:\n - " + "\n - ".join(sorted(environ.keys())) + "\n")
115-116
: Enhance function documentation and type hintsConsider improving the docstring and adding type hints for better code maintainability.
-def check_environments(dirpath: str, envs: list) -> None: - """Check that the provided dir exists and contains the necessary environment definitions""" +def check_environments(dirpath: str, envs: list[str]) -> None: + """Validate the presence of required conda environment definition files. + + Args: + dirpath: Path to the working directory containing workflow/envs + envs: List of environment names to verify + + Raises: + SystemExit: If the envs directory is missing or any required environment file is not found + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
harpy/_cli_types_generic.py
(1 hunks)harpy/_conda.py
(3 hunks)harpy/demultiplex.py
(4 hunks)harpy/resume.py
(2 hunks)harpy/snp.py
(8 hunks)harpy/sv.py
(8 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- harpy/demultiplex.py
- harpy/resume.py
- harpy/snp.py
- harpy/sv.py
🔇 Additional comments (4)
harpy/_conda.py (2)
4-8
: LGTM: Clean import additions for enhanced error handling
The new imports are well-organized and appropriate for the enhanced error reporting functionality.
109-113
: Verify the hardcoded SPAdes version
The post-deployment script downloads SPAdes v4.0.0. Let's verify if this is still the recommended version.
✅ Verification successful
SPAdes version 4.0.0 is correctly specified
The hardcoded SPAdes version (4.0.0) in the post-deployment script matches the latest release version from the official repository, confirming it's up to date.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check latest SPAdes release version
gh api repos/ablab/spades/releases/latest --jq .tag_name
Length of output: 65
harpy/_cli_types_generic.py (2)
59-59
: LGTM! Good standardization.
Adding the name
attribute aligns this class with click's best practices and matches the pattern used by other parameter type classes in this file.
59-59
: Verify simuG output compatibility
Since this PR changes simuG to a conda-based installation, please verify that the filedict
in the InputFile
class supports all file extensions that the conda version of simuG produces.
✅ Verification successful
File extension validation is compatible with simuG outputs
Based on the code inspection, the InputFile
class's filedict
already supports all necessary file extensions for simuG:
- Input genome files:
.fasta
,.fas
,.fa
,.fna
(covered by fasta extensions) - Output VCF files:
.vcf
,.vcf.gz
(covered by vcf extensions)
The changes to conda-based simuG installation don't introduce any new file types or extensions that would require updates to the extension validation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if simuG conda package documentation mentions any additional file extensions
# that need to be supported by the InputFile class
# Search for file extension patterns in repository documentation
rg -i "simug.*\.(fa|fasta|vcf|gff|gz)" --type md
# Search for actual file extensions in test data or example outputs
fd -e fa -e fasta -e vcf -e gff -e gz | grep -i simug
Length of output: 106
Script:
#!/bin/bash
# Let's check the actual implementation of InputFile class and its file extension handling
# First, let's find the InputFile class implementation
ast-grep --pattern 'class InputFile'
# Then, let's look for any file extension validation or handling
rg -A 5 'filedict|extensions' harpy/_cli_types_generic.py
# Also check for any simuG-related file handling in tests
rg -i 'simug.*test' -A 5
Length of output: 1990
Script:
#!/bin/bash
# Let's check the PR description and documentation for any mentions of simuG output files
rg -i "simug.*output"
# Also check for any documentation about simuG usage
fd README.md | xargs rg -i "simug"
# And check the actual changes in this PR related to simuG
rg -i "simug" -g '!{*.md,*.txt}'
Length of output: 1532
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
harpy/snakefiles/simulate_variants.smk (2)
73-84
: Add file existence checks before renamingThe rename_haploid rule should verify file existence before renaming to prevent potential errors.
Consider adding validation:
rule rename_haploid: input: fasta = f"{outdir}/{outprefix}.simseq.genome.fa", vcf = f"{outdir}/{outprefix}.refseq2simseq.{variant}.vcf", mapfile = f"{outdir}/{outprefix}.refseq2simseq.map.txt" output: fasta = f"{outdir}/{outprefix}.fasta", vcf = f"{outdir}/{outprefix}.{variant}.vcf", mapfile = f"{outdir}/{outprefix}.{variant}.map" run: + for i in input: + if not os.path.exists(i): + raise ValueError(f"Input file {i} does not exist") for i,j in zip(input, output): os.rename(i,j)
95-96
: Consider consolidating directory creationDirectory creation for haplotypes could be moved to a separate rule or done at workflow start to improve organization.
Consider creating a setup rule:
rule setup_directories: output: directory(f"{outdir}/haplotype_1"), directory(f"{outdir}/haplotype_2") run: for dir in output: os.makedirs(dir, exist_ok=True)Then update the diploid_variants rule to depend on these directories:
rule diploid_variants: input: + dirs = [f"{outdir}/haplotype_1", f"{outdir}/haplotype_2"], vcf = f"{outdir}/{outprefix}.{variant}.vcf"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
harpy/snakefiles/simulate_snpindel.smk
(3 hunks)harpy/snakefiles/simulate_variants.smk
(3 hunks)
🔇 Additional comments (8)
harpy/snakefiles/simulate_variants.smk (2)
71-71
: Verify simuG conda package availability and version
The workflow now depends on simuG being available through conda. Let's verify its availability and version constraints.
Also applies to: 125-125
✅ Verification successful
simuG package is properly configured in the workflow
The verification shows that:
- simuG is available in bioconda with versions 1.0.0 and 1.0.1
- The package is properly specified in the workflow's conda dependencies as shown in
harpy/_conda.py
with the constraint"bioconda::simug>1.0.0"
- The simulation rules in
simulate_variants.smk
correctly use the conda environment specified in{envdir}/simulations.yaml
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if simuG is properly specified in conda environment file
rg -A 5 "simug" "${envdir}/simulations.yaml"
# Verify package availability in bioconda
curl -s "https://conda.anaconda.org/bioconda/noarch/repodata.json" | \
jq '.packages | to_entries[] | select(.value.name=="simug") | .value.version'
Length of output: 262
Script:
#!/bin/bash
# Let's check for environment files in the repository
fd -g "*environment*.y*ml"
# Also check for any conda/mamba related config files
fd -g "*conda*.y*ml"
# Let's also check the Snakefile or other workflow files for environment specifications
rg -A 3 "conda:"
Length of output: 24283
15-15
: 🛠️ Refactor suggestion
Ensure output directory exists before accessing envdir path
The environment directory path now includes the output directory, but there's no explicit creation of this path. This could lead to errors if the directory doesn't exist when first accessed.
Consider adding directory creation similar to how it's done for haplotype directories:
+os.makedirs(envdir, exist_ok=True)
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
harpy/snakefiles/simulate_snpindel.smk (6)
15-15
: LGTM: Improved path handling
The use of os.path.join
for path construction is a good practice, making the code more portable across different platforms.
97-100
: LGTM: Improved output file naming
The output paths are now more intuitive and consistent with the PR's objective of making simuG names less unwieldy.
111-125
: Add error handling for rename operations
The rename rules should include error handling as suggested in the previous review.
Also applies to: 178-187
192-196
: LGTM: Consistent path updates in workflow summary
The workflow summary has been properly updated to reflect the new file naming convention and path structure.
Also applies to: 200-201
154-155
: Add VCF validation before operations
Consider adding VCF validation as suggested in the previous review.
109-109
: Verify conda environment configuration
The switch to conda-based simuG installation requires proper environment setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
harpy/reports/align_stats.Rmd (5)
50-51
: Enhance error handling usingstop()
instead ofprint()
andknitr::knit_exit()
.Consider using
stop(paste("Input data file", infile, "is empty"))
to provide clearer error messages and terminate the script appropriately.
58-65
: ConsolidateVALID_PRESENT
andSKIPTEXT
variables for simplicity.Using a single variable to control conditional evaluations can simplify the code and improve readability. Consider using
VALID_PRESENT
alone to manage the evaluations and removeSKIPTEXT
.
151-154
: Streamline conditional text display based onVALID_PRESENT
.Instead of using both
SKIPTEXT
andVALID_PRESENT
, consider using!VALID_PRESENT
directly in theeval
parameter to control the evaluation of the message, reducing redundancy.
193-193
: Ensure consistent conditional evaluation of plots.All plot code chunks are conditioned on
VALID_PRESENT
. Verify that this variable accurately represents when plots should be generated and that it aligns with data availability.Also applies to: 218-218, 255-255, 301-301, 326-326
421-426
: Enhance error handling usingstop()
instead ofknitr::knit_exit()
.Similar to earlier, using
stop("Input data file is empty")
provides clearer error messages and halts execution appropriately.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
.github/workflows/tests.yml
(1 hunks).gitignore
(0 hunks)harpy/_printing.py
(1 hunks)harpy/reports/align_bxstats.Rmd
(1 hunks)harpy/reports/align_stats.Rmd
(10 hunks)
💤 Files with no reviewable changes (1)
- .gitignore
✅ Files skipped from review due to trivial changes (1)
- harpy/reports/align_bxstats.Rmd
🔇 Additional comments (3)
harpy/_printing.py (1)
26-26
: Verify the new conda environment path and consider making it more explicit.
While the path change from .harpy_envs
to /workflows/envs
aligns with the conda-based installation changes, let's ensure the new path is correct and clear to users.
Let's verify the existence and structure of the new conda environment path:
Consider making the path reference more explicit to avoid confusion:
-"There was an issue creating the software environment(s) necessary to run this workflow. If you manually edited the conda dependencies in [blue]/workflows/envs[/blue], see the error below for troubleshooting. If you didn't, it might be a bug or related to how your system is setup for Conda or Singularity environments and you should submit an issue on GitHub: [bold]https://github.com/pdimens/harpy/issues"
+"There was an issue creating the software environment(s) necessary to run this workflow. If you manually edited the conda dependencies in [blue]<harpy_root>/workflows/envs[/blue], see the error below for troubleshooting. If you didn't, it might be a bug or related to how your system is setup for Conda or Singularity environments and you should submit an issue on GitHub: [bold]https://github.com/pdimens/harpy/issues"
✅ Verification successful
Path reference is correct and consistent with codebase
The path /workflows/envs
is consistently referenced across the codebase, particularly in _conda.py
and _printing.py
. The path is relative to Harpy's working directory and is used in error messages for missing environment files. The current implementation is correct and doesn't require modification as:
- The path format is consistent across error messages
- It's clearly used as a relative path in the codebase
- The error message provides sufficient context about conda dependencies
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence and structure of the new conda environment path
# Check if the workflows/envs directory exists and list its contents
if [ -d "workflows/envs" ]; then
echo "Directory exists. Contents:"
ls -la workflows/envs/
else
echo "Directory not found at workflows/envs"
fi
# Check for any references to the old path
rg -g '!.git' '.harpy_envs'
Length of output: 140
Script:
#!/bin/bash
# Let's search for any environment-related files and directories
fd -t d "env"
# Check for any conda or environment-related YAML files
fd -e yml -e yaml
# Search for references to environment paths in the codebase
rg -g '!.git' "workflows/envs|\.harpy_envs" -A 2 -B 2
Length of output: 1340
.github/workflows/tests.yml (1)
692-692
: LGTM! Consistent naming pattern for simulation outputs.
The changes standardize the output file paths across all simulation types (SNPs, indels, inversions, CNVs, and translocations) by using the "haplotype_1" naming convention. This improves clarity and maintainability.
Let's verify that the new file paths are consistently used across the codebase:
Also applies to: 698-698, 704-704, 710-710
✅ Verification successful
Changes to simulation output paths are consistent with codebase structure
The changes in the workflow file align with the codebase's implementation:
- The new
haplotype_1
pattern is properly implemented in the simulation Snakefiles (harpy/snakefiles/simulate_variants.smk
andharpy/snakefiles/simulate_snpindel.smk
) - The references to "diploid" in the codebase are primarily in documentation, help text, and rule names, which don't affect the actual file paths
- The workflow changes correctly reflect the file structure created by the Snakemake rules, which use
{outdir}/haplotype_1/{outprefix}.hap1.{variant}.vcf
pattern
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to the old "diploid" naming pattern
# in simulation-related files and verify the new "haplotype_1" pattern is used consistently.
# Search for any remaining "diploid" references in simulation-related files
rg -i "diploid" --type python
# Search for the new "haplotype_1" pattern to ensure consistent usage
rg -i "haplotype_1" --type python
Length of output: 125
Script:
#!/bin/bash
# Let's try searching without file type restriction since Python might not be recognized
# Search for any remaining "diploid" references in simulation-related files
rg -i "diploid"
# Search for the new "haplotype_1" pattern to ensure consistent usage
rg -i "haplotype_1"
# Also check for files containing simulation-related patterns
fd -t f "sim.*\.vcf"
Length of output: 4959
harpy/reports/align_stats.Rmd (1)
45-46
: Good practice: Use dynamic parameter for molecule distance.
Replacing the hardcoded value with sm_moldist <- snakemake@params$mol_dist
enhances flexibility and allows for dynamic configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
harpy/snakefiles/simulate_variants.smk
(3 hunks)
🔇 Additional comments (3)
harpy/snakefiles/simulate_variants.smk (3)
72-72
: Add error handling for simuG command
91-92
: LGTM: Well-organized directory structure
The new directory structure with separate haplotype directories is well-organized and follows Snakemake best practices:
- Directories are created in the rule that needs them
- Using
exist_ok=True
prevents race conditions - Clear separation of haplotype outputs
Also applies to: 96-97
142-146
: LGTM: Comprehensive workflow summary
The workflow summary has been properly updated to reflect the new file structure and provides good traceability with:
- Correct paths to all output files
- Proper conditional collection of haplotype files
- Clear separation between haploid and diploid outputs
Provides simuG from the updated conda installation and removes the Harpy-local copy. This includes changing how the command is called and the expected outputs. As a result, there are a few changes:
rename_X
rules because the default simuG names are unwieldySummary by CodeRabbit
Release Notes
New Features
view
command to include an option for viewing the workflow's configuration file.workflow_summary
rules to provide detailed summaries of workflow execution parameters and commands.Bug Fixes
view
command.Chores
simuG.pl
script for several simulation workflows, streamlining the execution process..gitignore
file to include the.harpy_envs/
directory, allowing it to be tracked by Git.