Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swap simuG to conda-based install #168

Merged
merged 23 commits into from
Nov 26, 2024
Merged

swap simuG to conda-based install #168

merged 23 commits into from
Nov 26, 2024

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Nov 22, 2024

Provides simuG from the updated conda installation and removes the Harpy-local copy. This includes changing how the command is called and the expected outputs. As a result, there are a few changes:

  1. extra rename_X rules because the default simuG names are unwieldy
  2. an extra rule to concatenate the snindel VCFs if both snps and indels are present. purely a convenience feature

Summary by CodeRabbit

Release Notes

  • New Features

    • Added new rules for renaming and concatenating variant files in the SNP and indel simulation workflows.
    • Updated output file naming conventions for clearer organization of simulated sequences and VCF files.
    • Enhanced the view command to include an option for viewing the workflow's configuration file.
    • Introduced improved handling of conda environments across various commands, allowing for better configuration and dependency management.
    • Added a new function to check the existence of required conda environment definition files, enhancing error reporting.
    • Enhanced the workflow_summary rules to provide detailed summaries of workflow execution parameters and commands.
    • Updated the GitHub Actions workflow configuration for testing to align with new output paths for simulation commands.
  • Bug Fixes

    • Improved regular expression handling in file processing to enhance clarity and prevent issues.
    • Enhanced error handling for missing files in the view command.
    • Corrected a typo in the R Markdown file for proper knitting functionality.
  • Chores

    • Removed dependency on the simuG.pl script for several simulation workflows, streamlining the execution process.
    • Updated environment directory paths for better organization and clarity across all workflows.
    • Standardized naming conventions for input file types within the CLI framework.
    • Modified the .gitignore file to include the .harpy_envs/ directory, allowing it to be tracked by Git.

Copy link

coderabbitai bot commented Nov 22, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The pull request introduces several modifications across multiple files in the harpy module. Key changes include the update of the create_conda_recipes function to accept parameters for output directories and environment lists, the addition of a new dependency for conda environments, and the restructuring of output file naming conventions in Snakemake workflows. New rules for renaming and concatenating variant files have been added, enhancing the clarity and organization of output files. Overall, the changes streamline workflows while maintaining core functionalities.

Changes

File Change Summary
harpy/_conda.py Updated create_conda_recipes to accept outdir and envs; added dependency "bioconda::simug>1.0.0".
harpy/popgroup.py Updated regular expression patterns to raw string literals; retained error handling and output file logic.
harpy/simulate.py Removed fetch_script(workflowdir, "simuG.pl") from snpindel, inversion, cnv, and translocation; updated create_conda_recipes call.
harpy/snakefiles/simulate_snpindel.smk Restructured output naming conventions; added rules rename_haploid, rename_diploid, and concatenate_variants.
harpy/snakefiles/simulate_variants.smk Updated output naming conventions; added rules rename_haploid, rename_diploid; modified existing rules for clarity.
harpy/view.py Updated view command to include a new --config option for viewing workflow configuration files.
harpy/_validations.py Removed check_envdir function, eliminating its associated error handling logic.
harpy/align.py Updated command functions to include conda_envs for conda environment management.
harpy/assembly.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/container.py Updated create_conda_recipes call to include "container" as an argument.
harpy/deconvolve.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/demultiplex.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/impute.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/metassembly.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/phase.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/preflight.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/qc.py Updated create_conda_recipes call to include output_dir and conda_envs.
harpy/snakefiles/align_bwa.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/align_ema.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/align_strobealign.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/assembly.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/containerize.smk Removed onsuccess rule; updated rules to use new envdir variable.
harpy/snakefiles/deconvolve.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/demultiplex_gen1.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/impute.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/metassembly.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/phase.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/preflight_bam.smk Updated envdir to point to the new directory structure; defined new rules for handling BAM files.
harpy/snakefiles/preflight_fastq.smk Updated envdir to point to the new directory structure; defined new rules for handling FASTQ files.
harpy/snakefiles/qc.smk Updated envdir to point to the new directory structure; added rules for deconvolution.
harpy/snakefiles/simulate_linkedreads.smk Updated envdir to point to the new directory structure; modified rules for linked read simulation.
harpy/snakefiles/snp_freebayes.smk Updated envdir to point to the new directory structure; modified rules for variant calling.
harpy/snakefiles/snp_mpileup.smk Updated envdir to point to the new directory structure; modified rules for mpileup processing.
harpy/snakefiles/sv_leviathan.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/sv_leviathan_pop.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/sv_naibr.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/sv_naibr_phase.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/sv_naibr_pop.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snakefiles/sv_naibr_pop_phase.smk Updated envdir to point to the new directory structure; enhanced workflow_summary rule.
harpy/snp.py Added conda_envs variable for mpileup and freebayes commands; updated create_conda_recipes calls.
harpy/sv.py Added conda_envs variable for leviathan and naibr commands; updated create_conda_recipes calls.

Possibly related PRs

  • Leverage heterozygosity to create diploid genomes #140: The changes in this PR enhance the simulation of genomic variants by introducing new options and restructuring existing commands, which directly relates to the modifications made in the create_conda_recipes function in the main PR.
  • Add Metassembly #144: This PR adds metassembly support and modifies various workflows, which may indirectly relate to the changes in environment management and recipe creation in the main PR.
  • Molecule coverage #149: The introduction of molecular coverage calculations in this PR aligns with the overall theme of enhancing simulation and reporting capabilities, which is relevant to the changes in the main PR.
  • add singleton/non stats #156: The addition of singleton/non-singleton statistics in this PR complements the improvements in reporting and data handling in the main PR.
  • haplotagging barcodes as default #162: The modifications to barcode handling and simulation processes in this PR are relevant to the changes in environment management and recipe creation in the main PR.
  • better sim demux support #163: While the description is empty, the focus on improving simulation demultiplexing support may relate to the overall enhancements in the simulation workflows in the main PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
harpy/popgroup.py (1)

Line range hint 41-47: Consider enhancing file handling robustness

While the error messages are clear and helpful, consider adding checks for:

  1. File permissions when writing the output file
  2. Directory write permissions before attempting to create the output file
  3. Maximum filename length limitations

Here's a suggested improvement for the file handling:

def check_write_permissions(filepath):
    """Check if we have permission to write to the directory"""
    directory = os.path.dirname(filepath) or '.'
    return os.access(directory, os.W_OK)

# Add before opening the output file:
if not check_write_permissions(output):
    print_error("permission denied", f"Cannot write to output file: {output}")
    print_solution("Please ensure you have write permissions or choose a different output location")
    sys.exit(1)
🧰 Tools
🪛 Ruff (0.7.0)

34-34: Do not use bare except

(E722)

harpy/snakefiles/simulate_variants.smk (1)

112-138: Consider unifying the rename implementation approach.

The two rename rules use different approaches to achieve the same goal:

  • rename_haploid uses Python's os.rename
  • rename_diploid uses shell mv commands

Consider unifying the implementation for better maintainability.

-rule rename_diploid:
-    input:
-        fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.simseq.genome.fa",
-        mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.refseq2simseq.map.txt"
-    output:
-        fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.fasta",
-        mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.{variant}.map"
-    container:
-        None
-    shell:
-        """
-        mv {input.fasta} {output.fasta}
-        mv {input.mapfile} {output.mapfile}
-        """
+rule rename_diploid:
+    input:
+        fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.simseq.genome.fa",
+        mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.refseq2simseq.map.txt"
+    output:
+        fasta = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.fasta",
+        mapfile = f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.{variant}.map"
+    run:
+        for i,j in zip(input, output):
+            os.rename(i,j)
harpy/snakefiles/simulate_snpindel.smk (2)

141-148: Consider standardizing log file naming

While the input/output paths follow the new naming convention, the log file path at line 149 could be updated to match:

-    log:
-        f"{outdir}/logs/{outprefix}.hap{{haplotype}}.log"
+    log:
+        f"{outdir}/logs/{outprefix}.hap{{haplotype}}.simseq.log"

Also applies to: 153-154, 158-158


187-201: Consider enhancing VCF concatenation robustness

The current implementation could benefit from:

  1. Sorting variants by position
  2. Validating VCF format
  3. Checking for duplicate variants
 rule concatenate_variants:
     input:
         f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.indel.vcf",
         f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.snp.vcf"
     output:
         f"{outdir}/diploid/{outprefix}.hap{{haplotype}}.snpindel.vcf"
+    conda:
+        f"{envdir}/simulations.yaml"
     shell:
-        import shutil
-        shutil.copy(input[0], output[0])
-        with open(output[0], "a") as outvcf, open(input[1], "r") as invcf:
-            for record in invcf:
-                if record.startswith("#"):
-                    continue
-                outvcf.write(record)
+        "bcftools concat -a {input} | bcftools sort -o {output}"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 736e540 and 98b1b66.

📒 Files selected for processing (5)
  • harpy/_conda.py (1 hunks)
  • harpy/popgroup.py (1 hunks)
  • harpy/simulate.py (0 hunks)
  • harpy/snakefiles/simulate_snpindel.smk (2 hunks)
  • harpy/snakefiles/simulate_variants.smk (2 hunks)
💤 Files with no reviewable changes (1)
  • harpy/simulate.py
🧰 Additional context used
🪛 Ruff (0.7.0)
harpy/popgroup.py

34-34: Do not use bare except

(E722)

🔇 Additional comments (10)
harpy/popgroup.py (1)

28-28: LGTM! Good improvement to regex patterns

The switch to raw string literals (r"...") for regular expressions is a good practice that improves readability and prevents potential issues with escape sequences.

Also applies to: 36-36

harpy/_conda.py (1)

63-68: Verify if all perl-related dependencies are still needed

Since we're moving from a local perl-based simuG to a conda package, please verify if all the perl-related dependencies (perl-math-random, perl-inline-c, perl-parse-recdescent, perl) are still required by the conda version of simuG.

✅ Verification successful

Perl dependencies are still required - local Perl script is present

The verification shows that Perl dependencies are still needed as the package includes a Perl script (harpy/scripts/HaploSim.pl) that actively uses these modules:

  • Math::Random for random number generation
  • Inline::C for C code integration
  • The script also uses various other Perl modules

While the package now uses the conda version of simuG, the Perl dependencies are required for other functionality within the package itself.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check simuG conda package dependencies to verify perl requirements

# Get simuG package info from bioconda
curl -s https://api.anaconda.org/package/bioconda/simug | jq -r '.dependencies[]' | grep -i 'perl'

# Search for perl usage in package
conda create -n simug_test bioconda::simug -y 2>/dev/null
conda list -n simug_test | grep -i 'perl'

Length of output: 344


Script:

#!/bin/bash
# Let's try a different approach to verify perl dependencies

# First, check if simug is using perl in its implementation
rg -i "perl|use\s+[\w:]+|require\s+[\w:]+" --type perl

# Check for any perl script execution or perl-related system calls
rg -i "system.*perl|exec.*perl|\`perl|\$PERL|perl\s+[-\w./]+"

# Look for perl module imports or requirements in the codebase
rg -i "use\s+Math::Random|use\s+Inline::C|use\s+Parse::RecDescent"

# Check package metadata files for perl dependencies
fd "meta.ya?ml|environment.ya?ml|setup.py|pyproject.toml" --exec cat {}

Length of output: 3931

harpy/snakefiles/simulate_variants.smk (5)

60-62: LGTM! Improved output file naming clarity.

The explicit output paths better reflect simuG's default naming convention, making the workflow more predictable and easier to understand.


75-78: LGTM! Consistent path naming updates.

The path updates in both rules maintain consistency with the new naming convention, and the temporary VCF file is properly managed.

Also applies to: 96-101


143-147: LGTM! Consistent workflow summary updates.

The input paths properly reflect the new file naming convention and maintain appropriate conditional logic for heterozygosity.


Line range hint 1-147: Verify no broken dependencies on old file paths.

Let's check for any remaining references to the old file naming pattern that might cause broken dependencies.

✅ Verification successful

No dependencies found on old simuG file patterns

The verification confirms that there are no references to the old simuG default naming patterns (simseq.genome.fa or refseq2simseq) outside of Snakemake files. The renaming changes in the workflow are self-contained and do not impact any external dependencies.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential references to old file patterns
rg -g '!*.smk' "simseq.genome.fa|refseq2simseq" || echo "No references found to simuG default naming patterns outside Snakemake files"

Length of output: 212


71-71: Verify simuG availability in conda environment.

The command now directly uses simuG instead of the script. Let's verify the conda environment configuration.

✅ Verification successful

Let me try to locate the conda environment file using a different approach.


Let me check the content of the conda-related files we found to verify the simuG configuration.


simuG is properly configured in the conda environment

The verification shows that simuG is correctly specified in the conda dependencies:

  • It's included in the simulations environment with version constraint bioconda::simug>1.0.0
  • The environment is automatically created by the _conda.py script when needed
  • The direct usage of simuG command is appropriate as it will be available in the conda environment
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if simuG is properly specified in conda environment file
cat .harpy_envs/simulations.yaml || echo "Conda environment file not found at expected location"

Length of output: 212


Script:

#!/bin/bash
# Search for any yaml/yml files that might contain environment specifications
fd -e yaml -e yml

# Also check if there's any mention of simuG in the repository
rg -i "simug" -g "!{.git,node_modules}" --no-binary

Length of output: 1553


Script:

#!/bin/bash
# Check conda-related YAML files
cat resources/meta.yaml
echo "-------------------"
cat resources/harpy.yaml

# Also check the Python file that seems to handle conda dependencies
cat harpy/_conda.py

Length of output: 5704

harpy/snakefiles/simulate_snpindel.smk (3)

97-100: LGTM: Output paths are well-structured

The output paths follow a clear naming convention and correctly handle conditional SNP/indel outputs.


113-116: LGTM: Consistent directory structure and naming

The diploid variant rules maintain a clear and consistent directory structure under 'diploid/' with well-organized output paths.

Also applies to: 134-137


206-211: LGTM: Comprehensive workflow summary updates

The workflow summary correctly captures all output files with appropriate conditionals for different variant types and ploidy scenarios.

harpy/popgroup.py Show resolved Hide resolved
harpy/_conda.py Show resolved Hide resolved
harpy/snakefiles/simulate_snpindel.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
harpy/view.py (2)

20-21: Minor docstring improvement needed.

The docstring mentions "Use --snakefile or --config" but splits across lines in a way that might affect readability.

-    of a Harpy output directory. Use `--snakefile` or `--config` to view the workflow
-    snakefile or config.yaml file instead, respectively. Output is printed to the screen via `less` and
+    of a Harpy output directory. Use `--snakefile` or `--config` to view the workflow's
+    snakefile or config.yaml file, respectively. Output is printed to the screen via `less` and

44-51: Consider making the config filename configurable.

The config file path is currently hard-coded as config.yaml. Consider making this configurable to support different naming conventions or multiple config files.

+DEFAULT_CONFIG_NAME = "config.yaml"
+
 def view(directory, snakefile, config):
     ...
     elif config:
-        files = [f"{directory}/workflow/config.yaml"]
+        files = [f"{directory}/workflow/{DEFAULT_CONFIG_NAME}"]
         err_dir = f"{directory}/workflow/"
-        err_file = "There is no [blue]config.yaml[/blue] file"
+        err_file = f"There is no [blue]{DEFAULT_CONFIG_NAME}[/blue] file"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 98b1b66 and 19a8a43.

📒 Files selected for processing (1)
  • harpy/view.py (4 hunks)
🔇 Additional comments (3)
harpy/view.py (3)

12-15: LGTM! Well-structured CLI option addition.

The new --config option is properly integrated with appropriate flags and help text, maintaining consistency with the existing --snakefile option.


32-34: LGTM! Good error handling for mutually exclusive options.

The mutual exclusivity check is well-implemented with a clear error message using rich formatting.


69-69: LGTM! Consistent error message formatting.

The error message follows the established pattern and properly uses rich formatting.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Outside diff range and nitpick comments (49)
harpy/container.py (1)

18-18: Address the TODO comment regarding environment completeness.

The TODO comment suggests that not all required conda environments are being included in the container build. This could lead to missing dependencies when running workflows in the containerized environment.

Would you like me to help identify all the required conda environments across the codebase? I can create a GitHub issue to track this task.

harpy/snakefiles/containerize.smk (1)

4-4: Consider making the environment directory configurable

While the path construction is correct, consider making the environment directory configurable through Snakemake's config system or workflow.basedir instead of hardcoding the relative path. This would provide more flexibility for different deployment scenarios.

-envdir = os.path.join(os.getcwd(), "container/workflow/envs")
+envdir = os.path.join(workflow.basedir, config.get("env_dir", "container/workflow/envs"))
harpy/snakefiles/preflight_bam.smk (2)

Line range hint 7-12: Preserve error logs for debugging

The current onerror handler removes the log file, which could make debugging failures more difficult. Consider preserving error logs in a dedicated directory.

 onerror:
-    os.remove(logger.logfile)
+    error_log = os.path.join(outdir, "logs", "error.log")
+    os.makedirs(os.path.dirname(error_log), exist_ok=True)
+    shutil.copy2(logger.logfile, error_log)
+    logger.error(f"Workflow failed. Error log saved to {error_log}")

Line range hint 89-103: Enhance workflow summary with environment details

The workflow summary could be more informative by including details about the conda environments and container versions used.

     run:
         os.makedirs(f"{outdir}/workflow/", exist_ok= True)
         summary = ["The harpy preflight bam workflow ran using these parameters:"]
+        # Add container version
+        summary.append(f"Container version: {containerized}")
+        # Add conda environment details
+        if os.path.exists(envdir):
+            env_files = glob.glob(os.path.join(envdir, "*.yaml"))
+            if env_files:
+                summary.append("\nConda environments:")
+                for env in env_files:
+                    summary.append(f"\t{os.path.basename(env)}")
         valids = "Validations were performed with:\n"
harpy/snakefiles/preflight_fastq.smk (1)

Line range hint 48-60: LGTM with a minor suggestion

The concatenation logic is solid with proper temporary file handling and sorted output. Consider using a heredoc for the header line to improve readability:

     shell:
         """
-        cat {input} | sort -k1 > {output.tmp}
-        echo -e "file\treads\tnoBX\tbadBX\tbadSamSpec\tbxNotLast\n$(cat {output.tmp})" > {output.final}
+        cat {input} | sort -k1 > {output.tmp}
+        cat << 'EOF' > {output.final}
+file	reads	noBX	badBX	badSamSpec	bxNotLast
+EOF
+        cat {output.tmp} >> {output.final}
         """
harpy/resume.py (1)

27-34: Fix typo in docstring requirements

The docstring updates accurately reflect the changes, but there's a typo in the requirements section.

-    - the targest directory has `workflow/envs/*.yaml` present in it
+    - the target directory has `workflow/envs/*.yaml` present in it
harpy/snakefiles/deconvolve.smk (1)

Line range hint 1-1: Consider standardizing environment management approach.

The workflow mixes different environment management approaches:

  • Some rules use containers (containerized directive at the top)
  • Some rules explicitly set container: None
  • Some rules use conda environments

Consider standardizing the approach to use either conda environments consistently (aligning with the PR's objective of conda-based installation) or containers throughout the workflow. This would make the workflow more maintainable and reduce potential environment-related issues.

Also applies to: 18-18, 41-44, 52-59, 61-67

harpy/_conda.py (2)

107-113: Consider enhancing the SPAdes post-deployment script

The current implementation could be improved in several ways:

  1. Add SHA256 checksum verification for the downloaded file
  2. Make the SPAdes version configurable
  3. Check for wget availability before use

Here's a suggested improvement:

-            shellscript.write("wget -O .spades.tar.gz https://github.com/ablab/spades/releases/download/v4.0.0/SPAdes-4.0.0-Linux.tar.gz\n")
-            shellscript.write("tar -xvzf .spades.tar.gz && rm .spades.tar.gz\n")
+            shellscript.write("""
+SPADES_VERSION="4.0.0"
+SPADES_SHA256="904e52c505fb3ce2e8ca1147957f31500208df7dd55547474cf7da42c6935c2c"
+
+command -v wget >/dev/null 2>&1 || { echo "wget is required but not installed. Aborting." >&2; exit 1; }
+
+wget -O .spades.tar.gz "https://github.com/ablab/spades/releases/download/v${SPADES_VERSION}/SPAdes-${SPADES_VERSION}-Linux.tar.gz"
+echo "${SPADES_SHA256}  .spades.tar.gz" | sha256sum -c - || { echo "Checksum verification failed!" >&2; exit 1; }
+tar -xzf .spades.tar.gz && rm .spades.tar.gz
+""")

134-134: Fix typo in error message

There's a typo in "otheriwse".

-            "Check that the names conform to Harpy's expectations, otheriwse you can recreate this directory using the [green bold]--conda[/green bold] option.",
+            "Check that the names conform to Harpy's expectations, otherwise you can recreate this directory using the [green bold]--conda[/green bold] option.",
harpy/snakefiles/demultiplex_gen1.smk (2)

Line range hint 89-96: Consider improving the robustness of sample demultiplexing.

The current grep-based approach for demultiplexing might be fragile:

  1. Fixed pattern matching (A..{params}B..D) could break if the format changes
  2. Silent failure with touch {output} might mask issues

Consider:

  1. Using a more robust parsing approach
  2. Adding error logging
  3. Implementing validation of the output file size
 rule demultiplex_samples:
     input:
         outdir + "/demux_R{FR}_001.fastq.gz"
     output:
         outdir + "/{sample}.R{FR}.fq.gz"
     params:
         c_barcode = lambda wc: samples[wc.get("sample")]
+    log:
+        outdir + "/logs/demux/{sample}.R{FR}.log"
     container:
         None
     shell:
         """
-        ( zgrep -A3 "A..{params}B..D" {input} | grep -v "^--$" | gzip -q > {output} ) || touch {output}
+        (
+            zgrep -A3 "A..{params}B..D" {input} | grep -v "^--$" | gzip -q > {output}
+            if [ ! -s {output} ]; then
+                echo "Warning: No reads found for barcode pattern A..{params}B..D" >> {log}
+                touch {output}
+            fi
+        ) 2>> {log}
         """

Line range hint 98-127: Consider improving error handling in quality assessment.

The hardcoded fallback output in case of failure might become outdated if the falco tool's output format changes.

Consider:

  1. Moving the fallback template to a separate file
  2. Adding version checking
  3. Implementing proper error codes
 rule assess_quality:
     input:
         outdir + "/{sample}.R{FR}.fq.gz"
     output: 
         outdir + "/reports/data/{sample}.R{FR}.fastqc"
     log:
         outdir + "/logs/{sample}.R{FR}.qc.log"
     params:
         f"{outdir}/reports/data"
     threads:
         1
     conda:
         f"{envdir}/qc.yaml"
     shell:
         """
-        ( falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 ||
-cat <<EOF > {output}
-##Falco	1.2.4
->>Basic Statistics	fail
-#Measure	Value
-Filename	{wildcards.sample}.R{wildcards.FR}.fq.gz
-File type	Conventional base calls
-Encoding	Sanger / Illumina 1.9
-Total Sequences	0
-Sequences flagged as poor quality	0
-Sequence length	0
-%GC	0
->>END_MODULE
-EOF      
+        falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} > {log} 2>&1
+        if [ $? -ne 0 ]; then
+            echo "Error: Quality assessment failed for {wildcards.sample}.R{wildcards.FR}" >> {log}
+            exit 1
+        fi
         """
harpy/metassembly.py (1)

72-72: Consider version pinning conda environments for reproducibility.

While the conda environments list is comprehensive for the workflow needs, consider adding version specifications to ensure reproducible builds across different systems.

Example modification:

-    conda_envs = ["align", "assembly", "metassembly", "qc", "spades"]
+    conda_envs = {
+        "align": "1.0",
+        "assembly": "1.0",
+        "metassembly": "1.0",
+        "qc": "1.0",
+        "spades": "3.15.5"  # or your preferred version
+    }

Also applies to: 85-85

harpy/qc.py (1)

79-79: Consider making conda environments configurable.

While the current hardcoded list ["qc", "r"] works for the current setup, consider making these environments configurable through a configuration file or CLI parameter for better flexibility and maintainability.

harpy/impute.py (1)

85-93: Consider improving conda environment configuration.

While the addition of conda environments is aligned with the PR objectives, consider these improvements:

  1. Move the environment list to a configuration file for better maintainability
  2. Add validation for required environment names
  3. Consider if additional environments might be needed for the imputation workflow

Example implementation:

+ from typing import List
+ def validate_conda_envs(envs: List[str]) -> None:
+     required_envs = {"r", "stitch"}
+     if not required_envs.issubset(set(envs)):
+         missing = required_envs - set(envs)
+         raise ValueError(f"Missing required conda environments: {missing}")

- conda_envs = ["r", "stitch"]
+ # Load from config file
+ conda_envs = load_conda_config()  # implement this function
+ validate_conda_envs(conda_envs)
harpy/snakefiles/assembly.smk (3)

Line range hint 1-1: Consider using a specific container version for reproducibility.

Using the latest tag for the container image can lead to reproducibility issues as the image content may change over time. Consider pinning to a specific version.

-containerized: "docker://pdimens/harpy:latest"
+containerized: "docker://pdimens/harpy:1.0.0"  # Replace with actual version

Line range hint 6-9: Improve log file handling for better debugging.

The current implementation removes the log file in both success and error cases, which might make debugging harder. Consider:

  1. Preserving the log file on error
  2. Moving successful logs to an archive directory
onstart:
    logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
onsuccess:
-    os.remove(logger.logfile)
+    os.rename(logger.logfile, os.path.join(outdir, "logs", "archive", os.path.basename(logger.logfile)))
onerror:
-    os.remove(logger.logfile)
+    logger.logger.error("Workflow failed. Log file preserved at: %s", logger.logfile)

Line range hint 156-169: Improve error handling in BUSCO analysis.

The current implementation silently creates an empty output file on failure (touch {output}), which could mask real issues. Consider failing the workflow or at least logging the error.

rule BUSCO_analysis:
    input:
        f"{outdir}/scaffolds.fasta"
    output:
        f"{outdir}/busco/short_summary.specific.{lineagedb}_odb10.busco.txt"
    log:
        f"{outdir}/logs/busco.log"
    params:
        output_folder = outdir,
        out_prefix = "-o busco",
        lineage = f"-l {lineagedb}",
        download_path = f"--download_path {outdir}/busco",
        metaeuk = "--metaeuk" if organism == "eukaryote" else "" 
    threads:
        workflow.cores
    conda:
        f"{envdir}/assembly.yaml"
    shell:
-        "( busco -f -i {input} -c {threads} -m genome --out_path {params} > {log} 2>&1 ) || touch {output}"
+        """
+        if ! busco -f -i {input} -c {threads} -m genome --out_path {params} > {log} 2>&1; then
+            echo "BUSCO analysis failed. Check the log file at {log}" >&2
+            exit 1
+        fi
+        """
harpy/snakefiles/simulate_variants.smk (2)

60-62: LGTM! Consider updating documentation

The changes correctly implement the switch to conda-based simuG installation by:

  1. Removing the Perl script dependency
  2. Using more descriptive output filenames that include the variant type

Consider adding a comment explaining the output file naming convention for future maintainability.

Also applies to: 71-71


112-138: Standardize implementation of rename rules

The two rename rules use different approaches to achieve the same goal:

  • rename_haploid uses Python's os.rename
  • rename_diploid uses shell mv command

Consider standardizing the implementation. Here's a suggested refactor for rename_diploid:

-    container:
-        None
-    shell:
-        """
-        mv {input.fasta} {output.fasta}
-        mv {input.mapfile} {output.mapfile}
-        """
+    run:
+        for i,j in zip(input, output):
+            os.rename(i,j)
harpy/phase.py (1)

Line range hint 83-94: Consider environment validation and configuration flexibility.

While the addition of conda environments aligns with the migration to conda-based installation, consider these improvements:

  1. Add validation to ensure the required environments exist
  2. Consider moving environment names to a configuration file for better maintainability

Here's a suggested approach:

-    conda_envs = ["phase", "r"]
+    # Import from a central configuration
+    from ._config import PHASE_ENVIRONMENTS
+    
+    conda_envs = PHASE_ENVIRONMENTS
+    
+    # Validate environments before proceeding
+    if not all(validate_conda_env(env) for env in conda_envs):
+        raise ValueError("One or more required conda environments are missing")
harpy/snakefiles/sv_naibr.smk (1)

Line range hint 191-193: Consider cleaning up temporary workflow directories on completion

The workflow creates temporary directories under workflow/input for NAIBR configuration files, but these aren't cleaned up after the workflow completes. Only the final summary is retained in the workflow directory.

Consider adding cleanup in the onsuccess directive:

 onsuccess:
     os.remove(logger.logfile)
+    shutil.rmtree(os.path.join(outdir, "workflow", "input"), ignore_errors=True)

Don't forget to add the import:

 import os
 import re
 import logging
+import shutil
 from pathlib import Path

Also applies to: 207-208

harpy/assembly.py (1)

Line range hint 87-128: Architecture supports conda-based installation effectively

The changes effectively support the transition to conda-based installation by:

  1. Explicitly defining required conda environments
  2. Integrating them into the workflow configuration
  3. Properly managing them through the create_conda_recipes function

This approach eliminates the need for a local copy of Harpy while maintaining workflow functionality.

harpy/snakefiles/simulate_linkedreads.smk (2)

Line range hint 73-83: Consider version pinning for external tools

The workflow uses several external tools (dwgsim, seqtk, msort, perl) without version pinning. This could lead to reproducibility issues.

Add version constraints in the conda environment file:

# simulations.yaml
dependencies:
-  - dwgsim
-  - seqtk
-  - msort
-  - perl
+  - dwgsim=0.1.13  # Replace with actual version
+  - seqtk=1.4      # Replace with actual version
+  - msort=8.53     # Replace with actual version
+  - perl=5.32.1    # Replace with actual version

Also applies to: 127-137


Line range hint 186-218: Enhance workflow summary with version information

The summary provides good documentation of parameters but lacks version information for the tools used. This information is crucial for reproducibility.

Consider adding version information to the summary:

     summary = ["The harpy simulate linkedreas workflow ran using these parameters:"]
+    # Add version information
+    versions = ["Tool versions used:"]
+    for tool in ["dwgsim", "seqtk", "msort", "perl"]:
+        result = shell(f"conda list {tool} | grep {tool}", capture_output=True)
+        versions.append(f"{tool}: {result.stdout.decode().strip()}")
+    summary.extend(["\n".join(versions), ""])
     summary.append(f"Genome haplotype 1: {gen_hap1}")
harpy/snakefiles/snp_mpileup.smk (4)

Line range hint 27-37: Add input validation for configuration and files

The region handling code could benefit from additional validation:

  1. Validate file existence before opening
  2. Add type checking for configuration values
  3. Handle malformed lines in region file

Consider this improved implementation:

 if regiontype == "region":
+    if not isinstance(regioninput, str):
+        raise ValueError("Region input must be a string")
     intervals = [regioninput]
     regions = {f"{regioninput}" : f"{regioninput}"}
 else:
+    if not os.path.exists(regioninput):
+        raise FileNotFoundError(f"Region file not found: {regioninput}")
     with open(regioninput, "r") as reg_in:
         intervals = set()
         for line in reg_in:
-            cont,startpos,endpos = line.split()
+            try:
+                cont,startpos,endpos = line.strip().split()
+                # Validate numeric positions
+                int(startpos), int(endpos)
+            except (ValueError, TypeError) as e:
+                raise ValueError(f"Invalid region format in line: {line}") from e
             intervals.add(f"{cont}:{startpos}-{endpos}")
     regions = dict(zip(intervals, intervals))

Line range hint 114-127: Add error handling to mpileup rule

The pipe output in the mpileup rule could lead to silent failures if downstream processing fails. Consider adding error checking and logging.

 rule mpileup:
     input:
         bamlist = outdir + "/logs/samples.files",
         bam = bamlist,
         bai = collect("{bam}.bai", bam = bamlist),
         genome  = f"Genome/{bn}",
         genome_fai = f"Genome/{bn}.fai"
     output: 
         bcf = pipe(outdir + "/mpileup/{part}.mp.bcf"),
         logfile = temp(outdir + "/logs/{part}.mpileup.log")
     params:
         region = lambda wc: "-r " + regions[wc.part],
         extra = mp_extra
     container:
         None
     shell:
-        "bcftools mpileup --fasta-ref {input.genome} --bam-list {input.bamlist} --annotate AD --output-type b {params} > {output.bcf} 2> {output.logfile}"
+        """
+        set -o pipefail
+        bcftools mpileup --fasta-ref {input.genome} \
+            --bam-list {input.bamlist} \
+            --annotate AD \
+            --output-type b \
+            {params} > {output.bcf} 2> {output.logfile} || {{
+                echo "mpileup failed for region {wildcards.part}" >> {output.logfile}
+                exit 1
+        }}
+        """

Line range hint 246-276: Enhance workflow summary with more detailed information

The workflow summary could benefit from additional information about the execution environment and input validation results.

 rule workflow_summary:
     default_target: True
     input:
         vcf = collect(outdir + "/variants.{file}.bcf", file = ["raw"]),
         agg_log = outdir + "/logs/mpileup.log",
         reports = collect(outdir + "/reports/variants.{file}.html", file = ["raw", "normalized"]) if not skip_reports else []
     params:
         ploidy = f"--ploidy {ploidy}",
         populations = f"--populations {groupings}" if groupings else "--populations -"
     run:
-        summary = ["The harpy snp freebayes workflow ran using these parameters:"]
+        summary = ["# Harpy SNP Mpileup Workflow Summary"]
+        summary.append("\n## Environment Information")
+        summary.append(f"Execution timestamp: {datetime.datetime.now()}")
+        summary.append(f"Harpy version: {config.get('version', 'unknown')}")
+        
+        summary.append("\n## Input Validation")
+        summary.append(f"Number of BAM files: {len(bamlist)}")
+        summary.append(f"Number of intervals processed: {len(intervals)}")
+        
+        summary.append("\n## Workflow Parameters")
         summary.append(f"The provided genome: {bn}")
         if windowsize:
             summary.append(f"Size of intervals to split genome for variant calling: {windowsize}")
         else:
             summary.append(f"Genomic positions for which variants were called: {regioninput}")
+        
+        summary.append("\n## Command Parameters")
         mpileup = "The mpileup parameters:\n"
         mpileup += f"\tbcftools mpileup --fasta-ref GENOME --region REGION --bam-list BAMS --annotate AD --output-type b {mp_extra}"
         summary.append(mpileup)

Line range hint 1-276: Overall assessment: Approved with suggestions

The changes align well with the PR objectives of moving to conda-based installation. The code is well-structured and maintains good separation of concerns. Consider implementing the suggested improvements for input validation and error handling in future iterations.

For future improvements, consider:

  1. Adding a configuration validation step at workflow start
  2. Implementing retry logic for failed regions
  3. Adding progress tracking for long-running operations
harpy/preflight.py (1)

Line range hint 79-153: Architecture feedback: Clean conda integration

The changes cleanly integrate conda environment management into the existing preflight check system. The implementation:

  • Maintains separation of concerns
  • Uses consistent patterns across commands
  • Preserves the original functionality while adding conda support

Consider documenting the conda environment requirements in the docstrings to help users understand what environments are needed and why.

harpy/snakefiles/phase.smk (2)

Line range hint 58-71: Add error handling for alignment file matching

The helper functions could benefit from better error handling when no matches or multiple matches are found.

Consider this improvement:

 def get_alignments(wildcards):
     """returns a list with the bam file for the sample based on wildcards.sample"""
     r = re.compile(fr".*/({wildcards.sample})\.(bam|sam)$", flags = re.IGNORECASE)
     aln = list(filter(r.match, bamlist))
+    if not aln:
+        raise ValueError(f"No alignment file found for sample {wildcards.sample}")
+    if len(aln) > 1:
+        raise ValueError(f"Multiple alignment files found for sample {wildcards.sample}: {aln}")
     return aln[0]

Line range hint 132-148: Add resource specifications for compute-intensive rules

Consider adding resource specifications for compute-intensive rules like process_genome to help with workflow scheduling and resource allocation.

Add resources like this:

 rule process_genome:
     input:
         genomefile
     output: 
         geno
     container:
         None
+    resources:
+        mem_mb=4000,
+        runtime=60
     shell: 
         "seqtk seq {input} > {output}"
harpy/snakefiles/sv_naibr_pop.smk (3)

18-19: Consider relocating conda environments directory

The conda environments directory (envdir) is currently placed under the output directory. This could lead to duplicate environments across different runs and waste disk space. Consider moving it to a central location, such as .conda/envs in the project root.

-envdir       = os.path.join(os.getcwd(), outdir, "workflow", "envs")
+envdir       = os.path.join(os.getcwd(), ".conda", "envs")

Line range hint 283-311: Improve robustness and safety of workflow summary generation

Several issues need attention:

  1. Using os.system for file removal is unsafe and should be replaced with shutil
  2. String concatenation could be simplified using list operations
  3. Hardcoded paths should be derived from configuration

Here's a safer implementation:

-        os.system(f"rm -rf {outdir}/naibrlog")
+        import shutil
+        naibrlog_path = os.path.join(outdir, "naibrlog")
+        if os.path.exists(naibrlog_path):
+            shutil.rmtree(naibrlog_path)

         summary = ["The harpy sv naibr workflow ran using these parameters:"]
         summary.append(f"The provided genome: {bn}")
-        concat = "The alignments were concatenated using:\n"
-        concat += "\tconcatenate_bam.py -o groupname.bam -b samples.list"
-        summary.append(concat)
+        summary.append("The alignments were concatenated using:\n\tconcatenate_bam.py -o groupname.bam -b samples.list")

-        naibr = "naibr variant calling ran using these configurations:\n"
-        naibr += "\tbam_file=BAMFILE\n"
-        naibr += "\tprefix=PREFIX\n"
-        naibr += "\toutdir=Variants/naibr/PREFIX\n"
-        naibr += "\n\t".join([f"{k}={v}" for k,v in argdict.items()])
-        summary.append(naibr)
+        naibr_config = [
+            "naibr variant calling ran using these configurations:",
+            "\tbam_file=BAMFILE",
+            "\tprefix=PREFIX",
+            "\toutdir=Variants/naibr/PREFIX",
+            *[f"\t{k}={v}" for k,v in argdict.items()]
+        ]
+        summary.append("\n".join(naibr_config))

-        sm = "The Snakemake workflow was called via command line:\n"
-        sm = f"\t{config['workflow_call']}"
-        summary.append(sm)
+        summary.append(f"The Snakemake workflow was called via command line:\n\t{config['workflow_call']}")

Line range hint 1-311: Consider enhancing error handling across rules

While the conda integration looks good, consider adding error handling for common failure scenarios:

  1. Missing input files
  2. Failed conda environment creation
  3. NAIBR execution failures

Example approach for handling NAIBR failures:

# Add to the call_variants rule
onerror:
    logger.error(f"NAIBR execution failed for {wildcards.population}. Check {log} for details.")
    # Clean up partial outputs
    for f in [output.bedpe, output.refmt, output.vcf]:
        if os.path.exists(f):
            os.remove(f)
harpy/snakefiles/sv_leviathan_pop.smk (2)

Line range hint 123-123: Add validation for conda environment files

The workflow references several conda environment files but doesn't validate their existence. Consider adding an onstart check to ensure all required environment files exist.

Add this validation to the onstart block:

required_envs = ["variants.yaml", "align.yaml", "r.yaml"]
for env in required_envs:
    env_path = os.path.join(envdir, env)
    if not os.path.exists(env_path):
        raise FileNotFoundError(f"Required conda environment file not found: {env_path}")

Also applies to: 144-144, 186-186, 251-251, 267-267


Line range hint 6-11: Improve error handling for log files

Currently, log files are removed on both success and error, which might hinder debugging. Consider:

  1. Archiving logs on success instead of removing them
  2. Preserving logs on error for debugging
onstart:
    logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
onsuccess:
-    os.remove(logger.logfile)
+    if os.path.exists(logger.logfile):
+        archive_dir = os.path.join(outdir, "archived_logs")
+        os.makedirs(archive_dir, exist_ok=True)
+        os.rename(logger.logfile, os.path.join(archive_dir, f"success_{os.path.basename(logger.logfile)}"))
onerror:
-    os.remove(logger.logfile)
+    if os.path.exists(logger.logfile):
+        error_dir = os.path.join(outdir, "error_logs")
+        os.makedirs(error_dir, exist_ok=True)
+        os.rename(logger.logfile, os.path.join(error_dir, f"error_{os.path.basename(logger.logfile)}"))
harpy/snakefiles/align_strobealign.smk (2)

Line range hint 1-1: Consider standardizing environment management across rules.

The workflow currently mixes container-based and conda-based rules, with some rules having no explicit environment specification. Given the PR's focus on conda-based installations, consider standardizing the environment management approach across all rules.

Consider:

  1. Documenting the rationale for using different environments in different rules
  2. Standardizing on conda environments where possible
  3. Explicitly specifying environments for all rules

Also applies to: 17-17, 73-73, 89-89, 108-108, 156-156, 186-186, 205-205


Line range hint 207-249: Consider enhancing the workflow summary with environment information.

The workflow summary captures command-line parameters and workflow steps but could be enhanced to include information about the environments used (conda/container) and their locations.

Consider adding environment information to the summary:

     summary.append(duplicates)
+    environments = "Environment configuration:\n"
+    environments += f"\tConda environments directory: {envdir}\n"
+    environments += "\tContainer image: docker://pdimens/harpy:latest"
+    summary.append(environments)
     sm = "The Snakemake workflow was called via command line:\n"
harpy/snakefiles/simulate_snpindel.smk (1)

141-148: LGTM with suggestion: Consider adding error handling

The path handling and command construction are well-implemented. However, consider adding error handling for the simuG command execution.

-        "simuG -refseq {input.geno} -prefix {params.prefix} {params.snp} {params.indel} > {log}"
+        """
+        if ! simuG -refseq {input.geno} -prefix {params.prefix} {params.snp} {params.indel} > {log} 2>&1; then
+            echo "simuG failed. Check the log file at {log}" >&2
+            exit 1
+        fi
+        """

Also applies to: 152-154, 158-159

harpy/snakefiles/metassembly.smk (2)

Line range hint 1-24: Consider documenting conda environment setup in workflow summary

The switch to conda-based installation is a significant change. Consider enhancing the workflow summary to include:

  1. Required conda environment setup steps
  2. Minimum conda version requirements
  3. Instructions for environment creation

This documentation would help users understand the new conda-based setup and ensure reproducibility across different systems.


Line range hint 82-103: Add error handling for conda environment failures

The rules that depend on conda environments should include error handling for cases where environment creation or activation fails. Currently, only the BUSCO rule has explicit error handling.

Consider adding similar error handling to other conda-dependent rules. For example:

 rule spades_assembly:
     input:
         fastq_R1C = outdir + "/error_correction/corrected/input.R1.fq00.0_0.cor.fastq.gz",
         fastq_R2C = outdir + "/error_correction/corrected/input.R2.fq00.0_0.cor.fastq.gz",
         fastq_UNC = outdir + "/error_correction/corrected/input.R_unpaired00.0_0.cor.fastq.gz"
     output:
         f"{outdir}/spades_assembly/contigs.fasta" 
     params:
         outdir = outdir + "/spades_assembly",
         k = k_param,
         mem = max_mem // 1000,
         extra = extra
     log:
         outdir + "/logs/spades_assembly.log"
     threads:
         workflow.cores
     resources:
         mem_mb=max_mem
     conda:
         f"{envdir}/spades.yaml"
     container:
         None
     shell:
-        "metaspades.py -t {threads} -m {params.mem} -k {params.k} {params.extra} -1 {input.fastq_R1C} -2 {input.fastq_R2C} -s {input.fastq_UNC} -o {params.outdir} --only-assembler > {log}"
+        """
+        (metaspades.py -t {threads} -m {params.mem} -k {params.k} {params.extra} \
+            -1 {input.fastq_R1C} -2 {input.fastq_R2C} -s {input.fastq_UNC} \
+            -o {params.outdir} --only-assembler > {log} 2>&1) || \
+        (echo "Error: SPAdes assembly failed. Check conda environment and logs." >> {log} && exit 1)
+        """

Also applies to: 104-125, 126-146, 147-156, 157-170, 206-225, 226-251, 252-270, 271-280

harpy/snakefiles/align_ema.smk (2)

Line range hint 391-419: Consider using f-strings consistently in workflow_summary.

The workflow_summary rule mixes string concatenation styles. Consider using f-strings consistently for better readability.

Example improvement for the counts variable:

-        counts = "Barcodes were counted and validated with EMA using:\n"
-        counts += f"\tseqtk mergepe forward.fq.gz reverse.fq.gz | ema count {params.beadtech}"
+        counts = f"""Barcodes were counted and validated with EMA using:
+        seqtk mergepe forward.fq.gz reverse.fq.gz | ema count {params.beadtech}"""

Line range hint 183-196: Consider quoting variables in shell commands.

In the mark_duplicates and index_duplicates rules, consider quoting variables in shell commands to handle paths with spaces.

Example improvement:

         samtools collate -O -u {input.sam} |
-            samtools fixmate -m -u - - |
-            samtools sort -T {params.tmpdir} -u --reference {input.genome} -l 0 -m {resources.mem_mb}M - |
-            samtools markdup -@ {threads} -S --barcode-tag BX -f {log} - {output}
-        rm -rf {params.tmpdir}
+            samtools fixmate -m -u - - |
+            samtools sort -T "{params.tmpdir}" -u --reference "{input.genome}" -l 0 -m {resources.mem_mb}M - |
+            samtools markdup -@ {threads} -S --barcode-tag BX -f "{log}" - "{output}"
+        rm -rf "{params.tmpdir}"

Also applies to: 197-208

harpy/align.py (2)

Line range hint 116-127: Consider moving conda environments list to a module-level constant.

The conda environments list ["align", "r", "qc"] is duplicated across all three alignment commands. To improve maintainability and reduce duplication, consider moving it to a module-level constant.

+ # At the top of the file, with other imports
+ ALIGNMENT_CONDA_ENVS = ["align", "r", "qc"]

- conda_envs = ["align", "r", "qc"]
+ conda_envs = ALIGNMENT_CONDA_ENVS

Also applies to: 140-140


221-222: Remove unnecessary empty line.

The empty line between the conda environments list declaration and the configs dictionary is unnecessary and inconsistent with other commands.

 conda_envs = ["align", "r", "qc"]
-
 configs = {
harpy/simulate.py (1)

201-201: Consider adding error handling for create_conda_recipes calls

The create_conda_recipes calls could benefit from error handling to gracefully handle potential failures in conda environment setup.

Example improvement:

-    create_conda_recipes(output_dir, conda_envs)
+    try:
+        create_conda_recipes(output_dir, conda_envs)
+    except Exception as e:
+        print_error("conda setup", f"Failed to create conda recipes: {str(e)}")
+        sys.exit(1)

Also applies to: 338-338, 444-444, 563-563, 665-665

harpy/snakefiles/qc.smk (3)

Line range hint 47-70: Ensure correct expansion of params in the fastp shell command

In the fastp rule within the deconvolve branch, {params} is passed directly into the shell command. Since params is a dictionary of parameter values, it may not expand correctly in the shell command, leading to potential runtime errors.

Proposed Fix:

Concatenate the parameter values into a single string and pass it to the shell command. Update the params section to include a fullparams string:

 params:
     minlen = f"--length_required {min_len}",
     maxlen = f"--max_len1 {max_len}",
     trim_adapters = "--detect_adapter_for_pe" if trimadapters else "--disable_adapter_trimming",
     dedup = "-D" if dedup else "",
     extra = extra,
+    fullparams = " ".join([minlen, maxlen, trim_adapters, dedup, extra])

 shell:
     """
-    fastp --trim_poly_g --cut_right {params} --thread {threads} -i {input.fw} -I {input.rv} --stdout -h {output.html} -j {output.json} -R "{wildcards.sample} QC Report" 2> {log.serr} > {output.fq}
+    fastp --trim_poly_g --cut_right {params.fullparams} --thread {threads} -i {input.fw} -I {input.rv} --stdout -h {output.html} -j {output.json} -R "{wildcards.sample} QC Report" 2> {log.serr} > {output.fq}
     """

This ensures that all parameters are correctly passed to fastp as a single string.


Line range hint 170-182: Address undefined variables in workflow_summary

Within the workflow_summary rule's run block, variables decon_k, decon_w, decon_d, and decon_a are referenced when constructing the summary. However, these variables may not be accessible within this scope, leading to a NameError at runtime.

Proposed Fix:

Pass these variables into the workflow_summary rule using the params section:

 rule workflow_summary:
     default_target: True
     input:
         fq = collect(outdir + "/{sample}.{FR}.fq.gz", FR = ["R1", "R2"], sample = samplenames),
         bx_report = outdir + "/reports/barcode.summary.html" if not skip_reports else [],
         agg_report = outdir + "/reports/qc.report.html" if not skip_reports else []
     params:
         minlen = f"--length_required {min_len}",
         maxlen = f"--max_len1 {max_len}",
         trim_adapters = "--detect_adapter_for_pe" if trimadapters else "--disable_adapter_trimming",
         dedup = "-D" if dedup else "",
         extra = extra,
+        decon_k = decon_k if deconvolve else None,
+        decon_w = decon_w if deconvolve else None,
+        decon_d = decon_d if deconvolve else None,
+        decon_a = decon_a if deconvolve else None,
+        threads = workflow.cores
     run:
         summary = ["The harpy qc workflow ran using these parameters:"]
         fastp = "fastp ran using:\n"
         fastp += f"\tfastp --trim_poly_g --cut_right {params}"
         summary.append(fastp)
         if deconvolve:
             deconv = "Deconvolution occurred using QuickDeconvolution:\n"
-            deconv += "\tQuickDeconvolution -t threads -i infile.fq -o output.fq -k {decon_k} -w {decon_w} -d {decon_d} -a {decon_a}"
+            deconv += f"\tQuickDeconvolution -t {params.threads} -i infile.fq -o output.fq -k {params.decon_k} -w {params.decon_w} -d {params.decon_d} -a {params.decon_a}"
             summary.append(deconv)

This modification ensures that all variables used within the run block are properly defined and accessible.


Line range hint 172-176: Use actual thread count in deconvolution summary

In the deconvolution summary within workflow_summary, the number of threads is listed as a string literal "threads" rather than the actual thread count.

Suggested Improvement:

Replace the string "threads" with the actual thread count:

-    deconv += "\tQuickDeconvolution -t threads -i infile.fq -o output.fq -k {params.decon_k} -w {params.decon_w} -d {params.decon_d} -a {params.decon_a}"
+    deconv += f"\tQuickDeconvolution -t {params.threads} -i infile.fq -o output.fq -k {params.decon_k} -w {params.decon_w} -d {params.decon_d} -a {params.decon_a}"

This provides accurate information in the summary about the resources used during deconvolution.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 19a8a43 and 7454316.

📒 Files selected for processing (40)
  • harpy/_conda.py (3 hunks)
  • harpy/_validations.py (0 hunks)
  • harpy/align.py (9 hunks)
  • harpy/assembly.py (3 hunks)
  • harpy/container.py (1 hunks)
  • harpy/deconvolve.py (2 hunks)
  • harpy/demultiplex.py (2 hunks)
  • harpy/impute.py (2 hunks)
  • harpy/metassembly.py (3 hunks)
  • harpy/phase.py (3 hunks)
  • harpy/preflight.py (2 hunks)
  • harpy/qc.py (2 hunks)
  • harpy/resume.py (2 hunks)
  • harpy/simulate.py (15 hunks)
  • harpy/snakefiles/align_bwa.smk (1 hunks)
  • harpy/snakefiles/align_ema.smk (1 hunks)
  • harpy/snakefiles/align_strobealign.smk (1 hunks)
  • harpy/snakefiles/assembly.smk (1 hunks)
  • harpy/snakefiles/containerize.smk (1 hunks)
  • harpy/snakefiles/deconvolve.smk (1 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (1 hunks)
  • harpy/snakefiles/impute.smk (1 hunks)
  • harpy/snakefiles/metassembly.smk (1 hunks)
  • harpy/snakefiles/phase.smk (1 hunks)
  • harpy/snakefiles/preflight_bam.smk (1 hunks)
  • harpy/snakefiles/preflight_fastq.smk (1 hunks)
  • harpy/snakefiles/qc.smk (1 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (1 hunks)
  • harpy/snakefiles/simulate_snpindel.smk (3 hunks)
  • harpy/snakefiles/simulate_variants.smk (3 hunks)
  • harpy/snakefiles/snp_freebayes.smk (1 hunks)
  • harpy/snakefiles/snp_mpileup.smk (1 hunks)
  • harpy/snakefiles/sv_leviathan.smk (1 hunks)
  • harpy/snakefiles/sv_leviathan_pop.smk (1 hunks)
  • harpy/snakefiles/sv_naibr.smk (1 hunks)
  • harpy/snakefiles/sv_naibr_phase.smk (1 hunks)
  • harpy/snakefiles/sv_naibr_pop.smk (1 hunks)
  • harpy/snakefiles/sv_naibr_pop_phase.smk (1 hunks)
  • harpy/snp.py (6 hunks)
  • harpy/sv.py (6 hunks)
💤 Files with no reviewable changes (1)
  • harpy/_validations.py
🔇 Additional comments (68)
harpy/container.py (1)

19-19: Verify container build completeness with new conda-based simuG.

The change to create_conda_recipes("container") aligns with the PR's objective to switch to conda-based simuG installation. However, we should verify that all necessary environments are properly configured.

Let's check the conda environment configurations and their usage:

harpy/snakefiles/containerize.smk (3)

11-53: LGTM! Consistent environment rule structure

The environment rules follow a consistent pattern and correctly use the new envdir variable. The structure is clean and maintainable.


5-8: Verify environment completeness for simuG integration

The change from 'snp' to 'variants' aligns with the PR objectives. Let's verify that all required environments for the conda-based simuG installation are included.

✅ Verification successful

Environment list is complete and correctly updated for simuG integration

The environment list in the rule all is complete and correctly matches the environment definitions in _conda.py. The change from 'snp' to 'variants' aligns with the codebase, where simuG (>1.0.0) is properly included in the 'simulations' environment, and variant-calling tools are grouped in the 'variants' environment. All nine environments listed in the rule (align, assembly, metassembly, phase, qc, r, simulations, stitch, variants) are defined in the codebase with their respective dependencies.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if all necessary environment files exist and contain required dependencies

# Check for environment files
echo "Checking for environment files..."
for env in align assembly metassembly phase qc r simulations stitch variants; do
  if [ -f "container/workflow/envs/${env}.yaml" ]; then
    echo "Found ${env}.yaml"
    # Check if simuG is included in variants.yaml
    if [ "$env" = "variants" ]; then
      echo "Checking for simuG in variants.yaml..."
      grep -A 5 "dependencies:" "container/workflow/envs/${env}.yaml"
    fi
  else
    echo "Missing ${env}.yaml"
  fi
done

Length of output: 1678


Script:

#!/bin/bash
# Let's check if the environment files are in a different location
echo "Searching for yaml/yml files..."
fd -e yaml -e yml

# Also check if there are any Snakefile rules that might indicate the correct path
echo -e "\nSearching for Snakefile rules with environment definitions..."
rg -A 3 "conda:" 

Length of output: 24504


Script:

#!/bin/bash
# Let's check the content of _conda.py to verify the environment definitions
echo "Checking environment definitions in _conda.py..."
cat harpy/_conda.py

# Also check if there are any other environment-related files
echo -e "\nChecking for other environment files..."
fd -t f -e yaml -e yml -e json "env|conda" --exclude "*.env"

Length of output: 5675


20-23: Verify the variants environment configuration

The new variants rule replaces the old snp rule. Let's ensure this change is reflected in the corresponding environment file and that it includes all necessary dependencies for simuG.

harpy/snakefiles/preflight_bam.smk (2)

Line range hint 1-103: Overall structure looks good

The workflow is well-organized and follows Snakemake best practices. The changes to environment management align with the PR's objective of improving conda-based installations.


18-18: Verify environment path changes across workflows

The environment directory path has been moved from a global .harpy_envs to a workflow-specific location. While this provides better isolation, we should verify that all related workflows are updated consistently.

Consider adding environment cleanup logic to the onsuccess handler to prevent accumulation of unused conda environments when running multiple workflows. For example:

onsuccess:
    os.remove(logger.logfile)
    # Add cleanup of workflow-specific conda environments
    if os.path.exists(envdir):
        shutil.rmtree(envdir)
✅ Verification successful

Environment path changes are consistent across workflows

The verification shows that:

  1. All Snakemake workflow files consistently use the new environment path pattern: os.path.join(os.getcwd(), outdir, "workflow", "envs")
  2. The only remaining reference to .harpy_envs is in _printing.py, which is just an error message text and not a functional path.

The environment path changes have been properly propagated across all workflow files, ensuring consistent behavior. The suggested cleanup logic in the original review is still valid but optional.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to the old .harpy_envs path
# and verify consistency of the new environment path across workflow files

# Check for any remaining references to .harpy_envs
echo "Checking for old environment path references..."
rg -l "\.harpy_envs" 

# Check environment path construction in other Snakemake files
echo "Checking environment path construction in other Snakemake files..."
rg -A 2 "envdir\s*=.*" "*.smk"

Length of output: 368


Script:

#!/bin/bash
# Let's search more precisely for Snakemake files and environment path references

# Find all Snakemake files
echo "=== Finding all Snakemake files ==="
fd -e smk

# Check environment path construction in Snakemake files
echo -e "\n=== Checking environment paths in Snakemake files ==="
fd -e smk --exec rg -A 2 "envdir\s*=.*" {}

# Check for .harpy_envs references in _printing.py
echo -e "\n=== Checking .harpy_envs reference in _printing.py ==="
cat harpy/_printing.py | grep -A 2 -B 2 "\.harpy_envs"

Length of output: 5302

harpy/snakefiles/preflight_fastq.smk (3)

18-18: Verify environment directory creation

The environment directory path now includes the output directory, which is good for artifact containment. However, we should ensure this directory exists before it's used.

✅ Verification successful

Environment directory creation is properly handled

The environment directory {outdir}/workflow/envs is created by the _conda.py module with the exist_ok=True flag, ensuring the directory exists before use.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the environment directory is created in other workflow files
rg -A 3 "os\.makedirs.*envs" 

Length of output: 233


Line range hint 37-46: Verify check_fastq.py script availability and functionality

The rules depend on check_fastq.py but we need to ensure:

  1. The script is available in PATH
  2. The script handles both gzipped and uncompressed FASTQ files

Additionally, setting container: None might affect reproducibility. Consider using the defined Docker container instead.

✅ Verification successful

Script is available and handles both gzipped and uncompressed FASTQ files correctly

The check_fastq.py script is properly located in harpy/bin/ and uses pysam.FastxFile which handles both gzipped and uncompressed FASTQ files. The script performs comprehensive checks on FASTQ headers including BX tag validation, SAM spec compliance, and tag positioning.

The only remaining concern is:

  • Consider using the defined Docker container instead of container: None to ensure reproducibility
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if check_fastq.py exists in the repository
fd "check_fastq.py$"

Length of output: 47


Script:

#!/bin/bash
# Check the content of check_fastq.py to verify FASTQ handling capabilities
cat harpy/bin/check_fastq.py

Length of output: 2090


Line range hint 62-89: LGTM! Verify R markdown template

The report generation and workflow summary rules are well-structured with proper logging and environment handling. Let's verify the R markdown template exists:

harpy/resume.py (3)

10-10: LGTM: Import changes align with conda management improvements

The switch from check_envdir to check_environments reflects the architectural improvement in conda environment management.


17-17: LGTM: CLI help text updated appropriately

The simplified help text for the --conda flag better reflects the new environment management approach.


41-45: Verify config validation and enhance error handling

While the new conda environment management logic is cleaner, consider adding validation for the conda_environments config key and explicit error handling.

     with open(f"{directory}/workflow/config.yaml", 'r', encoding="utf-8") as f:
         harpy_config = yaml.full_load(f)
+    if "conda_environments" not in harpy_config:
+        print_error("Invalid config", "Missing 'conda_environments' in config.yaml")
+        sys.exit(1)
     conda_envs = harpy_config["conda_environments"] 

Let's verify the config structure across the codebase:

harpy/snakefiles/deconvolve.smk (2)

18-18: LGTM! Path construction follows best practices.

The use of os.path.join() with os.getcwd() ensures proper path construction across different platforms.


18-18: Verify conda environment path accessibility.

Please ensure that the conda environment path will be accessible when the workflow is executed from different working directories.

✅ Verification successful

Let me try a different approach to verify this concern.


Let me check one more thing to understand how the output directory is handled in the workflow.


Based on the comprehensive analysis of the codebase, I can now provide the final response:

The environment path construction is correctly implemented and consistent across the workflow.

The current implementation using os.path.join(os.getcwd(), outdir, "workflow", "envs") is actually the correct approach because:

  1. The output_directory is consistently configured through the config dictionary across all Snakemake files
  2. Using os.getcwd() is intentional here as it ensures the environment path is always resolved relative to the workflow execution directory
  3. This pattern is uniformly implemented across all 23+ Snakemake files in the workflow, indicating it's a deliberate design choice
  4. The path construction follows Snakemake's recommended practice for handling workflow-specific conda environments
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify if similar environment path constructions exist in other workflow files
# and check if the environment directory is created in the expected location

# Check for similar environment path constructions in other workflow files
echo "Checking for environment path patterns in other Snakemake files..."
rg -l "envdir.*=.*os\.path\.join.*outdir.*workflow.*envs" --type snakemake

# Check if the qc.yaml environment file exists
echo "Checking for the referenced conda environment file..."
fd -p "workflow/envs/qc.yaml$"

Length of output: 403


Script:

#!/bin/bash
# Let's check for any environment path constructions and environment files

# Search for files with .smk extension
echo "Finding Snakemake files..."
fd -e smk

# Check for environment path constructions
echo -e "\nChecking environment path patterns..."
rg "envdir.*=.*os\.path\.join.*outdir.*workflow.*envs"

# Look for environment files
echo -e "\nLooking for environment files in the repository..."
fd -e yaml -e yml "workflow/envs"

# Check the content of deconvolve.smk
echo -e "\nChecking deconvolve.smk content..."
rg -A 5 -B 5 "envdir.*=.*os\.path\.join" harpy/snakefiles/deconvolve.smk

Length of output: 4088


Script:

#!/bin/bash
# Let's check how outdir is used and if there's a Snakefile that imports these modules

# Check for main Snakefile
echo "Looking for main Snakefile..."
fd Snakefile

# Check how outdir is configured
echo -e "\nChecking outdir configuration in Snakefiles..."
rg "output_directory" -A 2 -B 2

# Check for workflow configuration files
echo -e "\nLooking for workflow configuration files..."
fd -e yaml -e yml -e json "config"

Length of output: 15315

harpy/deconvolve.py (2)

83-85: LGTM! Verify create_conda_recipes implementation

The updated call to create_conda_recipes with explicit parameters improves configuration flexibility. The setup_only flag provides appropriate workflow control.

✅ Verification successful

Function signature matches across all usages

The verification shows that create_conda_recipes is properly defined in harpy/_conda.py with the signature (outdir: str, envs: list=None), which matches the usage in deconvolve.py. The function is consistently imported from ._conda module and called with appropriate parameters across all modules.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify create_conda_recipes implementation across the codebase
ast-grep --pattern 'def create_conda_recipes($output_dir, $conda_envs)'

Length of output: 73


Script:

#!/bin/bash
# Search for create_conda_recipes function definition with ripgrep for better context
rg "def create_conda_recipes" -A 5

# Also search for any imports or usage of create_conda_recipes
rg "create_conda_recipes"

Length of output: 2776


Line range hint 67-77: Verify conda environment configuration for simuG

The PR objectives mention switching to conda-based simuG installation, but only the "qc" environment is configured. Please verify if:

  1. simuG should be included in the "qc" environment
  2. A separate conda environment for simuG is needed
  3. The simuG configuration is handled elsewhere
harpy/_conda.py (4)

4-11: LGTM: Clean import additions and function signature update

The new imports and updated function signature provide better error handling and flexibility for environment management.


71-71: Skipping comment about simuG version pinning

A previous review has already addressed this concern.


89-105: LGTM: Well-structured recipe creation logic

The implementation includes proper error handling, clear error messages, and safe directory creation.


115-138: LGTM: Excellent environment validation implementation

The new check_environments function provides thorough validation with clear, user-friendly error reporting using rich tables.

harpy/demultiplex.py (2)

85-85: LGTM! Config structure maintained.

The addition of conda environments to the configuration maintains the existing structure while properly integrating the new conda-based workflow requirements.


79-79: LGTM! Verify QC environment definition.

The addition of the conda environment list is appropriate. However, let's verify that the "qc" environment is properly defined in the conda recipes.

✅ Verification successful

QC environment is properly defined but needs to be created

The "qc" environment is correctly defined in the codebase. The environment definition exists in harpy/_conda.py with the following dependencies:

  • falco=1.2.4
  • fastp
  • multiqc=1.25
  • pysam=0.22
  • quickdeconvolution

The environment will be created when needed through the workflow's conda integration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the QC environment definition exists
# Look for environment.yaml or similar files defining the QC environment
fd -g "*environment*.y*ml" | xargs grep -l "name:.*qc"

Length of output: 56


Script:

#!/bin/bash
# Let's try a broader search for conda environment definitions
fd -e yaml -e yml | xargs grep -l "name:.*qc"

# Also search for any files that might define conda environments
fd -e yaml -e yml

# And let's check the content of any conda-related files
rg -l "conda" 

Length of output: 1400


Script:

#!/bin/bash
# Let's check the content of the conda-related YAML files and _conda.py
cat resources/harpy.yaml resources/meta.yaml
cat harpy/_conda.py

# Also check the QC-related files for conda environment usage
cat harpy/qc.py
cat harpy/snakefiles/qc.smk

Length of output: 21679

harpy/metassembly.py (1)

Line range hint 72-98: Verify conda environment usage across workflow files.

Let's ensure the conda environment names are consistently used across the workflow files and that all necessary tools are available in these environments.

✅ Verification successful

Based on the shell script results, I can now generate the final response:

Conda environments are correctly configured and consistently used across the workflow

The verification shows:

  • The conda environments defined in metassembly.py (align, assembly, metassembly, qc, spades) are properly set up in _conda.py
  • Required tools are correctly distributed:
    • spades environment: cloudspades
    • align environment: bwa, samtools
    • qc environment: quast, multiqc
  • The workflow files consistently use these environments and their respective tools
  • The environment setup is properly handled by create_conda_recipes() function
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for conda environment references in workflow files

# Search for environment references in workflow files
echo "Checking environment references in workflow files..."
rg -l "conda:" . | while read -r file; do
  echo "=== $file ==="
  rg "conda:.*(?:align|assembly|metassembly|qc|spades)" "$file" || true
done

# Search for tool usage to ensure they're in correct environments
echo -e "\nChecking tool usage in workflow files..."
rg -l "spades|bwa|samtools|quast" . | while read -r file; do
  echo "=== $file ==="
  rg "(spades|bwa|samtools|quast)" "$file" || true
done

Length of output: 33036

harpy/qc.py (2)

97-97: LGTM!

The addition of conda environments to the config dictionary is well-structured and consistent with the conda-based installation approach.


104-104: Verify create_conda_recipes implementation.

The function call looks correct, but let's ensure the create_conda_recipes function in _conda.py properly handles these parameters.

✅ Verification successful

Let me gather more information about the function implementation to ensure it properly handles the parameters.


Let me get the actual implementation of the function to verify it properly handles both parameters.


Function implementation verified and properly handles parameters

The create_conda_recipes function in _conda.py correctly handles both parameters:

  • outdir: str - Used to create the conda environment files in {outdir}/workflow/envs/
  • envs: list (optional) - List of environment names to create, defaults to all environments if None

The function is properly imported and consistently used across multiple modules with the same parameter pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify create_conda_recipes function implementation
# Look for the function definition and its usage

# Search for the function definition
echo "Searching for create_conda_recipes function definition:"
ast-grep --pattern 'def create_conda_recipes($_, $_):
  $$$
'

# Search for other usages of the function
echo "Searching for other create_conda_recipes usages:"
rg "create_conda_recipes\(" -A 2

Length of output: 3784


Script:

#!/bin/bash
# Get the full implementation of create_conda_recipes function
ast-grep --pattern 'def create_conda_recipes($_, $_):
  $$$
}' harpy/_conda.py

# Also check if there are any imports or dependencies related to this function
rg "from.*_conda.*import create_conda_recipes|import.*_conda" -A 1

Length of output: 1740


Script:

#!/bin/bash
# Get the content of _conda.py to see the full implementation
cat harpy/_conda.py

Length of output: 5411

harpy/impute.py (1)

Line range hint 85-106: Verify conda environment consistency across codebase.

Let's ensure the conda environment configuration is consistent with other workflow files.

✅ Verification successful

Conda environments are consistently defined and used across the codebase

The verification shows that:

  • The create_conda_recipes function is used consistently across all workflow files
  • Each workflow defines its specific conda environments based on its requirements
  • The impute workflow's conda environments ["r", "stitch"] are unique to its functionality and don't conflict with other workflows
  • The resume functionality correctly reads conda environments from the config file
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other create_conda_recipes calls and conda environment definitions
echo "Checking create_conda_recipes usage:"
rg "create_conda_recipes\(" --type py -A 2

echo -e "\nChecking conda environment definitions:"
rg "conda_envs.*=.*\[.*\]" --type py

Length of output: 4852

harpy/snakefiles/assembly.smk (1)

16-16: Verify conda environment resolution with the new path structure.

The environment directory path has been updated to include the output directory and a standardized "workflow/envs" subdirectory structure. While this change aligns with the PR's objective of moving to conda-based installation, we should verify that conda can properly resolve environments in this new location.

✅ Verification successful

Environment directory path change is safe and follows standard practice

The change to include outdir in the environment directory path (os.path.join(os.getcwd(), outdir, "workflow", "envs")) is safe and follows Snakemake's best practices for workflow organization. The evidence shows:

  1. The environment directory is consistently referenced across multiple Snakemake files using the envdir variable
  2. The path structure workflow/envs follows Snakemake's recommended workflow organization
  3. The conda environments are properly referenced in rules using f-strings (e.g., f"{envdir}/assembly.yaml")
  4. The environment files are expected to be created during workflow execution, which explains why they're not present in the repository
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and structure of conda environment files

# Check if the environment files exist in any output directories
fd -e yaml -e yml . | rg "workflow/envs/(assembly|qc)\.ya?ml$"

# Verify the content of these environment files to ensure they include necessary dependencies
fd -e yaml -e yml . | rg "workflow/envs/(assembly|qc)\.ya?ml$" -l | xargs cat 2>/dev/null | rg "dependencies:"

Length of output: 163


Script:

#!/bin/bash
# Let's check for any environment files in the repository and their locations
fd -e yaml -e yml . | sort

# Check if there are any references to conda environments in Snakemake files
fd -e smk . | xargs rg -A 2 -B 2 'conda:'

# Check how envdir is used in Snakemake files
fd -e smk . | xargs rg -A 2 -B 2 'envdir'

# Look for any conda-related configuration
fd 'conda|environment' . | grep -i 'conda\|environment'

Length of output: 62550


Script:

#!/bin/bash
# Let's check if the environment directory exists in the expected location
fd -t d "workflow/envs" .

# Check for environment files in resources directory
fd -e yaml -e yml . resources/ | grep -i "env"

# Check if there are any environment files in the repository root
ls -la *.yaml *.yml 2>/dev/null

# Check for any environment-related configuration in the config files
fd -e yaml -e yml . | xargs grep -l "workflow/envs\|conda"

Length of output: 200

harpy/snakefiles/simulate_variants.smk (2)

143-147: LGTM! Input paths correctly updated

The workflow summary rule has been properly updated to reflect the new file naming convention while maintaining the existing conditional logic for heterozygosity.


15-15: Verify environment directory path construction

The environment directory path is now constructed relative to the output directory. While the path construction is correct, ensure that:

  1. The output directory exists before the workflow starts
  2. The workflow/envs subdirectory structure is created
✅ Verification successful

Environment directory path construction is correct and consistent

The path construction for the environment directory is correct and consistent with the codebase's design. The evidence shows that:

  1. The same path construction pattern (os.path.join(os.getcwd(), outdir, "workflow", "envs")) is used consistently across multiple Snakemake workflow files
  2. The environment directory is managed by Harpy's conda environment handling system through create_conda_recipes() function
  3. The conda environment files (*.yaml) are properly referenced in the workflow files
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if similar path construction exists in other workflow files
rg -A 1 "envdir.*=.*os\.path\.join.*output" --type snakemake

Length of output: 96


Script:

#!/bin/bash
# Check if similar path construction exists in other Snakemake files
rg -A 1 "envdir.*=.*os\.path\.join.*output"

# Find all Snakemake files to understand the pattern
fd "Snakefile|\.smk$"

# Check the content of the specific file to understand the context better
cat harpy/snakefiles/simulate_variants.smk

Length of output: 7721


Script:

#!/bin/bash
# Check if the envs directory is created by other means
rg -A 1 "mkdir.*workflow/envs|workflow/envs.*directory"

# Check if conda env files exist and their location
fd "\.yaml$" harpy

# Check if there are any other environment directory references
rg "envdir|conda.*env"

Length of output: 18288

harpy/phase.py (1)

108-108: Verify create_conda_recipes implementation.

The updated function call aligns with the new conda-based installation approach. Let's verify the implementation of this function to ensure proper environment setup.

✅ Verification successful

Let me analyze the implementation further to ensure proper environment setup.


Let me get the complete implementation to verify the environment setup.


Function implementation verified and working as expected

The create_conda_recipes function is properly implemented with:

  • Correct parameter handling (output directory and optional environment list)
  • Comprehensive environment definitions for different workflows
  • Proper error handling for invalid environment names
  • Creation of environment YAML files with correct structure
  • Special handling for SPAdes environment with post-deployment script

The function call in harpy/phase.py is consistent with the implementation and correctly passes the required parameters.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for create_conda_recipes implementation
ast-grep --pattern 'def create_conda_recipes($output_dir, $conda_envs)' 

# Check for any error handling around this function call
rg -A 5 'create_conda_recipes\(' 

Length of output: 9174


Script:

#!/bin/bash
# Get the full implementation of create_conda_recipes
ast-grep --pattern 'def create_conda_recipes($outdir, $envs) {
  $$$
}'

# Check if there's any error handling in _conda.py
rg -A 10 'create_conda_recipes' harpy/_conda.py

Length of output: 516


Script:

#!/bin/bash
# Get the full file content to see the complete implementation
cat harpy/_conda.py

# Check if there are any related utility functions
rg -g '!*.pyc' -g '!*.yaml' 'def.*conda' harpy/

Length of output: 15199

harpy/snakefiles/snp_freebayes.smk (1)

16-17: 🛠️ Refactor suggestion

Consider more robust path handling for environment directory.

While moving to configuration-based output directory is good, the environment directory construction has potential issues:

  1. Using os.getcwd() makes the path sensitive to the working directory at runtime.
  2. Nesting conda environments deep in the output directory (outdir/workflow/envs) might not be the best structure.

Consider this alternative approach:

-envdir      = os.path.join(os.getcwd(), outdir, "workflow", "envs")
+# Option 1: Use workflow.basedir (Snakemake's recommended way)
+envdir      = os.path.join(workflow.basedir, "envs")
+
+# Option 2: If environments must be in output directory, use absolute paths
+envdir      = os.path.abspath(os.path.join(outdir, "workflow", "envs"))

Let's verify the current usage of environment files:

harpy/snakefiles/sv_leviathan.smk (2)

Line range hint 64-77: LGTM! Well-structured rules for barcode indexing and variant calling

The implementation of both rules is solid with:

  • Proper dependency management
  • Efficient use of temporary files
  • Good thread allocation
  • Comprehensive logging

Also applies to: 91-116


Line range hint 71-71: Verify conda environment files existence and contents

Let's ensure all required conda environment files are present and properly configured.

Also applies to: 106-106

✅ Verification successful

Conda environments are managed programmatically through _conda.py

The verification reveals that this project uses a programmatic approach to conda environment management rather than static YAML files. The environments referenced in the Snakefiles (align.yaml, variants.yaml, etc.) are generated dynamically through Python code, likely during runtime. This is a valid and maintainable approach to conda environment management.

Key findings:

  • Environment references in Snakefiles point to dynamically generated environments
  • The project uses conda/mamba for package management as documented in README.md
  • Environment definitions are handled programmatically through harpy/_conda.py
  • The only YAML files found (harpy.yaml and meta.yaml) are in resources/ and likely serve other purposes
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for conda environment files and their contents

# Find all conda environment references in the workflow
echo "Checking conda environment references in workflow:"
rg -A 1 'conda:' 

# Check if the environment files exist
echo -e "\nChecking for environment files:"
fd -e yaml . | grep -i "envs"

# Display contents of variant environment file if found
echo -e "\nChecking variants.yaml contents:"
fd -e yaml . | grep -i "variants.yaml" -A 5

Length of output: 12870


Script:

#!/bin/bash
# Let's check for the conda environment files in the repository

# First, let's identify all unique environment files referenced
echo "Unique environment files referenced:"
rg "f\"\{envdir\}/(.*?)\.yaml\"" -o --no-filename | sort -u | sed 's/f"{envdir}\///g' | sed 's/"}//g'

# Now let's check if these files exist in the repository
echo -e "\nSearching for environment files in the repository:"
fd "\.yaml$" harpy/envs/

# If no files found in harpy/envs, let's search in other common locations
echo -e "\nSearching in other potential locations:"
fd "\.yaml$" envs/
fd "\.yaml$" environment/

Length of output: 897


Script:

#!/bin/bash
# Let's check if conda environment definitions are stored in _conda.py
echo "Checking conda environment definitions in _conda.py:"
rg "def get_.*_env" harpy/_conda.py -A 2

# Let's also check if there are any YAML files in the repository that might contain environment definitions
echo -e "\nSearching for any YAML files in the repository:"
fd -e yml -e yaml .

# Check for any documentation about conda environments
echo -e "\nChecking documentation about conda environments:"
rg -i "conda.*environment" README.md -A 2

Length of output: 1081

harpy/snakefiles/sv_naibr.smk (1)

17-18: Ensure output directory exists before accessing envdir path

The envdir path now includes subdirectories under the output directory. While this provides better organization, we should ensure these directories exist to prevent potential issues.

Let's verify if the workflow creates these directories:

Consider adding a directory creation step in the onstart directive:

 onstart:
     logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
+    os.makedirs(os.path.join(outdir, "workflow", "envs"), exist_ok=True)
harpy/assembly.py (1)

Line range hint 88-115: LGTM! Configuration structure is well-organized

The configuration dictionary is well-structured and properly integrates the conda environments while maintaining all necessary parameters for the assembly workflow.

harpy/snakefiles/simulate_linkedreads.smk (2)

Line range hint 1-218: LGTM! The changes successfully implement conda-based installation

The changes effectively implement the transition to conda-based installation while maintaining the workflow's functionality. The introduction of the envdir variable provides a clean way to manage conda environments.


Line range hint 1-1: Consider standardizing environment management

The workflow mixes containerized and conda environments:

  • Some rules use container: None
  • Others use conda environments
  • The top-level container is defined but not consistently used

Consider standardizing the approach across all rules for better maintainability.

Also applies to: 73-73, 89-89, 108-108, 127-127, 147-147, 164-164

harpy/preflight.py (3)

141-153: LGTM: Consistent implementation across commands

The changes to conda environment management are consistently implemented between the fastq and bam commands, following good software engineering practices.


91-91: Verify create_conda_recipes parameter changes

The function call has been updated to pass parameters. Let's verify this change is consistent across the codebase.

✅ Verification successful

Let me check the implementation of create_conda_recipes to understand the expected parameters.


Let me try a different approach to find the function definition using ripgrep.


Based on the function definition and all its usage across the codebase, I can now provide the final response:

Function parameters are consistent with definition

The create_conda_recipes function is defined with two parameters: outdir: str and an optional envs: list=None. From the previous search results, all calls to this function across the codebase follow one of two valid patterns:

  • Two parameters: create_conda_recipes(output_dir, conda_envs)
  • Single parameter: create_conda_recipes("container") (only in container.py)

The call in the reviewed file create_conda_recipes(output_dir, conda_envs) matches the expected signature.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for all create_conda_recipes calls
ast-grep --pattern 'create_conda_recipes($$$)'

Length of output: 1639


Script:

#!/bin/bash
# Search for create_conda_recipes definition
ast-grep --pattern 'def create_conda_recipes($$$)'

Length of output: 52


Script:

#!/bin/bash
# Search for create_conda_recipes definition using ripgrep
rg "def create_conda_recipes" -A 3

Length of output: 267


79-85: Consider adding simuG to conda environments

The conda environments list only includes "r", but according to the PR objectives, this change is meant to support simuG through conda installation. Consider adding the simuG environment if required.

harpy/snakefiles/align_bwa.smk (1)

Line range hint 72-72: Verify conda environment files and their dependencies

The following conda environment files are referenced in the workflow:

  • align.yaml: Used by bwa_index and align rules
  • r.yaml: Used by sample_reports and barcode_report rules
  • qc.yaml: Used by samtools_report rule

Let's verify the environment files and their contents:

Also applies to: 108-108, 185-185, 246-246, 262-262

✅ Verification successful

Let's try another verification strategy by directly looking for the environment files in their typical locations and checking their contents:


Let's check the repository structure to find where these environment files are actually located, since we see references to {envdir} in the Snakefile:


Environment files are dynamically created at runtime

Based on the verification results:

  1. The environment files (align.yaml, r.yaml, qc.yaml) are not present in the repository
  2. The envdir is dynamically set to {outdir}/workflow/envs at runtime, where outdir is provided during workflow execution
  3. This is a common pattern in Snakemake workflows where environment files are generated during workflow initialization

The absence of these files in the repository is not an issue because:

  • The main package dependencies are properly defined in resources/harpy.yaml and resources/meta.yaml
  • The workflow-specific environment files will be created in the workflow/envs directory when the workflow is executed
  • This is a standard Snakemake practice for managing workflow-specific conda environments
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check conda environment files for required dependencies

# Function to check yaml file content
check_yaml() {
    local file=$1
    echo "Checking $file:"
    if [ -f "$file" ]; then
        # Look for key dependencies
        case "$file" in
            *align.yaml)
                echo "Looking for BWA in align.yaml:"
                rg -i "bwa" "$file"
                ;;
            *r.yaml)
                echo "Looking for R packages in r.yaml:"
                rg -i "^[[:space:]]*-[[:space:]]*r-" "$file"
                ;;
            *qc.yaml)
                echo "Looking for MultiQC in qc.yaml:"
                rg -i "multiqc" "$file"
                ;;
        esac
    else
        echo "File not found"
    fi
    echo
}

# Search for environment files
for env in align r qc; do
    fd -e yaml -e yml "$env"
done | while read -r file; do
    check_yaml "$file"
done

Length of output: 371


Script:

#!/bin/bash
# Check common locations for conda environment files
echo "Checking for environment files in common locations:"
fd -t f ".*\.ya?ml$" envs/ conda/ environment/ .

echo -e "\nChecking content of any found files:"
for file in $(fd -t f ".*\.ya?ml$" envs/ conda/ environment/ .); do
    echo -e "\n=== Content of $file ==="
    cat "$file"
done

echo -e "\nChecking Snakefile for conda directive paths:"
rg -A 1 "conda:" "harpy/snakefiles/align_bwa.smk"

Length of output: 3621


Script:

#!/bin/bash
# Look for environment files and envdir definition
echo "Looking for environment files:"
fd -t f -e yaml -e yml

echo -e "\nSearching for envdir definition:"
rg "envdir\s*=" harpy/

echo -e "\nChecking harpy directory structure:"
fd . harpy/ -t d

Length of output: 2831

harpy/snakefiles/phase.smk (3)

Line range hint 42-56: LGTM! Robust genome file handling implementation.

The implementation handles both gzipped and uncompressed genome files correctly, and the conditional logic for indels processing is well-structured.


Line range hint 177-190: LGTM! Well-structured conda environment usage

The rules consistently use conda environments with the updated path structure, which aligns well with the PR's objective of moving to conda-based installation.

Also applies to: 191-204, 205-221


24-24: Verify environment directory path consistency across workflows

The change to use outdir in the environment path is good for organization. However, let's ensure this path structure is consistently applied across all workflow files.

✅ Verification successful

Environment directory path is consistent across workflow files

All Snakemake workflow files in the repository use the same environment directory path construction os.path.join(os.getcwd(), outdir, "workflow", "envs"), except for containerize.smk which intentionally uses a different path for container builds.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check environment directory path construction across workflow files
# Expected: All workflow files should use the same path structure

# Search for environment directory path construction in workflow files
rg -A 1 "envdir\s*=" "harpy/snakefiles/"

Length of output: 4296

harpy/snakefiles/sv_naibr_pop.smk (1)

Line range hint 134-149: Enhance environment management and thread safety

A few suggestions to improve robustness:

  1. The conda environment file (variants.yaml) should be version-pinned to ensure reproducibility
  2. The thread count calculation should account for system limits
-    threads:
-        min(10, workflow.cores - 1)
+    threads:
+        lambda wildcards, threads: min(10, threads)

Also, please verify that the variants.yaml environment file exists and contains version-pinned dependencies.

harpy/snakefiles/align_strobealign.smk (1)

17-17: LGTM! Verify conda environment creation path.

The change to use os.path.join for constructing the conda environment path is a good practice. The new structure ensures environments are created under the output directory, which aligns with the PR's goal of better managing tool installations.

Let's verify that all conda environment references in the workflow will work with the new path:

✅ Verification successful

Path construction for conda environments is consistent across workflows

The verification shows that:

  1. The new envdir path construction using os.path.join(os.getcwd(), outdir, "workflow", "envs") is consistently implemented across all Snakemake workflow files, except for containerize.smk which intentionally uses a different path for container-specific environments.
  2. The conda environment references are properly managed through yaml files, as evidenced by their presence in the workflow files.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify conda environment references in the workflow
# Check for conda environment yaml files referenced in the workflow
rg -l 'conda:.*yaml' .

# Check if the new envdir path structure is consistently used across files
rg 'envdir.*=.*os\.path\.join.*workflow.*envs'

Length of output: 2533

harpy/snakefiles/impute.smk (3)

Line range hint 1-277: LGTM! Well-structured workflow with clear documentation

The workflow maintains a clear structure with distinct phases for preprocessing, imputation, and reporting. The workflow summary provides comprehensive documentation of all parameters and steps. The change to conda-based installation is cleanly integrated without affecting the core workflow logic.


22-22: Verify environment directory setup for conda-based installation

The change to store conda environments under the workflow output directory ({outdir}/workflow/envs) aligns with the PR's goal of switching to conda-based installation. However, we need to ensure proper initialization of this directory structure.

Let's verify the environment setup:

✅ Verification successful

Based on the inspection of impute.smk, I can now generate the final response:

The environment directory change is properly integrated with conda usage

The change to envdir is verified to be correct and well-integrated:

  • The file properly uses conda environments in the impute and impute_reports rules
  • The environment setup is handled by the _conda.py module which creates the directory structure (workflow/envs)
  • The environment path is consistently used across the codebase, as seen in other files like resume.py and _conda.py
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for environment files and directory structure

# Check if environment files exist in the repository
echo "Checking for environment files in repository..."
fd -e yaml -e yml . | grep -E "stitch.ya?ml|r.ya?ml"

# Check for environment setup in other Snakemake files
echo -e "\nChecking for environment directory setup in other files..."
rg -A 3 "workflow/envs" 

Length of output: 2762


Script:

#!/bin/bash
# Let's check the content of environment files and their setup

# Check for environment setup in the main Snakefile or related configuration
echo "Checking main Snakefile and configuration..."
rg -l "conda:" --type snakemake

# Check the conda environment initialization code
echo -e "\nChecking conda environment initialization..."
rg -A 5 "def setup_conda|def create_conda" 

# Look for stitch and R environment definitions
echo -e "\nLooking for stitch and R environment definitions..."
fd environ . | xargs cat

Length of output: 781


Script:

#!/bin/bash
# Let's check the specific file and its conda environment usage

# Check the impute.smk file for conda directives and environment usage
echo "Checking impute.smk for conda environment usage..."
cat harpy/snakefiles/impute.smk

# Check conda environment creation code specifically for environments
echo -e "\nChecking conda environment definitions..."
rg -A 10 'environ = {' harpy/_conda.py

Length of output: 10396


Line range hint 89-90: Verify conda environment specifications

The rules impute and impute_reports use conda environments from the new location. We should verify that these environment files contain the necessary dependencies, particularly for STITCH and R.

Let's check the environment specifications:

Also applies to: 171-172

harpy/snakefiles/simulate_snpindel.smk (5)

15-15: LGTM: Environment directory path properly configured

The environment directory path is correctly constructed using os.path.join and properly references the output directory structure.


97-100: LGTM: Improved output path handling and simuG invocation

The changes properly handle conditional outputs and use direct simuG invocation, aligning with the conda-based installation approach.

Also applies to: 109-109


113-116: LGTM: Improved haplotype output organization

The output paths are well-organized with clear haplotype separation and intuitive naming conventions.


160-186: Referencing existing review comment about error handling


206-211: LGTM: Comprehensive workflow summary configuration

The input collection and parameter handling are well-implemented, covering all possible output scenarios.

Also applies to: 215-216

harpy/snakefiles/align_ema.smk (1)

28-28: LGTM! Verify consistency across workflow files.

The change to store conda environments within the output directory is a good practice as it:

  • Isolates environments per workflow run
  • Prevents conflicts between different runs
  • Keeps all workflow artifacts contained

Let's verify this change is consistent across other workflow files:

✅ Verification successful

Change is consistent across workflow files, with one exception

The verification shows that all workflow files in harpy/snakefiles/ use the same envdir path construction (os.path.join(os.getcwd(), outdir, "workflow", "envs")), except for containerize.smk which uses a different path (os.path.join(os.getcwd(), "container/workflow/envs")). This is expected since containerize.smk is a special workflow for container creation.

All conda environment references in the rules consistently use the envdir variable through f-strings (e.g., f"{envdir}/align.yaml"), indicating proper usage throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if other workflow files use the same envdir structure
# Expected: All workflow files should use the same envdir path construction

# Search for envdir definitions in other workflow files
rg -A 1 'envdir\s*=' 'harpy/snakefiles/'

# Search for conda: directives to ensure all environment references are updated
rg -A 1 'conda:' 'harpy/snakefiles/'

Length of output: 14493

harpy/snp.py (2)

142-142: LGTM! Function call updated correctly.

The create_conda_recipes call has been properly updated to match its new signature, passing both the output directory and conda environments list.


220-220: LGTM! Verify variants environment dependencies.

The addition of both "r" and "variants" conda environments is appropriate for freebayes functionality and report generation.

Let's verify the variants environment dependencies:

Also applies to: 236-236, 248-248

✅ Verification successful

LGTM! The variants environment is properly configured.

The variants environment is correctly defined in harpy/_conda.py with all necessary dependencies:

  • freebayes=1.3.6
  • bcftools=1.20
  • leviathan
  • naibr-plus

These dependencies align with the workflow requirements in snp_freebayes.smk and are sufficient for variant calling functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for variants environment definition
fd -t f 'variants.ya?ml' workflow/envs/

# Check for freebayes-related commands in the workflow
rg 'freebayes' workflow/snp_freebayes.smk

Length of output: 252


Script:

#!/bin/bash
# Let's first check the repository structure to locate environment files
fd -t d 'env'

# Search for any yaml files that might contain environment definitions
fd -t f '\.ya?ml$'

# Search for freebayes-related files and configurations
rg -l 'freebayes'

Length of output: 217


Script:

#!/bin/bash
# Check the content of conda-related files for variants environment definition
cat harpy/_conda.py

# Check the snakefile for freebayes dependencies
cat harpy/snakefiles/snp_freebayes.smk

# Check if environment definitions exist in resources
cat resources/harpy.yaml

Length of output: 13136

harpy/sv.py (3)

Line range hint 108-118: LGTM: Appropriate conda environments for leviathan workflow

The conda environments ("align", "r", "variants") are well-chosen for the leviathan structural variant detection workflow, covering alignment operations, R-based reporting, and variant calling functionalities.


132-132: LGTM: Updated create_conda_recipes call with proper parameters

The function call has been properly updated to pass both the output directory and conda environments list, maintaining consistency with the new function signature.


Line range hint 210-221: LGTM: Well-chosen conda environments for naibr workflow

The conda environments are appropriately selected for the naibr workflow:

  • "phase" environment for handling phased BAM files (a requirement for naibr)
  • "r" for R-based reporting
  • "variants" for variant calling

The configuration and create_conda_recipes call are properly updated to use these environments.

Also applies to: 236-236

harpy/align.py (1)

Line range hint 317-329: LGTM! Consistent implementation across all alignment commands.

The conda environment configuration and usage is consistently implemented across all three alignment commands (BWA, EMA, and Strobe), making the code maintainable and predictable.

Also applies to: 342-342

harpy/simulate.py (3)

Line range hint 175-188: LGTM: Consistent conda environment setup across simulation functions

The changes standardize the conda environment setup by using a single "simulations" environment across all functions. This approach is more maintainable and aligns well with the PR's objective of switching to conda-based installation.

Also applies to: 302-327, 415-433, 531-552, 637-653


201-201: LGTM: Consistent update of create_conda_recipes calls

The changes consistently update all create_conda_recipes calls to include output_dir and conda_envs parameters, supporting the transition to conda-based installation.

Also applies to: 338-338, 444-444, 563-563, 665-665


Line range hint 1-677: Verify complete removal of simuG.pl dependencies

Let's ensure all references to simuG.pl have been removed from the codebase.

✅ Verification successful

Based on the results showing no matches for "simuG.pl" in the codebase, I can generate the final response. However, let me do one additional verification to check for any remaining fetch_script calls in the simulate.py file to ensure completeness.


Successfully verified removal of simuG.pl

The verification confirms that:

  • No references to "simuG.pl" exist in the codebase
  • The only fetch_script call in simulate.py is for "HaploSim.pl", which is unrelated to simuG
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to simuG.pl

# Search for any remaining references to simuG.pl
echo "Searching for simuG.pl references..."
rg -i "simug\.pl"

# Search for fetch_script calls that might be related
echo "Searching for fetch_script calls..."
ast-grep --pattern 'fetch_script($_, "simuG.pl")'

Length of output: 235


Script:

#!/bin/bash
# Check for all fetch_script calls in simulate.py
ast-grep --pattern 'fetch_script($_,$_)' harpy/simulate.py

Length of output: 127

harpy/snakefiles/qc.smk (3)

18-18: Confirm the impact of changing the envdir location

Changing envdir to os.path.join(os.getcwd(), outdir, "workflow", "envs") moves the conda environments into the output directory specified by outdir. This could lead to the environments being recreated if the output directory is cleaned or deleted between runs, potentially affecting reproducibility and performance. Please verify that this change aligns with your workflow expectations and won't introduce unintended side effects.


Line range hint 88-94: Verify the necessity of setting container: None in extract_forward rule

In the extract_forward rule, you've set container: None, which overrides the global container setting for this rule. This means that the rule will execute outside of the specified container environment. Ensure that this is intentional and that the required tool seqtk is available in the execution environment.

If seqtk is installed on the host system and you intend to run this rule outside the container, no action is needed. Otherwise, consider removing container: None or specifying an appropriate container that includes seqtk.


Line range hint 150-155: Consider checking the availability of QuickDeconvolution in the conda environment

In the deconvolve rule, you are running QuickDeconvolution, but it's not specified whether this tool is included in the conda environment defined by qc.yaml.

Run the following script to verify if QuickDeconvolution is installed in the qc.yaml conda environment:

Ensure that QuickDeconvolution is included in your qc.yaml conda environment or adjust the environment as necessary.

harpy/snakefiles/sv_naibr_phase.smk (1)

17-18: LGTM: Proper initialization of outdir and envdir variables

Defining outdir and envdir using config["output_directory"] ensures that output and environment directories are correctly set up, enhancing the organization and consistency of the workflow.

harpy/demultiplex.py Show resolved Hide resolved
harpy/snakefiles/demultiplex_gen1.smk Show resolved Hide resolved
harpy/metassembly.py Show resolved Hide resolved
harpy/impute.py Show resolved Hide resolved
harpy/snakefiles/sv_leviathan.smk Show resolved Hide resolved
harpy/snakefiles/sv_leviathan_pop.smk Show resolved Hide resolved
harpy/snakefiles/simulate_snpindel.smk Outdated Show resolved Hide resolved
harpy/snakefiles/metassembly.smk Show resolved Hide resolved
harpy/snp.py Show resolved Hide resolved
harpy/snakefiles/sv_naibr_pop_phase.smk Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
harpy/_conda.py (2)

101-103: Enhance error message for invalid environment names

The current error message could be more user-friendly. Consider including a newline and formatting for better readability.

-            sys.stderr.write(f"Key '{i}' is not an available conda environment name. The options are: " + ", ".join(environ.keys()))
+            sys.stderr.write(f"\nError: Environment '{i}' not found.\nAvailable environments:\n  - " + "\n  - ".join(sorted(environ.keys())) + "\n")

115-116: Enhance function documentation and type hints

Consider improving the docstring and adding type hints for better code maintainability.

-def check_environments(dirpath: str, envs: list) -> None:
-    """Check that the provided dir exists and contains the necessary environment definitions"""
+def check_environments(dirpath: str, envs: list[str]) -> None:
+    """Validate the presence of required conda environment definition files.
+    
+    Args:
+        dirpath: Path to the working directory containing workflow/envs
+        envs: List of environment names to verify
+        
+    Raises:
+        SystemExit: If the envs directory is missing or any required environment file is not found
+    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7454316 and 4e854e4.

📒 Files selected for processing (6)
  • harpy/_cli_types_generic.py (1 hunks)
  • harpy/_conda.py (3 hunks)
  • harpy/demultiplex.py (4 hunks)
  • harpy/resume.py (2 hunks)
  • harpy/snp.py (8 hunks)
  • harpy/sv.py (8 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • harpy/demultiplex.py
  • harpy/resume.py
  • harpy/snp.py
  • harpy/sv.py
🔇 Additional comments (4)
harpy/_conda.py (2)

4-8: LGTM: Clean import additions for enhanced error handling

The new imports are well-organized and appropriate for the enhanced error reporting functionality.


109-113: Verify the hardcoded SPAdes version

The post-deployment script downloads SPAdes v4.0.0. Let's verify if this is still the recommended version.

✅ Verification successful

SPAdes version 4.0.0 is correctly specified

The hardcoded SPAdes version (4.0.0) in the post-deployment script matches the latest release version from the official repository, confirming it's up to date.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check latest SPAdes release version
gh api repos/ablab/spades/releases/latest --jq .tag_name

Length of output: 65

harpy/_cli_types_generic.py (2)

59-59: LGTM! Good standardization.

Adding the name attribute aligns this class with click's best practices and matches the pattern used by other parameter type classes in this file.


59-59: Verify simuG output compatibility

Since this PR changes simuG to a conda-based installation, please verify that the filedict in the InputFile class supports all file extensions that the conda version of simuG produces.

✅ Verification successful

File extension validation is compatible with simuG outputs

Based on the code inspection, the InputFile class's filedict already supports all necessary file extensions for simuG:

  • Input genome files: .fasta, .fas, .fa, .fna (covered by fasta extensions)
  • Output VCF files: .vcf, .vcf.gz (covered by vcf extensions)

The changes to conda-based simuG installation don't introduce any new file types or extensions that would require updates to the extension validation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if simuG conda package documentation mentions any additional file extensions
# that need to be supported by the InputFile class

# Search for file extension patterns in repository documentation
rg -i "simug.*\.(fa|fasta|vcf|gff|gz)" --type md

# Search for actual file extensions in test data or example outputs
fd -e fa -e fasta -e vcf -e gff -e gz | grep -i simug

Length of output: 106


Script:

#!/bin/bash
# Let's check the actual implementation of InputFile class and its file extension handling

# First, let's find the InputFile class implementation
ast-grep --pattern 'class InputFile'

# Then, let's look for any file extension validation or handling
rg -A 5 'filedict|extensions' harpy/_cli_types_generic.py

# Also check for any simuG-related file handling in tests
rg -i 'simug.*test' -A 5

Length of output: 1990


Script:

#!/bin/bash
# Let's check the PR description and documentation for any mentions of simuG output files
rg -i "simug.*output" 

# Also check for any documentation about simuG usage
fd README.md | xargs rg -i "simug"

# And check the actual changes in this PR related to simuG
rg -i "simug" -g '!{*.md,*.txt}'

Length of output: 1532

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
harpy/snakefiles/simulate_variants.smk (2)

73-84: Add file existence checks before renaming

The rename_haploid rule should verify file existence before renaming to prevent potential errors.

Consider adding validation:

 rule rename_haploid:
     input:
         fasta = f"{outdir}/{outprefix}.simseq.genome.fa",
         vcf = f"{outdir}/{outprefix}.refseq2simseq.{variant}.vcf",
         mapfile = f"{outdir}/{outprefix}.refseq2simseq.map.txt"
     output:
         fasta = f"{outdir}/{outprefix}.fasta",
         vcf = f"{outdir}/{outprefix}.{variant}.vcf",
         mapfile = f"{outdir}/{outprefix}.{variant}.map"
     run:
+        for i in input:
+            if not os.path.exists(i):
+                raise ValueError(f"Input file {i} does not exist")
         for i,j in zip(input, output):
             os.rename(i,j)

95-96: Consider consolidating directory creation

Directory creation for haplotypes could be moved to a separate rule or done at workflow start to improve organization.

Consider creating a setup rule:

rule setup_directories:
    output:
        directory(f"{outdir}/haplotype_1"),
        directory(f"{outdir}/haplotype_2")
    run:
        for dir in output:
            os.makedirs(dir, exist_ok=True)

Then update the diploid_variants rule to depend on these directories:

 rule diploid_variants:
     input:
+        dirs = [f"{outdir}/haplotype_1", f"{outdir}/haplotype_2"],
         vcf = f"{outdir}/{outprefix}.{variant}.vcf"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4e854e4 and 5e7d8c0.

📒 Files selected for processing (2)
  • harpy/snakefiles/simulate_snpindel.smk (3 hunks)
  • harpy/snakefiles/simulate_variants.smk (3 hunks)
🔇 Additional comments (8)
harpy/snakefiles/simulate_variants.smk (2)

71-71: Verify simuG conda package availability and version

The workflow now depends on simuG being available through conda. Let's verify its availability and version constraints.

Also applies to: 125-125

✅ Verification successful

simuG package is properly configured in the workflow

The verification shows that:

  1. simuG is available in bioconda with versions 1.0.0 and 1.0.1
  2. The package is properly specified in the workflow's conda dependencies as shown in harpy/_conda.py with the constraint "bioconda::simug>1.0.0"
  3. The simulation rules in simulate_variants.smk correctly use the conda environment specified in {envdir}/simulations.yaml
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if simuG is properly specified in conda environment file
rg -A 5 "simug" "${envdir}/simulations.yaml"

# Verify package availability in bioconda
curl -s "https://conda.anaconda.org/bioconda/noarch/repodata.json" | \
  jq '.packages | to_entries[] | select(.value.name=="simug") | .value.version'

Length of output: 262


Script:

#!/bin/bash
# Let's check for environment files in the repository
fd -g "*environment*.y*ml"

# Also check for any conda/mamba related config files
fd -g "*conda*.y*ml"

# Let's also check the Snakefile or other workflow files for environment specifications
rg -A 3 "conda:" 

Length of output: 24283


15-15: 🛠️ Refactor suggestion

Ensure output directory exists before accessing envdir path

The environment directory path now includes the output directory, but there's no explicit creation of this path. This could lead to errors if the directory doesn't exist when first accessed.

Consider adding directory creation similar to how it's done for haplotype directories:

+os.makedirs(envdir, exist_ok=True)
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
harpy/snakefiles/simulate_snpindel.smk (6)

15-15: LGTM: Improved path handling

The use of os.path.join for path construction is a good practice, making the code more portable across different platforms.


97-100: LGTM: Improved output file naming

The output paths are now more intuitive and consistent with the PR's objective of making simuG names less unwieldy.


111-125: Add error handling for rename operations

The rename rules should include error handling as suggested in the previous review.

Also applies to: 178-187


192-196: LGTM: Consistent path updates in workflow summary

The workflow summary has been properly updated to reflect the new file naming convention and path structure.

Also applies to: 200-201


154-155: Add VCF validation before operations

Consider adding VCF validation as suggested in the previous review.


109-109: Verify conda environment configuration

The switch to conda-based simuG installation requires proper environment setup.

harpy/snakefiles/simulate_variants.smk Show resolved Hide resolved
harpy/snakefiles/simulate_snpindel.smk Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
harpy/reports/align_stats.Rmd (5)

50-51: Enhance error handling using stop() instead of print() and knitr::knit_exit().

Consider using stop(paste("Input data file", infile, "is empty")) to provide clearer error messages and terminate the script appropriately.


58-65: Consolidate VALID_PRESENT and SKIPTEXT variables for simplicity.

Using a single variable to control conditional evaluations can simplify the code and improve readability. Consider using VALID_PRESENT alone to manage the evaluations and remove SKIPTEXT.


151-154: Streamline conditional text display based on VALID_PRESENT.

Instead of using both SKIPTEXT and VALID_PRESENT, consider using !VALID_PRESENT directly in the eval parameter to control the evaluation of the message, reducing redundancy.


193-193: Ensure consistent conditional evaluation of plots.

All plot code chunks are conditioned on VALID_PRESENT. Verify that this variable accurately represents when plots should be generated and that it aligns with data availability.

Also applies to: 218-218, 255-255, 301-301, 326-326


421-426: Enhance error handling using stop() instead of knitr::knit_exit().

Similar to earlier, using stop("Input data file is empty") provides clearer error messages and halts execution appropriately.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5e7d8c0 and 5e9fec8.

📒 Files selected for processing (5)
  • .github/workflows/tests.yml (1 hunks)
  • .gitignore (0 hunks)
  • harpy/_printing.py (1 hunks)
  • harpy/reports/align_bxstats.Rmd (1 hunks)
  • harpy/reports/align_stats.Rmd (10 hunks)
💤 Files with no reviewable changes (1)
  • .gitignore
✅ Files skipped from review due to trivial changes (1)
  • harpy/reports/align_bxstats.Rmd
🔇 Additional comments (3)
harpy/_printing.py (1)

26-26: Verify the new conda environment path and consider making it more explicit.

While the path change from .harpy_envs to /workflows/envs aligns with the conda-based installation changes, let's ensure the new path is correct and clear to users.

Let's verify the existence and structure of the new conda environment path:

Consider making the path reference more explicit to avoid confusion:

-"There was an issue creating the software environment(s) necessary to run this workflow. If you manually edited the conda dependencies in [blue]/workflows/envs[/blue], see the error below for troubleshooting. If you didn't, it might be a bug or related to how your system is setup for Conda or Singularity environments and you should submit an issue on GitHub: [bold]https://github.com/pdimens/harpy/issues"
+"There was an issue creating the software environment(s) necessary to run this workflow. If you manually edited the conda dependencies in [blue]<harpy_root>/workflows/envs[/blue], see the error below for troubleshooting. If you didn't, it might be a bug or related to how your system is setup for Conda or Singularity environments and you should submit an issue on GitHub: [bold]https://github.com/pdimens/harpy/issues"
✅ Verification successful

Path reference is correct and consistent with codebase

The path /workflows/envs is consistently referenced across the codebase, particularly in _conda.py and _printing.py. The path is relative to Harpy's working directory and is used in error messages for missing environment files. The current implementation is correct and doesn't require modification as:

  1. The path format is consistent across error messages
  2. It's clearly used as a relative path in the codebase
  3. The error message provides sufficient context about conda dependencies
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and structure of the new conda environment path

# Check if the workflows/envs directory exists and list its contents
if [ -d "workflows/envs" ]; then
    echo "Directory exists. Contents:"
    ls -la workflows/envs/
else
    echo "Directory not found at workflows/envs"
fi

# Check for any references to the old path
rg -g '!.git' '.harpy_envs'

Length of output: 140


Script:

#!/bin/bash
# Let's search for any environment-related files and directories
fd -t d "env"

# Check for any conda or environment-related YAML files
fd -e yml -e yaml

# Search for references to environment paths in the codebase
rg -g '!.git' "workflows/envs|\.harpy_envs" -A 2 -B 2

Length of output: 1340

.github/workflows/tests.yml (1)

692-692: LGTM! Consistent naming pattern for simulation outputs.

The changes standardize the output file paths across all simulation types (SNPs, indels, inversions, CNVs, and translocations) by using the "haplotype_1" naming convention. This improves clarity and maintainability.

Let's verify that the new file paths are consistently used across the codebase:

Also applies to: 698-698, 704-704, 710-710

✅ Verification successful

Changes to simulation output paths are consistent with codebase structure

The changes in the workflow file align with the codebase's implementation:

  • The new haplotype_1 pattern is properly implemented in the simulation Snakefiles (harpy/snakefiles/simulate_variants.smk and harpy/snakefiles/simulate_snpindel.smk)
  • The references to "diploid" in the codebase are primarily in documentation, help text, and rule names, which don't affect the actual file paths
  • The workflow changes correctly reflect the file structure created by the Snakemake rules, which use {outdir}/haplotype_1/{outprefix}.hap1.{variant}.vcf pattern
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to the old "diploid" naming pattern
# in simulation-related files and verify the new "haplotype_1" pattern is used consistently.

# Search for any remaining "diploid" references in simulation-related files
rg -i "diploid" --type python

# Search for the new "haplotype_1" pattern to ensure consistent usage
rg -i "haplotype_1" --type python

Length of output: 125


Script:

#!/bin/bash
# Let's try searching without file type restriction since Python might not be recognized
# Search for any remaining "diploid" references in simulation-related files
rg -i "diploid" 

# Search for the new "haplotype_1" pattern to ensure consistent usage
rg -i "haplotype_1"

# Also check for files containing simulation-related patterns
fd -t f "sim.*\.vcf"

Length of output: 4959

harpy/reports/align_stats.Rmd (1)

45-46: Good practice: Use dynamic parameter for molecule distance.

Replacing the hardcoded value with sm_moldist <- snakemake@params$mol_dist enhances flexibility and allows for dynamic configuration.

harpy/reports/align_stats.Rmd Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5e9fec8 and caa8df3.

📒 Files selected for processing (1)
  • harpy/snakefiles/simulate_variants.smk (3 hunks)
🔇 Additional comments (3)
harpy/snakefiles/simulate_variants.smk (3)

72-72: Add error handling for simuG command


91-92: LGTM: Well-organized directory structure

The new directory structure with separate haplotype directories is well-organized and follows Snakemake best practices:

  • Directories are created in the rule that needs them
  • Using exist_ok=True prevents race conditions
  • Clear separation of haplotype outputs

Also applies to: 96-97


142-146: LGTM: Comprehensive workflow summary

The workflow summary has been properly updated to reflect the new file structure and provides good traceability with:

  • Correct paths to all output files
  • Proper conditional collection of haplotype files
  • Clear separation between haploid and diploid outputs

harpy/snakefiles/simulate_variants.smk Show resolved Hide resolved
harpy/snakefiles/simulate_variants.smk Show resolved Hide resolved
@pdimens pdimens merged commit 46f5c80 into main Nov 26, 2024
20 checks passed
@pdimens pdimens deleted the pop_out_simuG branch November 26, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant