Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add command-line level validations #159

Merged
merged 27 commits into from
Nov 5, 2024
Merged

add command-line level validations #159

merged 27 commits into from
Nov 5, 2024

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Nov 4, 2024

This PR moves the file-extension validation to the command-line level. It also adds basic-ish software-specific validation for all workflows that feature --extra-params

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new custom parameter types for command-line options, enhancing input validation across various commands.
    • Added --fragment-density option in the ema command for read fragment density optimization.
    • Updated Snakemake workflows to include fragment density optimization in alignment processes.
    • Enhanced command-line interfaces for various commands by implementing specific parameter types for improved validation.
  • Bug Fixes

    • Improved error handling in file validation functions to provide clearer feedback on issues.
  • Documentation

    • Updated command documentation to reflect new parameter types and options, ensuring clarity for users.
  • Refactor

    • Streamlined command-line interface by replacing generic types with specific parameter types for better validation and type safety.

Copy link

coderabbitai bot commented Nov 4, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

This pull request includes significant updates to the command-line interface of the Harpy project. Key changes involve the introduction of new custom parameter types in multiple files, enhancing input validation for command-line options. The existing _cli_types.py file has been removed, and its functionality has been migrated to a new file called _cli_types_generic.py. Additionally, several validation functions have been modified or removed, and various workflows in Snakemake have been updated to reflect these changes. The overall structure and logic of the commands remain intact.

Changes

File Path Change Summary
.github/filters.yml Updated reference from 'harpy/_cli_types.py' to 'harpy/_cli_types_generic.py'.
harpy/_cli_types.py File deleted; contained classes for custom parameter types.
harpy/_cli_types_generic.py New file added; contains classes for custom parameter types with enhanced validation mechanisms.
harpy/_cli_types_params.py New file added; defines parameter types for various tools.
harpy/_validations.py Removed validate_input_by_ext function; modified check_fasta with enhanced error handling.
harpy/align.py Updated parameter types for commands; added --fragment-density flag to ema command.
harpy/assembly.py Updated parameter types for command-line options to use specific parameter types.
harpy/deconvolve.py Updated --hpc and --snakemake options to use HPCProfile and SnakemakeParams.
harpy/demultiplex.py Updated --hpc and --snakemake options to use HPCProfile and SnakemakeParams.
harpy/hpc.py Changed directory structure for generated config.yaml files; updated installation commands.
harpy/impute.py Updated parameter types for command-line options; removed validate_input_by_ext call.
harpy/metassembly.py Updated parameter types for command-line options to use specific parameter types.
harpy/phase.py Updated parameter types for command-line options to use specific parameter types.
harpy/preflight.py Updated --hpc and --snakemake options to use HPCProfile and SnakemakeParams.
harpy/qc.py Updated parameter types for command-line options to use specific parameter types.
harpy/simulate.py Updated parameter types for command-line options; removed validate_input_by_ext calls.
harpy/snakefiles/align_ema.smk Added frag_opt variable; updated align_ema and workflow_summary rules to include it.
harpy/snakefiles/assembly.smk Removed newline in BUSCO_analysis rule; updated workflow_summary for clarity.
harpy/snp.py Updated parameter types for mpileup and freebayes commands; adjusted error handling.
harpy/sv.py Updated parameter types for leviathan and naibr commands; refined help descriptions.

Possibly related PRs

  • Leverage heterozygosity to create diploid genomes #140: The changes in this PR enhance the command-line interface for simulating genomic variants, which includes updates to the handling of input files and parameters that are relevant to the CLI types mentioned in the main PR.
  • Add Metassembly #144: This PR introduces metassembly support and modifies various workflows and command-line options, which may relate to the changes in the CLI types and parameters in the main PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Outside diff range and nitpick comments (15)
harpy/qc.py (1)

Line range hint 46-116: Consider enhancing error handling and configuration validation.

While the command function is well-structured, consider these improvements:

  1. Add validation for the output directory path to ensure write permissions.
  2. Add error handling for YAML configuration writing.
  3. Validate the combination of parameters (e.g., when both deduplicate and deconvolve are enabled).

Here's a suggested improvement for the configuration writing:

-    with open(os.path.join(workflowdir, 'config.yaml'), "w", encoding="utf-8") as config:
-        yaml.dump(configs, config, default_flow_style= False, sort_keys=False, width=float('inf'))
+    config_path = os.path.join(workflowdir, 'config.yaml')
+    try:
+        with open(config_path, "w", encoding="utf-8") as config:
+            yaml.dump(configs, config, default_flow_style=False, sort_keys=False, width=float('inf'))
+    except (IOError, yaml.YAMLError) as e:
+        click.echo(f"Error writing configuration to {config_path}: {str(e)}", err=True)
+        sys.exit(1)
harpy/impute.py (1)

Line range hint 45-57: Enhance docstring with validation-related information.

While the docstring provides good examples, it could be enhanced by mentioning the new validation features, particularly:

  • Supported VCF file formats (uncompressed only)
  • Format requirements for the HPC profile
  • Validation rules for STITCH parameters

Consider adding a section like:

    ```
    harpy ... -x 'switchModelIterations = 39, splitReadIterations = NA, reference_populations = c("CEU","GBR")'...
    ```
+   
+   Note: The command performs validation on:
+   - VCF files (uncompressed format only)
+   - STITCH parameters (R-style syntax validation)
+   - HPC profile (yaml configuration validation)
harpy/snakefiles/assembly.smk (1)

Line range hint 1-24: Consider enhancing parameter validation.

Given this PR's focus on command-line validation, consider adding parameter validation at the workflow level:

  1. The container tag is hard-coded to latest which can lead to reproducibility issues
  2. Input parameters like max_mem, k_param, etc., lack validation

Consider:

  1. Using a specific container tag instead of latest
  2. Adding parameter validation in the workflow configuration:
# Add at the start of the workflow
validate_config = {
    "inputs": {
        "fastq_r1": str,
        "fastq_r2": str
    },
    "spades": {
        "max_memory": lambda x: isinstance(x, int) and x > 0,
        "k": lambda x: isinstance(x, int) and x > 0 and x % 2 == 1
    }
    # ... add more validation rules
}

from snakemake.utils import validate
validate(config, validate_config)
harpy/assembly.py (1)

38-38: Enhance help text for --extra-params.

Consider updating the help text to indicate the expected format for SPAdes parameters, similar to how it's done for the --arcs-extra option.

-@click.option('-x', '--extra-params', type = SpadesParams(), help = 'Additional spades parameters, in quotes')
+@click.option('-x', '--extra-params', type = SpadesParams(), help = 'Additional SPAdes parameters, in quotes (`option=arg` format)')
harpy/preflight.py (2)

45-45: Enhance help text and option ordering.

While the type changes improve validation, consider the following improvements:

  1. Enhance help text to be more descriptive:
    • For --hpc: Specify the expected format of the config.yaml file
    • For --snakemake: Provide examples of valid parameters
  2. Consider grouping related options together (e.g., move --snakemake next to --hpc since both are workflow control options)
-@click.option('--hpc',  type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file')
+@click.option('--hpc',  type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file (must contain cluster configuration)')
 @click.option('--quiet',  is_flag = True, show_default = True, default = False, help = 'Don\'t show output text while running')
-@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters, in quotes')
+@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters in quotes (e.g. "--restart-times 3 --keep-going")')

Also applies to: 47-47


105-106: Maintain consistency between fastq and bam commands.

While the type changes are good, the order of options differs between the fastq and bam commands. Consider maintaining the same option order for consistency in the CLI interface.

-@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters, in quotes')
-@click.option('--hpc',  type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file')
+@click.option('--hpc',  type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file (must contain cluster configuration)')
+@click.option('--quiet',  is_flag = True, show_default = True, default = False, help = 'Don\'t show output text while running')
+@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters in quotes (e.g. "--restart-times 3 --keep-going")')
harpy/snakefiles/align_ema.smk (1)

358-359: Consider improving the command string construction.

The parameters are correctly added, but there are some potential improvements:

  1. The command string uses single quotes which could cause issues with the RG tag escaping
  2. The indentation in the command string could be more consistent

Consider applying these changes:

-        ema_align += f'\tema align {extra} {params.frag_opt} -p {platform} -R "@RG\\tID:SAMPLE\\tSM:SAMPLE" |\n'
-        ema_align += f"\tsamtools view -h {params.unmapped} -q {config["alignment_quality"]} - |\n"
-        ema_align += "\tsamtools sort --reference genome"
+        ema_align += (
+            f'\tema align {extra} {params.frag_opt} -p {platform} '
+            f'-R "@RG\\tID:SAMPLE\\tSM:SAMPLE" |\n'
+            f'\tsamtools view -h {params.unmapped} '
+            f'-q {config["alignment_quality"]} - |\n'
+            '\tsamtools sort --reference genome'
+        )

Also applies to: 370-373

harpy/snp.py (1)

Line range hint 67-150: Consider reducing code duplication

There's significant code duplication between the mpileup and freebayes functions, particularly in the configuration setup and workflow launch logic. Consider extracting the common code into a shared helper function.

Here's a suggested approach:

def setup_snp_workflow(
    workflow_name: str,
    inputs: list,
    output_dir: str,
    genome: str,
    threads: int,
    populations: str,
    ploidy: int,
    regions: str,
    extra_params: dict,
    snakemake: str,
    skip_reports: bool,
    quiet: bool,
    hpc: str,
    conda: bool,
    setup_only: bool
) -> None:
    """Common setup logic for SNP workflow commands."""
    output_dir = output_dir.rstrip("/")
    workflowdir = os.path.join(output_dir, 'workflow')
    # ... rest of the common setup code ...
    
    configs = {
        "workflow": f"snp {workflow_name}",
        # ... rest of the config setup ...
    }
    
    # ... rest of the common launch code ...

This would simplify both command functions to:

@click.command(...)
def mpileup(...):
    """Call variants from using bcftools mpileup"""
    setup_snp_workflow(
        workflow_name="mpileup",
        inputs=inputs,
        output_dir=output_dir,
        # ... rest of the parameters ...
    )

@click.command(...)
def freebayes(...):
    """Call variants using freebayes"""
    setup_snp_workflow(
        workflow_name="freebayes",
        inputs=inputs,
        output_dir=output_dir,
        # ... rest of the parameters ...
    )

Also applies to: 172-255

harpy/sv.py (1)

10-11: Remove unused imports

The following imports from ._cli_types_params are not used in the code:

  • LeviathanParams
  • NaibrParams
from ._cli_types_generic import ContigList, HPCProfile, InputFile, SnakemakeParams
-from ._cli_types_params import LeviathanParams, NaibrParams
🧰 Tools
🪛 Ruff

11-11: ._cli_types_params.LeviathanParams imported but unused

Remove unused import

(F401)


11-11: ._cli_types_params.NaibrParams imported but unused

Remove unused import

(F401)

harpy/_validations.py (3)

Line range hint 415-474: Add error handling for file operations and improve resource management.

The function should handle file operations more safely to prevent resource leaks and provide better error messages.

Consider using a context manager and try-catch blocks:

 def check_fasta(genofile):
     """perform validations on fasta file for extensions and file contents"""
-    if is_gzip(genofile):
-        fasta = gzip.open(genofile, 'rt')
-    elif is_plaintext(genofile):
-        fasta = open(genofile, 'r', encoding="utf-8")
-    else:
+    try:
+        if is_gzip(genofile):
+            fasta = gzip.open(genofile, 'rt')
+        elif is_plaintext(genofile):
+            fasta = open(genofile, 'r', encoding="utf-8")
+        else:
+            print_error("unknown file type", f"Unable to determine file encoding for [blue]{genofile}[/blue]. Please check that it is a gzipped or uncompressed FASTA file.")
+            sys.exit(1)
+    except IOError as e:
         print_error("unknown file type", f"Unable to determine file encoding for [blue]{genofile}[/blue]. Please check that it is a gzipped or uncompressed FASTA file.")
         sys.exit(1)
-    line_num = 0
-    seq_id = 0
-    seq = 0
-    last_header = False
-    for line in fasta:
+
+    try:
+        with fasta:
+            line_num = 0
+            seq_id = 0
+            seq = 0
+            last_header = False
+            for line in fasta:
🧰 Tools
🪛 Ruff

419-419: Use a context manager for opening files

(SIM115)


Line range hint 476-497: Add type hints and improve documentation.

The function implementation is solid, but could benefit from better documentation and type hints.

Consider these improvements:

-def fasta_contig_match(contigs, fasta):
-    """Checks whether a list of contigs are present in a fasta file"""
+def fasta_contig_match(contigs: list[str], fasta: str) -> None:
+    """Checks whether a list of contigs are present in a fasta file.
+    
+    Args:
+        contigs: List of contig names to verify.
+        fasta: Path to the FASTA file.
+        
+    Raises:
+        SystemExit: If any contig is not found in the FASTA file.
+    """
🧰 Tools
🪛 Ruff

419-419: Use a context manager for opening files

(SIM115)


Line range hint 499-542: Add parameter validation and improve memory efficiency.

The function could benefit from input validation and memory optimization.

Consider these improvements:

-def validate_fastq_bx(fastq_list, threads, quiet):
+def validate_fastq_bx(fastq_list: list[str], threads: int, quiet: bool) -> None:
     """
     Parse a list of fastq files to verify that they have BX/BC tag, and only one of those two types per file
+    
+    Args:
+        fastq_list: List of FASTQ file paths to validate
+        threads: Number of parallel threads to use
+        quiet: Whether to suppress progress bar
+        
+    Raises:
+        SystemExit: If validation fails for any file
     """
+    if not fastq_list:
+        print_error("no files provided", "The list of FASTQ files is empty.")
+        sys.exit(1)
+    if threads < 1:
+        print_error("invalid threads", f"Invalid number of threads: {threads}")
+        sys.exit(1)
+
     def validate(fastq):
         BX = False
         BC = False
-        if is_gzip(fastq):
-            fq = gzip.open(fastq, "rt")
-        else:
-            fq = open(fastq, "r")
-        with fq:
+        opener = gzip.open if is_gzip(fastq) else open
+        with opener(fastq, "rt") as fq:
             while True:
                 line = fq.readline()
                 if not line:
                     break
                 if not line.startswith("@"):
                     continue
                 BX = True if "BX:Z" in line else BX
                 BC = True if "BC:Z" in line else BC
                 if BX and BC:
                     print_error("clashing barcode tags", f"Both [green bold]BC:Z[/green bold] and [green bold]BX:Z[/green bold] tags were detected in the read headers for [blue]{os.path.basename(fastq)}[/blue]. Athena accepts [bold]only[/bold] one of [green bold]BC:Z[/green bold] or [green bold]BX:Z[/green bold].")
                     print_solution("Check why your data has both tags in use and remove/rename one of the tags.")
                     sys.exit(1)
🧰 Tools
🪛 Ruff

419-419: Use a context manager for opening files

(SIM115)

harpy/simulate.py (1)

127-127: Consider using InputFile type for VCF options.

Several options still use generic Path type where InputFile could provide better validation:

  • --vcf options in inversion, cnv, and translocation commands

Example improvement:

-@click.option('-v', '--vcf', type=click.Path(exists=True, dir_okay=False, readable=True), help = 'VCF file of known variants to simulate')
+@click.option('-v', '--vcf', type=InputFile("vcf", gzip_ok = True), help = 'VCF file of known variants to simulate')

Also applies to: 345-345, 454-454, 566-566

harpy/_cli_types_generic.py (1)

109-112: Use os.path.join for cross-platform compatibility

When constructing file paths, it's recommended to use os.path.join to ensure compatibility across different operating systems.

Apply this diff to improve path handling:

103         if not os.path.exists(value):
104             self.fail(f"{value} does not exist. Please check the spelling and try again.", param, ctx)
105         elif not os.access(value, os.R_OK):
106             self.fail(f"{value} is not readable. Please check file/directory permissions and try again", param, ctx)
107         if os.path.isfile(value):
108             self.fail(f"{value} is a file, but input should be a directory.", param, ctx)
-109         if not os.path.exists(f"{value}/config.yaml"):
+109         config_file = os.path.join(value, "config.yaml")
+110         if not os.path.exists(config_file):
110             self.fail(f"{value} does not contain the necessary config.yaml file.", param, ctx)
-111         elif not os.access(f"{value}/config.yaml", os.R_OK):
+111         elif not os.access(config_file, os.R_OK):
112             self.fail(f"{value}/config.yaml does not have read access. Please check the file permissions and try again.", param, ctx)
harpy/_cli_types_params.py (1)

5-231: Consider refactoring to reduce code duplication across parameter validation classes

The convert methods in all parameter validation classes (FastpParams, BwaParams, EmaParams, etc.) share similar logic for parsing and validating options. Refactoring common code into a base class or utility functions can enhance maintainability and reduce redundancy. This approach will make it easier to add new parameter types or update validation logic in the future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6b98fdf and 07500c0.

📒 Files selected for processing (20)
  • .github/filters.yml (1 hunks)
  • harpy/_cli_types.py (0 hunks)
  • harpy/_cli_types_generic.py (1 hunks)
  • harpy/_cli_types_params.py (1 hunks)
  • harpy/_validations.py (1 hunks)
  • harpy/align.py (12 hunks)
  • harpy/assembly.py (3 hunks)
  • harpy/deconvolve.py (2 hunks)
  • harpy/demultiplex.py (2 hunks)
  • harpy/hpc.py (6 hunks)
  • harpy/impute.py (2 hunks)
  • harpy/metassembly.py (2 hunks)
  • harpy/phase.py (3 hunks)
  • harpy/preflight.py (3 hunks)
  • harpy/qc.py (2 hunks)
  • harpy/simulate.py (11 hunks)
  • harpy/snakefiles/align_ema.smk (5 hunks)
  • harpy/snakefiles/assembly.smk (1 hunks)
  • harpy/snp.py (5 hunks)
  • harpy/sv.py (7 hunks)
💤 Files with no reviewable changes (1)
  • harpy/_cli_types.py
🧰 Additional context used
🪛 Ruff
harpy/_cli_types_generic.py

21-21: Do not use bare except

(E722)


22-22: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

harpy/sv.py

11-11: ._cli_types_params.LeviathanParams imported but unused

Remove unused import

(F401)


11-11: ._cli_types_params.NaibrParams imported but unused

Remove unused import

(F401)

🔇 Additional comments (40)
.github/filters.yml (1)

4-4: LGTM! Verify the new file location.

The path update from _cli_types.py to _cli_types_generic.py aligns with the PR's objective to enhance command-line validation. Since this file is referenced in the common section and used across multiple workflows, let's verify its existence.

✅ Verification successful

File path and content verification successful

The file harpy/_cli_types_generic.py exists at the specified location and contains the expected parameter type definitions for command-line validation, including IntList, KParam, ContigList, InputFile, SnakemakeParams, and HPCProfile classes that extend click.ParamType. The path update in filters.yml is correct and consistent with the actual file location.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the new CLI types file exists and contains the expected content

# Test 1: Check if the file exists at the new location
fd "_cli_types_generic.py$" harpy/

# Test 2: Verify it contains parameter type definitions
rg -A 5 "class.*\(.*ParamType\)" harpy/_cli_types_generic.py

Length of output: 4561

harpy/deconvolve.py (2)

9-9: LGTM! Import changes align with validation improvements.

The import of custom parameter types from ._cli_types_generic supports the PR's objective of enhancing command-line validation.


37-37: Verify backward compatibility of HPCProfile validation.

The change from click.Path to HPCProfile suggests enhanced validation. Please ensure that:

  1. All valid HPC config formats are still accepted
  2. Error messages are clear when validation fails

Let's check for any existing HPC config examples in the repository:

#!/bin/bash
# Search for HPC config examples or tests
fd -e yaml -e yml --exec grep -l "HPC|cluster|submit" {}
harpy/demultiplex.py (2)

10-10: LGTM! Import of custom parameter types aligns with validation objectives.

The import of HPCProfile and SnakemakeParams from the new _cli_types_generic module supports the PR's goal of enhancing command-line validation.


47-47: Enhanced type validation for --hpc and --snakemake options.

The changes improve input validation by:

  • Replacing generic path validation with HPC-specific validation for --hpc
  • Adding parameter syntax validation for --snakemake instead of accepting any string

This aligns well with the PR's objective of enhancing command-line validation.

Let's verify the consistency of these changes across other commands:

Also applies to: 50-50

✅ Verification successful

Type validation changes are consistently applied across all commands

The search results confirm that:

  1. The --hpc option consistently uses HPCProfile() type across all commands
  2. The --snakemake option consistently uses SnakemakeParams() type across all commands

The changes are part of a systematic update to improve command-line validation across the entire codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if all commands use the same type validation

# Test 1: Verify HPCProfile usage is consistent
echo "Checking HPCProfile usage..."
rg --type python "type\s*=\s*(click\.Path|HPCProfile)\(\)" -A 2 -B 2

# Test 2: Verify SnakemakeParams usage is consistent
echo "Checking SnakemakeParams usage..."
rg --type python "type\s*=\s*(str|SnakemakeParams)\(\)" -A 2 -B 2

Length of output: 344


Script:

#!/bin/bash
# Let's try without the --type flag and use file extension patterns instead

# Test 1: Verify HPCProfile usage is consistent
echo "Checking HPCProfile usage..."
rg "type\s*=\s*(click\.Path|HPCProfile)\(\)" -g "*.py" -A 2 -B 2

# Test 2: Verify SnakemakeParams usage is consistent
echo "Checking SnakemakeParams usage..."
rg "type\s*=\s*(str|SnakemakeParams)\(\)" -g "*.py" -A 2 -B 2

# Test 3: Find all Click options to ensure we haven't missed any
echo "Finding all Click options..."
rg "@click\.option" -g "*.py" -A 1

Length of output: 78634

harpy/metassembly.py (4)

12-13: LGTM: Well-organized type imports.

The separation of CLI types into generic and parameter-specific modules improves code organization and maintainability.


Line range hint 48-114: LGTM: Clean implementation of enhanced parameter validation.

The implementation successfully:

  • Maintains existing functionality while improving type safety
  • Adds software-specific validation for SPAdes parameters
  • Standardizes parameter handling across the codebase

41-41: Verify consistent parameter type usage across commands.

The changes to HPCProfile() and SnakemakeParams() types suggest improved validation. Let's verify consistent usage across other commands:

#!/bin/bash
# Description: Check consistent usage of parameter types across commands

# Test: Look for HPCProfile usage across all Python files
rg -l "HPCProfile\(\)" "*.py"

# Test: Look for SnakemakeParams usage across all Python files
rg -l "SnakemakeParams\(\)" "*.py"

# Test: Look for any remaining old-style parameter handling
rg "type\s*=\s*click\.(?!Choice|Path|IntRange)" "*.py"

Also applies to: 45-45


35-35: Verify SPAdes parameter validation implementation.

The change to SpadesParams() type aligns with the PR objective of adding software-specific validation. Let's verify the implementation:

✅ Verification successful

SPAdes parameter validation is properly implemented

The SpadesParams class provides comprehensive validation:

  • Maintains a list of valid SPAdes-specific options
  • Prevents conflicts with Harpy's internal options
  • Validates each parameter against SPAdes documentation
  • Provides helpful error messages with documentation links
  • Handles mate-pair library options (--mpX-Y and --hqmpX-Y)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the implementation of SpadesParams validation

# Test: Look for the SpadesParams class definition and validation logic
ast-grep --pattern 'class SpadesParams:
  $$$
'

# Test: Look for any SPAdes-specific parameter validation
rg -A 5 'SpadesParams'

Length of output: 479


Script:

#!/bin/bash
# Search for SpadesParams class definition and usage
rg -A 10 "class SpadesParams"

# Search for files containing SpadesParams
fd -e py | xargs rg "SpadesParams"

# Look for parameter validation patterns in Python files
fd -e py | xargs rg -A 5 "def.*param.*valid"

Length of output: 1648


Script:

#!/bin/bash
# Get the complete implementation of SpadesParams class
rg -A 30 "class SpadesParams" harpy/_cli_types_params.py

Length of output: 2470

harpy/qc.py (2)

12-13: LGTM! Well-organized type imports.

The separation of CLI types into generic and parameter-specific modules improves code organization and maintainability.


32-32: Verify the validation logic in the new parameter types.

The switch to specific parameter types (FastpParams, HPCProfile, SnakemakeParams) improves type safety. Let's verify their implementation:

Also applies to: 40-40, 43-43

✅ Verification successful

Validation logic in parameter types is well-implemented and comprehensive

The validation logic in all three parameter types is robust and well-designed:

  • FastpParams: Validates against predefined lists of allowed and Harpy-specific options, preventing conflicts
  • HPCProfile: Performs thorough filesystem checks:
    • Existence of directory
    • Read permissions
    • Directory (not file) validation
    • Presence of required config.yaml
  • SnakemakeParams: Implements comprehensive option validation:
    • Prevents usage of forbidden options that conflict with Harpy
    • Validates against a complete list of available Snakemake parameters
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the validation logic in the new parameter types

# Test 1: Check FastpParams validation
echo "Checking FastpParams validation..."
rg -A 10 "class FastpParams" 

# Test 2: Check HPCProfile validation
echo "Checking HPCProfile validation..."
rg -A 10 "class HPCProfile"

# Test 3: Check SnakemakeParams validation
echo "Checking SnakemakeParams validation..."
rg -A 10 "class SnakemakeParams"

Length of output: 6800

harpy/impute.py (3)

10-11: LGTM: Clean import organization for new type validators.

The new imports from _cli_types_generic and _cli_types_params are well-organized and align with the PR's objective of enhancing command-line validation.


32-32: LGTM: Enhanced type validation at command-line level.

The updated click options with custom type validators (StitchParams, InputFile, HPCProfile, SnakemakeParams) provide stronger validation at the command-line level, which aligns well with the PR objectives.

Also applies to: 34-34, 36-36, 39-39, 42-42


36-36: Verify VCF file handling restrictions.

The gzip_ok=False parameter on the VCF input type suggests that compressed VCF files are not supported. This might be intentional, but it's worth verifying if this restriction is necessary, as compressed VCF files are common in bioinformatics workflows.

harpy/phase.py (3)

12-13: LGTM! Clean separation of CLI type validations.

The imports reflect a well-structured separation between generic CLI types and parameter-specific validations, aligning with the PR's objective of moving file-extension validation to the command-line level.

Also applies to: 15-15


31-32: Strong improvement in command-line parameter validation!

The changes enhance type safety by:

  • Adding software-specific validation for HapCut2 parameters
  • Enforcing correct file extensions and compression formats
  • Implementing structured validation for HPC profiles and Snakemake parameters

Let's verify backward compatibility with existing parameter values:

Also applies to: 38-38, 42-42, 45-45

✅ Verification successful

Parameter type changes maintain compatibility and add robust validation

The implementation shows well-designed type validators that enhance CLI safety while maintaining compatibility:

  • HapCutParams: Validates against predefined lists of valid HapCut2 options
  • InputFile: Performs file existence and extension checks while explicitly handling gzip
  • SnakemakeParams: Comprehensive validation against allowed/forbidden Snakemake parameters
  • HPCProfile: Validates HPC profile directory structure

Since no existing usage examples were found in documentation or tests, and the type implementations show proper parameter handling with clear fallback paths, the changes appear safe. The validators primarily add validation on top of existing string inputs rather than changing the fundamental parameter format.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the new parameter types maintain backward compatibility
# with existing command usages in documentation, tests, or example scripts.

# Test: Search for example usages of the changed parameters
echo "Searching for example usages of changed parameters..."
rg -A 2 "harpy phase.*(-x|--extra-params|--genome|-g|--vcf|-v|--hpc|--snakemake)" \
  --glob '!*.py' \
  --glob '!*.pyc' \
  --glob '!*.git*'

Length of output: 248


Script:

#!/bin/bash
# Let's try a different approach to find usage examples and tests

# Check for test files that might contain CLI invocations
echo "=== Searching test files ==="
fd "test.*\.(py|sh|txt|md)" --type f --exec rg -l "harpy phase" {}

# Look for documentation files
echo -e "\n=== Searching documentation files ==="
fd "\.(md|rst|txt)" --type f --exec rg -l "harpy phase" {}

# Search for example scripts or configuration files
echo -e "\n=== Searching example files ==="
fd "(example|sample|config)" --type f --exec rg -l "harpy phase" {}

# Check for specific parameter type implementations
echo -e "\n=== Checking parameter type implementations ==="
rg -A 5 "class (HapCutParams|InputFile|HPCProfile|SnakemakeParams)" 

Length of output: 5350


76-76: Verify the impact of removing the quiet parameter.

The quiet parameter has been removed from the check_fasta call. Let's ensure this doesn't affect error reporting behavior.

✅ Verification successful

Let me try a different approach to find the check_fasta implementation since the previous attempt didn't yield results.


Let me check the git history to understand the changes made to the check_fasta function.


Removal of file extension validation is an intentional simplification

The git history shows that the check_fasta function was simplified by removing the file extension validation logic that was previously controlled by the quiet parameter. The function now focuses solely on validating file contents and encoding, which is a more robust approach since file extensions aren't always reliable indicators of file format. This change is an improvement as it removes potentially confusing warnings about unconventional file extensions while maintaining the essential validation of the file's actual content.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the implementation of check_fasta function
# to understand the impact of removing the quiet parameter

# Test: Search for the check_fasta function implementation
echo "Searching for check_fasta implementation..."
ast-grep --pattern 'def check_fasta($$$)' -A 10

Length of output: 146


Script:

#!/bin/bash
# Search for check_fasta using ripgrep for better context
echo "Searching for check_fasta definition and usage..."
rg "def check_fasta" -A 10
rg "check_fasta\(" -A 2

Length of output: 3284


Script:

#!/bin/bash
# Check git history for changes to check_fasta function
git log -p -- harpy/_validations.py | grep -A 15 "check_fasta"

Length of output: 1681

harpy/assembly.py (4)

9-10: LGTM! Good separation of concerns.

The separation of CLI types into generic and software-specific modules improves code organization and maintainability.


40-40: LGTM! Clear parameter validation and documentation.

The change to ArcsParams() type enhances validation while maintaining clear documentation of the expected format.


59-59: Verify Snakemake parameter handling.

The change to SnakemakeParams() adds validation, but we should verify that:

  1. The validation aligns with Snakemake's expected parameter format
  2. The parameter handling in the command construction is compatible with the validated format
#!/bin/bash
# Description: Check Snakemake parameter usage patterns

# Test: Search for Snakemake command construction patterns
rg -A 10 'snakemake.*--configfile'

# Test: Search for Snakemake parameter validation
ast-grep --pattern 'class SnakemakeParams'

Also applies to: 74-80


55-55: Verify backward compatibility of HPC profile validation.

The change from basic path validation to HPCProfile() suggests enhanced validation. Please ensure this change maintains compatibility with existing HPC configuration files.

✅ Verification successful

Based on the code inspection, I can now provide a final response:

HPC profile validation maintains backward compatibility

The HPCProfile type performs the same essential validation as before, checking that:

  1. The directory exists and is readable
  2. A config.yaml file exists in the directory and is readable

The change from path validation to HPCProfile() actually provides better error messages and more specific validation, while maintaining the same core requirements for HPC configuration files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential breaking changes in HPC config validation

# Test: Search for HPC config usage patterns
rg -A 5 'config\.yaml' --type yaml

Length of output: 81


Script:

#!/bin/bash
# Let's check the HPCProfile implementation and its usage
ast-grep --pattern 'class HPCProfile'

# Search for HPC config related code
rg -A 5 'HPC.*config' 

# Look for config file handling in assembly.py
rg -A 10 'config.*yaml' assembly.py

# Check for any HPC profile validation logic
rg -A 5 'validate.*hpc|hpc.*validate'

Length of output: 17557


Script:

#!/bin/bash
# Let's check the HPCProfile implementation
cat harpy/_cli_types_generic.py

Length of output: 8453

harpy/preflight.py (2)

9-9: LGTM: Import of custom parameter types.

The import of HPCProfile and SnakemakeParams from the new ._cli_types_generic module aligns with the PR's objective of enhancing command-line validation.


9-9: Verify custom type implementations.

Let's verify the implementation of the custom parameter types to ensure proper validation.

Also applies to: 45-45, 47-47, 105-106

✅ Verification successful

Custom type implementations are properly validated

The implementation shows robust input validation:

  • HPCProfile validates:

    • Directory existence and readability
    • Ensures input is a directory not a file
    • Checks for required config.yaml file presence
  • SnakemakeParams validates:

    • Maintains a comprehensive list of forbidden options that conflict with Harpy
    • Validates against available Snakemake parameters
    • Prevents parameter duplication

The types are consistently used across all CLI modules for --hpc and --snakemake options with proper help text.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check implementation and usage of custom parameter types

# Check implementation of custom types
echo "Checking custom type implementations..."
rg -A 10 "class (HPCProfile|SnakemakeParams)" 

# Check usage patterns
echo -e "\nChecking usage patterns..."
rg "(HPCProfile|SnakemakeParams)\(\)" 

Length of output: 9967

harpy/hpc.py (2)

16-18: LGTM: Documentation accurately reflects the new directory structure.

The updated docstring clearly explains the new organization where configuration files are stored in scheduler-specific subdirectories (hpc/system/), which matches the actual implementation.


41-41: Verify the installation commands for all scheduler plugins.

The installation commands have been standardized to use conda instead of mamba. While this change is consistent, we should verify:

  1. That these packages are available in the bioconda channel
  2. That these are the correct package names for each scheduler

Also applies to: 83-83, 119-119, 154-154, 189-189

harpy/snakefiles/align_ema.smk (3)

22-22: LGTM: Clear and consistent variable declaration.

The new frag_opt variable is well-named and appropriately placed with other configuration variables.


163-163: LGTM: Fragment optimization flag correctly integrated in command.

The frag_opt parameter is properly positioned in the EMA align command.


153-153: LGTM: Appropriate parameter handling for fragment density optimization.

The parameter is correctly configured to conditionally enable fragment density optimization with the -d flag.

✅ Verification successful

LGTM: Fragment density optimization is properly implemented and documented

The verification confirms that the -d flag for fragment density optimization is:

  • Properly documented in the CLI as --fragment-density with help text "Perform read fragment density optimization"
  • Correctly configured in the align module's configuration as fragment_density_optimization
  • Appropriately handled in the Snakefile by setting frag_opt = "-d" when enabled
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that fragment density optimization is documented in EMA's help text
# Test: Check if -d flag is documented in EMA's help. Expect: Description of fragment density optimization.
rg -A 5 "fragment.*density|[-]d\s" README* docs/*

Length of output: 102


Script:

#!/bin/bash
# Let's try to find any documentation or usage of the -d flag in the codebase
rg -A 5 "\-d\s|fragment.*density" .

# Also search for EMA documentation or configuration files
fd -e md -e txt -e yaml -e yml -e json . | xargs rg -l "ema|align"

# Look for the actual usage of frag_opt in the code
ast-grep --pattern 'frag_opt = $_'

Length of output: 21992

harpy/snp.py (4)

10-11: LGTM: Clean separation of CLI type validations

The separation of CLI types into generic and parameter-specific modules improves code organization and maintainability.


52-53: LGTM: Enhanced parameter validation

The introduction of specific parameter types (MpileupParams, InputFile, HPCProfile, SnakemakeParams) strengthens input validation at the CLI level, which aligns well with the PR objectives.

Also applies to: 59-59, 64-64


157-158: LGTM: Consistent parameter validation across commands

The freebayes command implements the same enhanced validation types as mpileup, maintaining consistency across the module. This is a good practice that ensures uniform behavior and maintainability.

Also applies to: 166-166, 169-169, 203-203


98-98: Verify error handling behavior

The removal of the quiet parameter from check_fasta call might affect error reporting behavior.

harpy/sv.py (5)

67-67: LGTM: Enhanced parameter validation

The use of custom parameter types (HPCProfile, SnakemakeParams) improves command-line validation by ensuring proper format and content of these parameters before workflow execution.

Also applies to: 70-70


158-158: LGTM: Consistent parameter validation

The parameter validation approach is consistently applied across both commands, using the same custom types (HPCProfile, SnakemakeParams) for improved command-line validation.

Also applies to: 161-161


Line range hint 195-205: LGTM: Comprehensive validation sequence

The validation sequence is well-structured and covers all necessary checks:

  1. Genome FASTA validation
  2. Contig validation (if specified)
  3. Population file validation (if specified)
  4. VCF phasing validation (if specified)

This ensures all inputs are valid before workflow execution.


145-146: Verify extra parameters validation for naibr

The --extra-params option should validate NAIBR-specific parameters to prevent invalid options from being passed to the workflow.

✅ Verification successful

Let me gather more context about the NaibrParams implementation to verify the validation logic.


Let me try a different approach to get the implementation details.


Based on the implementation shown in _cli_types_params.py, I can now generate the final response:

The --extra-params option already has proper NAIBR-specific parameter validation

The NaibrParams class implements thorough validation that:

  • Prevents usage of parameters already managed by Harpy (bam_file, prefix, outdir, threads, min_mapq, d, min_sv, k)
  • Validates against allowed NAIBR parameters (blacklist, candidates)
  • Enforces correct parameter format (ARG VAL pairs without dashes)
  • Provides clear error messages with documentation links
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for NAIBR parameter validation
rg -A 5 "NaibrParams|validate_naibr"

Length of output: 888


Script:

#!/bin/bash
# Get full implementation of NaibrParams class
ast-grep --pattern 'class NaibrParams {
  $$$
}'

# Also check usage of NaibrParams in sv.py
rg -A 3 "NaibrParams\(" harpy/sv.py

Length of output: 88


Script:

#!/bin/bash
# Get the full content of the file containing NaibrParams
cat harpy/_cli_types_params.py

# Check how extra-params is used in the naibr command
rg -B 2 -A 5 "extra-params.*naibr" 

Length of output: 20774


56-57: Verify extra parameters validation for leviathan

The --extra-params option should validate leviathan-specific parameters to prevent invalid options from being passed to the workflow.

✅ Verification successful

Let me gather more context about how these parameters are actually used in the code.


Let me modify the search pattern to correctly find the leviathan function and its usage.


Let me try one final search pattern to locate the leviathan function and its parameter usage.


Based on the gathered context, I can now provide a final response about the parameter validation:

Extra parameters validation is properly implemented for leviathan

The --extra-params option is properly validated through the LeviathanParams type which:

  • Defines allowed Harpy-specific options (-b -i -g -o -v --minVariantSize etc.)
  • Defines valid Leviathan-specific options (--regionSize --maxLinks --mediumSize etc.)
  • The validated parameters are correctly passed to the workflow configuration via the extra key
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for leviathan parameter validation
rg -A 5 "LeviathanParams|validate_leviathan" 

Length of output: 1087


Script:

#!/bin/bash
# Search for usage of LeviathanParams and extra_params
ast-grep --pattern 'def leviathan($_) {
  $$$
}'

rg -B 2 -A 5 '@click\.option.*extra-params.*\n.*def leviathan'

Length of output: 307


Script:

#!/bin/bash
# Search for leviathan function implementation and usage of extra_params
rg -U -B 2 -A 10 '@click\.option.*extra-params.*\n.*def leviathan'

# Also search for the leviathan function definition
ast-grep --pattern 'def leviathan($params) { $$$  }'

Length of output: 123


Script:

#!/bin/bash
# Search for the leviathan function and its implementation
rg -A 20 "^def leviathan" harpy/sv.py

# Search for where extra_params is used
rg -A 5 "extra_params" harpy/sv.py

Length of output: 2705

harpy/simulate.py (2)

9-9: LGTM: Import changes align with validation improvements.

The imports have been updated to use the new custom parameter types from _cli_types_generic.py, which supports the move to command-line level validation.

Also applies to: 14-14


137-137: LGTM: Enhanced CLI validation with custom types.

The use of InputFile, HPCProfile, and SnakemakeParams types improves input validation by enforcing file extensions and standardizing parameter handling at the command-line level.

Also applies to: 141-143, 211-212, 219-219, 221-221, 229-229, 232-233, 345-346, 354-354, 357-358, 454-455, 463-463, 466-467, 566-567, 575-575, 578-579

harpy/_cli_types_params.py (1)

104-116: Verify the handling of options starting with dashes in ArcsParams

Disallowing options that start with a dash could prevent users from specifying valid arcs options. While the intention is to avoid arguments meant for arcs-make, some valid arcs options might begin with a dash (e.g., -v for verbose output). Please verify whether this restriction is necessary.

To confirm whether arcs supports options starting with a dash, you can check the official documentation or run the following command:

This script fetches the ARCS README and extracts options that start with one or two dashes. Review the output to ensure that all valid options are accounted for and adjust the validation logic accordingly.

✅ Verification successful

Let me gather more information about how ARCS options are handled in the codebase.


Based on the search results, I can now generate the final response:

The restriction on dash-prefixed options in ArcsParams is correct and necessary

Looking at the codebase, ARCS parameters are explicitly designed to be used in the format ARG=VAL without dashes. This is evident from:

  1. The valid_options list in ArcsParams which contains options like "d", "k", "o", "e" (without dashes)
  2. The error message correctly instructs users to use the format ARG=VAL (e.g., k=15)
  3. The distinction between arcs-make arguments (which use dashes) and arcs parameters is intentional to prevent parameter confusion
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for options starting with a dash in ARCS documentation.

# Expected result: List of valid ARCS options that start with a dash.

curl -s https://raw.githubusercontent.com/bcgsc/arcs/master/README.md | grep -Eo '^-{1,2}[a-zA-Z\-]+' | sort | uniq

Length of output: 124


Script:

#!/bin/bash
# Search for ARCS command-line options in the codebase
rg -A 5 "arcs options|ARCS options" --type py

# Search for any hardcoded ARCS options or parameters
rg -A 5 "valid_options.*=|harpy_options.*=" --type py

Length of output: 12494

harpy/deconvolve.py Show resolved Hide resolved
harpy/snakefiles/assembly.smk Show resolved Hide resolved
harpy/simulate.py Show resolved Hide resolved
harpy/_cli_types_generic.py Show resolved Hide resolved
harpy/_cli_types_generic.py Show resolved Hide resolved
harpy/_cli_types_params.py Show resolved Hide resolved
harpy/_cli_types_params.py Outdated Show resolved Hide resolved
harpy/_cli_types_params.py Outdated Show resolved Hide resolved
harpy/_cli_types_params.py Show resolved Hide resolved
harpy/align.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (6)
harpy/_cli_types_generic.py (6)

24-25: Enhance error message with count details.

The error message could be more helpful by including the actual number of entries provided.

Apply this diff to improve the error message:

-            self.fail(f"{value} is not a valid list of integers. The value should be {self.max_entries} integers separated by a comma.", param, ctx)
+            self.fail(f"{value} is not a valid list of integers. Found {len(parts)} entries but expected exactly {self.max_entries} integers separated by commas.", param, ctx)

30-40: Enhance input validation and error messages.

The validation could be more robust with specific error messages for different failure cases.

Apply this diff to improve the validation:

     def convert(self, value, param, ctx):
         try:
+            if not value:
+                raise ValueError("Empty input is not allowed")
             if value == "auto":
                 return value
             parts = [i.strip() for i in value.split(',')]
             for i in parts:
-                if int(i) % 2 == 0 or int(i) > 128:
-                    raise ValueError
+                num = int(i)
+                if num % 2 == 0:
+                    self.fail(f"'{i}' is not an odd integer", param, ctx)
+                if num > 128:
+                    self.fail(f"'{i}' exceeds maximum allowed value of 128", param, ctx)
             return [int(i) for i in parts]
         except ValueError:
             self.fail(f"{value} is not 'auto' or odd integers <128 separated by a comma.", param, ctx)

67-68: Simplify dictionary key check.

The .keys() call is unnecessary when checking for key existence.

Apply this diff:

-        if self.filetype not in filedict.keys():
+        if self.filetype not in filedict:
🧰 Tools
🪛 Ruff

67-67: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)


91-92: Optimize parameter validation using sets.

Converting the lists to sets would improve lookup performance, especially given the large number of options.

Apply this diff:

-        forbidden = "--rerun-incomplete --ri --show-failed-logs --rerun-triggers --nolock --software-deployment-method --smd --deployment --deployment-method --conda-prefix --cores -c --directory -d --snakefile -s --configfile --configfiles".split() 
-        available = "--dry-run --dryrun -n --profile --cache --jobs -j --local-cores --resources --res --set-threads --max-threads --set-resources --set-scatter --set-resource-scopes --default-resources --default-res --preemptible-rules --preemptible-retries --envvars --touch -t --keep-going -k --force -f --executor -e --forceall -F --forcerun -R --prioritize -P --batch --until -U --omit-from -O --shadow-prefixDIR --scheduler --wms-monitor --wms-monitor-arg --scheduler-ilp-solver --conda-base-path --no-subworkflows --nosw --precommand --groups --group-components --report --report-stylesheet --reporterPLUGIN --draft-notebook --edit-notebook --notebook-listen --lint --generate-unit-tests --containerize --export-cwl --list-rules --list -l --list-target-rules --lt --dag --rulegraph --filegraph --d3dag --summary -S --detailed-summary -D --archive --cleanup-metadata --cmFILE --cleanup-shadow --skip-script-cleanup --unlock --list-changes --lc --list-input-changes --li --list-params-changes --lp --list-untracked --lu --delete-all-output --delete-temp-output --keep-incomplete --drop-metadata --version -v --printshellcmds -p --debug-dag --nocolor --quiet -q --print-compilation --verbose --force-use-threads --allow-ambiguity -a --ignore-incomplete --ii --max-inventory-time --latency-wait --output-wait -w --wait-for-files --wait-for-files-file --queue-input-wait-time --notemp --nt --all-temp --unneeded-temp-files --keep-storage-local-copies --target-files-omit-workdir-adjustment --allowed-rules --max-jobs-per-timespan --max-jobs-per-second --max-status-checks-per-second --seconds-between-status-checks --retries --restart-times -T --wrapper-prefix --default-storage-provider --default-storage-prefix --local-storage-prefix --remote-job-local-storage-prefix --shared-fs-usage --scheduler-greediness --greediness --no-hooks --debug --runtime-profile --local-groupid --attempt --log-handler-script --log-service --job-deploy-sources --benchmark-extended --container-image --immediate-submit --is --jobscript --js --jobname --jn --flux --container-cleanup-images --use-conda --conda-not-block-search-path-envvars --list-conda-envs --conda-cleanup-envs --conda-cleanup-pkgs --conda-create-envs-only --conda-frontend --use-apptainer --use-singularity --apptainer-prefix --singularity-prefix --apptainer-args --singularity-args --use-envmodules --scheduler-solver-path --deploy-sources --target-jobs --mode --report-html-path --report-html-stylesheet-path".split()
+        forbidden = set("--rerun-incomplete --ri --show-failed-logs --rerun-triggers --nolock --software-deployment-method --smd --deployment --deployment-method --conda-prefix --cores -c --directory -d --snakefile -s --configfile --configfiles".split())
+        available = set("--dry-run --dryrun -n --profile --cache --jobs -j --local-cores --resources --res --set-threads --max-threads --set-resources --set-scatter --set-resource-scopes --default-resources --default-res --preemptible-rules --preemptible-retries --envvars --touch -t --keep-going -k --force -f --executor -e --forceall -F --forcerun -R --prioritize -P --batch --until -U --omit-from -O --shadow-prefixDIR --scheduler --wms-monitor --wms-monitor-arg --scheduler-ilp-solver --conda-base-path --no-subworkflows --nosw --precommand --groups --group-components --report --report-stylesheet --reporterPLUGIN --draft-notebook --edit-notebook --notebook-listen --lint --generate-unit-tests --containerize --export-cwl --list-rules --list -l --list-target-rules --lt --dag --rulegraph --filegraph --d3dag --summary -S --detailed-summary -D --archive --cleanup-metadata --cmFILE --cleanup-shadow --skip-script-cleanup --unlock --list-changes --lc --list-input-changes --li --list-params-changes --lp --list-untracked --lu --delete-all-output --delete-temp-output --keep-incomplete --drop-metadata --version -v --printshellcmds -p --debug-dag --nocolor --quiet -q --print-compilation --verbose --force-use-threads --allow-ambiguity -a --ignore-incomplete --ii --max-inventory-time --latency-wait --output-wait -w --wait-for-files --wait-for-files-file --queue-input-wait-time --notemp --nt --all-temp --unneeded-temp-files --keep-storage-local-copies --target-files-omit-workdir-adjustment --allowed-rules --max-jobs-per-timespan --max-jobs-per-second --max-status-checks-per-second --seconds-between-status-checks --retries --restart-times -T --wrapper-prefix --default-storage-provider --default-storage-prefix --local-storage-prefix --remote-job-local-storage-prefix --shared-fs-usage --scheduler-greediness --greediness --no-hooks --debug --runtime-profile --local-groupid --attempt --log-handler-script --log-service --job-deploy-sources --benchmark-extended --container-image --immediate-submit --is --jobscript --js --jobname --jn --flux --container-cleanup-images --use-conda --conda-not-block-search-path-envvars --list-conda-envs --conda-cleanup-envs --conda-cleanup-pkgs --conda-create-envs-only --conda-frontend --use-apptainer --use-singularity --apptainer-prefix --singularity-prefix --apptainer-args --singularity-args --use-envmodules --scheduler-solver-path --deploy-sources --target-jobs --mode --report-html-path --report-html-stylesheet-path".split())

111-114: Improve path handling and error messages.

Use os.path.join for path construction and consolidate similar error messages.

Apply this diff:

-        if not os.path.exists(f"{value}/config.yaml"):
-            self.fail(f"{value} does not contain the necessary config.yaml file.", param, ctx)
-        elif not os.access(f"{value}/config.yaml", os.R_OK):
-            self.fail(f"{value}/config.yaml does not have read access. Please check the file permissions and try again.", param, ctx)
+        config_path = os.path.join(value, "config.yaml")
+        if not os.path.exists(config_path):
+            self.fail(f"{value} does not contain the necessary config.yaml file.", param, ctx)
+        elif not os.access(config_path, os.R_OK):
+            self.fail(f"{config_path} does not have read access. Please check the file permissions and try again.", param, ctx)

117-117: Document planned program-specific parameter types.

Consider adding more detailed documentation about the planned program-specific parameter types, including:

  • Expected parameter types to be implemented
  • Their purpose and validation requirements
  • Any dependencies or prerequisites
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 07500c0 and 1ff9d39.

📒 Files selected for processing (5)
  • .github/workflows/tests.yml (2 hunks)
  • harpy/_cli_types_generic.py (1 hunks)
  • harpy/_cli_types_params.py (1 hunks)
  • harpy/align.py (12 hunks)
  • harpy/simulate.py (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/_cli_types_params.py
🧰 Additional context used
🪛 Ruff
harpy/_cli_types_generic.py

21-21: Do not use bare except

(E722)


22-22: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


67-67: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)

🔇 Additional comments (11)
harpy/align.py (3)

12-13: LGTM! Import changes align with validation objectives.

The new imports from _cli_types_generic and _cli_types_params support the enhanced command-line validation strategy.


68-68: LGTM! Enhanced type safety with custom parameter types.

The changes improve input validation by:

  • Using InputFile("fasta", gzip_ok=True) for genome files
  • Using specific parameter types for extra parameters (BwaParams, EmaParams, StrobeAlignParams)
  • Using HPCProfile and SnakemakeParams for workflow controls

Also applies to: 70-70, 79-79, 82-82, 152-152, 156-156, 166-166, 169-169, 258-258, 260-260, 270-270, 273-273


108-108: LGTM! Consistent configuration handling across commands.

The configuration handling is well-structured and follows best practices:

  • Consistent pattern for optional parameters
  • Proper path resolution for input files
  • Clear separation of workflow and input configurations

Also applies to: 209-209, 302-302

.github/workflows/tests.yml (2)

347-347: LGTM! Verify the fragment-density functionality.

The removal of -x "-d" option simplifies the command, which is now handled by the new --fragment-density flag in the EMA command.

#!/bin/bash
# Description: Verify the new --fragment-density flag is properly implemented
# in the harpy align ema command.

# Test: Search for the fragment-density parameter definition
ast-grep --pattern 'def align_ema($$$):
  $$$
  fragment_density$$$
  $$$'

232-232: LGTM! Verify the new filter parameter.

The change from --trim_poly_g to --low_complexity_filter appears to be a more comprehensive approach to filtering low-quality sequences.

harpy/simulate.py (6)

9-9: LGTM: Import of custom types for enhanced validation.

The addition of custom types from _cli_types_generic aligns with the PR objective of moving file-extension validation to the command-line level.


137-141: LGTM: Enhanced type validation for linkedreads command options.

The command options now use custom types for better validation:

  • HPCProfile() for HPC configuration
  • SnakemakeParams() for Snakemake parameters

142-143: LGTM: Improved file validation for genome inputs.

The genome arguments now use InputFile("fasta", gzip_ok = True) to validate FASTA files with gzip support.


211-212: LGTM: Enhanced file validation for snpindel command options.

The command options now use InputFile with appropriate extensions:

  • VCF files: gzip_ok = False
  • GFF files: gzip_ok = True

Also applies to: 219-221


229-232: LGTM: Improved parameter validation for snpindel workflow controls.

The workflow control options now use custom types:

  • HPCProfile() for HPC configuration
  • SnakemakeParams() for Snakemake parameters

167-168: Remove the quiet parameter from check_fasta calls.

The check_fasta function signature in _validations.py has been updated to only accept a single parameter genofile. The calls need to be updated to match this change.

Also applies to: 275-275, 387-387, 502-502, 607-607

harpy/_cli_types_generic.py Show resolved Hide resolved
harpy/_cli_types_generic.py Show resolved Hide resolved
harpy/align.py Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
harpy/popgroups.py (1)

50-54: Fix color tag and consider consolidating status messages

The formatting improvements enhance readability, but there's a small issue:

  1. Line 50: Missing closing tag for [blue]
  2. Lines 52 and 54: Similar message structures could be consolidated

Apply these changes:

-    rprint(f"[bold]{len(samplenames)} samples [/bold] detected in [blue]{inputdir}[blue]", file = sys.stderr)
+    rprint(f"[bold]{len(samplenames)} samples [/bold] detected in [blue]{inputdir}[/blue]", file = sys.stderr)
-    if os.path.exists(output):
-        write_text = f"The file [blue]{output}[/blue] was overwritten."
-    else:
-        write_text = f"Created sample population grouping file [blue]{output}[/blue]."
+    write_text = f"{'Overwrote' if os.path.exists(output) else 'Created'} sample population grouping file [blue]{output}[/blue]"
harpy/hpc.py (2)

33-49: Enhance robustness of package validation.

While the function works, consider these improvements for better reliability:

  1. Add error handling for subprocess.run
  2. Make the package name prefix configurable
  3. Use more robust output parsing

Consider this implementation:

 def package_exists(pkg):
     """helper function to search for a package in the active conda environment"""
-    search_pkg = subprocess.run(f"conda list snakemake-executor-plugin-{pkg}".split(), capture_output = True, text = True)
-    exists = False
-    for i in search_pkg.stdout.split("\n"):
-        if i and not i.startswith("#"):
-            exists = True
+    pkg_prefix = "snakemake-executor-plugin-"
+    try:
+        search_pkg = subprocess.run(
+            f"conda list {pkg_prefix}{pkg}".split(),
+            capture_output=True,
+            text=True,
+            check=False  # Don't raise on non-zero exit
+        )
+        # Check if any non-header line contains the package name
+        exists = any(
+            line.strip() and not line.startswith("#")
+            for line in search_pkg.stdout.splitlines()
+        )
+    except subprocess.SubprocessError as e:
+        print_notice(f"Error checking for package: {e}")
+        return False

Line range hint 1-240: Architecture Review: Good separation of concerns and extensibility.

The changes improve the codebase by:

  1. Better organization of HPC profiles under system-specific directories
  2. Enhanced validation through package existence checks
  3. Consistent implementation across different HPC systems
  4. Improved user feedback through rich formatting

The modular design makes it easy to add support for new HPC systems in the future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9212929 and 408114b.

📒 Files selected for processing (4)
  • .github/filters.yml (1 hunks)
  • harpy/hpc.py (7 hunks)
  • harpy/popgroups.py (2 hunks)
  • harpy/stitchparams.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • harpy/stitchparams.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/filters.yml
🔇 Additional comments (3)
harpy/popgroups.py (1)

7-7: LGTM: Rich library import for enhanced CLI output

The addition of rich's print function aligns with the PR's objective of improving command-line output formatting.

harpy/hpc.py (2)

4-7: LGTM: Import changes and docstring updates are appropriate.

The new imports support the package validation feature, and the switch to rich.print enhances user feedback. The docstring accurately reflects the new directory structure.

Also applies to: 19-21


56-57: LGTM: Consistent implementation of validation and output formatting.

The changes across all HPC profile generators are consistent and well-implemented:

  1. Improved output formatting using rich
  2. Added package validation before file creation

Let's verify the impact of the directory structure changes:

Also applies to: 92-93, 122-123, 151-152, 180-181

✅ Verification successful

Let me verify if these paths are being used correctly by checking the context around these usages.


Let me check if there are any other files that might be affected by or depend on these directory structures.


Directory structure changes are correctly implemented

The directory structure changes are properly handled in the code:

  • All HPC profile generators consistently create their respective directories using os.makedirs()
  • The paths are only referenced in harpy/hpc.py and nowhere else in the codebase
  • The error from fd command confirms that the hpc/ directory doesn't exist yet, which is expected as it's created at runtime
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if any files in the codebase still reference the old hpc/ directory structure
rg -g '!*.{yaml,yml}' 'hpc/[^/]+/config\.yaml'

Length of output: 312


Script:

#!/bin/bash
# Check the context of how these paths are used
rg -B 2 -A 2 'outfile = "hpc/.+/config\.yaml"' harpy/hpc.py

Length of output: 1188


Script:

#!/bin/bash
# Search for any other references to hpc directory structure
rg -g '!{*.yaml,*.yml,*.pyc}' -l 'hpc/[^/]+' --type-add 'test:*.{py,sh,txt}' --type test

# Check if there are any config templates that might need to be moved
fd -e yaml -e yml . hpc/

Length of output: 222

harpy/popgroups.py Show resolved Hide resolved
@pdimens pdimens merged commit 31c905d into main Nov 5, 2024
19 of 20 checks passed
@pdimens pdimens deleted the cli_validations branch November 5, 2024 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant