-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add command-line level validations #159
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThis pull request includes significant updates to the command-line interface of the Harpy project. Key changes involve the introduction of new custom parameter types in multiple files, enhancing input validation for command-line options. The existing Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Outside diff range and nitpick comments (15)
harpy/qc.py (1)
Line range hint
46-116
: Consider enhancing error handling and configuration validation.While the command function is well-structured, consider these improvements:
- Add validation for the output directory path to ensure write permissions.
- Add error handling for YAML configuration writing.
- Validate the combination of parameters (e.g., when both
deduplicate
anddeconvolve
are enabled).Here's a suggested improvement for the configuration writing:
- with open(os.path.join(workflowdir, 'config.yaml'), "w", encoding="utf-8") as config: - yaml.dump(configs, config, default_flow_style= False, sort_keys=False, width=float('inf')) + config_path = os.path.join(workflowdir, 'config.yaml') + try: + with open(config_path, "w", encoding="utf-8") as config: + yaml.dump(configs, config, default_flow_style=False, sort_keys=False, width=float('inf')) + except (IOError, yaml.YAMLError) as e: + click.echo(f"Error writing configuration to {config_path}: {str(e)}", err=True) + sys.exit(1)harpy/impute.py (1)
Line range hint
45-57
: Enhance docstring with validation-related information.While the docstring provides good examples, it could be enhanced by mentioning the new validation features, particularly:
- Supported VCF file formats (uncompressed only)
- Format requirements for the HPC profile
- Validation rules for STITCH parameters
Consider adding a section like:
``` harpy ... -x 'switchModelIterations = 39, splitReadIterations = NA, reference_populations = c("CEU","GBR")'... ``` + + Note: The command performs validation on: + - VCF files (uncompressed format only) + - STITCH parameters (R-style syntax validation) + - HPC profile (yaml configuration validation)harpy/snakefiles/assembly.smk (1)
Line range hint
1-24
: Consider enhancing parameter validation.Given this PR's focus on command-line validation, consider adding parameter validation at the workflow level:
- The container tag is hard-coded to
latest
which can lead to reproducibility issues- Input parameters like
max_mem
,k_param
, etc., lack validationConsider:
- Using a specific container tag instead of
latest
- Adding parameter validation in the workflow configuration:
# Add at the start of the workflow validate_config = { "inputs": { "fastq_r1": str, "fastq_r2": str }, "spades": { "max_memory": lambda x: isinstance(x, int) and x > 0, "k": lambda x: isinstance(x, int) and x > 0 and x % 2 == 1 } # ... add more validation rules } from snakemake.utils import validate validate(config, validate_config)harpy/assembly.py (1)
38-38
: Enhance help text for--extra-params
.Consider updating the help text to indicate the expected format for SPAdes parameters, similar to how it's done for the
--arcs-extra
option.-@click.option('-x', '--extra-params', type = SpadesParams(), help = 'Additional spades parameters, in quotes') +@click.option('-x', '--extra-params', type = SpadesParams(), help = 'Additional SPAdes parameters, in quotes (`option=arg` format)')harpy/preflight.py (2)
45-45
: Enhance help text and option ordering.While the type changes improve validation, consider the following improvements:
- Enhance help text to be more descriptive:
- For
--hpc
: Specify the expected format of theconfig.yaml
file- For
--snakemake
: Provide examples of valid parameters- Consider grouping related options together (e.g., move
--snakemake
next to--hpc
since both are workflow control options)-@click.option('--hpc', type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file') +@click.option('--hpc', type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file (must contain cluster configuration)') @click.option('--quiet', is_flag = True, show_default = True, default = False, help = 'Don\'t show output text while running') -@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters, in quotes') +@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters in quotes (e.g. "--restart-times 3 --keep-going")')Also applies to: 47-47
105-106
: Maintain consistency between fastq and bam commands.While the type changes are good, the order of options differs between the
fastq
andbam
commands. Consider maintaining the same option order for consistency in the CLI interface.-@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters, in quotes') -@click.option('--hpc', type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file') +@click.option('--hpc', type = HPCProfile(), help = 'Directory with HPC submission `config.yaml` file (must contain cluster configuration)') +@click.option('--quiet', is_flag = True, show_default = True, default = False, help = 'Don\'t show output text while running') +@click.option('--snakemake', type = SnakemakeParams(), help = 'Additional Snakemake parameters in quotes (e.g. "--restart-times 3 --keep-going")')harpy/snakefiles/align_ema.smk (1)
358-359
: Consider improving the command string construction.The parameters are correctly added, but there are some potential improvements:
- The command string uses single quotes which could cause issues with the RG tag escaping
- The indentation in the command string could be more consistent
Consider applying these changes:
- ema_align += f'\tema align {extra} {params.frag_opt} -p {platform} -R "@RG\\tID:SAMPLE\\tSM:SAMPLE" |\n' - ema_align += f"\tsamtools view -h {params.unmapped} -q {config["alignment_quality"]} - |\n" - ema_align += "\tsamtools sort --reference genome" + ema_align += ( + f'\tema align {extra} {params.frag_opt} -p {platform} ' + f'-R "@RG\\tID:SAMPLE\\tSM:SAMPLE" |\n' + f'\tsamtools view -h {params.unmapped} ' + f'-q {config["alignment_quality"]} - |\n' + '\tsamtools sort --reference genome' + )Also applies to: 370-373
harpy/snp.py (1)
Line range hint
67-150
: Consider reducing code duplicationThere's significant code duplication between the
mpileup
andfreebayes
functions, particularly in the configuration setup and workflow launch logic. Consider extracting the common code into a shared helper function.Here's a suggested approach:
def setup_snp_workflow( workflow_name: str, inputs: list, output_dir: str, genome: str, threads: int, populations: str, ploidy: int, regions: str, extra_params: dict, snakemake: str, skip_reports: bool, quiet: bool, hpc: str, conda: bool, setup_only: bool ) -> None: """Common setup logic for SNP workflow commands.""" output_dir = output_dir.rstrip("/") workflowdir = os.path.join(output_dir, 'workflow') # ... rest of the common setup code ... configs = { "workflow": f"snp {workflow_name}", # ... rest of the config setup ... } # ... rest of the common launch code ...This would simplify both command functions to:
@click.command(...) def mpileup(...): """Call variants from using bcftools mpileup""" setup_snp_workflow( workflow_name="mpileup", inputs=inputs, output_dir=output_dir, # ... rest of the parameters ... ) @click.command(...) def freebayes(...): """Call variants using freebayes""" setup_snp_workflow( workflow_name="freebayes", inputs=inputs, output_dir=output_dir, # ... rest of the parameters ... )Also applies to: 172-255
harpy/sv.py (1)
10-11
: Remove unused importsThe following imports from
._cli_types_params
are not used in the code:
LeviathanParams
NaibrParams
from ._cli_types_generic import ContigList, HPCProfile, InputFile, SnakemakeParams -from ._cli_types_params import LeviathanParams, NaibrParams
🧰 Tools
🪛 Ruff
11-11:
._cli_types_params.LeviathanParams
imported but unusedRemove unused import
(F401)
11-11:
._cli_types_params.NaibrParams
imported but unusedRemove unused import
(F401)
harpy/_validations.py (3)
Line range hint
415-474
: Add error handling for file operations and improve resource management.The function should handle file operations more safely to prevent resource leaks and provide better error messages.
Consider using a context manager and try-catch blocks:
def check_fasta(genofile): """perform validations on fasta file for extensions and file contents""" - if is_gzip(genofile): - fasta = gzip.open(genofile, 'rt') - elif is_plaintext(genofile): - fasta = open(genofile, 'r', encoding="utf-8") - else: + try: + if is_gzip(genofile): + fasta = gzip.open(genofile, 'rt') + elif is_plaintext(genofile): + fasta = open(genofile, 'r', encoding="utf-8") + else: + print_error("unknown file type", f"Unable to determine file encoding for [blue]{genofile}[/blue]. Please check that it is a gzipped or uncompressed FASTA file.") + sys.exit(1) + except IOError as e: print_error("unknown file type", f"Unable to determine file encoding for [blue]{genofile}[/blue]. Please check that it is a gzipped or uncompressed FASTA file.") sys.exit(1) - line_num = 0 - seq_id = 0 - seq = 0 - last_header = False - for line in fasta: + + try: + with fasta: + line_num = 0 + seq_id = 0 + seq = 0 + last_header = False + for line in fasta:🧰 Tools
🪛 Ruff
419-419: Use a context manager for opening files
(SIM115)
Line range hint
476-497
: Add type hints and improve documentation.The function implementation is solid, but could benefit from better documentation and type hints.
Consider these improvements:
-def fasta_contig_match(contigs, fasta): - """Checks whether a list of contigs are present in a fasta file""" +def fasta_contig_match(contigs: list[str], fasta: str) -> None: + """Checks whether a list of contigs are present in a fasta file. + + Args: + contigs: List of contig names to verify. + fasta: Path to the FASTA file. + + Raises: + SystemExit: If any contig is not found in the FASTA file. + """🧰 Tools
🪛 Ruff
419-419: Use a context manager for opening files
(SIM115)
Line range hint
499-542
: Add parameter validation and improve memory efficiency.The function could benefit from input validation and memory optimization.
Consider these improvements:
-def validate_fastq_bx(fastq_list, threads, quiet): +def validate_fastq_bx(fastq_list: list[str], threads: int, quiet: bool) -> None: """ Parse a list of fastq files to verify that they have BX/BC tag, and only one of those two types per file + + Args: + fastq_list: List of FASTQ file paths to validate + threads: Number of parallel threads to use + quiet: Whether to suppress progress bar + + Raises: + SystemExit: If validation fails for any file """ + if not fastq_list: + print_error("no files provided", "The list of FASTQ files is empty.") + sys.exit(1) + if threads < 1: + print_error("invalid threads", f"Invalid number of threads: {threads}") + sys.exit(1) + def validate(fastq): BX = False BC = False - if is_gzip(fastq): - fq = gzip.open(fastq, "rt") - else: - fq = open(fastq, "r") - with fq: + opener = gzip.open if is_gzip(fastq) else open + with opener(fastq, "rt") as fq: while True: line = fq.readline() if not line: break if not line.startswith("@"): continue BX = True if "BX:Z" in line else BX BC = True if "BC:Z" in line else BC if BX and BC: print_error("clashing barcode tags", f"Both [green bold]BC:Z[/green bold] and [green bold]BX:Z[/green bold] tags were detected in the read headers for [blue]{os.path.basename(fastq)}[/blue]. Athena accepts [bold]only[/bold] one of [green bold]BC:Z[/green bold] or [green bold]BX:Z[/green bold].") print_solution("Check why your data has both tags in use and remove/rename one of the tags.") sys.exit(1)🧰 Tools
🪛 Ruff
419-419: Use a context manager for opening files
(SIM115)
harpy/simulate.py (1)
127-127
: Consider using InputFile type for VCF options.Several options still use generic
Path
type whereInputFile
could provide better validation:
--vcf
options in inversion, cnv, and translocation commandsExample improvement:
-@click.option('-v', '--vcf', type=click.Path(exists=True, dir_okay=False, readable=True), help = 'VCF file of known variants to simulate') +@click.option('-v', '--vcf', type=InputFile("vcf", gzip_ok = True), help = 'VCF file of known variants to simulate')Also applies to: 345-345, 454-454, 566-566
harpy/_cli_types_generic.py (1)
109-112
: Useos.path.join
for cross-platform compatibilityWhen constructing file paths, it's recommended to use
os.path.join
to ensure compatibility across different operating systems.Apply this diff to improve path handling:
103 if not os.path.exists(value): 104 self.fail(f"{value} does not exist. Please check the spelling and try again.", param, ctx) 105 elif not os.access(value, os.R_OK): 106 self.fail(f"{value} is not readable. Please check file/directory permissions and try again", param, ctx) 107 if os.path.isfile(value): 108 self.fail(f"{value} is a file, but input should be a directory.", param, ctx) -109 if not os.path.exists(f"{value}/config.yaml"): +109 config_file = os.path.join(value, "config.yaml") +110 if not os.path.exists(config_file): 110 self.fail(f"{value} does not contain the necessary config.yaml file.", param, ctx) -111 elif not os.access(f"{value}/config.yaml", os.R_OK): +111 elif not os.access(config_file, os.R_OK): 112 self.fail(f"{value}/config.yaml does not have read access. Please check the file permissions and try again.", param, ctx)harpy/_cli_types_params.py (1)
5-231
: Consider refactoring to reduce code duplication across parameter validation classesThe
convert
methods in all parameter validation classes (FastpParams
,BwaParams
,EmaParams
, etc.) share similar logic for parsing and validating options. Refactoring common code into a base class or utility functions can enhance maintainability and reduce redundancy. This approach will make it easier to add new parameter types or update validation logic in the future.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (20)
.github/filters.yml
(1 hunks)harpy/_cli_types.py
(0 hunks)harpy/_cli_types_generic.py
(1 hunks)harpy/_cli_types_params.py
(1 hunks)harpy/_validations.py
(1 hunks)harpy/align.py
(12 hunks)harpy/assembly.py
(3 hunks)harpy/deconvolve.py
(2 hunks)harpy/demultiplex.py
(2 hunks)harpy/hpc.py
(6 hunks)harpy/impute.py
(2 hunks)harpy/metassembly.py
(2 hunks)harpy/phase.py
(3 hunks)harpy/preflight.py
(3 hunks)harpy/qc.py
(2 hunks)harpy/simulate.py
(11 hunks)harpy/snakefiles/align_ema.smk
(5 hunks)harpy/snakefiles/assembly.smk
(1 hunks)harpy/snp.py
(5 hunks)harpy/sv.py
(7 hunks)
💤 Files with no reviewable changes (1)
- harpy/_cli_types.py
🧰 Additional context used
🪛 Ruff
harpy/_cli_types_generic.py
21-21: Do not use bare except
(E722)
22-22: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
harpy/sv.py
11-11: ._cli_types_params.LeviathanParams
imported but unused
Remove unused import
(F401)
11-11: ._cli_types_params.NaibrParams
imported but unused
Remove unused import
(F401)
🔇 Additional comments (40)
.github/filters.yml (1)
4-4
: LGTM! Verify the new file location.
The path update from _cli_types.py
to _cli_types_generic.py
aligns with the PR's objective to enhance command-line validation. Since this file is referenced in the common
section and used across multiple workflows, let's verify its existence.
✅ Verification successful
File path and content verification successful
The file harpy/_cli_types_generic.py
exists at the specified location and contains the expected parameter type definitions for command-line validation, including IntList
, KParam
, ContigList
, InputFile
, SnakemakeParams
, and HPCProfile
classes that extend click.ParamType
. The path update in filters.yml
is correct and consistent with the actual file location.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the new CLI types file exists and contains the expected content
# Test 1: Check if the file exists at the new location
fd "_cli_types_generic.py$" harpy/
# Test 2: Verify it contains parameter type definitions
rg -A 5 "class.*\(.*ParamType\)" harpy/_cli_types_generic.py
Length of output: 4561
harpy/deconvolve.py (2)
9-9
: LGTM! Import changes align with validation improvements.
The import of custom parameter types from ._cli_types_generic
supports the PR's objective of enhancing command-line validation.
37-37
: Verify backward compatibility of HPCProfile validation.
The change from click.Path
to HPCProfile
suggests enhanced validation. Please ensure that:
- All valid HPC config formats are still accepted
- Error messages are clear when validation fails
Let's check for any existing HPC config examples in the repository:
#!/bin/bash
# Search for HPC config examples or tests
fd -e yaml -e yml --exec grep -l "HPC|cluster|submit" {}
harpy/demultiplex.py (2)
10-10
: LGTM! Import of custom parameter types aligns with validation objectives.
The import of HPCProfile
and SnakemakeParams
from the new _cli_types_generic
module supports the PR's goal of enhancing command-line validation.
47-47
: Enhanced type validation for --hpc and --snakemake options.
The changes improve input validation by:
- Replacing generic path validation with HPC-specific validation for
--hpc
- Adding parameter syntax validation for
--snakemake
instead of accepting any string
This aligns well with the PR's objective of enhancing command-line validation.
Let's verify the consistency of these changes across other commands:
Also applies to: 50-50
✅ Verification successful
Type validation changes are consistently applied across all commands
The search results confirm that:
- The
--hpc
option consistently usesHPCProfile()
type across all commands - The
--snakemake
option consistently usesSnakemakeParams()
type across all commands
The changes are part of a systematic update to improve command-line validation across the entire codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if all commands use the same type validation
# Test 1: Verify HPCProfile usage is consistent
echo "Checking HPCProfile usage..."
rg --type python "type\s*=\s*(click\.Path|HPCProfile)\(\)" -A 2 -B 2
# Test 2: Verify SnakemakeParams usage is consistent
echo "Checking SnakemakeParams usage..."
rg --type python "type\s*=\s*(str|SnakemakeParams)\(\)" -A 2 -B 2
Length of output: 344
Script:
#!/bin/bash
# Let's try without the --type flag and use file extension patterns instead
# Test 1: Verify HPCProfile usage is consistent
echo "Checking HPCProfile usage..."
rg "type\s*=\s*(click\.Path|HPCProfile)\(\)" -g "*.py" -A 2 -B 2
# Test 2: Verify SnakemakeParams usage is consistent
echo "Checking SnakemakeParams usage..."
rg "type\s*=\s*(str|SnakemakeParams)\(\)" -g "*.py" -A 2 -B 2
# Test 3: Find all Click options to ensure we haven't missed any
echo "Finding all Click options..."
rg "@click\.option" -g "*.py" -A 1
Length of output: 78634
harpy/metassembly.py (4)
12-13
: LGTM: Well-organized type imports.
The separation of CLI types into generic and parameter-specific modules improves code organization and maintainability.
Line range hint 48-114
: LGTM: Clean implementation of enhanced parameter validation.
The implementation successfully:
- Maintains existing functionality while improving type safety
- Adds software-specific validation for SPAdes parameters
- Standardizes parameter handling across the codebase
41-41
: Verify consistent parameter type usage across commands.
The changes to HPCProfile()
and SnakemakeParams()
types suggest improved validation. Let's verify consistent usage across other commands:
#!/bin/bash
# Description: Check consistent usage of parameter types across commands
# Test: Look for HPCProfile usage across all Python files
rg -l "HPCProfile\(\)" "*.py"
# Test: Look for SnakemakeParams usage across all Python files
rg -l "SnakemakeParams\(\)" "*.py"
# Test: Look for any remaining old-style parameter handling
rg "type\s*=\s*click\.(?!Choice|Path|IntRange)" "*.py"
Also applies to: 45-45
35-35
: Verify SPAdes parameter validation implementation.
The change to SpadesParams()
type aligns with the PR objective of adding software-specific validation. Let's verify the implementation:
✅ Verification successful
SPAdes parameter validation is properly implemented
The SpadesParams
class provides comprehensive validation:
- Maintains a list of valid SPAdes-specific options
- Prevents conflicts with Harpy's internal options
- Validates each parameter against SPAdes documentation
- Provides helpful error messages with documentation links
- Handles mate-pair library options (
--mpX-Y
and--hqmpX-Y
)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the implementation of SpadesParams validation
# Test: Look for the SpadesParams class definition and validation logic
ast-grep --pattern 'class SpadesParams:
$$$
'
# Test: Look for any SPAdes-specific parameter validation
rg -A 5 'SpadesParams'
Length of output: 479
Script:
#!/bin/bash
# Search for SpadesParams class definition and usage
rg -A 10 "class SpadesParams"
# Search for files containing SpadesParams
fd -e py | xargs rg "SpadesParams"
# Look for parameter validation patterns in Python files
fd -e py | xargs rg -A 5 "def.*param.*valid"
Length of output: 1648
Script:
#!/bin/bash
# Get the complete implementation of SpadesParams class
rg -A 30 "class SpadesParams" harpy/_cli_types_params.py
Length of output: 2470
harpy/qc.py (2)
12-13
: LGTM! Well-organized type imports.
The separation of CLI types into generic and parameter-specific modules improves code organization and maintainability.
32-32
: Verify the validation logic in the new parameter types.
The switch to specific parameter types (FastpParams
, HPCProfile
, SnakemakeParams
) improves type safety. Let's verify their implementation:
Also applies to: 40-40, 43-43
✅ Verification successful
Validation logic in parameter types is well-implemented and comprehensive
The validation logic in all three parameter types is robust and well-designed:
FastpParams
: Validates against predefined lists of allowed and Harpy-specific options, preventing conflictsHPCProfile
: Performs thorough filesystem checks:- Existence of directory
- Read permissions
- Directory (not file) validation
- Presence of required
config.yaml
SnakemakeParams
: Implements comprehensive option validation:- Prevents usage of forbidden options that conflict with Harpy
- Validates against a complete list of available Snakemake parameters
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the validation logic in the new parameter types
# Test 1: Check FastpParams validation
echo "Checking FastpParams validation..."
rg -A 10 "class FastpParams"
# Test 2: Check HPCProfile validation
echo "Checking HPCProfile validation..."
rg -A 10 "class HPCProfile"
# Test 3: Check SnakemakeParams validation
echo "Checking SnakemakeParams validation..."
rg -A 10 "class SnakemakeParams"
Length of output: 6800
harpy/impute.py (3)
10-11
: LGTM: Clean import organization for new type validators.
The new imports from _cli_types_generic
and _cli_types_params
are well-organized and align with the PR's objective of enhancing command-line validation.
32-32
: LGTM: Enhanced type validation at command-line level.
The updated click options with custom type validators (StitchParams
, InputFile
, HPCProfile
, SnakemakeParams
) provide stronger validation at the command-line level, which aligns well with the PR objectives.
Also applies to: 34-34, 36-36, 39-39, 42-42
36-36
: Verify VCF file handling restrictions.
The gzip_ok=False
parameter on the VCF input type suggests that compressed VCF files are not supported. This might be intentional, but it's worth verifying if this restriction is necessary, as compressed VCF files are common in bioinformatics workflows.
harpy/phase.py (3)
12-13
: LGTM! Clean separation of CLI type validations.
The imports reflect a well-structured separation between generic CLI types and parameter-specific validations, aligning with the PR's objective of moving file-extension validation to the command-line level.
Also applies to: 15-15
31-32
: Strong improvement in command-line parameter validation!
The changes enhance type safety by:
- Adding software-specific validation for HapCut2 parameters
- Enforcing correct file extensions and compression formats
- Implementing structured validation for HPC profiles and Snakemake parameters
Let's verify backward compatibility with existing parameter values:
Also applies to: 38-38, 42-42, 45-45
✅ Verification successful
Parameter type changes maintain compatibility and add robust validation
The implementation shows well-designed type validators that enhance CLI safety while maintaining compatibility:
HapCutParams
: Validates against predefined lists of valid HapCut2 optionsInputFile
: Performs file existence and extension checks while explicitly handling gzipSnakemakeParams
: Comprehensive validation against allowed/forbidden Snakemake parametersHPCProfile
: Validates HPC profile directory structure
Since no existing usage examples were found in documentation or tests, and the type implementations show proper parameter handling with clear fallback paths, the changes appear safe. The validators primarily add validation on top of existing string inputs rather than changing the fundamental parameter format.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the new parameter types maintain backward compatibility
# with existing command usages in documentation, tests, or example scripts.
# Test: Search for example usages of the changed parameters
echo "Searching for example usages of changed parameters..."
rg -A 2 "harpy phase.*(-x|--extra-params|--genome|-g|--vcf|-v|--hpc|--snakemake)" \
--glob '!*.py' \
--glob '!*.pyc' \
--glob '!*.git*'
Length of output: 248
Script:
#!/bin/bash
# Let's try a different approach to find usage examples and tests
# Check for test files that might contain CLI invocations
echo "=== Searching test files ==="
fd "test.*\.(py|sh|txt|md)" --type f --exec rg -l "harpy phase" {}
# Look for documentation files
echo -e "\n=== Searching documentation files ==="
fd "\.(md|rst|txt)" --type f --exec rg -l "harpy phase" {}
# Search for example scripts or configuration files
echo -e "\n=== Searching example files ==="
fd "(example|sample|config)" --type f --exec rg -l "harpy phase" {}
# Check for specific parameter type implementations
echo -e "\n=== Checking parameter type implementations ==="
rg -A 5 "class (HapCutParams|InputFile|HPCProfile|SnakemakeParams)"
Length of output: 5350
76-76
: Verify the impact of removing the quiet parameter.
The quiet
parameter has been removed from the check_fasta
call. Let's ensure this doesn't affect error reporting behavior.
✅ Verification successful
Let me try a different approach to find the check_fasta
implementation since the previous attempt didn't yield results.
Let me check the git history to understand the changes made to the check_fasta
function.
Removal of file extension validation is an intentional simplification
The git history shows that the check_fasta
function was simplified by removing the file extension validation logic that was previously controlled by the quiet
parameter. The function now focuses solely on validating file contents and encoding, which is a more robust approach since file extensions aren't always reliable indicators of file format. This change is an improvement as it removes potentially confusing warnings about unconventional file extensions while maintaining the essential validation of the file's actual content.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the implementation of check_fasta function
# to understand the impact of removing the quiet parameter
# Test: Search for the check_fasta function implementation
echo "Searching for check_fasta implementation..."
ast-grep --pattern 'def check_fasta($$$)' -A 10
Length of output: 146
Script:
#!/bin/bash
# Search for check_fasta using ripgrep for better context
echo "Searching for check_fasta definition and usage..."
rg "def check_fasta" -A 10
rg "check_fasta\(" -A 2
Length of output: 3284
Script:
#!/bin/bash
# Check git history for changes to check_fasta function
git log -p -- harpy/_validations.py | grep -A 15 "check_fasta"
Length of output: 1681
harpy/assembly.py (4)
9-10
: LGTM! Good separation of concerns.
The separation of CLI types into generic and software-specific modules improves code organization and maintainability.
40-40
: LGTM! Clear parameter validation and documentation.
The change to ArcsParams()
type enhances validation while maintaining clear documentation of the expected format.
59-59
: Verify Snakemake parameter handling.
The change to SnakemakeParams()
adds validation, but we should verify that:
- The validation aligns with Snakemake's expected parameter format
- The parameter handling in the command construction is compatible with the validated format
#!/bin/bash
# Description: Check Snakemake parameter usage patterns
# Test: Search for Snakemake command construction patterns
rg -A 10 'snakemake.*--configfile'
# Test: Search for Snakemake parameter validation
ast-grep --pattern 'class SnakemakeParams'
Also applies to: 74-80
55-55
: Verify backward compatibility of HPC profile validation.
The change from basic path validation to HPCProfile()
suggests enhanced validation. Please ensure this change maintains compatibility with existing HPC configuration files.
✅ Verification successful
Based on the code inspection, I can now provide a final response:
HPC profile validation maintains backward compatibility
The HPCProfile
type performs the same essential validation as before, checking that:
- The directory exists and is readable
- A
config.yaml
file exists in the directory and is readable
The change from path validation to HPCProfile()
actually provides better error messages and more specific validation, while maintaining the same core requirements for HPC configuration files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential breaking changes in HPC config validation
# Test: Search for HPC config usage patterns
rg -A 5 'config\.yaml' --type yaml
Length of output: 81
Script:
#!/bin/bash
# Let's check the HPCProfile implementation and its usage
ast-grep --pattern 'class HPCProfile'
# Search for HPC config related code
rg -A 5 'HPC.*config'
# Look for config file handling in assembly.py
rg -A 10 'config.*yaml' assembly.py
# Check for any HPC profile validation logic
rg -A 5 'validate.*hpc|hpc.*validate'
Length of output: 17557
Script:
#!/bin/bash
# Let's check the HPCProfile implementation
cat harpy/_cli_types_generic.py
Length of output: 8453
harpy/preflight.py (2)
9-9
: LGTM: Import of custom parameter types.
The import of HPCProfile
and SnakemakeParams
from the new ._cli_types_generic
module aligns with the PR's objective of enhancing command-line validation.
9-9
: Verify custom type implementations.
Let's verify the implementation of the custom parameter types to ensure proper validation.
Also applies to: 45-45, 47-47, 105-106
✅ Verification successful
Custom type implementations are properly validated
The implementation shows robust input validation:
-
HPCProfile
validates:- Directory existence and readability
- Ensures input is a directory not a file
- Checks for required
config.yaml
file presence
-
SnakemakeParams
validates:- Maintains a comprehensive list of forbidden options that conflict with Harpy
- Validates against available Snakemake parameters
- Prevents parameter duplication
The types are consistently used across all CLI modules for --hpc
and --snakemake
options with proper help text.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check implementation and usage of custom parameter types
# Check implementation of custom types
echo "Checking custom type implementations..."
rg -A 10 "class (HPCProfile|SnakemakeParams)"
# Check usage patterns
echo -e "\nChecking usage patterns..."
rg "(HPCProfile|SnakemakeParams)\(\)"
Length of output: 9967
harpy/hpc.py (2)
16-18
: LGTM: Documentation accurately reflects the new directory structure.
The updated docstring clearly explains the new organization where configuration files are stored in scheduler-specific subdirectories (hpc/system/
), which matches the actual implementation.
41-41
: Verify the installation commands for all scheduler plugins.
The installation commands have been standardized to use conda
instead of mamba
. While this change is consistent, we should verify:
- That these packages are available in the bioconda channel
- That these are the correct package names for each scheduler
Also applies to: 83-83, 119-119, 154-154, 189-189
harpy/snakefiles/align_ema.smk (3)
22-22
: LGTM: Clear and consistent variable declaration.
The new frag_opt
variable is well-named and appropriately placed with other configuration variables.
163-163
: LGTM: Fragment optimization flag correctly integrated in command.
The frag_opt
parameter is properly positioned in the EMA align command.
153-153
: LGTM: Appropriate parameter handling for fragment density optimization.
The parameter is correctly configured to conditionally enable fragment density optimization with the -d
flag.
✅ Verification successful
LGTM: Fragment density optimization is properly implemented and documented
The verification confirms that the -d
flag for fragment density optimization is:
- Properly documented in the CLI as
--fragment-density
with help text "Perform read fragment density optimization" - Correctly configured in the align module's configuration as
fragment_density_optimization
- Appropriately handled in the Snakefile by setting
frag_opt = "-d"
when enabled
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that fragment density optimization is documented in EMA's help text
# Test: Check if -d flag is documented in EMA's help. Expect: Description of fragment density optimization.
rg -A 5 "fragment.*density|[-]d\s" README* docs/*
Length of output: 102
Script:
#!/bin/bash
# Let's try to find any documentation or usage of the -d flag in the codebase
rg -A 5 "\-d\s|fragment.*density" .
# Also search for EMA documentation or configuration files
fd -e md -e txt -e yaml -e yml -e json . | xargs rg -l "ema|align"
# Look for the actual usage of frag_opt in the code
ast-grep --pattern 'frag_opt = $_'
Length of output: 21992
harpy/snp.py (4)
10-11
: LGTM: Clean separation of CLI type validations
The separation of CLI types into generic and parameter-specific modules improves code organization and maintainability.
52-53
: LGTM: Enhanced parameter validation
The introduction of specific parameter types (MpileupParams
, InputFile
, HPCProfile
, SnakemakeParams
) strengthens input validation at the CLI level, which aligns well with the PR objectives.
Also applies to: 59-59, 64-64
157-158
: LGTM: Consistent parameter validation across commands
The freebayes command implements the same enhanced validation types as mpileup, maintaining consistency across the module. This is a good practice that ensures uniform behavior and maintainability.
Also applies to: 166-166, 169-169, 203-203
98-98
: Verify error handling behavior
The removal of the quiet
parameter from check_fasta
call might affect error reporting behavior.
harpy/sv.py (5)
67-67
: LGTM: Enhanced parameter validation
The use of custom parameter types (HPCProfile
, SnakemakeParams
) improves command-line validation by ensuring proper format and content of these parameters before workflow execution.
Also applies to: 70-70
158-158
: LGTM: Consistent parameter validation
The parameter validation approach is consistently applied across both commands, using the same custom types (HPCProfile
, SnakemakeParams
) for improved command-line validation.
Also applies to: 161-161
Line range hint 195-205
: LGTM: Comprehensive validation sequence
The validation sequence is well-structured and covers all necessary checks:
- Genome FASTA validation
- Contig validation (if specified)
- Population file validation (if specified)
- VCF phasing validation (if specified)
This ensures all inputs are valid before workflow execution.
145-146
: Verify extra parameters validation for naibr
The --extra-params
option should validate NAIBR-specific parameters to prevent invalid options from being passed to the workflow.
✅ Verification successful
Let me gather more context about the NaibrParams implementation to verify the validation logic.
Let me try a different approach to get the implementation details.
Based on the implementation shown in _cli_types_params.py
, I can now generate the final response:
The --extra-params
option already has proper NAIBR-specific parameter validation
The NaibrParams
class implements thorough validation that:
- Prevents usage of parameters already managed by Harpy (bam_file, prefix, outdir, threads, min_mapq, d, min_sv, k)
- Validates against allowed NAIBR parameters (blacklist, candidates)
- Enforces correct parameter format (ARG VAL pairs without dashes)
- Provides clear error messages with documentation links
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for NAIBR parameter validation
rg -A 5 "NaibrParams|validate_naibr"
Length of output: 888
Script:
#!/bin/bash
# Get full implementation of NaibrParams class
ast-grep --pattern 'class NaibrParams {
$$$
}'
# Also check usage of NaibrParams in sv.py
rg -A 3 "NaibrParams\(" harpy/sv.py
Length of output: 88
Script:
#!/bin/bash
# Get the full content of the file containing NaibrParams
cat harpy/_cli_types_params.py
# Check how extra-params is used in the naibr command
rg -B 2 -A 5 "extra-params.*naibr"
Length of output: 20774
56-57
: Verify extra parameters validation for leviathan
The --extra-params
option should validate leviathan-specific parameters to prevent invalid options from being passed to the workflow.
✅ Verification successful
Let me gather more context about how these parameters are actually used in the code.
Let me modify the search pattern to correctly find the leviathan function and its usage.
Let me try one final search pattern to locate the leviathan function and its parameter usage.
Based on the gathered context, I can now provide a final response about the parameter validation:
Extra parameters validation is properly implemented for leviathan
The --extra-params
option is properly validated through the LeviathanParams
type which:
- Defines allowed Harpy-specific options (
-b -i -g -o -v --minVariantSize
etc.) - Defines valid Leviathan-specific options (
--regionSize --maxLinks --mediumSize
etc.) - The validated parameters are correctly passed to the workflow configuration via the
extra
key
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for leviathan parameter validation
rg -A 5 "LeviathanParams|validate_leviathan"
Length of output: 1087
Script:
#!/bin/bash
# Search for usage of LeviathanParams and extra_params
ast-grep --pattern 'def leviathan($_) {
$$$
}'
rg -B 2 -A 5 '@click\.option.*extra-params.*\n.*def leviathan'
Length of output: 307
Script:
#!/bin/bash
# Search for leviathan function implementation and usage of extra_params
rg -U -B 2 -A 10 '@click\.option.*extra-params.*\n.*def leviathan'
# Also search for the leviathan function definition
ast-grep --pattern 'def leviathan($params) { $$$ }'
Length of output: 123
Script:
#!/bin/bash
# Search for the leviathan function and its implementation
rg -A 20 "^def leviathan" harpy/sv.py
# Search for where extra_params is used
rg -A 5 "extra_params" harpy/sv.py
Length of output: 2705
harpy/simulate.py (2)
9-9
: LGTM: Import changes align with validation improvements.
The imports have been updated to use the new custom parameter types from _cli_types_generic.py
, which supports the move to command-line level validation.
Also applies to: 14-14
137-137
: LGTM: Enhanced CLI validation with custom types.
The use of InputFile
, HPCProfile
, and SnakemakeParams
types improves input validation by enforcing file extensions and standardizing parameter handling at the command-line level.
Also applies to: 141-143, 211-212, 219-219, 221-221, 229-229, 232-233, 345-346, 354-354, 357-358, 454-455, 463-463, 466-467, 566-567, 575-575, 578-579
harpy/_cli_types_params.py (1)
104-116
: Verify the handling of options starting with dashes in ArcsParams
Disallowing options that start with a dash could prevent users from specifying valid arcs
options. While the intention is to avoid arguments meant for arcs-make
, some valid arcs
options might begin with a dash (e.g., -v
for verbose output). Please verify whether this restriction is necessary.
To confirm whether arcs
supports options starting with a dash, you can check the official documentation or run the following command:
This script fetches the ARCS README and extracts options that start with one or two dashes. Review the output to ensure that all valid options are accounted for and adjust the validation logic accordingly.
✅ Verification successful
Let me gather more information about how ARCS options are handled in the codebase.
Based on the search results, I can now generate the final response:
The restriction on dash-prefixed options in ArcsParams
is correct and necessary
Looking at the codebase, ARCS parameters are explicitly designed to be used in the format ARG=VAL
without dashes. This is evident from:
- The
valid_options
list inArcsParams
which contains options like "d", "k", "o", "e" (without dashes) - The error message correctly instructs users to use the format
ARG=VAL
(e.g.,k=15
) - The distinction between
arcs-make
arguments (which use dashes) andarcs
parameters is intentional to prevent parameter confusion
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for options starting with a dash in ARCS documentation.
# Expected result: List of valid ARCS options that start with a dash.
curl -s https://raw.githubusercontent.com/bcgsc/arcs/master/README.md | grep -Eo '^-{1,2}[a-zA-Z\-]+' | sort | uniq
Length of output: 124
Script:
#!/bin/bash
# Search for ARCS command-line options in the codebase
rg -A 5 "arcs options|ARCS options" --type py
# Search for any hardcoded ARCS options or parameters
rg -A 5 "valid_options.*=|harpy_options.*=" --type py
Length of output: 12494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (6)
harpy/_cli_types_generic.py (6)
24-25
: Enhance error message with count details.The error message could be more helpful by including the actual number of entries provided.
Apply this diff to improve the error message:
- self.fail(f"{value} is not a valid list of integers. The value should be {self.max_entries} integers separated by a comma.", param, ctx) + self.fail(f"{value} is not a valid list of integers. Found {len(parts)} entries but expected exactly {self.max_entries} integers separated by commas.", param, ctx)
30-40
: Enhance input validation and error messages.The validation could be more robust with specific error messages for different failure cases.
Apply this diff to improve the validation:
def convert(self, value, param, ctx): try: + if not value: + raise ValueError("Empty input is not allowed") if value == "auto": return value parts = [i.strip() for i in value.split(',')] for i in parts: - if int(i) % 2 == 0 or int(i) > 128: - raise ValueError + num = int(i) + if num % 2 == 0: + self.fail(f"'{i}' is not an odd integer", param, ctx) + if num > 128: + self.fail(f"'{i}' exceeds maximum allowed value of 128", param, ctx) return [int(i) for i in parts] except ValueError: self.fail(f"{value} is not 'auto' or odd integers <128 separated by a comma.", param, ctx)
67-68
: Simplify dictionary key check.The
.keys()
call is unnecessary when checking for key existence.Apply this diff:
- if self.filetype not in filedict.keys(): + if self.filetype not in filedict:🧰 Tools
🪛 Ruff
67-67: Use
key not in dict
instead ofkey not in dict.keys()
Remove
.keys()
(SIM118)
91-92
: Optimize parameter validation using sets.Converting the lists to sets would improve lookup performance, especially given the large number of options.
Apply this diff:
- forbidden = "--rerun-incomplete --ri --show-failed-logs --rerun-triggers --nolock --software-deployment-method --smd --deployment --deployment-method --conda-prefix --cores -c --directory -d --snakefile -s --configfile --configfiles".split() - available = "--dry-run --dryrun -n --profile --cache --jobs -j --local-cores --resources --res --set-threads --max-threads --set-resources --set-scatter --set-resource-scopes --default-resources --default-res --preemptible-rules --preemptible-retries --envvars --touch -t --keep-going -k --force -f --executor -e --forceall -F --forcerun -R --prioritize -P --batch --until -U --omit-from -O --shadow-prefixDIR --scheduler --wms-monitor --wms-monitor-arg --scheduler-ilp-solver --conda-base-path --no-subworkflows --nosw --precommand --groups --group-components --report --report-stylesheet --reporterPLUGIN --draft-notebook --edit-notebook --notebook-listen --lint --generate-unit-tests --containerize --export-cwl --list-rules --list -l --list-target-rules --lt --dag --rulegraph --filegraph --d3dag --summary -S --detailed-summary -D --archive --cleanup-metadata --cmFILE --cleanup-shadow --skip-script-cleanup --unlock --list-changes --lc --list-input-changes --li --list-params-changes --lp --list-untracked --lu --delete-all-output --delete-temp-output --keep-incomplete --drop-metadata --version -v --printshellcmds -p --debug-dag --nocolor --quiet -q --print-compilation --verbose --force-use-threads --allow-ambiguity -a --ignore-incomplete --ii --max-inventory-time --latency-wait --output-wait -w --wait-for-files --wait-for-files-file --queue-input-wait-time --notemp --nt --all-temp --unneeded-temp-files --keep-storage-local-copies --target-files-omit-workdir-adjustment --allowed-rules --max-jobs-per-timespan --max-jobs-per-second --max-status-checks-per-second --seconds-between-status-checks --retries --restart-times -T --wrapper-prefix --default-storage-provider --default-storage-prefix --local-storage-prefix --remote-job-local-storage-prefix --shared-fs-usage --scheduler-greediness --greediness --no-hooks --debug --runtime-profile --local-groupid --attempt --log-handler-script --log-service --job-deploy-sources --benchmark-extended --container-image --immediate-submit --is --jobscript --js --jobname --jn --flux --container-cleanup-images --use-conda --conda-not-block-search-path-envvars --list-conda-envs --conda-cleanup-envs --conda-cleanup-pkgs --conda-create-envs-only --conda-frontend --use-apptainer --use-singularity --apptainer-prefix --singularity-prefix --apptainer-args --singularity-args --use-envmodules --scheduler-solver-path --deploy-sources --target-jobs --mode --report-html-path --report-html-stylesheet-path".split() + forbidden = set("--rerun-incomplete --ri --show-failed-logs --rerun-triggers --nolock --software-deployment-method --smd --deployment --deployment-method --conda-prefix --cores -c --directory -d --snakefile -s --configfile --configfiles".split()) + available = set("--dry-run --dryrun -n --profile --cache --jobs -j --local-cores --resources --res --set-threads --max-threads --set-resources --set-scatter --set-resource-scopes --default-resources --default-res --preemptible-rules --preemptible-retries --envvars --touch -t --keep-going -k --force -f --executor -e --forceall -F --forcerun -R --prioritize -P --batch --until -U --omit-from -O --shadow-prefixDIR --scheduler --wms-monitor --wms-monitor-arg --scheduler-ilp-solver --conda-base-path --no-subworkflows --nosw --precommand --groups --group-components --report --report-stylesheet --reporterPLUGIN --draft-notebook --edit-notebook --notebook-listen --lint --generate-unit-tests --containerize --export-cwl --list-rules --list -l --list-target-rules --lt --dag --rulegraph --filegraph --d3dag --summary -S --detailed-summary -D --archive --cleanup-metadata --cmFILE --cleanup-shadow --skip-script-cleanup --unlock --list-changes --lc --list-input-changes --li --list-params-changes --lp --list-untracked --lu --delete-all-output --delete-temp-output --keep-incomplete --drop-metadata --version -v --printshellcmds -p --debug-dag --nocolor --quiet -q --print-compilation --verbose --force-use-threads --allow-ambiguity -a --ignore-incomplete --ii --max-inventory-time --latency-wait --output-wait -w --wait-for-files --wait-for-files-file --queue-input-wait-time --notemp --nt --all-temp --unneeded-temp-files --keep-storage-local-copies --target-files-omit-workdir-adjustment --allowed-rules --max-jobs-per-timespan --max-jobs-per-second --max-status-checks-per-second --seconds-between-status-checks --retries --restart-times -T --wrapper-prefix --default-storage-provider --default-storage-prefix --local-storage-prefix --remote-job-local-storage-prefix --shared-fs-usage --scheduler-greediness --greediness --no-hooks --debug --runtime-profile --local-groupid --attempt --log-handler-script --log-service --job-deploy-sources --benchmark-extended --container-image --immediate-submit --is --jobscript --js --jobname --jn --flux --container-cleanup-images --use-conda --conda-not-block-search-path-envvars --list-conda-envs --conda-cleanup-envs --conda-cleanup-pkgs --conda-create-envs-only --conda-frontend --use-apptainer --use-singularity --apptainer-prefix --singularity-prefix --apptainer-args --singularity-args --use-envmodules --scheduler-solver-path --deploy-sources --target-jobs --mode --report-html-path --report-html-stylesheet-path".split())
111-114
: Improve path handling and error messages.Use
os.path.join
for path construction and consolidate similar error messages.Apply this diff:
- if not os.path.exists(f"{value}/config.yaml"): - self.fail(f"{value} does not contain the necessary config.yaml file.", param, ctx) - elif not os.access(f"{value}/config.yaml", os.R_OK): - self.fail(f"{value}/config.yaml does not have read access. Please check the file permissions and try again.", param, ctx) + config_path = os.path.join(value, "config.yaml") + if not os.path.exists(config_path): + self.fail(f"{value} does not contain the necessary config.yaml file.", param, ctx) + elif not os.access(config_path, os.R_OK): + self.fail(f"{config_path} does not have read access. Please check the file permissions and try again.", param, ctx)
117-117
: Document planned program-specific parameter types.Consider adding more detailed documentation about the planned program-specific parameter types, including:
- Expected parameter types to be implemented
- Their purpose and validation requirements
- Any dependencies or prerequisites
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
.github/workflows/tests.yml
(2 hunks)harpy/_cli_types_generic.py
(1 hunks)harpy/_cli_types_params.py
(1 hunks)harpy/align.py
(12 hunks)harpy/simulate.py
(12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- harpy/_cli_types_params.py
🧰 Additional context used
🪛 Ruff
harpy/_cli_types_generic.py
21-21: Do not use bare except
(E722)
22-22: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
67-67: Use key not in dict
instead of key not in dict.keys()
Remove .keys()
(SIM118)
🔇 Additional comments (11)
harpy/align.py (3)
12-13
: LGTM! Import changes align with validation objectives.
The new imports from _cli_types_generic
and _cli_types_params
support the enhanced command-line validation strategy.
68-68
: LGTM! Enhanced type safety with custom parameter types.
The changes improve input validation by:
- Using
InputFile("fasta", gzip_ok=True)
for genome files - Using specific parameter types for extra parameters (
BwaParams
,EmaParams
,StrobeAlignParams
) - Using
HPCProfile
andSnakemakeParams
for workflow controls
Also applies to: 70-70, 79-79, 82-82, 152-152, 156-156, 166-166, 169-169, 258-258, 260-260, 270-270, 273-273
108-108
: LGTM! Consistent configuration handling across commands.
The configuration handling is well-structured and follows best practices:
- Consistent pattern for optional parameters
- Proper path resolution for input files
- Clear separation of workflow and input configurations
Also applies to: 209-209, 302-302
.github/workflows/tests.yml (2)
347-347
: LGTM! Verify the fragment-density functionality.
The removal of -x "-d"
option simplifies the command, which is now handled by the new --fragment-density
flag in the EMA command.
#!/bin/bash
# Description: Verify the new --fragment-density flag is properly implemented
# in the harpy align ema command.
# Test: Search for the fragment-density parameter definition
ast-grep --pattern 'def align_ema($$$):
$$$
fragment_density$$$
$$$'
232-232
: LGTM! Verify the new filter parameter.
The change from --trim_poly_g
to --low_complexity_filter
appears to be a more comprehensive approach to filtering low-quality sequences.
harpy/simulate.py (6)
9-9
: LGTM: Import of custom types for enhanced validation.
The addition of custom types from _cli_types_generic
aligns with the PR objective of moving file-extension validation to the command-line level.
137-141
: LGTM: Enhanced type validation for linkedreads command options.
The command options now use custom types for better validation:
HPCProfile()
for HPC configurationSnakemakeParams()
for Snakemake parameters
142-143
: LGTM: Improved file validation for genome inputs.
The genome arguments now use InputFile("fasta", gzip_ok = True)
to validate FASTA files with gzip support.
211-212
: LGTM: Enhanced file validation for snpindel command options.
The command options now use InputFile
with appropriate extensions:
- VCF files:
gzip_ok = False
- GFF files:
gzip_ok = True
Also applies to: 219-221
229-232
: LGTM: Improved parameter validation for snpindel workflow controls.
The workflow control options now use custom types:
HPCProfile()
for HPC configurationSnakemakeParams()
for Snakemake parameters
167-168
: Remove the quiet
parameter from check_fasta
calls.
The check_fasta
function signature in _validations.py
has been updated to only accept a single parameter genofile
. The calls need to be updated to match this change.
Also applies to: 275-275, 387-387, 502-502, 607-607
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
harpy/popgroups.py (1)
50-54
: Fix color tag and consider consolidating status messagesThe formatting improvements enhance readability, but there's a small issue:
- Line 50: Missing closing tag for
[blue]
- Lines 52 and 54: Similar message structures could be consolidated
Apply these changes:
- rprint(f"[bold]{len(samplenames)} samples [/bold] detected in [blue]{inputdir}[blue]", file = sys.stderr) + rprint(f"[bold]{len(samplenames)} samples [/bold] detected in [blue]{inputdir}[/blue]", file = sys.stderr) - if os.path.exists(output): - write_text = f"The file [blue]{output}[/blue] was overwritten." - else: - write_text = f"Created sample population grouping file [blue]{output}[/blue]." + write_text = f"{'Overwrote' if os.path.exists(output) else 'Created'} sample population grouping file [blue]{output}[/blue]"harpy/hpc.py (2)
33-49
: Enhance robustness of package validation.While the function works, consider these improvements for better reliability:
- Add error handling for subprocess.run
- Make the package name prefix configurable
- Use more robust output parsing
Consider this implementation:
def package_exists(pkg): """helper function to search for a package in the active conda environment""" - search_pkg = subprocess.run(f"conda list snakemake-executor-plugin-{pkg}".split(), capture_output = True, text = True) - exists = False - for i in search_pkg.stdout.split("\n"): - if i and not i.startswith("#"): - exists = True + pkg_prefix = "snakemake-executor-plugin-" + try: + search_pkg = subprocess.run( + f"conda list {pkg_prefix}{pkg}".split(), + capture_output=True, + text=True, + check=False # Don't raise on non-zero exit + ) + # Check if any non-header line contains the package name + exists = any( + line.strip() and not line.startswith("#") + for line in search_pkg.stdout.splitlines() + ) + except subprocess.SubprocessError as e: + print_notice(f"Error checking for package: {e}") + return False
Line range hint
1-240
: Architecture Review: Good separation of concerns and extensibility.The changes improve the codebase by:
- Better organization of HPC profiles under system-specific directories
- Enhanced validation through package existence checks
- Consistent implementation across different HPC systems
- Improved user feedback through rich formatting
The modular design makes it easy to add support for new HPC systems in the future.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
.github/filters.yml
(1 hunks)harpy/hpc.py
(7 hunks)harpy/popgroups.py
(2 hunks)harpy/stitchparams.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- harpy/stitchparams.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/filters.yml
🔇 Additional comments (3)
harpy/popgroups.py (1)
7-7
: LGTM: Rich library import for enhanced CLI output
The addition of rich's print function aligns with the PR's objective of improving command-line output formatting.
harpy/hpc.py (2)
4-7
: LGTM: Import changes and docstring updates are appropriate.
The new imports support the package validation feature, and the switch to rich.print enhances user feedback. The docstring accurately reflects the new directory structure.
Also applies to: 19-21
56-57
: LGTM: Consistent implementation of validation and output formatting.
The changes across all HPC profile generators are consistent and well-implemented:
- Improved output formatting using rich
- Added package validation before file creation
Let's verify the impact of the directory structure changes:
Also applies to: 92-93, 122-123, 151-152, 180-181
✅ Verification successful
Let me verify if these paths are being used correctly by checking the context around these usages.
Let me check if there are any other files that might be affected by or depend on these directory structures.
Directory structure changes are correctly implemented
The directory structure changes are properly handled in the code:
- All HPC profile generators consistently create their respective directories using
os.makedirs()
- The paths are only referenced in
harpy/hpc.py
and nowhere else in the codebase - The error from
fd
command confirms that thehpc/
directory doesn't exist yet, which is expected as it's created at runtime
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if any files in the codebase still reference the old hpc/ directory structure
rg -g '!*.{yaml,yml}' 'hpc/[^/]+/config\.yaml'
Length of output: 312
Script:
#!/bin/bash
# Check the context of how these paths are used
rg -B 2 -A 2 'outfile = "hpc/.+/config\.yaml"' harpy/hpc.py
Length of output: 1188
Script:
#!/bin/bash
# Search for any other references to hpc directory structure
rg -g '!{*.yaml,*.yml,*.pyc}' -l 'hpc/[^/]+' --type-add 'test:*.{py,sh,txt}' --type test
# Check if there are any config templates that might need to be moved
fd -e yaml -e yml . hpc/
Length of output: 222
This PR moves the file-extension validation to the command-line level. It also adds basic-ish software-specific validation for all workflows that feature --extra-params
Summary by CodeRabbit
Release Notes
New Features
--fragment-density
option in theema
command for read fragment density optimization.Bug Fixes
Documentation
Refactor