-
Notifications
You must be signed in to change notification settings - Fork 2
bugfix - divide-by-zero error #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe changes in this pull request involve modifications to multiple scripts within the Changes
Assessment against linked issues
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (7)
harpy/bin/depth_windows.py (3)
26-28
: Good addition of zero check, but needs error logging.The protection against division by zero is good, but we should log when we skip windows to help diagnose issues.
if WINSIZE > 0: depth = _SUM / WINSIZE print(f"{LAST_CONTIG}\t{POSITION}\t{depth}", file = sys.stdout) +else: + print(f"Warning: Skipping window for {LAST_CONTIG} due to zero window size", file=sys.stderr)
Line range hint
7-11
: Add input validation for windowsize.The script should validate that the windowsize argument is positive before processing any data.
parser.add_argument('windowsize', type= int, help = "The window size to use to calcualte mean depth over (non-overlapping)") args = parser.parse_args() +if args.windowsize <= 0: + print("Error: Window size must be greater than zero", file=sys.stderr) + sys.exit(1) + _SUM = 0
Line range hint
1-44
: Consider adding error handling and logging.The script would benefit from proper error handling and logging to help diagnose issues in production:
- Add try-except blocks around file operations
- Use Python's logging module
- Add progress indicators for long-running operations
Would you like me to provide an example implementation with these improvements?
harpy/bin/assign_mi.py (1)
86-89
: Consider defining the default BX value as a constant.The implementation correctly removes MI and DI tags while adding a default BX tag. However, the hardcoded value "A00C00B00D00" could be defined as a constant at the module level for better maintainability.
Consider adding this at the top of the file:
+# Default BX tag value for records with missing BX tags +DEFAULT_BX_VALUE = "A00C00B00D00"Then update the
write_missingbx
function:- tags.append(("BX", "A00C00B00D00")) + tags.append(("BX", DEFAULT_BX_VALUE))harpy/bin/deconvolve_alignments.py (3)
25-25
: Avoid using 'input' as an argument name to prevent shadowingUsing
'input'
as the argument name shadows the built-ininput()
function in Python, which might lead to unexpected behavior.Consider renaming the argument to
'input_file'
.Apply this diff:
-parser.add_argument('input', help = "Input coordinate-sorted bam/sam file") +parser.add_argument('input_file', help = "Input coordinate-sorted bam/sam file") ... -bam_input = args.input +bam_input = args.input_file
52-71
: Resolve contradiction regarding MI tag handling inwrite_invalidbx
The docstring states that the function keeps the existing MI tag if present, but the comment contradicts this by stating it will not keep an existing MI tag. Please ensure the documentation and code behavior are consistent.
Update the docstring or adjust the code to accurately reflect the function's behavior.
101-102
: Simplify nestedif
statements by combining themTo improve code readability, you can combine the nested
if
statements into a single condition usingand
.Apply this diff to simplify the code:
-if bam_input.lower().endswith(".bam"): - if not os.path.exists(bam_input + ".bai"): +if bam_input.lower().endswith(".bam") and not os.path.exists(bam_input + ".bai"):🧰 Tools
🪛 Ruff
101-102: Use a single
if
statement instead of nestedif
statementsCombine
if
statements usingand
(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
harpy/_cli_types_generic.py
(2 hunks)harpy/bin/assign_mi.py
(2 hunks)harpy/bin/deconvolve_alignments.py
(1 hunks)harpy/bin/depth_windows.py
(2 hunks)
🧰 Additional context used
🪛 Ruff
harpy/bin/deconvolve_alignments.py
101-102: Use a single if
statement instead of nested if
statements
Combine if
statements using and
(SIM102)
129-129: Do not use bare except
(E722)
🔇 Additional comments (6)
harpy/bin/assign_mi.py (2)
Line range hint 65-89
: Verify if these changes resolve the ZeroDivisionError.
While the changes improve tag handling consistency by removing MI tags from invalid/missing BX records, we should verify if this fully addresses the ZeroDivisionError
mentioned in issue #160. Could you provide more context about how these changes prevent the divide-by-zero condition in the molecule_coverage
rule?
Let's check the molecule_coverage rule implementation:
#!/bin/bash
# Search for the molecule_coverage rule and its dependencies
rg -l "molecule_coverage" | while read -r file; do
echo "=== $file ==="
rg "rule molecule_coverage" "$file" -A 10
done
# Search for potential division operations in related files
rg "ZeroDivisionError|/ 0|\b/\s*0\b"
65-67
: Verify the impact of MI tag removal.
The change to remove existing MI tags from invalid BX records is well-documented and implemented correctly. However, since this changes the previous behavior, we should verify that downstream processes handle records without MI tags appropriately.
Let's check for any downstream dependencies on MI tags:
✅ Verification successful
Based on the comprehensive code analysis, I can now provide a final response:
Removing MI tags from invalid BX records is safe and correct
The change to remove existing MI tags from invalid BX records is appropriate because:
- The
molecule_coverage.py
script doesn't use MI tags at all - it works with BX stats data - The only downstream usage of MI tags is in
concatenate_bam.py
, which properly handles records with missing MI tags - Invalid BX records shouldn't have MI tags as they don't represent valid molecules, which is consistent with the tool's documentation
The change aligns with the codebase's design where MI tags are only assigned to valid molecular identifiers, making the behavior more consistent and preventing potential issues with invalid molecular assignments.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for code that relies on MI tags
rg -l "MI" | grep -v "assign_mi.py" | while read -r file; do
echo "=== $file ==="
rg "get_tag\(['\"]MI['\"]" "$file" -A 2
done
Length of output: 1791
Script:
#!/bin/bash
# Let's check for MI tag usage in molecule_coverage.py since it's related to the divide-by-zero error
echo "=== molecule_coverage.py content ==="
cat harpy/bin/molecule_coverage.py
# Also check for any other files that might process MI tags
echo -e "\n=== Files referencing MI tags ==="
rg -l "MI" --type py
# Check specific MI tag processing
echo -e "\n=== MI tag processing in Python files ==="
rg "MI" --type py -B2 -A2
Length of output: 19671
harpy/_cli_types_generic.py (3)
67-67
: LGTM! Clear error handling for unsupported file types.
The validation check and error message effectively guide users when encountering unsupported file types.
116-116
: Whitespace change.
Line range hint 67-116
: Verify the relevance to the divide-by-zero fix.
These changes appear unrelated to the main PR objective of fixing the ZeroDivisionError
in the molecule_coverage
rule. Let's verify if this file is in the relevant code path.
harpy/bin/deconvolve_alignments.py (1)
168-169
: Ensure correct handling of negative distances in dist
calculation
The calculation dist = pos_start - d[bx]["lastpos"]
may result in negative distances if pos_start
is less than d[bx]["lastpos"]
. Ensure that the logic accounts for this scenario and handles negative distances appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (17)
harpy/bin/parse_phaseblocks.py (1)
35-35
: Consider keepingUnless there's a specific performance requirement, using
print()
would be more idiomatic and maintainable in Python:- sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n") + print(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}")harpy/bin/rename_bam.py (1)
9-10
: LGTM! Consider enhancing error handling for samtools dependency.The improved description and usage string make the script's requirements and usage clearer. However, since samtools is a required dependency, it would be good to verify its availability early in the script.
Consider adding this check at the start of the script:
def check_samtools(): if os.system("which samtools > /dev/null 2>&1") != 0: sys.exit("Error: samtools is required but not found in PATH") check_samtools()harpy/bin/count_bx.py (3)
Line range hint
19-23
: Consider adding explicit input file validationWhile
pysam.FastxFile
handles file errors internally, adding an explicit check for file existence before processing would provide clearer error messages to users.args = parser.parse_args() + +# Validate input file exists +if not os.path.exists(args.input): + sys.stderr.write(f"Error: Input file '{args.input}' does not exist\n") + sys.exit(1)
Line range hint
29-35
: Fix potential case sensitivity mismatch in invalid tag countingThe invalid tag regex pattern
[AaBbCcDd]00
matches both upper and lowercase letters, but theinv_dict
only contains uppercase keys. This could lead to missed counts for lowercase invalid tags.inv_dict = { "A" : 0, "B" : 0, "C" : 0, - "D" : 0 + "D" : 0, + "a" : 0, + "b" : 0, + "c" : 0, + "d" : 0 }Alternatively, normalize the case when counting:
if inv: for i in inv: - inv_dict[i[0]] += 1 + inv_dict[i[0].upper()] += 1 continue
58-65
: Optimize output writing performanceWhile using
sys.stdout.write
is good, multiple consecutive writes can be inefficient. Consider combining all output into a single write operation.-sys.stdout.write(f"totalReads\t{N_READS}\n") -sys.stdout.write(f"bxTagCount\t{N_BX}\n") -sys.stdout.write(f"bxValid\t{N_VALID}\n") -sys.stdout.write(f"bxInvalid\t{N_BX - N_VALID}\n") -sys.stdout.write(f"A00\t{inv_dict['A']}\n") -sys.stdout.write(f"C00\t{inv_dict['C']}\n") -sys.stdout.write(f"B00\t{inv_dict['B']}\n") -sys.stdout.write(f"D00\t{inv_dict['D']}\n") +output = ( + f"totalReads\t{N_READS}\n" + f"bxTagCount\t{N_BX}\n" + f"bxValid\t{N_VALID}\n" + f"bxInvalid\t{N_BX - N_VALID}\n" + f"A00\t{inv_dict['A']}\n" + f"C00\t{inv_dict['C']}\n" + f"B00\t{inv_dict['B']}\n" + f"D00\t{inv_dict['D']}\n" +) +sys.stdout.write(output)harpy/bin/infer_sv.py (2)
Line range hint
22-27
: Add error handling for invalid orientations and file operations.The script could fail with a KeyError if it encounters an unexpected orientation pattern in line[6]. Additionally, file operations should be wrapped in try-catch blocks.
Here's a suggested improvement for the orientation lookup:
- inference = conversions[line[6]] + try: + inference = conversions[line[6]] + except KeyError: + sys.stderr.write(f"Warning: Unknown orientation pattern '{line[6]}' encountered. Skipping record.\n") + continue + except IndexError: + sys.stderr.write(f"Warning: Invalid record format. Skipping record.\n") + continueAlso consider wrapping file operations in try-except blocks to handle potential IOErrors gracefully.
Also applies to: 45-58
Line range hint
28-58
: Refactor to reduce code duplication and improve readability.The current implementation has duplicate logic for header processing and could benefit from a more Pythonic approach.
Consider this refactoring:
-if args.failfile: - with open(args.bedfile, "r", encoding="utf-8") as f, open(args.failfile, "w", encoding="utf-8") as failout: - # first line, the header - line = f.readline().strip().split("\t") - line_corrected = [i.title().replace(" ", "") for i in line] - line_corrected.append("SV") - HEADERLINE = "\t".join(line_corrected) + "\n" - sys.stdout.write(HEADERLINE) - failout.write(HEADERLINE) - # the reamining lines - while True: - line = f.readline() - if not line: - break - line = line.strip().split("\t") - inference = conversions[line[6]] - line.append(inference) - NEWROW = "\t".join(line) + "\n" - if "FAIL" not in line: - sys.stdout.write(NEWROW) - else: - failout.write(NEWROW) -else: - with open(args.bedfile, "r", encoding="utf-8") as f: - # first line, the header - line = f.readline().strip().split("\t") - line_corrected = [i.title().replace(" ", "") for i in line] - line_corrected.append("SV") - HEADERLINE = "\t".join(line_corrected) + "\n" - sys.stdout.write(HEADERLINE) - # the reamining lines - while True: - line = f.readline() - if not line: - break - line = line.strip().split("\t") - if "FAIL" not in line: - inference = conversions[line[6]] - line.append(inference) - NEWROW = "\t".join(line) + "\n" - sys.stdout.write(NEWROW) +def process_header(f): + header = f.readline().strip().split("\t") + header_corrected = [i.title().replace(" ", "") for i in header] + header_corrected.append("SV") + return "\t".join(header_corrected) + "\n" + +def process_line(line, conversions): + fields = line.strip().split("\t") + if not fields: + return None, None + inference = conversions[fields[6]] + fields.append(inference) + return fields, "\t".join(fields) + "\n" + +with open(args.bedfile, "r", encoding="utf-8") as f: + header = process_header(f) + sys.stdout.write(header) + + if args.failfile: + with open(args.failfile, "w", encoding="utf-8") as failout: + failout.write(header) + for line in f: + fields, formatted_line = process_line(line, conversions) + if fields: + if "FAIL" not in fields: + sys.stdout.write(formatted_line) + else: + failout.write(formatted_line) + else: + for line in f: + fields, formatted_line = process_line(line, conversions) + if fields and "FAIL" not in fields: + sys.stdout.write(formatted_line)This refactoring:
- Extracts header and line processing into separate functions
- Uses a more Pythonic
for
loop for file iteration- Reduces code duplication
- Improves maintainability
harpy/bin/molecule_coverage.py (2)
22-22
: Improve argument validation but remove redundant parameter.Good improvement making the
-f/--fai
argument required to prevent missing input errors. However, therequired=True
parameter is redundant for the positionalstatsfile
argument as positional arguments are already required by default.-parser.add_argument('statsfile', required = True, help = "stats file produced by harpy via bx_stats.py") +parser.add_argument('statsfile', help = "stats file produced by harpy via bx_stats.py")Also applies to: 26-27
Line range hint
89-92
: Add bounds checking and optimize memory usage.The coverage calculation could be improved with:
- Validation of start/end values against contig length
- Memory optimization for large ranges
Add bounds checking and chunked updates:
start = int(splitline[IDX_START]) end = int(splitline[IDX_END]) +if start < 1 or end > contigs[contig]: + sys.stderr.write(f"Error: Position {start}-{end} out of bounds for contig {contig} (length: {contigs[contig]})\n") + sys.exit(1) + +# Update in chunks to reduce memory usage for large ranges +chunk_size = 1000000 # Adjust based on available memory +for chunk_start in range(start, end + 1, chunk_size): + chunk_end = min(chunk_start + chunk_size, end + 1) + coverage.update(range(chunk_start, chunk_end)) -coverage.update(range(start, end + 1))harpy/bin/make_windows.py (3)
Line range hint
31-44
: Consider optimizing the makewindows function.While the changes look good, there's room for improvement in terms of type safety and performance.
-def makewindows(_contig, _c_len, windowsize): +def makewindows(_contig: str, _c_len: int, windowsize: int) -> None: """create a file of the specified windows""" start = args.mode end = min(_c_len, windowsize) - starts = [start] - ends = [end] - # write to output file - for (startpos, endpos) in zip (starts,ends): + while start < _c_len: + end = min(start + windowsize, _c_len) sys.stdout.write(f"{_contig}\t{startpos}\t{endpos}\n") + start += windowsize
47-56
: Use context manager for file operations.The current implementation correctly handles FAI files, but could be improved with proper resource management.
-if testname.endswith("fai"): - with open(args.infile, "r", encoding="utf-8") as fai: - while True: - line = fai.readline() - if not line: - break +if testname.endswith("fai"): + with open(args.infile, "r", encoding="utf-8") as fai: + for line in fai: lsplit = line.split("\t") contig = lsplit[0] c_len = int(lsplit[1]) c_len += args.mode makewindows(contig, c_len, args.window)
Line range hint
58-85
: Improve FASTA file handling and resource management.The current implementation has several issues:
- Missing context manager for file operations
- Complex and error-prone FASTA parsing
- Potential resource leaks
Consider using BioPython's SeqIO for robust FASTA parsing:
+from Bio import SeqIO + elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"): - if is_gzip(args.infile): - fopen = gzip.open(args.infile, "rt") - else: - fopen = open(args.infile, "r", encoding="utf-8") - line = fopen.readline() - while True: - C_LEN=0 - if not line: - break - if line.startswith(">"): - # remove newline, > starting symbol, and any comments after name - contig = line.rstrip("\n").lstrip(">").split()[0] - # keep reading until hitting another ">" - HEADER = False - while not HEADER: - line = fopen.readline() - if not line: - break - line = line.rstrip("\n") - if line.startswith(">"): - HEADER = True - break - else: - C_LEN += len(line)-1 - HEADER = False - C_LEN += args.mode - makewindows(contig, C_LEN, args.window) - fopen.close() + mode = "rt" if is_gzip(args.infile) else "r" + opener = gzip.open if is_gzip(args.infile) else open + with opener(args.infile, mode) as fasta: + for record in SeqIO.parse(fasta, "fasta"): + contig = record.id + c_len = len(record.seq) + args.mode + makewindows(contig, c_len, args.window)🧰 Tools
🪛 Ruff
60-60: Use a context manager for opening files
(SIM115)
62-62: Use a context manager for opening files
(SIM115)
harpy/bin/concatenate_bam.py (2)
30-33
: LGTM! Error handling improvement.Good change from
sys.stderr.write
for error messages, aligning with best practices and maintaining consistency with other files in the codebase.Consider combining the multiple writes into a single multi-line string for better performance:
- sys.stderr.write("Please provide a single file to \'--bamlist\' (-b) featuring all the files you want to concatenate (one per line):\n") - sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt\n\n") - sys.stderr.write("Alternatively, provide the files after \'-o output.bam\':\n",) - sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam\n") + sys.stderr.write( + "Please provide a single file to '--bamlist' (-b) featuring all the files you want to concatenate (one per line):\n" + "[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt\n\n" + "Alternatively, provide the files after '-o output.bam':\n" + "[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam\n" + )
Line range hint
89-95
: Improve error handling in MI tag processing.The bare
except
clause could mask important errors. Consider catching specific exceptions and providing appropriate error messages.try: mi = record.get_tag("MI") # if previously converted for this sample, use that if mi in MI_LOCAL: record.set_tag("MI", MI_LOCAL[mi]) else: record.set_tag("MI", MI_NEW) # add to sample conversion dict MI_LOCAL[mi] = MI_NEW # increment to next unique MI MI_NEW += 1 - except: + except KeyError: + # Skip records without MI tag pass + except ValueError as e: + sys.stderr.write(f"Error processing MI tag in {xam}: {str(e)}\n") + raiseharpy/bin/deconvolve_alignments.py (3)
16-16
: Fix typo in help text: "presevered" → "preserved"- with an invalid barcode (00 as one of its segments) are presevered + with an invalid barcode (00 as one of its segments) are preserved
23-23
: Add validation for negative cutoff valuesThe cutoff distance should be positive. Consider adding validation to ensure the value is greater than zero.
-parser.add_argument('-c','--cutoff', type=int, default = 100000, help = "Distance in base pairs at which alignments with the same barcode should be considered different molecules (default: 100000)") +parser.add_argument('-c','--cutoff', type=lambda x: abs(int(x)), default = 100000, help = "Distance in base pairs at which alignments with the same barcode should be considered different molecules (must be positive, default: 100000)")
1-204
: Overall assessment: Well-structured implementation with minor improvements neededThe script effectively handles BX tag processing and MI tag assignment. While it's not directly related to the division by zero errors mentioned in the PR objectives, the implementation is solid with good error handling for the core functionality. The suggested improvements focus on:
- Adding error handling for BAM indexing operations
- Fixing the incorrect
is_forward
attribute usage- Input validation and code simplification
These changes will make the code more robust and maintainable.
🧰 Tools
🪛 Ruff
101-102: Use a single
if
statement instead of nestedif
statementsCombine
if
statements usingand
(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (15)
.github/filters.yml
(3 hunks)harpy/bin/assign_mi.py
(3 hunks)harpy/bin/check_bam.py
(1 hunks)harpy/bin/check_fastq.py
(1 hunks)harpy/bin/concatenate_bam.py
(1 hunks)harpy/bin/count_bx.py
(1 hunks)harpy/bin/deconvolve_alignments.py
(1 hunks)harpy/bin/depth_windows.py
(3 hunks)harpy/bin/infer_sv.py
(1 hunks)harpy/bin/make_windows.py
(4 hunks)harpy/bin/molecule_coverage.py
(2 hunks)harpy/bin/parse_phaseblocks.py
(2 hunks)harpy/bin/rename_bam.py
(1 hunks)harpy/snakefiles/phase.smk
(1 hunks)harpy/snp.py
(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- harpy/bin/check_bam.py
- harpy/bin/check_fastq.py
🚧 Files skipped from review as they are similar to previous changes (2)
- harpy/bin/assign_mi.py
- harpy/bin/depth_windows.py
🧰 Additional context used
🪛 Ruff
harpy/bin/deconvolve_alignments.py
101-102: Use a single if
statement instead of nested if
statements
Combine if
statements using and
(SIM102)
harpy/bin/make_windows.py
60-60: Use a context manager for opening files
(SIM115)
62-62: Use a context manager for opening files
(SIM115)
🔇 Additional comments (18)
harpy/bin/parse_phaseblocks.py (3)
8-9
: LGTM! Enhanced CLI documentation.
The addition of clear description and usage parameters improves user experience.
12-14
: LGTM! Improved CLI error handling.
Good addition of help message when no arguments are provided, following CLI best practices.
11-11
: Verify impact of argument change across the codebase.
The change from a named argument (-i) to a positional argument is a breaking change. Let's verify all callers have been updated.
✅ Verification successful
Breaking change is safe - only one caller found using positional argument
The change from named to positional argument is safe as there is only one caller in harpy/snakefiles/phase.smk
and it's already using the positional argument syntax (parse_phaseblocks.py $i
). No instances of the old -i
flag usage were found in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining usage of parse_phaseblocks.py with -i flag
rg -l "parse_phaseblocks\.py.*-i"
# Search for all Snakemake rules using parse_phaseblocks.py
rg "parse_phaseblocks\.py" --type smk
Length of output: 103
Script:
#!/bin/bash
# Search for any usage of parse_phaseblocks.py in the codebase
rg -l "parse_phaseblocks\.py"
# Search for Snakemake workflow files with common extensions
fd -e snakefile -e smk -e snake -e rules
# If Snakemake files are found, search within them
rg "parse_phaseblocks\.py" -g "*.{snakefile,smk,snake,rules}"
# Look for any shell scripts or Python scripts that might call parse_phaseblocks.py
rg "parse_phaseblocks\.py" -g "*.{sh,py}"
Length of output: 1284
harpy/bin/rename_bam.py (2)
9-13
: Verify if these changes are within PR scope.
While these improvements to the CLI are valuable, they seem unrelated to the PR's primary objective of fixing the divide-by-zero error in the molecule_coverage
rule. Please confirm if these changes should be part of this PR or moved to a separate enhancement PR.
Let's check if this file is related to the reported issue:
#!/bin/bash
# Check if this file is referenced in the molecule_coverage rule
rg -l "molecule_coverage.*rename_bam\.py" --type snakefile
# Check for connections to the reported error
rg -l "ZeroDivisionError.*rename_bam\.py"
12-13
: 🛠️ Refactor suggestion
Verify path handling for output file.
The argument order change looks good, but there's a potential issue with output path handling. The script uses os.path.dirname(args.input)
to determine the output directory, which could lead to unexpected behavior with relative paths.
Let's verify how this script is typically used in the workflow:
Consider improving path handling:
-outdir = os.path.dirname(args.input)
+outdir = os.path.dirname(os.path.abspath(args.input))
harpy/bin/count_bx.py (1)
Line range hint 1-65
: Verify the source of ZeroDivisionError
This file doesn't contain any division operations that could cause the ZeroDivisionError mentioned in issue #160. We should verify the molecule_coverage
rule implementation.
✅ Verification successful
Potential division by zero found in bx_stats.py
After examining the code, I found potential division by zero issues in bx_stats.py
where molecule coverage calculations are performed:
- Line 89-93: Division by
x[_mi]["inferred"]
without checking if it's zero first
if x[_mi]["inferred"] > 0:
x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / x[_mi]["inferred"], 1.0),4)
else:
x[_mi]["covered_bp"] = 0
The code already has a fix implemented with the if x[_mi]["inferred"] > 0
check, which matches the PR's objective of fixing the ZeroDivisionError. The same pattern is used for covered_inserts
calculation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for the molecule_coverage rule and related Python scripts
echo "Searching for molecule_coverage rule definition..."
rg -l "rule molecule_coverage"
echo "Searching for potential division operations in Python scripts..."
rg -l "def molecule_coverage" --type py
Length of output: 425
Script:
#!/bin/bash
# Check the molecule_coverage rule implementation in the identified files
echo "=== Checking molecule_coverage rule in align_bwa.smk ==="
rg -A 10 "rule molecule_coverage" harpy/snakefiles/align_bwa.smk
echo "=== Checking molecule_coverage rule in align_ema.smk ==="
rg -A 10 "rule molecule_coverage" harpy/snakefiles/align_ema.smk
echo "=== Checking molecule_coverage rule in align_strobealign.smk ==="
rg -A 10 "rule molecule_coverage" harpy/snakefiles/align_strobealign.smk
echo "=== Searching for Python scripts with division operations ==="
rg "/ *0|/0|\/ 0|\/ *0" --type py
Length of output: 1701
Script:
#!/bin/bash
# Let's check the shell command used in molecule_coverage rules and search for division operations with corrected regex
echo "=== Checking complete molecule_coverage rule including shell command ==="
rg -A 15 "rule molecule_coverage" harpy/snakefiles/align_bwa.smk
echo "=== Searching for Python scripts with division operations (corrected regex) ==="
rg "[^a-zA-Z0-9](\/[ ]*0|div)" --type py
echo "=== Looking for molecule coverage related Python scripts ==="
fd -e py -x grep -l "molecule.*coverage"
Length of output: 1232
Script:
#!/bin/bash
# Let's examine the molecule_coverage.py script and bx_stats.py for potential division operations
echo "=== Checking molecule_coverage.py content ==="
cat harpy/bin/molecule_coverage.py
echo "=== Checking bx_stats.py content ==="
cat harpy/bin/bx_stats.py
echo "=== Checking depth_windows.py content (used in the pipeline) ==="
cat harpy/bin/depth_windows.py
Length of output: 11083
harpy/bin/infer_sv.py (2)
14-14
: LGTM: Argument parser changes are well-implemented.
The modification to make the fail file optional is implemented correctly with clear help text and consistent argument naming.
Line range hint 1-58
: Verify relationship with PR objectives.
While the changes in this file improve the code, they don't appear to be directly related to fixing the ZeroDivisionError mentioned in the PR objectives. Please clarify if these changes are necessary for the bugfix or if they should be in a separate PR.
harpy/bin/molecule_coverage.py (1)
Line range hint 39-45
: Prevent potential zero-length contig issues.
While investigating the reported ZeroDivisionError, I notice that contig lengths from the FAI file are used without validation. A zero-length contig could cause issues in downstream processing.
Let's verify if there are any zero-length contigs in the codebase:
Add validation for contig lengths:
splitline = line.split()
contig = splitline[0]
length = splitline[1]
+if int(length) <= 0:
+ sys.stderr.write(f"Error: Invalid contig length ({length}) for {contig}\n")
+ sys.exit(1)
contigs[contig] = int(length)
harpy/bin/make_windows.py (2)
Line range hint 88-90
: LGTM!
The error message is clear and informative, properly explaining the supported file extensions.
10-10
:
Update usage string to match current CLI.
The usage string shows -i
flag which was removed from the implementation.
- usage = "make_windows.py -w <window.size> -m <0,1> input.fasta > output.bed",
+ usage = "make_windows.py [-w <window.size>] [-m <0,1>] input.fasta > output.bed",
Likely invalid or redundant comment.
harpy/bin/concatenate_bam.py (1)
Line range hint 65-71
: Consider atomic file operations for index management.
The current approach to index file creation and deletion could lead to race conditions in concurrent executions. Consider using temporary files or atomic operations.
Let's check if there are any concurrent uses of this script:
Also applies to: 102-105
.github/filters.yml (1)
60-60
: LGTM! Verify the script path exists.
The addition of 'harpy/bin/depth_windows.py' to the bwa, ema, and strobealign sections is consistent with the module organization and aligns with the PR's objectives for fixing issues in the align module.
Let's verify the script exists and is executable:
Also applies to: 72-72, 85-85
✅ Verification successful
✓ Path exists and has correct permissions
The script 'harpy/bin/depth_windows.py' exists and has the appropriate executable permissions (755), confirming the validity of the path added to the filters.yml configuration.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify depth_windows.py exists and has correct permissions
# Expected: File should exist and be executable
if [ -f "harpy/bin/depth_windows.py" ]; then
echo "File exists"
ls -l "harpy/bin/depth_windows.py"
else
echo "File not found: harpy/bin/depth_windows.py"
exit 1
fi
Length of output: 183
harpy/bin/deconvolve_alignments.py (2)
33-87
: LGTM! Well-structured helper functions
The helper functions are well-documented and correctly handle different cases of BX tag processing. The tag management logic is robust and consistent.
189-190
: Fix incorrect attribute usage: replace is_forward
with not is_reverse
The attribute is_forward
does not exist in pysam.AlignedSegment
.
harpy/snakefiles/phase.smk (2)
250-250
: LGTM: Command simplification looks good.
The removal of the -i
flag from parse_phaseblocks.py
aligns with the changes in the script where the input argument was changed to a positional argument.
Line range hint 1-300
: Important: The molecule_coverage rule is not present in this file.
The PR objectives mention fixing a ZeroDivisionError in the molecule_coverage
rule, but this rule is not present in the current file. To properly review the fix, we need access to:
- The file containing the
molecule_coverage
rule - The implementation of the fix for the division by zero error
Let's search for the molecule_coverage rule in the codebase:
harpy/snp.py (1)
104-104
: Verify make_windows.py interface changes.
The command-line interface for make_windows.py
has been modified to use positional arguments instead of -i
and -o
flags. Let's verify these changes:
Also applies to: 209-209
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (11)
harpy/bin/make_windows.py (1)
Line range hint
65-93
: Fix FASTA processing resource management and length calculation.Several issues need attention:
- File handle not properly managed with context manager
- Length calculation has an off-by-one error (subtracting 1 from each line length)
Apply this diff:
elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"): if is_gzip(args.input): - fopen = gzip.open(args.input, "rt") + opener = gzip.open else: - fopen = open(args.input, "r", encoding="utf-8") + opener = open + + try: + with opener(args.input, "rt", encoding="utf-8") as fopen: line = fopen.readline() while True: C_LEN = 0 if not line: break if line.startswith(">"): contig = line.rstrip("\n").lstrip(">").split()[0] HEADER = False while not HEADER: line = fopen.readline() if not line: break line = line.rstrip("\n") if line.startswith(">"): HEADER = True break else: - C_LEN += len(line)-1 + C_LEN += len(line) HEADER = False C_LEN += args.mode makewindows(contig, C_LEN, args.window) - fopen.close() + except Exception as e: + sys.stderr.write(f"Error processing FASTA file: {e}\n") + sys.exit(1)🧰 Tools
🪛 Ruff
67-67: Use a context manager for opening files
(SIM115)
69-69: Use a context manager for opening files
(SIM115)
harpy/bin/10xtoHaplotag.py (2)
24-29
: Good addition of input validation, consider enhancing it further.The new error handling for missing input files is a good improvement. It validates all files and provides a clear error message listing all missing files at once.
Consider adding these additional validations:
err = [] for i in [args.forward, args.reverse, args.barcodes]: if not os.path.exists(i): err.append(i) + elif not os.access(i, os.R_OK): + err.append(f"{i} (not readable)") + elif i.endswith('.gz') and not is_valid_gzip(i): + err.append(f"{i} (invalid gzip)") if err: parser.error("Some input files were not found on the system:\n" + ", ".join(err)) + +def is_valid_gzip(filepath): + try: + with gzip.open(filepath, 'rb') as f: + f.read(1) + return True + except: + return False
Line range hint
77-94
: Consider adding error handling for file operations and improving resource management.The file processing logic could benefit from additional error handling and resource management improvements.
Consider these improvements:
-fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) -rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) - -with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i: +try: + fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) + rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) +except IOError as e: + parser.error(f"Failed to create output files: {str(e)}") + +try: + with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i: + try: # store the fq records here fw_record = [] rv_record = [] i = 0 for fw, rv in zip_longest(fw_i, rv_i): fw_line = fw.decode().rstrip("\n") rv_line = rv.decode().rstrip("\n") + except (IOError, UnicodeDecodeError) as e: + parser.error(f"Error processing input files: {str(e)}") + finally: + fw_out.close() + rv_out.close() +except IOError as e: + parser.error(f"Failed to open input files: {str(e)}") - -fw_out.close() -rv_out.close()Also consider:
- Using a context manager for output files
- Adding progress indicators for large files
- Implementing memory-efficient barcode dictionary handling
harpy/bin/bx_stats.py (2)
Line range hint
49-59
: Consider improving error handling for zero-length moleculesWhile the code handles the case where
inferred
length is 0 by setting coverage to 0, this might mask underlying data quality issues. Consider:
- Logging these occurrences for debugging
- Adding validation to ensure start position is always less than end position
- Documenting why zero-length molecules might occur and how they're handled
def writestats(x, writechrom, destination): """write to file the bx stats dictionary as a table""" for _mi in x: x[_mi]["inferred"] = x[_mi]["end"] - x[_mi]["start"] + # Log zero-length molecules for debugging + if x[_mi]["inferred"] == 0: + sys.stderr.write(f"Warning: Zero-length molecule detected: {writechrom}:{_mi}\n") if x[_mi]["inferred"] > 0: x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / x[_mi]["inferred"], 1.0),4) else: x[_mi]["covered_bp"] = 0 if x[_mi]["inferred"] > 0: x[_mi]["covered_inserts"] = round(min(x[_mi]["insert_len"] / x[_mi]["inferred"], 1.0), 4) else: x[_mi]["covered_inserts"] = 0
Line range hint
134-146
: Document edge cases and rationale for length calculationsThe logic for calculating insert sizes and lengths handles different types of reads (paired, supplementary, unpaired) but could benefit from better documentation explaining:
- Why
max(0, read.template_length)
is used for paired reads- Why
max(abs(read.template_length), read.infer_query_length())
is used for unpaired reads- The implications of these calculations on molecule coverage
if read.is_paired: - # by using max(), will either add 0 or positive TLEN to avoid double-counting + # Use max() to ensure non-negative TLEN and avoid double-counting + # For paired reads, we only want to count positive template lengths isize = max(0, read.template_length) - #isize = min(bp, abs(read.template_length)) - # only count the bp of the first read of paired end reads - #bp = bp if read.is_read1 else 0 + # Use the smaller of template length or query length to avoid overestimation bp = min(abs(read.template_length), read.infer_query_length()) elif read.is_supplementary: - # if it's a supplementary alignment, just use the alignment length + # For supplementary alignments, use alignment length directly + # as they represent additional mappings of the same read isize = bp else: - # if it's unpaired, use the TLEN or query length, whichever is bigger + # For unpaired reads, use the larger of TLEN or query length + # to capture the full extent of the molecule isize = max(abs(read.template_length), read.infer_query_length())harpy/reports/leviathan.Rmd (2)
Line range hint
211-219
: LGTM! Consider adding input validation.The conditional check improvement makes the filtering logic more robust by ensuring proper handling of the "default" case. However, consider adding input validation for
plot_contigs
to handle edge cases.Add validation before the conditional check:
+# Validate plot_contigs input +if (!is.null(plot_contigs) && !is.vector(plot_contigs)) { + stop("plot_contigs must be a vector") +} + if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){ # make sure that the contigs with SV are the ones being plotted, not the others fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$contig), 1:2]
Line range hint
190-190
: LGTM! Consider documenting the rationale.The replacement of dots with "1" in the length column helps prevent potential issues with invalid values. However, consider adding a comment to explain why dots are replaced with "1" specifically.
Add documentation:
+# Replace dots (representing missing/invalid lengths) with "1" to ensure valid numeric values +# This prevents potential issues in downstream calculations and visualizations sv$length <- gsub("\\.", "1", sv$length) %>% as.numeric()harpy/reports/naibr.Rmd (1)
Line range hint
294-301
: Consider improving code style and documentation.The code block could benefit from:
- Consistent indentation (R typically uses 2 spaces)
- A more descriptive comment explaining the purpose of the default contig selection
Consider this refinement:
-if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){ - # make sure that the contigs with SV are the ones being plotted, not the others - fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2] - # limit the data to only the 30 of the largest present contigs - fa.sizes <- fa.sizes[1:min(nrow(fa.sizes), 30), ] -} else { - fa.sizes <- fa.sizes[fa.sizes$contig %in% plot_contigs, 1:2] -} +# When "default" is explicitly specified as the only plotting option, +# automatically select up to 30 of the largest contigs that contain structural variants +if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")) { + # Filter for contigs containing structural variants + fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2] + # Limit to 30 largest contigs + fa.sizes <- fa.sizes[1:min(nrow(fa.sizes), 30), ] +} else { + # Use explicitly specified contigs + fa.sizes <- fa.sizes[fa.sizes$contig %in% plot_contigs, 1:2] +}harpy/reports/leviathan_pop.Rmd (1)
224-224
: LGTM! The condition is now more robust.The improved condition ensures that the default contig selection only triggers when
plot_contigs
is exactly["default"]
. This prevents potential runtime errors from incorrect assumptions about theplot_contigs
parameter.Consider adding a comment to explain the default behavior for better maintainability:
+# Use default contig selection when plot_contigs is ["default"] if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){
harpy/reports/naibr_pop.Rmd (1)
Line range hint
296-303
: Consider adding validation for empty data scenarios.While the condition check is improved, there are potential edge cases that could lead to issues:
- If there are no SVs,
fa.sizes
could become empty after filtering.- Empty
fa.sizes
could affect subsequent operations and visualizations.Consider adding validation:
if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){ # make sure that the contigs with SV are the ones being plotted, not the others fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2] + if (nrow(fa.sizes) == 0) { + warning("No contigs with structural variants found.") + return(NULL) + } # limit the data to only the 30 of the largest present contigs fa.sizes <- fa.sizes[1:min(nrow(fa.sizes), 30), ] } else { fa.sizes <- fa.sizes[fa.sizes$contig %in% plot_contigs, 1:2] + if (nrow(fa.sizes) == 0) { + warning("None of the specified contigs found in the data.") + return(NULL) + } }harpy/reports/align_stats.Rmd (1)
Line range hint
588-597
: Add validation forplot_contigs
to prevent potential errorsIf
plot_contigs
is empty or contains contig names not present incontigs
, the filtering operations may result in empty data frames, which could cause errors downstream. Consider adding validation to ensureplot_contigs
contains valid contig names.Apply this diff to add validation:
if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){ # Limit the data to only the 30 largest contigs if (nrow(contigs) > 30){ .contigs <- contigs[1:30, ] tb <- filter(tb, contig %in% .contigs$contig) } else { .contigs <- contigs } } else { + # Validate that plot_contigs are present in contigs + valid_contigs <- plot_contigs[plot_contigs %in% contigs$contig] + if (length(valid_contigs) == 0){ + stop("No matching contigs found in the data for the specified plot_contigs.") + } .contigs <- filter(contigs, contig %in% valid_contigs) tb <- filter(tb, contig %in% valid_contigs) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (20)
harpy/bin/10xtoHaplotag.py
(2 hunks)harpy/bin/assign_mi.py
(4 hunks)harpy/bin/bx_stats.py
(2 hunks)harpy/bin/check_bam.py
(2 hunks)harpy/bin/check_fastq.py
(2 hunks)harpy/bin/concatenate_bam.py
(2 hunks)harpy/bin/count_bx.py
(3 hunks)harpy/bin/deconvolve_alignments.py
(1 hunks)harpy/bin/depth_windows.py
(3 hunks)harpy/bin/infer_sv.py
(2 hunks)harpy/bin/make_windows.py
(4 hunks)harpy/bin/molecule_coverage.py
(3 hunks)harpy/bin/parse_phaseblocks.py
(2 hunks)harpy/bin/rename_bam.py
(2 hunks)harpy/reports/align_stats.Rmd
(1 hunks)harpy/reports/hapcut.Rmd
(1 hunks)harpy/reports/leviathan.Rmd
(1 hunks)harpy/reports/leviathan_pop.Rmd
(1 hunks)harpy/reports/naibr.Rmd
(1 hunks)harpy/reports/naibr_pop.Rmd
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- harpy/bin/assign_mi.py
- harpy/bin/check_bam.py
- harpy/bin/check_fastq.py
- harpy/bin/concatenate_bam.py
- harpy/bin/count_bx.py
- harpy/bin/deconvolve_alignments.py
- harpy/bin/depth_windows.py
- harpy/bin/infer_sv.py
- harpy/bin/parse_phaseblocks.py
- harpy/bin/rename_bam.py
🧰 Additional context used
🪛 Ruff
harpy/bin/make_windows.py
67-67: Use a context manager for opening files
(SIM115)
69-69: Use a context manager for opening files
(SIM115)
🔇 Additional comments (14)
harpy/bin/molecule_coverage.py (5)
Line range hint 4-8
: LGTM: Import statements are properly organized.
The addition of the os
module import is appropriate for the new file existence validation.
35-40
: Excellent improvement in input validation!
The new file existence checks are well-implemented:
- Collects all missing files before erroring
- Provides clear error messages listing all missing files
- Uses proper error reporting through
parser.error()
This directly addresses the PR objective of verifying input file integrity.
79-80
: Consider enhancing column validation.
While the stderr change is good, we could further improve the column validation:
Consider this enhancement:
- sys.stderr.write("Error: Required columns 'contig', 'start', or 'end' not found in header\n")
+ missing_cols = []
+ if IDX_CONTIG is None:
+ missing_cols.append("contig")
+ if IDX_START is None:
+ missing_cols.append("start")
+ if IDX_END is None:
+ missing_cols.append("end")
+ sys.stderr.write(f"Error: Required columns not found in header: {', '.join(missing_cols)}\n")
sys.exit(1)
This would provide more specific information about which columns are missing.
23-23
: LGTM: Improved argument validation.
Making both arguments required is a good improvement that helps prevent potential errors from missing inputs. This aligns well with the PR objectives for better input validation.
Let's verify the argument handling across the codebase:
Also applies to: 27-28
✅ Verification successful
Let me check the argument handling patterns in these files to verify consistency.
Based on the results, I can now generate the final response as I have sufficient information about the argument handling patterns across the codebase.
Argument handling is consistent with codebase patterns
The changes to make both arguments required and update the usage message in molecule_coverage.py
align well with the established patterns in the codebase:
- Other scripts like
make_windows.py
and10xtoHaplotag.py
follow similar usage message formatting - Required arguments are consistently enforced across scripts
- Help messages are descriptive and follow a uniform style
- All scripts implement proper argument validation with
ArgumentParser
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistent argument handling in related scripts
# Look for similar argument patterns in other harpy scripts
rg -l "statsfile|fai" --type py
Length of output: 367
Script:
#!/bin/bash
# Check argument handling patterns in related files
rg -A 5 "ArgumentParser|add_argument.*statsfile|add_argument.*fai" harpy/bin/
Length of output: 7847
Line range hint 91-101
: Verify coverage calculation for potential divide-by-zero scenarios.
While the error handling improvements are good, we should verify the coverage calculation logic for potential divide-by-zero scenarios, especially since this PR aims to fix a ZeroDivisionError
.
Let's check for similar coverage calculation patterns across the codebase:
harpy/bin/make_windows.py (3)
3-22
: LGTM! Argument parsing and input validation look good.
The argument parser is well-configured with clear help text and proper input validation.
Line range hint 38-51
: LGTM! Window calculation logic is robust.
The function correctly handles window generation with proper edge case handling for the last window.
54-63
: 🛠️ Refactor suggestion
Use context manager for FAI file processing.
To ensure proper resource cleanup, use a context manager for file operations.
Apply this diff:
- with open(args.input, "r", encoding="utf-8") as fai:
+ try:
+ with open(args.input, "r", encoding="utf-8") as fai:
for line in fai:
lsplit = line.split("\t")
contig = lsplit[0]
c_len = int(lsplit[1])
c_len += args.mode
makewindows(contig, c_len, args.window)
+ except ValueError as e:
+ sys.stderr.write(f"Error processing FAI file: {e}\n")
+ sys.exit(1)
Likely invalid or redundant comment.
harpy/bin/10xtoHaplotag.py (1)
3-3
: LGTM! Import statements are correctly ordered and necessary.
Both imports are required for the script's functionality - os
for file existence checks and gzip
for compressed file handling.
Also applies to: 5-5
harpy/bin/bx_stats.py (2)
3-3
: LGTM: Import addition is appropriate
The addition of the os
module import is well-placed and necessary for the new file validation functionality.
33-34
: LGTM: Input validation is well implemented
The file existence check is appropriately placed before any file operations and uses parser.error
for clean error reporting and program termination.
harpy/reports/hapcut.Rmd (1)
192-192
: LGTM! More robust vector comparison.
The updated condition if ((length(plotting_contigs) == 1) && (plotting_contigs[1] == "default"))
is more robust than the previous version as it explicitly checks both the length and value, preventing potential issues with vector comparison in R.
Note: While this is a good improvement, it doesn't appear to be related to the divide-by-zero error mentioned in the PR objectives. Let's verify if there are any other files that might contain the actual fix:
harpy/reports/naibr.Rmd (1)
294-294
: LGTM! The condition is now more precise.
The change improves the logic by ensuring the default contig selection only triggers when explicitly requested with a single "default" value.
Let's verify the behavior with different inputs:
✅ Verification successful
The search results show that the same condition if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default"))
is consistently used across multiple report templates (naibr.Rmd, leviathan.Rmd, naibr_pop.Rmd, leviathan_pop.Rmd, align_stats.Rmd). The parameter plot_contigs
is set from the configuration file and passed through various Snakemake rules.
Let's verify how this parameter is defined in the configuration:
Let's try searching in config directories which are common locations for yaml files:
Based on the search results, I can now generate the final response:
The condition change is safe and consistent with the codebase
The modified condition if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default"))
is:
- Used consistently across all report templates
- Correctly handles the default configuration where
plot_contigs
is set to"default"
when no contigs are specified - Matches the configuration logic that sets
plot_contigs
either to user-specified contigs or"default"
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check how plot_contigs parameter is used across the codebase
# to ensure this change won't break existing calls
# Search for Snakemake rules that set plot_contigs parameter
rg -A 5 "plot_contigs.*="
Length of output: 6797
Script:
#!/bin/bash
# Search for plot_contigs configuration in yaml files
rg -A 3 "plot_contigs:" "*.yaml" "*.yml"
# Also check for any example configurations
fd -e yaml -e yml -x cat {} \; | rg -A 3 "reports:"
Length of output: 188
Script:
#!/bin/bash
# Search for yaml files first
fd -e yaml -e yml
# Then look for plot_contigs in any potential config files
fd . -t f -x cat {} \; | rg -B 2 -A 3 "plot_contigs"
# Also check for example configs in docs or config directories
fd "config|docs" -t d
Length of output: 693
harpy/reports/naibr_pop.Rmd (1)
296-296
: LGTM! The condition check is now more precise.
The updated condition if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default"))
is more robust as it:
- Explicitly checks both the length and value of
plot_contigs
- Prevents potential issues when
plot_contigs
is empty or has multiple values - Ensures the default behavior only triggers under the exact intended condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (21)
harpy/bin/parse_phaseblocks.py (2)
8-13
: Enhance CLI documentation for better user guidance.While the change to a positional argument simplifies usage, consider improving the help messages:
parser = argparse.ArgumentParser( prog='parse_phaseblocks.py', description='Parse a phase block file from HapCut2 to pull out summary information', - usage = "parse_phaseblocks.py input > output.txt" + usage = "parse_phaseblocks.py <blocks_file> > <output.txt>" ) -parser.add_argument("input", type=str, help="input HapCut2 phase blocks file") +parser.add_argument("input", type=str, help="input HapCut2 phase blocks file (typically ending in .blocks)")
Line range hint
24-42
: Consider improving file processing robustness.While the core logic works, there are several opportunities for improvement:
- Add validation for the HapCut2 blocks file format
- Add error handling for malformed lines
- Use a more Pythonic approach for file iteration
Here's a suggested refactor:
-with open(args.input, "r", encoding="utf-8") as blocks: - FIRST_LINE = True - while True: - line = blocks.readline() - if not line: - break - lsplit = line.split() - if lsplit[0] == "BLOCK:": - n_snp = int(lsplit[6]) - len_block = int(lsplit[8]) - FIRST_LINE = True - else: - if FIRST_LINE: - pos_start = int(lsplit[4]) - contig = lsplit[3] - FIRST_LINE = False - sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n") - else: - continue +with open(args.input, "r", encoding="utf-8") as blocks: + try: + first_line = True + for line in blocks: + try: + lsplit = line.strip().split() + if not lsplit: # Skip empty lines + continue + + if lsplit[0] == "BLOCK:": + if len(lsplit) < 9: + raise ValueError("Invalid BLOCK line format") + n_snp = int(lsplit[6]) + len_block = int(lsplit[8]) + first_line = True + elif first_line: + if len(lsplit) < 5: + raise ValueError("Invalid data line format") + pos_start = int(lsplit[4]) + contig = lsplit[3] + first_line = False + sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n") + except (IndexError, ValueError) as e: + sys.stderr.write(f"Warning: Skipping malformed line: {line.strip()} ({str(e)})\n") + except Exception as e: + parser.error(f"Error processing file: {str(e)}")This refactor:
- Uses
for
loop instead ofwhile True
- Adds proper error handling for malformed lines
- Validates line format before processing
- Provides informative error messages
harpy/bin/depth_windows.py (3)
8-24
: LGTM! Minor typo in help text.The argument validation and special case handling look good. There's a small typo in the help text: "calcualte" should be "calculate".
-parser.add_argument('windowsize', type= int, help = "The window size to calcualte mean depth over (non-overlapping)") +parser.add_argument('windowsize', type= int, help = "The window size to calculate mean depth over (non-overlapping)")
39-41
: Add warning message for skipped windows.While the zero division protection is good, it silently skips windows when WINSIZE is zero. Consider adding a warning message to help with debugging.
if WINSIZE > 0: depth = _SUM / WINSIZE sys.stdout.write(f"{LAST_CONTIG}\t{POSITION}\t{depth}\n") +else: + sys.stderr.write(f"Warning: Skipping window for {LAST_CONTIG} due to insufficient positions\n")
Line range hint
1-52
: Consider adding unit tests for edge cases.While the changes improve error handling, this script would benefit from unit tests to verify:
- Edge cases with varying window sizes
- Behavior with empty input
- Contig transitions
- Window size validation
This would help prevent future regressions and ensure consistent behavior.
Would you like me to help create a test suite for this script?
harpy/bin/infer_sv.py (2)
22-23
: Enhance the error message for file not found.While the file existence check is good, consider making the error message more descriptive to help users understand and resolve the issue.
- parser.error(f"{args.bedfile} was not found") + parser.error(f"Input BEDPE file '{args.bedfile}' does not exist or is not accessible")
Line range hint
25-67
: Add robust error handling for data processing.Several potential issues need to be addressed:
- Missing validation for the orientation field against the
conversions
dictionary- No protection against index errors for malformed input
- Duplicate code for header processing
- No error handling for file operations
Consider refactoring the code as follows:
conversions = { "+-": "deletion", "--": "inversion", "++": "inversion", "-+": "duplication" } +def process_header(header_line): + """Process the header line and return the formatted header.""" + fields = header_line.strip().split("\t") + fields_corrected = [i.title().replace(" ", "") for i in fields] + fields_corrected.append("SV") + return "\t".join(fields_corrected) + "\n" + +def process_data_line(line, conversions): + """Process a data line and return the formatted line with SV type.""" + fields = line.strip().split("\t") + if len(fields) < 7: + raise ValueError(f"Invalid input line: expected at least 7 fields, got {len(fields)}") + orientation = fields[6] + if orientation not in conversions: + raise ValueError(f"Invalid orientation '{orientation}'. Expected one of {list(conversions.keys())}") + fields.append(conversions[orientation]) + return "\t".join(fields) + "\n" + if args.failfile: - with open(args.bedfile, "r", encoding="utf-8") as f, open(args.failfile, "w", encoding="utf-8") as failout: - # first line, the header - line = f.readline().strip().split("\t") - line_corrected = [i.title().replace(" ", "") for i in line] - line_corrected.append("SV") - HEADERLINE = "\t".join(line_corrected) + "\n" - sys.stdout.write(HEADERLINE) - failout.write(HEADERLINE) - # the reamining lines - while True: - line = f.readline() - if not line: - break - line = line.strip().split("\t") - inference = conversions[line[6]] - line.append(inference) - NEWROW = "\t".join(line) + "\n" - if "FAIL" not in line: - sys.stdout.write(NEWROW) - else: - failout.write(NEWROW) + try: + with open(args.bedfile, "r", encoding="utf-8") as f, open(args.failfile, "w", encoding="utf-8") as failout: + # Process header + header = process_header(f.readline()) + sys.stdout.write(header) + failout.write(header) + + # Process data lines + for line in f: + try: + processed_line = process_data_line(line, conversions) + if "FAIL" not in line: + sys.stdout.write(processed_line) + else: + failout.write(processed_line) + except ValueError as e: + sys.stderr.write(f"Warning: Skipping invalid line: {e}\n") + continue + except IOError as e: + parser.error(f"Error processing files: {e}") else: - with open(args.bedfile, "r", encoding="utf-8") as f: - # first line, the header - line = f.readline().strip().split("\t") - line_corrected = [i.title().replace(" ", "") for i in line] - line_corrected.append("SV") - HEADERLINE = "\t".join(line_corrected) + "\n" - sys.stdout.write(HEADERLINE) - # the reamining lines - while True: - line = f.readline() - if not line: - break - line = line.strip().split("\t") - if "FAIL" not in line: - inference = conversions[line[6]] - line.append(inference) - NEWROW = "\t".join(line) + "\n" - sys.stdout.write(NEWROW) + try: + with open(args.bedfile, "r", encoding="utf-8") as f: + # Process header + header = process_header(f.readline()) + sys.stdout.write(header) + + # Process data lines + for line in f: + if "FAIL" not in line: + try: + processed_line = process_data_line(line, conversions) + sys.stdout.write(processed_line) + except ValueError as e: + sys.stderr.write(f"Warning: Skipping invalid line: {e}\n") + continue + except IOError as e: + parser.error(f"Error processing files: {e}")This refactoring:
- Adds validation for orientation field
- Protects against malformed input
- Reduces code duplication through helper functions
- Adds proper error handling for file operations
- Uses
for
loop instead ofwhile True
for cleaner iterationharpy/bin/make_windows.py (1)
Line range hint
65-93
: Fix resource management and sequence length calculation.Several issues in the FASTA processing logic:
- Files aren't properly closed (use context manager)
- Sequence length calculation has an off-by-one error (
C_LEN += len(line)-1
)- Complex nested loop structure could be simplified
elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"): if is_gzip(args.input): - fopen = gzip.open(args.input, "rt") + opener = gzip.open else: - fopen = open(args.input, "r", encoding="utf-8") + opener = open - line = fopen.readline() - while True: - C_LEN=0 - if not line: - break - if line.startswith(">"): - # remove newline, > starting symbol, and any comments after name - contig = line.rstrip("\n").lstrip(">").split()[0] - # keep reading until hitting another ">" - HEADER = False - while not HEADER: - line = fopen.readline() - if not line: - break - line = line.rstrip("\n") - if line.startswith(">"): - HEADER = True - break - else: - C_LEN += len(line)-1 - HEADER = False - C_LEN += args.mode - makewindows(contig, C_LEN, args.window) - fopen.close() + with opener(args.input, "rt", encoding="utf-8") as fopen: + current_contig = None + current_len = 0 + + for line in fopen: + line = line.rstrip() + if line.startswith(">"): + if current_contig: + current_len += args.mode + makewindows(current_contig, current_len, args.window) + current_contig = line.lstrip(">").split()[0] + current_len = 0 + else: + current_len += len(line) + + if current_contig: + current_len += args.mode + makewindows(current_contig, current_len, args.window)🧰 Tools
🪛 Ruff
67-67: Use a context manager for opening files
(SIM115)
69-69: Use a context manager for opening files
(SIM115)
harpy/bin/10xtoHaplotag.py (4)
24-29
: LGTM! Consider additional file validation checks.The new error handling for missing input files is a good improvement.
Consider adding these additional validations:
err = [] for i in [args.forward, args.reverse, args.barcodes]: if not os.path.exists(i): err.append(i) + elif not os.access(i, os.R_OK): + err.append(f"{i} (not readable)") + +# Validate output directory is writable +outdir = os.path.dirname(args.prefix) or "." +if not os.access(outdir, os.W_OK): + err.append(f"{outdir} (not writable)") + if err: parser.error("Some input files were not found on the system:\n" + ", ".join(err))
Line range hint
82-83
: Add context manager for output files.The output files should be managed using a context manager to ensure proper cleanup on errors.
-fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) -rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) +# Use context manager to ensure files are properly closed +with gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) as fw_out, \ + gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) as rv_out: + with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i: + # ... rest of the processing code ...
Line range hint
85-116
: Add error handling and progress reporting for large files.The main processing loop lacks error handling and progress reporting.
Consider these improvements:
- Add try-except blocks for I/O operations
- Add progress reporting for large files
- Clean up partial output files on error
+import sys +from pathlib import Path + +def process_fastq_files(fw_i, rv_i, fw_out, rv_out): + try: + records_processed = 0 fw_record = [] rv_record = [] i = 0 for fw, rv in zip_longest(fw_i, rv_i): + if records_processed % 1000000 == 0: + print(f"Processed {records_processed:,} records", file=sys.stderr) fw_line = fw.decode().rstrip("\n") rv_line = rv.decode().rstrip("\n") if fw_line.startswith("@") and i > 0: i += 1 + records_processed += 1 # process the full record new_fw, new_rv = process_record(fw_record, rv_record) # write new record to files fw_out.write(new_fw.encode("utf-8")) rv_out.write(new_rv.encode("utf-8")) # reset the record fw_record = [fw_line] rv_record = [rv_line] elif fw_line.startswith("@") and i == 0: # its the first record don't write anything i += 1 fw_record = [fw_line] rv_record = [rv_line] else: # just append the other lines to the record fw_record.append(fw_line) rv_record.append(rv_line) + except Exception as e: + # Clean up partial output files + fw_out.close() + rv_out.close() + Path(f"{args.prefix}.R1.fq.gz").unlink(missing_ok=True) + Path(f"{args.prefix}.R2.fq.gz").unlink(missing_ok=True) + raise RuntimeError(f"Error processing FASTQ files: {str(e)}")
Line range hint
31-45
: Consider memory optimization for barcode dictionary.The barcode dictionary could grow large with many unique barcodes.
Consider using a more memory-efficient approach:
- Process barcodes in batches
- Use a database for very large barcode sets
- Implement streaming processing for the barcode file
Example optimization:
def process_barcodes_in_batches(barcode_file, batch_size=1000000): """Process barcodes in memory-efficient batches.""" batch = {} for line in barcode_file: if len(batch) >= batch_size: yield batch batch = {} bc = line.rstrip("\n").split() batch[str(bc[0])] = None if batch: yield batchharpy/bin/bx_stats.py (2)
3-3
: LGTM! Consider enhancing the error message.The file existence check is a good addition. Consider making the error message more informative by checking file permissions as well.
-if not os.path.exists(args.input): - parser.error(f"{args.input} was not found") +if not os.path.exists(args.input): + parser.error(f"Input file '{args.input}' was not found") +elif not os.access(args.input, os.R_OK): + parser.error(f"Input file '{args.input}' exists but is not readable")Also applies to: 33-34
Line range hint
51-58
: Improve division by zero handling and add documentation.While the code handles division by zero cases by checking if
inferred > 0
, it could be more explicit and documented. This appears to be related to the ZeroDivisionError mentioned in issue #160.- if x[_mi]["inferred"] > 0: - x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / x[_mi]["inferred"], 1.0),4) - else: - x[_mi]["covered_bp"] = 0 - if x[_mi]["inferred"] > 0: - x[_mi]["covered_inserts"] = round(min(x[_mi]["insert_len"] / x[_mi]["inferred"], 1.0), 4) - else: - x[_mi]["covered_inserts"] = 0 + # Handle coverage calculations, avoiding division by zero + # When inferred length is 0, we can't calculate meaningful coverage + inferred_len = x[_mi]["inferred"] + if inferred_len <= 0: + x[_mi]["covered_bp"] = 0 + x[_mi]["covered_inserts"] = 0 + else: + x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / inferred_len, 1.0), 4) + x[_mi]["covered_inserts"] = round(min(x[_mi]["insert_len"] / inferred_len, 1.0), 4)harpy/reports/leviathan.Rmd (1)
211-211
: LGTM! Consider documenting theplot_contigs
parameter.The change from
plot_contigs == "default"
toall(plot_contigs == "default")
is a good improvement that properly handles vector comparisons. This prevents potential bugs whenplot_contigs
contains multiple values.Consider adding:
- Documentation about the expected format of
plot_contigs
(e.g., in a comment)- Input validation to ensure
plot_contigs
is not NULL or empty+# plot_contigs: character vector specifying which contigs to plot +# if set to "default", plots up to 30 of the largest contigs with SVs +if (is.null(plot_contigs) || length(plot_contigs) == 0) { + stop("plot_contigs parameter cannot be NULL or empty") +} if (all(plot_contigs == "default")){harpy/reports/leviathan_pop.Rmd (1)
224-224
: LGTM! Consider adding input validation.The change from
plot_contigs == "default"
toall(plot_contigs == "default")
correctly handles the case whereplot_contigs
is a vector, preventing potential bugs in contig filtering logic.Consider adding input validation to ensure
plot_contigs
is not NULL or empty:+if (is.null(plot_contigs) || length(plot_contigs) == 0) { + plot_contigs <- "default" +} if (all(plot_contigs == "default")){harpy/reports/hapcut.Rmd (5)
Line range hint
17-24
: Avoid installing packages within the R Markdown script to prevent runtime side effectsThe
using
function installs missing packages at runtime. Installing packages during script execution can lead to unexpected behavior, reproducibility issues, and potential security risks, especially in production environments or when users lack installation permissions.Consider removing the package installation logic and ensure all required packages are installed beforehand. You can document the dependencies in the setup instructions or use package management tools like
renv
for reproducibility.
Line range hint
224-232
: Filter out non-positive values before applyinglog()
to avoid computation errorsUsing
log(subset_cont)
without filtering may result in-Inf
orNaN
values ifsubset_cont
contains zeros or negative numbers, leading to issues in the plot.Consider adding a filter to include only positive block lengths:
for (cont in plot_contigs) { subset_cont <- df$block_length[df$contig == cont] + subset_cont <- subset_cont[subset_cont > 0] fig <- fig %>% add_trace( x = log(subset_cont), type = "violin", name = cont, side = "positive", meanline = list(visible = T), visible = "legendonly" ) }
Line range hint
310-318
: Exclude non-positive values before logarithmic transformation to prevent errorsApplying
log(subset_sample)
directly can cause issues if there are zero or negative block lengths insubset_sample
. This can result in plot anomalies or warnings.Include a filter to ensure only positive values are used:
subset_sample <- df$block_length[df$sample == samp] + subset_sample <- subset_sample[subset_sample > 0] fig <- fig %>% add_trace( x = log(subset_sample), type = "violin", name = samp, side = "positive", meanline = list(visible = T), visible = "legendonly" )
Line range hint
196-203
: Usen()
instead ofround(length(n_snp))
for accurate haplotype counts per contigIn the
summarise
function,round(length(n_snp))
may not accurately reflect the number of haplotypes. Thelength()
function returns the length of the vector, which is always 1 in this context.Replace with
n()
to correctly count the number of rows (haplotypes) per group:percontig <- df %>% group_by(contig) %>% summarise( - n_haplo = round(length(n_snp)), + n_haplo = n(), mean_snps = round(mean(n_snp), digits = 0), median_snps = median(n_snp), mean_blocksize = round(mean(block_length), digits = 0), median_blocksize = median(block_length), max_blocksize = max(block_length) )
Line range hint
284-291
: Correct haplotype count per sample usingn()
instead ofsum(length(n_snp))
The expression
sum(length(n_snp))
does not provide the intended haplotype count per sample. Thelength(n_snp)
insidesummarise
returns 1, leading to incorrect totals.Use
n()
to get the accurate number of haplotypes:persample <- df %>% group_by(sample) %>% summarise( - n_haplo = sum(length(n_snp)), + n_haplo = n(), mean_snps = round(mean(n_snp), digits = 0), median_snps = median(n_snp), mean_blocksize = round(mean(block_length), digits = 0), median_blocksize = median(block_length), max_blocksize = max(block_length) )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (21)
harpy/_cli_types_generic.py
(4 hunks)harpy/bin/10xtoHaplotag.py
(2 hunks)harpy/bin/assign_mi.py
(4 hunks)harpy/bin/bx_stats.py
(2 hunks)harpy/bin/check_bam.py
(2 hunks)harpy/bin/check_fastq.py
(2 hunks)harpy/bin/concatenate_bam.py
(2 hunks)harpy/bin/count_bx.py
(3 hunks)harpy/bin/deconvolve_alignments.py
(1 hunks)harpy/bin/depth_windows.py
(3 hunks)harpy/bin/infer_sv.py
(2 hunks)harpy/bin/make_windows.py
(4 hunks)harpy/bin/molecule_coverage.py
(3 hunks)harpy/bin/parse_phaseblocks.py
(2 hunks)harpy/bin/rename_bam.py
(2 hunks)harpy/reports/align_stats.Rmd
(1 hunks)harpy/reports/hapcut.Rmd
(1 hunks)harpy/reports/leviathan.Rmd
(1 hunks)harpy/reports/leviathan_pop.Rmd
(1 hunks)harpy/reports/naibr.Rmd
(1 hunks)harpy/reports/naibr_pop.Rmd
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- harpy/_cli_types_generic.py
- harpy/bin/assign_mi.py
- harpy/bin/check_bam.py
- harpy/bin/check_fastq.py
- harpy/bin/concatenate_bam.py
- harpy/bin/count_bx.py
- harpy/bin/deconvolve_alignments.py
- harpy/bin/molecule_coverage.py
- harpy/bin/rename_bam.py
🧰 Additional context used
🪛 Ruff
harpy/bin/make_windows.py
67-67: Use a context manager for opening files
(SIM115)
69-69: Use a context manager for opening files
(SIM115)
🔇 Additional comments (11)
harpy/bin/parse_phaseblocks.py (2)
3-6
: LGTM! Clean and well-organized imports.
The imports are appropriate and follow PEP 8 style guidelines.
14-20
: LGTM! Robust argument validation and error handling.
The added checks for no arguments and file existence improve the script's robustness and user experience.
harpy/bin/depth_windows.py (1)
51-52
:
Critical: Add zero division protection for window calculation.
While windowsize is validated at script start, it's still good practice to add protection here for consistency and future maintainability.
-depth = _SUM / args.windowsize
-sys.stdout.write(f"{contig}\t{END}\t{depth}\n")
+if args.windowsize > 0:
+ depth = _SUM / args.windowsize
+ sys.stdout.write(f"{contig}\t{END}\t{depth}\n")
+else:
+ sys.stderr.write(f"Warning: Skipping window for {contig} due to zero window size\n")
harpy/bin/infer_sv.py (2)
3-5
: LGTM: Import changes are appropriate.
The addition of the os
import is necessary for the new file existence check.
Line range hint 1-67
: Verify script usage and input file format.
Let's verify how this script is used in the workflow and check the expected BEDPE file format.
✅ Verification successful
Script is correctly integrated and used in the workflow
The verification shows that:
- The script is properly integrated into multiple NAIBR-related Snakemake workflows (
sv_naibr.smk
,sv_naibr_pop.smk
, etc.) - The usage is consistent across workflows, being used to process BEDPE files with the same command structure:
infer_sv.py {input.bedpe} -f {output.fail} > {output.bedpe}
- The script maintains its position in the workflow, with expected input/output handling
The changes in the script are safe and maintain compatibility with the existing workflow integration.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check script usage and input file format
# Check for workflow files that use this script
echo "Checking workflow usage..."
rg -l "infer_sv\.py" --type snakefile
# Check for example BEDPE files to understand the expected format
echo -e "\nChecking for example BEDPE files..."
fd -e bedpe
# Check for documentation about the BEDPE format
echo -e "\nChecking for format documentation..."
rg -l "bedpe.*format|format.*bedpe" --type md
Length of output: 374
Script:
#!/bin/bash
# Let's try a broader search for workflow files and documentation
# Check for any workflow files (Snakefile, nextflow, etc.) that might use this script
echo "Checking workflow usage..."
rg -l "infer_sv"
# Look for documentation files in various formats
echo -e "\nChecking documentation..."
fd -e md -e rst -e txt
# Check if there are any test files or test data
echo -e "\nChecking for test files..."
fd -g "*test*"
Length of output: 464
Script:
#!/bin/bash
# Let's examine the workflow files and README for context
# Check how the script is used in the workflow files
echo "Checking workflow usage in Snakefiles..."
rg -A 5 -B 5 "infer_sv" harpy/snakefiles/sv_naibr.smk
rg -A 5 -B 5 "infer_sv" harpy/snakefiles/sv_naibr_pop.smk
# Check README for any relevant documentation
echo -e "\nChecking README..."
cat README.md
# Look for any input/output examples in the test directory
echo -e "\nChecking test directory..."
fd . test/vcf/ -t f -x head -n 5 {}
Length of output: 6081
harpy/bin/make_windows.py (2)
Line range hint 38-51
: LGTM! The window calculation logic is robust.
The function correctly:
- Handles edge cases for the last window
- Adjusts for 0/1-based coordinates
- Uses efficient list operations
Line range hint 1-97
: Verify connection to divide-by-zero fix.
While the changes improve code robustness, they don't directly address the divide-by-zero error mentioned in PR #160. Let's verify if this file is used by the affected molecule_coverage
rule.
harpy/bin/bx_stats.py (1)
Line range hint 49-50
: Validate molecule length calculations.
The inferred length calculation (end - start
) could potentially be zero or negative. Consider adding validation to ensure meaningful length values.
- x[_mi]["inferred"] = x[_mi]["end"] - x[_mi]["start"]
+ # Ensure meaningful molecule length
+ length = x[_mi]["end"] - x[_mi]["start"]
+ x[_mi]["inferred"] = max(0, length) # Prevent negative lengths
harpy/reports/naibr.Rmd (1)
Line range hint 294-299
: LGTM! The contig filtering logic improves robustness.
The change ensures that only contigs with structural variants are plotted, which should prevent potential errors when processing empty datasets. However, consider adding input validation to handle edge cases.
Add validation for empty plot_contigs
:
if (all(plot_contigs == "default")){
+ if (length(unique(sv$Chr1)) == 0) {
+ warning("No structural variants found in any contigs")
+ return()
+ }
# make sure that the contigs with SV are the ones being plotted, not the others
fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2]
Let's verify if this change fixes the divide-by-zero error:
✅ Verification successful
Let me gather more information about the potential division operations in the plotting code, particularly around the modified section.
{r.*plot' harpy/reports/naibr.Rmd
Look for any operations with fa.sizes to understand potential division operations
rg -B 2 -A 5 'fa.sizes' harpy/reports/naibr.Rmd
---
Let me fix the syntax error and try again.
\{r.*plot' harpy/reports/naibr.Rmd
# Look for any operations with fa.sizes to understand potential division operations
rg -B 2 -A 5 'fa\.sizes' harpy/reports/naibr.Rmd
# Search for any code chunks that might contain division operations
rg -B 2 -A 5 'plot|ggplot|geom_' harpy/reports/naibr.Rmd
Let me try one more time with properly escaped quotes:
Based on the code context, I can now generate the final response:
The contig filtering change is correct and prevents potential divide-by-zero errors
The modification ensures that only contigs containing structural variants are included in fa.sizes
, which is used to create the circos plot. This prevents potential errors in the following ways:
- Filters out contigs without SVs before they reach the plotting functions
- Ensures
fa.sizes
andsv
dataframes are properly aligned - The subsequent BioCircos plotting code won't attempt to create tracks for empty contigs
The change is safe and doesn't require additional validation because:
- The code already handles empty
fa.sizes
throughmin(nrow(fa.sizes), 30)
- The filtering is applied before any division operations in the plotting code
- The circos plot generation has proper error handling through the R markdown's chunk options
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for potential division operations in the plotting code
rg -A 5 "/(|\s+/\s+|\s*/\s*)" harpy/reports/naibr.Rmd
# Search for related error handling in test files
fd -e py test_ --exec rg -l "ZeroDivisionError|divide.*zero"
Length of output: 2345
Script:
#!/bin/bash
# Search for plotting code chunks and their context
rg -B 5 -A 10 '
Length of output: 93
Script:
#!/bin/bash
# Search for plotting code chunks and their context
rg -B 5 -A 10 '
Length of output: 93
Script:
#!/bin/bash
# Search for plotting code chunks and their context
rg -B 5 -A 10 "\`\`\`\{r.*plot" harpy/reports/naibr.Rmd
# Look for any operations with fa.sizes
rg -B 2 -A 5 "fa\.sizes" harpy/reports/naibr.Rmd
# Search for any code chunks that might contain division operations
rg -B 2 -A 5 "plot|ggplot|geom_" harpy/reports/naibr.Rmd
Length of output: 6384
harpy/reports/naibr_pop.Rmd (1)
296-296
: Good fix for the vector comparison!
The change from if (plot_contigs == "default")
to if (all(plot_contigs == "default"))
is a crucial fix. In R, comparing a vector with a scalar without all()
only checks the first element, which could lead to incorrect contig filtering. This ensures that the condition works correctly when plot_contigs
contains multiple values.
Let's verify the usage of plot_contigs
parameter:
✅ Verification successful
Good fix for the vector comparison!
The change from if (plot_contigs == "default")
to if (all(plot_contigs == "default"))
is a crucial fix. The codebase search confirms that:
plot_contigs
is consistently passed as a parameter from Snakemake configuration across multiple report templates (naibr.Rmd, leviathan.Rmd, align_stats.Rmd, etc.)- The parameter can be either "default" or a list of specific contigs, as seen in the Python code:
**({'plot_contigs': contigs} if contigs else {'plot_contigs': "default"})
- The same vector comparison pattern is used consistently across all R markdown reports, confirming this is the correct approach for handling the
plot_contigs
parameter.
The fix ensures proper handling of both single and multiple contig cases, preventing potential issues with vector comparison in R.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check how plot_contigs parameter is used in Snakemake rules
# Test: Search for plot_contigs parameter in Snakemake rules
rg -A 5 "plot_contigs" '*.smk' '*.snake'
Length of output: 134
Script:
#!/bin/bash
# Let's search for plot_contigs in all files to understand its usage
rg -A 5 "plot_contigs"
# Also search for Snakemake files with different extensions
fd -e snakefile -e Snakefile -e snake -e smk
Length of output: 17613
harpy/reports/hapcut.Rmd (1)
192-192
: Correct handling of vector comparison with all()
function
Changing the condition to if (all(plotting_contigs == "default"))
ensures proper evaluation when plotting_contigs
is a vector. This prevents unintended behavior from element-wise comparison, which could lead to incorrect branching due to partial matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
harpy/bin/make_windows.py (2)
54-63
: Use context manager for file handling.While the file handling works, it would be better to use a context manager to ensure proper resource cleanup.
-with open(args.input, "r", encoding="utf-8") as fai: +try: + with open(args.input, "r", encoding="utf-8") as fai: while True: line = fai.readline() if not line: break lsplit = line.split("\t") contig = lsplit[0] c_len = int(lsplit[1]) c_len += args.mode makewindows(contig, c_len, args.window) +except IOError as e: + parser.error(f"Error reading {args.input}: {e}")
Line range hint
65-93
: Improve FASTA file handling with context managers and error handling.The current implementation could benefit from:
- Using context managers for file handling
- Better error handling for file operations
- Simplified file reading logic
elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"): - if is_gzip(args.input): - fopen = gzip.open(args.input, "rt") - else: - fopen = open(args.input, "r", encoding="utf-8") - line = fopen.readline() - while True: + try: + open_func = gzip.open if is_gzip(args.input) else open + with open_func(args.input, "rt", encoding="utf-8") as fopen: + line = fopen.readline() + while line: C_LEN=0 - if not line: - break if line.startswith(">"): contig = line.rstrip("\n").lstrip(">").split()[0] HEADER = False while not HEADER: line = fopen.readline() if not line: break line = line.rstrip("\n") if line.startswith(">"): HEADER = True break else: C_LEN += len(line)-1 HEADER = False C_LEN += args.mode makewindows(contig, C_LEN, args.window) + line = fopen.readline() - fopen.close() + except IOError as e: + parser.error(f"Error reading {args.input}: {e}")🧰 Tools
🪛 Ruff
67-67: Use a context manager for opening files
(SIM115)
69-69: Use a context manager for opening files
(SIM115)
harpy/_validations.py (1)
170-170
: LGTM! Good improvement to error reporting.The change to use
os.path.basename()
improves error message clarity by showing only the filename instead of the full path. This makes error messages more concise and user-friendly.Consider extracting the basename earlier in the function to avoid repeated
os.path.basename()
calls if more error reporting is added in the future:def check_RG(bamfile): - samplename = Path(bamfile).stem + basename = os.path.basename(bamfile) + samplename = Path(basename).stem samview = subprocess.run(f"samtools samples {bamfile}".split(), stdout = subprocess.PIPE).stdout.decode('utf-8').split() if samplename != samview[0]: - return os.path.basename(bamfile), samview[0] + return basename, samview[0]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
.github/workflows/tests.yml
(1 hunks)harpy/_validations.py
(1 hunks)harpy/bin/make_windows.py
(4 hunks)harpy/bin/molecule_coverage.py
(3 hunks)harpy/snp.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- harpy/bin/molecule_coverage.py
- harpy/snp.py
🧰 Additional context used
🪛 Ruff
harpy/bin/make_windows.py
67-67: Use a context manager for opening files
(SIM115)
69-69: Use a context manager for opening files
(SIM115)
🔇 Additional comments (3)
harpy/bin/make_windows.py (2)
Line range hint 38-51
: LGTM! Clean implementation of window creation.
The function correctly handles window creation and follows Unix philosophy by writing to stdout.
Line range hint 1-95
: Verify integration with molecule_coverage rule.
Since this PR aims to fix a ZeroDivisionError in the molecule_coverage rule, let's verify how this script is used in that context.
.github/workflows/tests.yml (1)
556-556
: Command update looks good but let's verify the script.
The change from rename_bam
to rename_bam.py
with reordered arguments is an improvement that makes the script extension explicit.
Let's verify the script exists and is executable:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
harpy/bin/haplotag_acbd.py (2)
18-20
: Add error handling for directory operations.The directory creation lacks proper error handling and user feedback. Consider adding checks for permissions and providing status messages.
Here's a suggested improvement:
args = parser.parse_args() outdir = args.output_directory.rstrip("/") -os.makedirs(outdir, exist_ok = True) +try: + os.makedirs(outdir, exist_ok=True) + # Verify write permissions by attempting to create a test file + test_file = os.path.join(outdir, '.write_test') + try: + with open(test_file, 'w') as f: + pass + os.remove(test_file) + except IOError as e: + sys.stderr.write(f"Error: No write permission in directory '{outdir}'\n") + sys.exit(1) +except OSError as e: + sys.stderr.write(f"Error creating directory '{outdir}': {e}\n") + sys.exit(1)
Line range hint
89-93
: Improve file writing operations and error handling.The current implementation uses list comprehension for side effects and lacks error handling for file operations.
Here's a more idiomatic approach:
for BC in ["A","C","B","D"]: - with open(f"{outdir}/BC_{BC}.txt", "w", encoding="utf-8") as f: - ID = [f"{BC}{number:02d}" for number in range(1, 97)] - delim = [" ".join(tup) for tup in zip(ID, BX[BC])] - _ = [f.write(f"{i}\n") for i in delim] + output_file = os.path.join(outdir, f"BC_{BC}.txt") + try: + with open(output_file, "wt", encoding="utf-8") as f: + IDs = [f"{BC}{number:02d}" for number in range(1, 97)] + for id_str, barcode in zip(IDs, BX[BC]): + f.write(f"{id_str}\t{barcode}\n") + except IOError as e: + sys.stderr.write(f"Error writing to '{output_file}': {e}\n") + sys.exit(1)Changes made:
- Used
os.path.join
for path construction- Added proper error handling
- Used "wt" mode to explicitly indicate text mode
- Replaced list comprehension with a more readable loop
- Fixed the variable naming to be more descriptive
harpy/bin/concatenate_bam.py (5)
34-37
: Consider using f-strings for error messages.The error messages are clear and helpful, but could be more concise using f-strings.
- sys.stderr.write("Please provide a single file to \'--bamlist\' (-b) featuring all the files you want to concatenate (one per line):\n") - sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt\n\n") - sys.stderr.write("Alternatively, provide the files after \'-o output.bam\':\n",) - sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam\n") + sys.stderr.write(f"""Please provide a single file to '--bamlist' (-b) featuring all the files you want to concatenate (one per line): +[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt + +Alternatively, provide the files after '-o output.bam': +[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam +""")
50-57
: Consider using list comprehension for file validation.The validation logic is correct but could be more concise.
-err = [] -for i in aln_list: - if not os.path.exists(i): - err.append(i) -if err: - parser.error("Some input files were not found on the system:\n" + ", ".join(err)) +missing_files = [f for f in aln_list if not os.path.exists(f)] +if missing_files: + parser.error(f"Some input files were not found on the system:\n{', '.join(missing_files)}")
58-60
: Document the significance of the haplotag limit value.Consider adding a comment explaining why the limit is 96^4 (e.g., "96 possible values for each of the 4 positions in the barcode").
85-89
: Add type hints and document the BX tag format.Consider adding type hints and documenting the BX tag format for better maintainability.
+# BX tag format: A##C##B##D## where ## are zero-padded numbers from 01-96 if args.bx_unique: - bc_range = [f"{i}".zfill(2) for i in range(1,97)] - bc_generator = product("A", bc_range, "C", bc_range, "B", bc_range, "D", bc_range) + bc_range: list[str] = [f"{i}".zfill(2) for i in range(1,97)] + bc_generator: Iterator[tuple[str, ...]] = product("A", bc_range, "C", bc_range, "B", bc_range, "D", bc_range)
93-96
: Simplify BX_NEW initialization using ternary operator.The initialization can be more concise.
- if args.bx_unique: - BX_NEW = "".join(next(bc_generator)) - else: - BX_NEW = None + BX_NEW = "".join(next(bc_generator)) if args.bx_unique else None🧰 Tools
🪛 Ruff
93-96: Use ternary operator
BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None
instead ofif
-else
-blockReplace
if
-else
-block withBX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None
(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
harpy/bin/concatenate_bam.py
(5 hunks)harpy/bin/haplotag_acbd.py
(1 hunks)harpy/snakefiles/sv_leviathan_pop.smk
(2 hunks)
🧰 Additional context used
🪛 Ruff
harpy/bin/concatenate_bam.py
93-96: Use ternary operator BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None
instead of if
-else
-block
Replace if
-else
-block with BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None
(SIM108)
126-126: Yoda condition detected
Rewrite as haplotag_limit < MI_NEW
(SIM300)
🔇 Additional comments (6)
harpy/bin/haplotag_acbd.py (1)
3-6
: LGTM! Clean import organization.
The added imports are appropriate and follow PEP 8 guidelines for import ordering.
harpy/bin/concatenate_bam.py (3)
7-7
: LGTM: Well-documented argument parser changes!
The addition of the --bx
flag and updated help text clearly explains the new functionality for handling BX tag uniqueness.
Also applies to: 16-18, 20-20, 26-27
129-135
: Comprehensive error message for haplotag limit.
The error message provides clear guidance and alternatives when hitting the haplotag limit.
114-128
: LGTM: Robust tag handling with proper error checks.
The logic for handling MI and BX tags is sound, with proper error handling for the haplotag limit.
✅ Verification successful
Let me gather more information about the BX tag handling to verify its uniqueness across samples.
Let me check one more thing about the haplotag_limit and bc_range initialization to fully understand the uniqueness constraints.
BX tag handling is correctly implemented with guaranteed uniqueness
The implementation ensures unique BX tags across samples through:
- A deterministic barcode generator using
itertools.product
with a fixed pattern (A,C,B,D) - Numbers 01-96 for each position, allowing up to 96^4 unique combinations
- Proper error handling when exceeding the haplotag limit
- Consistent BX tag generation between concatenate_bam.py and 10xtoHaplotag.py
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify that BX tags are unique across samples when --bx is used
rg -A 2 'set_tag\("BX"' | grep -v 'set_tag'
Length of output: 271
Script:
#!/bin/bash
# Check the implementation of bc_generator and BX_NEW initialization
ast-grep --pattern 'bc_generator = $_'
# Also check for BX_NEW initialization
rg -B 3 'BX_NEW ='
Length of output: 952
Script:
#!/bin/bash
# Check haplotag_limit and bc_range initialization
rg -B 2 'haplotag_limit ='
rg -B 2 'bc_range ='
Length of output: 591
🧰 Tools
🪛 Ruff
126-126: Yoda condition detected
Rewrite as haplotag_limit < MI_NEW
(SIM300)
harpy/snakefiles/sv_leviathan_pop.smk (2)
279-279
: LGTM! Documentation updated correctly.
The workflow summary has been updated to accurately reflect the new command with the --bx
flag, maintaining consistency with the actual implementation.
84-84
: Verify the impact of the --bx
flag on BAM concatenation.
The addition of the --bx
flag to concatenate_bam.py
suggests a change in how BX tags are handled during BAM concatenation. Let's verify the implementation and its potential effects.
✅ Verification successful
Let me gather more information about the concatenate_bam.py
script to understand how it handles the --bx
flag.
The --bx
flag implementation is properly handled and safe to use
The code review reveals that the --bx
flag is well-implemented in concatenate_bam.py
:
- It ensures BX tags remain unique between samples by generating new unique identifiers
- It includes proper error handling for cases exceeding the haplotag limit (96^4)
- The implementation maintains consistency with MI tags and handles both BAM/SAM formats correctly
- The change is documented in the script's help text
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the implementation of the --bx flag in concatenate_bam.py
# and its potential impact on BAM files.
# Test 1: Check if concatenate_bam.py handles the --bx flag
rg -A 5 "def.*main" harpy/bin/concatenate_bam.py
# Test 2: Look for related BX tag handling code
rg -l "BX:Z:" harpy/bin/
Length of output: 230
Script:
#!/bin/bash
# Check the content of concatenate_bam.py to understand the --bx flag implementation
cat harpy/bin/concatenate_bam.py
# Also check for any related test files
fd concatenate_bam.*test harpy/
Length of output: 6234
parser = argparse.ArgumentParser( | ||
prog = 'haplotag_acbd.py', | ||
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes", | ||
usage = "haplotag_acbd.py output_directory", | ||
exit_on_error = False | ||
) | ||
parser.add_argument("output_directory", type = str, help = "Directory to create barcode files") | ||
if len(sys.argv) == 1: | ||
parser.print_help(sys.stderr) | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve argument parser robustness and error handling.
A few suggestions to enhance the argument parsing:
- Remove
exit_on_error=False
to ensure parsing errors are properly reported - Use exit code 0 for help message as it's not an error condition
- Add path validation for the output directory
Here's the suggested implementation:
parser = argparse.ArgumentParser(
prog = 'haplotag_acbd.py',
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes",
- usage = "haplotag_acbd.py output_directory",
- exit_on_error = False
+ usage = "%(prog)s output_directory"
)
-parser.add_argument("output_directory", type = str, help = "Directory to create barcode files")
+parser.add_argument(
+ "output_directory",
+ type=str,
+ help="Directory to create barcode files",
+ metavar="PATH"
+)
if len(sys.argv) == 1:
parser.print_help(sys.stderr)
- sys.exit(1)
+ sys.exit(0)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
parser = argparse.ArgumentParser( | |
prog = 'haplotag_acbd.py', | |
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes", | |
usage = "haplotag_acbd.py output_directory", | |
exit_on_error = False | |
) | |
parser.add_argument("output_directory", type = str, help = "Directory to create barcode files") | |
if len(sys.argv) == 1: | |
parser.print_help(sys.stderr) | |
sys.exit(1) | |
parser = argparse.ArgumentParser( | |
prog = 'haplotag_acbd.py', | |
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes", | |
usage = "%(prog)s output_directory" | |
) | |
parser.add_argument( | |
"output_directory", | |
type=str, | |
help="Directory to create barcode files", | |
metavar="PATH" | |
) | |
if len(sys.argv) == 1: | |
parser.print_help(sys.stderr) | |
sys.exit(0) |
fixes #160
Summary by CodeRabbit
Release Notes
New Features
Improvements
Documentation