Skip to content

bugfix - divide-by-zero error #161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 8, 2024
Merged

bugfix - divide-by-zero error #161

merged 24 commits into from
Nov 8, 2024

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Nov 6, 2024

fixes #160

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new script for processing BAM/SAM files with BX-tagged barcodes, enabling assignment of Molecular Identifier (MI) tags based on distance cutoffs.
  • Improvements

    • Enhanced uniqueness of outputs in specific classes for better data integrity.
    • Simplified logic for file type validation and tag management in existing scripts.
    • Enhanced error handling for missing input files across multiple scripts.
    • Updated argument parser configurations for improved user guidance and error reporting.
    • Streamlined output methods across multiple scripts for consistency.
    • Improved handling of structural variant data in reports for better clarity and visualization.
    • Enhanced data processing steps and visualizations in R Markdown reports.
    • Added functionality to rewrite BX tags for uniqueness in BAM file concatenation.
  • Documentation

    • Updated comments for clarity on tag handling processes.
    • Added usage messages to several scripts for better user instructions.

Copy link

coderabbitai bot commented Nov 6, 2024

📝 Walkthrough

Walkthrough

The changes in this pull request involve modifications to multiple scripts within the harpy module. Key updates include enhancements to the KParam and ContigList classes for ensuring uniqueness in outputs, simplifications in input validation logic in the InputFile class, adjustments to tag handling in the assign_mi.py script, the introduction of a new script for processing BAM/SAM files with BX tags, and improvements in error handling across various scripts. These changes aim to enhance the clarity and robustness of the code without altering core functionalities.

Changes

File Change Summary
harpy/_cli_types_generic.py Enhanced KParam and ContigList for uniqueness; simplified validation in InputFile.
harpy/bin/assign_mi.py Updated tag handling in write_invalidbx and write_missingbx functions.
harpy/bin/deconvolve_alignments.py New script for processing BAM/SAM files; added functions for MI and BX tag management.
harpy/bin/depth_windows.py Added checks to prevent division by zero when calculating mean depth.
harpy/bin/check_bam.py Changed output from print to sys.stdout.write.
harpy/bin/check_fastq.py Changed output from print to sys.stdout.write.
harpy/bin/concatenate_bam.py Changed error messages from print to sys.stderr.write.
harpy/bin/count_bx.py Updated regex for BX tags and changed output from print to sys.stdout.write.
harpy/bin/infer_sv.py Made failfile argument optional.
harpy/bin/make_windows.py Updated input handling for gzipped files and changed output to sys.stdout.
harpy/bin/molecule_coverage.py Made statsfile and fai arguments required; updated error reporting.
harpy/bin/parse_phaseblocks.py Changed input argument to positional; updated output method to sys.stdout.write.
harpy/bin/rename_bam.py Updated argument parser for clarity and usability.
harpy/snakefiles/phase.smk Simplified command for parsing phase blocks in the workflow.
harpy/snp.py Modified command-line options for invoking make_windows.py.
.github/filters.yml Added harpy/bin/depth_windows.py to multiple sections.

Assessment against linked issues

Objective Addressed Explanation
Identify and Fix Bug (#160) The changes do not directly address the BrokenPipeError.
Address ZeroDivisionError (#160) Added checks in depth_windows.py help prevent ZeroDivisionError.
Ensure Output Integrity (#160) No changes were made to validate output file integrity.
Review Input and Output Files (#160) No changes were made to validate input file formats or accessibility.
Log Analysis (#160) No log examination or related changes were included.

Possibly related PRs

  • Molecule coverage #149: The changes in the main PR simplify input validation logic in the InputFile class, which is relevant to the broader context of handling input files, including those related to molecular coverage calculations in the retrieved PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (7)
harpy/bin/depth_windows.py (3)

26-28: Good addition of zero check, but needs error logging.

The protection against division by zero is good, but we should log when we skip windows to help diagnose issues.

 if WINSIZE > 0:
     depth = _SUM / WINSIZE
     print(f"{LAST_CONTIG}\t{POSITION}\t{depth}", file = sys.stdout)
+else:
+    print(f"Warning: Skipping window for {LAST_CONTIG} due to zero window size", file=sys.stderr)

Line range hint 7-11: Add input validation for windowsize.

The script should validate that the windowsize argument is positive before processing any data.

 parser.add_argument('windowsize', type= int, help = "The window size to use to calcualte mean depth over (non-overlapping)")

 args = parser.parse_args()
+if args.windowsize <= 0:
+    print("Error: Window size must be greater than zero", file=sys.stderr)
+    sys.exit(1)
+
 _SUM = 0

Line range hint 1-44: Consider adding error handling and logging.

The script would benefit from proper error handling and logging to help diagnose issues in production:

  1. Add try-except blocks around file operations
  2. Use Python's logging module
  3. Add progress indicators for long-running operations

Would you like me to provide an example implementation with these improvements?

harpy/bin/assign_mi.py (1)

86-89: Consider defining the default BX value as a constant.

The implementation correctly removes MI and DI tags while adding a default BX tag. However, the hardcoded value "A00C00B00D00" could be defined as a constant at the module level for better maintainability.

Consider adding this at the top of the file:

+# Default BX tag value for records with missing BX tags
+DEFAULT_BX_VALUE = "A00C00B00D00"

Then update the write_missingbx function:

-    tags.append(("BX", "A00C00B00D00"))
+    tags.append(("BX", DEFAULT_BX_VALUE))
harpy/bin/deconvolve_alignments.py (3)

25-25: Avoid using 'input' as an argument name to prevent shadowing

Using 'input' as the argument name shadows the built-in input() function in Python, which might lead to unexpected behavior.

Consider renaming the argument to 'input_file'.

Apply this diff:

-parser.add_argument('input', help = "Input coordinate-sorted bam/sam file")
+parser.add_argument('input_file', help = "Input coordinate-sorted bam/sam file")

...

-bam_input = args.input
+bam_input = args.input_file

52-71: Resolve contradiction regarding MI tag handling in write_invalidbx

The docstring states that the function keeps the existing MI tag if present, but the comment contradicts this by stating it will not keep an existing MI tag. Please ensure the documentation and code behavior are consistent.

Update the docstring or adjust the code to accurately reflect the function's behavior.


101-102: Simplify nested if statements by combining them

To improve code readability, you can combine the nested if statements into a single condition using and.

Apply this diff to simplify the code:

-if bam_input.lower().endswith(".bam"):
-    if not os.path.exists(bam_input + ".bai"):
+if bam_input.lower().endswith(".bam") and not os.path.exists(bam_input + ".bai"):
🧰 Tools
🪛 Ruff

101-102: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ec5b2b2 and d786d8e.

📒 Files selected for processing (4)
  • harpy/_cli_types_generic.py (2 hunks)
  • harpy/bin/assign_mi.py (2 hunks)
  • harpy/bin/deconvolve_alignments.py (1 hunks)
  • harpy/bin/depth_windows.py (2 hunks)
🧰 Additional context used
🪛 Ruff
harpy/bin/deconvolve_alignments.py

101-102: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)


129-129: Do not use bare except

(E722)

🔇 Additional comments (6)
harpy/bin/assign_mi.py (2)

Line range hint 65-89: Verify if these changes resolve the ZeroDivisionError.

While the changes improve tag handling consistency by removing MI tags from invalid/missing BX records, we should verify if this fully addresses the ZeroDivisionError mentioned in issue #160. Could you provide more context about how these changes prevent the divide-by-zero condition in the molecule_coverage rule?

Let's check the molecule_coverage rule implementation:

#!/bin/bash
# Search for the molecule_coverage rule and its dependencies
rg -l "molecule_coverage" | while read -r file; do
  echo "=== $file ==="
  rg "rule molecule_coverage" "$file" -A 10
done

# Search for potential division operations in related files
rg "ZeroDivisionError|/ 0|\b/\s*0\b" 

65-67: Verify the impact of MI tag removal.

The change to remove existing MI tags from invalid BX records is well-documented and implemented correctly. However, since this changes the previous behavior, we should verify that downstream processes handle records without MI tags appropriately.

Let's check for any downstream dependencies on MI tags:

✅ Verification successful

Based on the comprehensive code analysis, I can now provide a final response:

Removing MI tags from invalid BX records is safe and correct

The change to remove existing MI tags from invalid BX records is appropriate because:

  1. The molecule_coverage.py script doesn't use MI tags at all - it works with BX stats data
  2. The only downstream usage of MI tags is in concatenate_bam.py, which properly handles records with missing MI tags
  3. Invalid BX records shouldn't have MI tags as they don't represent valid molecules, which is consistent with the tool's documentation

The change aligns with the codebase's design where MI tags are only assigned to valid molecular identifiers, making the behavior more consistent and preventing potential issues with invalid molecular assignments.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for code that relies on MI tags
rg -l "MI" | grep -v "assign_mi.py" | while read -r file; do
  echo "=== $file ==="
  rg "get_tag\(['\"]MI['\"]" "$file" -A 2
done

Length of output: 1791


Script:

#!/bin/bash
# Let's check for MI tag usage in molecule_coverage.py since it's related to the divide-by-zero error
echo "=== molecule_coverage.py content ==="
cat harpy/bin/molecule_coverage.py

# Also check for any other files that might process MI tags
echo -e "\n=== Files referencing MI tags ==="
rg -l "MI" --type py

# Check specific MI tag processing
echo -e "\n=== MI tag processing in Python files ==="
rg "MI" --type py -B2 -A2

Length of output: 19671

harpy/_cli_types_generic.py (3)

67-67: LGTM! Clear error handling for unsupported file types.

The validation check and error message effectively guide users when encountering unsupported file types.


116-116: Whitespace change.


Line range hint 67-116: Verify the relevance to the divide-by-zero fix.

These changes appear unrelated to the main PR objective of fixing the ZeroDivisionError in the molecule_coverage rule. Let's verify if this file is in the relevant code path.

harpy/bin/deconvolve_alignments.py (1)

168-169: Ensure correct handling of negative distances in dist calculation

The calculation dist = pos_start - d[bx]["lastpos"] may result in negative distances if pos_start is less than d[bx]["lastpos"]. Ensure that the logic accounts for this scenario and handles negative distances appropriately.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (17)
harpy/bin/parse_phaseblocks.py (1)

35-35: Consider keeping print for better readability.

Unless there's a specific performance requirement, using print() would be more idiomatic and maintainable in Python:

-                sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n")
+                print(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}")
harpy/bin/rename_bam.py (1)

9-10: LGTM! Consider enhancing error handling for samtools dependency.

The improved description and usage string make the script's requirements and usage clearer. However, since samtools is a required dependency, it would be good to verify its availability early in the script.

Consider adding this check at the start of the script:

def check_samtools():
    if os.system("which samtools > /dev/null 2>&1") != 0:
        sys.exit("Error: samtools is required but not found in PATH")

check_samtools()
harpy/bin/count_bx.py (3)

Line range hint 19-23: Consider adding explicit input file validation

While pysam.FastxFile handles file errors internally, adding an explicit check for file existence before processing would provide clearer error messages to users.

 args = parser.parse_args()
+
+# Validate input file exists
+if not os.path.exists(args.input):
+    sys.stderr.write(f"Error: Input file '{args.input}' does not exist\n")
+    sys.exit(1)

Line range hint 29-35: Fix potential case sensitivity mismatch in invalid tag counting

The invalid tag regex pattern [AaBbCcDd]00 matches both upper and lowercase letters, but the inv_dict only contains uppercase keys. This could lead to missed counts for lowercase invalid tags.

 inv_dict = {
     "A" : 0,
     "B" : 0,
     "C" : 0,
-    "D" : 0
+    "D" : 0,
+    "a" : 0,
+    "b" : 0,
+    "c" : 0,
+    "d" : 0
 }

Alternatively, normalize the case when counting:

 if inv:
     for i in inv:
-        inv_dict[i[0]] += 1
+        inv_dict[i[0].upper()] += 1
     continue

58-65: Optimize output writing performance

While using sys.stdout.write is good, multiple consecutive writes can be inefficient. Consider combining all output into a single write operation.

-sys.stdout.write(f"totalReads\t{N_READS}\n")
-sys.stdout.write(f"bxTagCount\t{N_BX}\n")
-sys.stdout.write(f"bxValid\t{N_VALID}\n")
-sys.stdout.write(f"bxInvalid\t{N_BX - N_VALID}\n")
-sys.stdout.write(f"A00\t{inv_dict['A']}\n")
-sys.stdout.write(f"C00\t{inv_dict['C']}\n")
-sys.stdout.write(f"B00\t{inv_dict['B']}\n")
-sys.stdout.write(f"D00\t{inv_dict['D']}\n")
+output = (
+    f"totalReads\t{N_READS}\n"
+    f"bxTagCount\t{N_BX}\n"
+    f"bxValid\t{N_VALID}\n"
+    f"bxInvalid\t{N_BX - N_VALID}\n"
+    f"A00\t{inv_dict['A']}\n"
+    f"C00\t{inv_dict['C']}\n"
+    f"B00\t{inv_dict['B']}\n"
+    f"D00\t{inv_dict['D']}\n"
+)
+sys.stdout.write(output)
harpy/bin/infer_sv.py (2)

Line range hint 22-27: Add error handling for invalid orientations and file operations.

The script could fail with a KeyError if it encounters an unexpected orientation pattern in line[6]. Additionally, file operations should be wrapped in try-catch blocks.

Here's a suggested improvement for the orientation lookup:

-            inference = conversions[line[6]]
+            try:
+                inference = conversions[line[6]]
+            except KeyError:
+                sys.stderr.write(f"Warning: Unknown orientation pattern '{line[6]}' encountered. Skipping record.\n")
+                continue
+            except IndexError:
+                sys.stderr.write(f"Warning: Invalid record format. Skipping record.\n")
+                continue

Also consider wrapping file operations in try-except blocks to handle potential IOErrors gracefully.

Also applies to: 45-58


Line range hint 28-58: Refactor to reduce code duplication and improve readability.

The current implementation has duplicate logic for header processing and could benefit from a more Pythonic approach.

Consider this refactoring:

-if args.failfile:
-    with open(args.bedfile, "r", encoding="utf-8") as f, open(args.failfile, "w", encoding="utf-8") as failout:
-        # first line, the header
-        line = f.readline().strip().split("\t")
-        line_corrected = [i.title().replace(" ", "") for i in line]
-        line_corrected.append("SV")
-        HEADERLINE = "\t".join(line_corrected) + "\n"
-        sys.stdout.write(HEADERLINE)
-        failout.write(HEADERLINE)
-        # the reamining lines
-        while True:
-            line = f.readline()
-            if not line:
-                break
-            line = line.strip().split("\t")
-            inference = conversions[line[6]]
-            line.append(inference)
-            NEWROW = "\t".join(line) + "\n"
-            if "FAIL" not in line:
-                sys.stdout.write(NEWROW)
-            else:
-                failout.write(NEWROW)
-else:
-    with open(args.bedfile, "r", encoding="utf-8") as f:
-        # first line, the header
-        line = f.readline().strip().split("\t")
-        line_corrected = [i.title().replace(" ", "") for i in line]
-        line_corrected.append("SV")
-        HEADERLINE = "\t".join(line_corrected) + "\n"
-        sys.stdout.write(HEADERLINE)
-        # the reamining lines
-        while True:
-            line = f.readline()
-            if not line:
-                break
-            line = line.strip().split("\t")
-            if "FAIL" not in line:
-                inference = conversions[line[6]]
-                line.append(inference)
-                NEWROW = "\t".join(line) + "\n"
-                sys.stdout.write(NEWROW)
+def process_header(f):
+    header = f.readline().strip().split("\t")
+    header_corrected = [i.title().replace(" ", "") for i in header]
+    header_corrected.append("SV")
+    return "\t".join(header_corrected) + "\n"
+
+def process_line(line, conversions):
+    fields = line.strip().split("\t")
+    if not fields:
+        return None, None
+    inference = conversions[fields[6]]
+    fields.append(inference)
+    return fields, "\t".join(fields) + "\n"
+
+with open(args.bedfile, "r", encoding="utf-8") as f:
+    header = process_header(f)
+    sys.stdout.write(header)
+    
+    if args.failfile:
+        with open(args.failfile, "w", encoding="utf-8") as failout:
+            failout.write(header)
+            for line in f:
+                fields, formatted_line = process_line(line, conversions)
+                if fields:
+                    if "FAIL" not in fields:
+                        sys.stdout.write(formatted_line)
+                    else:
+                        failout.write(formatted_line)
+    else:
+        for line in f:
+            fields, formatted_line = process_line(line, conversions)
+            if fields and "FAIL" not in fields:
+                sys.stdout.write(formatted_line)

This refactoring:

  1. Extracts header and line processing into separate functions
  2. Uses a more Pythonic for loop for file iteration
  3. Reduces code duplication
  4. Improves maintainability
harpy/bin/molecule_coverage.py (2)

22-22: Improve argument validation but remove redundant parameter.

Good improvement making the -f/--fai argument required to prevent missing input errors. However, the required=True parameter is redundant for the positional statsfile argument as positional arguments are already required by default.

-parser.add_argument('statsfile', required = True, help = "stats file produced by harpy via bx_stats.py")
+parser.add_argument('statsfile', help = "stats file produced by harpy via bx_stats.py")

Also applies to: 26-27


Line range hint 89-92: Add bounds checking and optimize memory usage.

The coverage calculation could be improved with:

  1. Validation of start/end values against contig length
  2. Memory optimization for large ranges

Add bounds checking and chunked updates:

 start = int(splitline[IDX_START])
 end = int(splitline[IDX_END])
+if start < 1 or end > contigs[contig]:
+    sys.stderr.write(f"Error: Position {start}-{end} out of bounds for contig {contig} (length: {contigs[contig]})\n")
+    sys.exit(1)
+
+# Update in chunks to reduce memory usage for large ranges
+chunk_size = 1000000  # Adjust based on available memory
+for chunk_start in range(start, end + 1, chunk_size):
+    chunk_end = min(chunk_start + chunk_size, end + 1)
+    coverage.update(range(chunk_start, chunk_end))
-coverage.update(range(start, end + 1))
harpy/bin/make_windows.py (3)

Line range hint 31-44: Consider optimizing the makewindows function.

While the changes look good, there's room for improvement in terms of type safety and performance.

-def makewindows(_contig, _c_len, windowsize):
+def makewindows(_contig: str, _c_len: int, windowsize: int) -> None:
     """create a file of the specified windows"""
     start = args.mode
     end = min(_c_len, windowsize)
-    starts = [start]
-    ends = [end]
-    # write to output file
-    for (startpos, endpos) in zip (starts,ends):
+    while start < _c_len:
+        end = min(start + windowsize, _c_len)
         sys.stdout.write(f"{_contig}\t{startpos}\t{endpos}\n")
+        start += windowsize

47-56: Use context manager for file operations.

The current implementation correctly handles FAI files, but could be improved with proper resource management.

-if testname.endswith("fai"):
-    with open(args.infile, "r", encoding="utf-8") as fai:
-        while True:
-            line = fai.readline()
-            if not line:
-                break
+if testname.endswith("fai"):
+    with open(args.infile, "r", encoding="utf-8") as fai:
+        for line in fai:
             lsplit = line.split("\t")
             contig = lsplit[0]
             c_len = int(lsplit[1])
             c_len += args.mode
             makewindows(contig, c_len, args.window)

Line range hint 58-85: Improve FASTA file handling and resource management.

The current implementation has several issues:

  1. Missing context manager for file operations
  2. Complex and error-prone FASTA parsing
  3. Potential resource leaks

Consider using BioPython's SeqIO for robust FASTA parsing:

+from Bio import SeqIO
+
 elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"):
-    if is_gzip(args.infile):
-        fopen = gzip.open(args.infile, "rt")
-    else:
-        fopen = open(args.infile, "r", encoding="utf-8")
-    line = fopen.readline()
-    while True:
-        C_LEN=0
-        if not line:
-            break
-        if line.startswith(">"):
-            # remove newline, > starting symbol, and any comments after name
-            contig = line.rstrip("\n").lstrip(">").split()[0]
-            # keep reading until hitting another ">"
-            HEADER = False
-            while not HEADER:
-                line = fopen.readline()
-                if not line:
-                    break
-                line = line.rstrip("\n")
-                if line.startswith(">"):
-                    HEADER = True
-                    break
-                else:
-                    C_LEN += len(line)-1
-                    HEADER = False
-            C_LEN += args.mode
-            makewindows(contig, C_LEN, args.window)
-    fopen.close()
+    mode = "rt" if is_gzip(args.infile) else "r"
+    opener = gzip.open if is_gzip(args.infile) else open
+    with opener(args.infile, mode) as fasta:
+        for record in SeqIO.parse(fasta, "fasta"):
+            contig = record.id
+            c_len = len(record.seq) + args.mode
+            makewindows(contig, c_len, args.window)
🧰 Tools
🪛 Ruff

60-60: Use a context manager for opening files

(SIM115)


62-62: Use a context manager for opening files

(SIM115)

harpy/bin/concatenate_bam.py (2)

30-33: LGTM! Error handling improvement.

Good change from print to sys.stderr.write for error messages, aligning with best practices and maintaining consistency with other files in the codebase.

Consider combining the multiple writes into a single multi-line string for better performance:

-    sys.stderr.write("Please provide a single file to \'--bamlist\' (-b) featuring all the files you want to concatenate (one per line):\n")
-    sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt\n\n")
-    sys.stderr.write("Alternatively, provide the files after \'-o output.bam\':\n",)
-    sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam\n")
+    sys.stderr.write(
+        "Please provide a single file to '--bamlist' (-b) featuring all the files you want to concatenate (one per line):\n"
+        "[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt\n\n"
+        "Alternatively, provide the files after '-o output.bam':\n"
+        "[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam\n"
+    )

Line range hint 89-95: Improve error handling in MI tag processing.

The bare except clause could mask important errors. Consider catching specific exceptions and providing appropriate error messages.

                 try:
                     mi = record.get_tag("MI")
                     # if previously converted for this sample, use that
                     if mi in MI_LOCAL:
                         record.set_tag("MI", MI_LOCAL[mi])
                     else:
                         record.set_tag("MI", MI_NEW)
                         # add to sample conversion dict
                         MI_LOCAL[mi] = MI_NEW
                         # increment to next unique MI
                         MI_NEW += 1
-                except:
+                except KeyError:
+                    # Skip records without MI tag
                     pass
+                except ValueError as e:
+                    sys.stderr.write(f"Error processing MI tag in {xam}: {str(e)}\n")
+                    raise
harpy/bin/deconvolve_alignments.py (3)

16-16: Fix typo in help text: "presevered" → "preserved"

-    with an invalid barcode (00 as one of its segments) are presevered
+    with an invalid barcode (00 as one of its segments) are preserved

23-23: Add validation for negative cutoff values

The cutoff distance should be positive. Consider adding validation to ensure the value is greater than zero.

-parser.add_argument('-c','--cutoff', type=int, default = 100000, help = "Distance in base pairs at which alignments with the same barcode should be considered different molecules (default: 100000)")
+parser.add_argument('-c','--cutoff', type=lambda x: abs(int(x)), default = 100000, help = "Distance in base pairs at which alignments with the same barcode should be considered different molecules (must be positive, default: 100000)")

1-204: Overall assessment: Well-structured implementation with minor improvements needed

The script effectively handles BX tag processing and MI tag assignment. While it's not directly related to the division by zero errors mentioned in the PR objectives, the implementation is solid with good error handling for the core functionality. The suggested improvements focus on:

  1. Adding error handling for BAM indexing operations
  2. Fixing the incorrect is_forward attribute usage
  3. Input validation and code simplification

These changes will make the code more robust and maintainable.

🧰 Tools
🪛 Ruff

101-102: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d786d8e and 72ba05b.

📒 Files selected for processing (15)
  • .github/filters.yml (3 hunks)
  • harpy/bin/assign_mi.py (3 hunks)
  • harpy/bin/check_bam.py (1 hunks)
  • harpy/bin/check_fastq.py (1 hunks)
  • harpy/bin/concatenate_bam.py (1 hunks)
  • harpy/bin/count_bx.py (1 hunks)
  • harpy/bin/deconvolve_alignments.py (1 hunks)
  • harpy/bin/depth_windows.py (3 hunks)
  • harpy/bin/infer_sv.py (1 hunks)
  • harpy/bin/make_windows.py (4 hunks)
  • harpy/bin/molecule_coverage.py (2 hunks)
  • harpy/bin/parse_phaseblocks.py (2 hunks)
  • harpy/bin/rename_bam.py (1 hunks)
  • harpy/snakefiles/phase.smk (1 hunks)
  • harpy/snp.py (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • harpy/bin/check_bam.py
  • harpy/bin/check_fastq.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • harpy/bin/assign_mi.py
  • harpy/bin/depth_windows.py
🧰 Additional context used
🪛 Ruff
harpy/bin/deconvolve_alignments.py

101-102: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

harpy/bin/make_windows.py

60-60: Use a context manager for opening files

(SIM115)


62-62: Use a context manager for opening files

(SIM115)

🔇 Additional comments (18)
harpy/bin/parse_phaseblocks.py (3)

8-9: LGTM! Enhanced CLI documentation.

The addition of clear description and usage parameters improves user experience.


12-14: LGTM! Improved CLI error handling.

Good addition of help message when no arguments are provided, following CLI best practices.


11-11: Verify impact of argument change across the codebase.

The change from a named argument (-i) to a positional argument is a breaking change. Let's verify all callers have been updated.

✅ Verification successful

Breaking change is safe - only one caller found using positional argument

The change from named to positional argument is safe as there is only one caller in harpy/snakefiles/phase.smk and it's already using the positional argument syntax (parse_phaseblocks.py $i). No instances of the old -i flag usage were found in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining usage of parse_phaseblocks.py with -i flag
rg -l "parse_phaseblocks\.py.*-i" 

# Search for all Snakemake rules using parse_phaseblocks.py
rg "parse_phaseblocks\.py" --type smk

Length of output: 103


Script:

#!/bin/bash
# Search for any usage of parse_phaseblocks.py in the codebase
rg -l "parse_phaseblocks\.py"

# Search for Snakemake workflow files with common extensions
fd -e snakefile -e smk -e snake -e rules

# If Snakemake files are found, search within them
rg "parse_phaseblocks\.py" -g "*.{snakefile,smk,snake,rules}"

# Look for any shell scripts or Python scripts that might call parse_phaseblocks.py
rg "parse_phaseblocks\.py" -g "*.{sh,py}"

Length of output: 1284

harpy/bin/rename_bam.py (2)

9-13: Verify if these changes are within PR scope.

While these improvements to the CLI are valuable, they seem unrelated to the PR's primary objective of fixing the divide-by-zero error in the molecule_coverage rule. Please confirm if these changes should be part of this PR or moved to a separate enhancement PR.

Let's check if this file is related to the reported issue:

#!/bin/bash
# Check if this file is referenced in the molecule_coverage rule
rg -l "molecule_coverage.*rename_bam\.py" --type snakefile

# Check for connections to the reported error
rg -l "ZeroDivisionError.*rename_bam\.py"

12-13: 🛠️ Refactor suggestion

Verify path handling for output file.

The argument order change looks good, but there's a potential issue with output path handling. The script uses os.path.dirname(args.input) to determine the output directory, which could lead to unexpected behavior with relative paths.

Let's verify how this script is typically used in the workflow:

Consider improving path handling:

-outdir = os.path.dirname(args.input)
+outdir = os.path.dirname(os.path.abspath(args.input))
harpy/bin/count_bx.py (1)

Line range hint 1-65: Verify the source of ZeroDivisionError

This file doesn't contain any division operations that could cause the ZeroDivisionError mentioned in issue #160. We should verify the molecule_coverage rule implementation.

✅ Verification successful

Potential division by zero found in bx_stats.py

After examining the code, I found potential division by zero issues in bx_stats.py where molecule coverage calculations are performed:

  • Line 89-93: Division by x[_mi]["inferred"] without checking if it's zero first
if x[_mi]["inferred"] > 0:
    x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / x[_mi]["inferred"], 1.0),4)
else:
    x[_mi]["covered_bp"] = 0

The code already has a fix implemented with the if x[_mi]["inferred"] > 0 check, which matches the PR's objective of fixing the ZeroDivisionError. The same pattern is used for covered_inserts calculation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the molecule_coverage rule and related Python scripts
echo "Searching for molecule_coverage rule definition..."
rg -l "rule molecule_coverage"

echo "Searching for potential division operations in Python scripts..."
rg -l "def molecule_coverage" --type py

Length of output: 425


Script:

#!/bin/bash
# Check the molecule_coverage rule implementation in the identified files
echo "=== Checking molecule_coverage rule in align_bwa.smk ==="
rg -A 10 "rule molecule_coverage" harpy/snakefiles/align_bwa.smk

echo "=== Checking molecule_coverage rule in align_ema.smk ==="
rg -A 10 "rule molecule_coverage" harpy/snakefiles/align_ema.smk

echo "=== Checking molecule_coverage rule in align_strobealign.smk ==="
rg -A 10 "rule molecule_coverage" harpy/snakefiles/align_strobealign.smk

echo "=== Searching for Python scripts with division operations ==="
rg "/ *0|/0|\/ 0|\/ *0" --type py

Length of output: 1701


Script:

#!/bin/bash
# Let's check the shell command used in molecule_coverage rules and search for division operations with corrected regex
echo "=== Checking complete molecule_coverage rule including shell command ==="
rg -A 15 "rule molecule_coverage" harpy/snakefiles/align_bwa.smk

echo "=== Searching for Python scripts with division operations (corrected regex) ==="
rg "[^a-zA-Z0-9](\/[ ]*0|div)" --type py

echo "=== Looking for molecule coverage related Python scripts ==="
fd -e py -x grep -l "molecule.*coverage"

Length of output: 1232


Script:

#!/bin/bash
# Let's examine the molecule_coverage.py script and bx_stats.py for potential division operations
echo "=== Checking molecule_coverage.py content ==="
cat harpy/bin/molecule_coverage.py

echo "=== Checking bx_stats.py content ==="
cat harpy/bin/bx_stats.py

echo "=== Checking depth_windows.py content (used in the pipeline) ==="
cat harpy/bin/depth_windows.py

Length of output: 11083

harpy/bin/infer_sv.py (2)

14-14: LGTM: Argument parser changes are well-implemented.

The modification to make the fail file optional is implemented correctly with clear help text and consistent argument naming.


Line range hint 1-58: Verify relationship with PR objectives.

While the changes in this file improve the code, they don't appear to be directly related to fixing the ZeroDivisionError mentioned in the PR objectives. Please clarify if these changes are necessary for the bugfix or if they should be in a separate PR.

harpy/bin/molecule_coverage.py (1)

Line range hint 39-45: Prevent potential zero-length contig issues.

While investigating the reported ZeroDivisionError, I notice that contig lengths from the FAI file are used without validation. A zero-length contig could cause issues in downstream processing.

Let's verify if there are any zero-length contigs in the codebase:

Add validation for contig lengths:

 splitline = line.split()
 contig = splitline[0]
 length = splitline[1]
+if int(length) <= 0:
+    sys.stderr.write(f"Error: Invalid contig length ({length}) for {contig}\n")
+    sys.exit(1)
 contigs[contig] = int(length)
harpy/bin/make_windows.py (2)

Line range hint 88-90: LGTM!

The error message is clear and informative, properly explaining the supported file extensions.


10-10: ⚠️ Potential issue

Update usage string to match current CLI.

The usage string shows -i flag which was removed from the implementation.

-    usage = "make_windows.py -w <window.size> -m <0,1> input.fasta > output.bed",
+    usage = "make_windows.py [-w <window.size>] [-m <0,1>] input.fasta > output.bed",

Likely invalid or redundant comment.

harpy/bin/concatenate_bam.py (1)

Line range hint 65-71: Consider atomic file operations for index management.

The current approach to index file creation and deletion could lead to race conditions in concurrent executions. Consider using temporary files or atomic operations.

Let's check if there are any concurrent uses of this script:

Also applies to: 102-105

.github/filters.yml (1)

60-60: LGTM! Verify the script path exists.

The addition of 'harpy/bin/depth_windows.py' to the bwa, ema, and strobealign sections is consistent with the module organization and aligns with the PR's objectives for fixing issues in the align module.

Let's verify the script exists and is executable:

Also applies to: 72-72, 85-85

✅ Verification successful

✓ Path exists and has correct permissions

The script 'harpy/bin/depth_windows.py' exists and has the appropriate executable permissions (755), confirming the validity of the path added to the filters.yml configuration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify depth_windows.py exists and has correct permissions
# Expected: File should exist and be executable

if [ -f "harpy/bin/depth_windows.py" ]; then
    echo "File exists"
    ls -l "harpy/bin/depth_windows.py"
else
    echo "File not found: harpy/bin/depth_windows.py"
    exit 1
fi

Length of output: 183

harpy/bin/deconvolve_alignments.py (2)

33-87: LGTM! Well-structured helper functions

The helper functions are well-documented and correctly handle different cases of BX tag processing. The tag management logic is robust and consistent.


189-190: Fix incorrect attribute usage: replace is_forward with not is_reverse

The attribute is_forward does not exist in pysam.AlignedSegment.

harpy/snakefiles/phase.smk (2)

250-250: LGTM: Command simplification looks good.

The removal of the -i flag from parse_phaseblocks.py aligns with the changes in the script where the input argument was changed to a positional argument.


Line range hint 1-300: Important: The molecule_coverage rule is not present in this file.

The PR objectives mention fixing a ZeroDivisionError in the molecule_coverage rule, but this rule is not present in the current file. To properly review the fix, we need access to:

  1. The file containing the molecule_coverage rule
  2. The implementation of the fix for the division by zero error

Let's search for the molecule_coverage rule in the codebase:

harpy/snp.py (1)

104-104: Verify make_windows.py interface changes.

The command-line interface for make_windows.py has been modified to use positional arguments instead of -i and -o flags. Let's verify these changes:

Also applies to: 209-209

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (11)
harpy/bin/make_windows.py (1)

Line range hint 65-93: Fix FASTA processing resource management and length calculation.

Several issues need attention:

  1. File handle not properly managed with context manager
  2. Length calculation has an off-by-one error (subtracting 1 from each line length)

Apply this diff:

 elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"):
     if is_gzip(args.input):
-        fopen = gzip.open(args.input, "rt")
+        opener = gzip.open
     else:
-        fopen = open(args.input, "r", encoding="utf-8")
+        opener = open
+    
+    try:
+        with opener(args.input, "rt", encoding="utf-8") as fopen:
             line = fopen.readline()
             while True:
                 C_LEN = 0
                 if not line:
                     break
                 if line.startswith(">"):
                     contig = line.rstrip("\n").lstrip(">").split()[0]
                     HEADER = False
                     while not HEADER:
                         line = fopen.readline()
                         if not line:
                             break
                         line = line.rstrip("\n")
                         if line.startswith(">"):
                             HEADER = True
                             break
                         else:
-                            C_LEN += len(line)-1
+                            C_LEN += len(line)
                             HEADER = False
                     C_LEN += args.mode
                     makewindows(contig, C_LEN, args.window)
-    fopen.close()
+    except Exception as e:
+        sys.stderr.write(f"Error processing FASTA file: {e}\n")
+        sys.exit(1)
🧰 Tools
🪛 Ruff

67-67: Use a context manager for opening files

(SIM115)


69-69: Use a context manager for opening files

(SIM115)

harpy/bin/10xtoHaplotag.py (2)

24-29: Good addition of input validation, consider enhancing it further.

The new error handling for missing input files is a good improvement. It validates all files and provides a clear error message listing all missing files at once.

Consider adding these additional validations:

 err = []
 for i in [args.forward, args.reverse, args.barcodes]:
     if not os.path.exists(i):
         err.append(i)
+    elif not os.access(i, os.R_OK):
+        err.append(f"{i} (not readable)")
+    elif i.endswith('.gz') and not is_valid_gzip(i):
+        err.append(f"{i} (invalid gzip)")
 if err:
     parser.error("Some input files were not found on the system:\n" + ", ".join(err))
+
+def is_valid_gzip(filepath):
+    try:
+        with gzip.open(filepath, 'rb') as f:
+            f.read(1)
+        return True
+    except:
+        return False

Line range hint 77-94: Consider adding error handling for file operations and improving resource management.

The file processing logic could benefit from additional error handling and resource management improvements.

Consider these improvements:

-fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6)
-rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6)
-
-with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i:
+try:
+    fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6)
+    rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6)
+except IOError as e:
+    parser.error(f"Failed to create output files: {str(e)}")
+
+try:
+    with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i:
+        try:
             # store the fq records here
             fw_record = []
             rv_record = []
             i = 0
             for fw, rv in zip_longest(fw_i, rv_i):
                 fw_line = fw.decode().rstrip("\n")
                 rv_line = rv.decode().rstrip("\n")
+        except (IOError, UnicodeDecodeError) as e:
+            parser.error(f"Error processing input files: {str(e)}")
+        finally:
+            fw_out.close()
+            rv_out.close()
+except IOError as e:
+    parser.error(f"Failed to open input files: {str(e)}")
-
-fw_out.close()
-rv_out.close()

Also consider:

  1. Using a context manager for output files
  2. Adding progress indicators for large files
  3. Implementing memory-efficient barcode dictionary handling
harpy/bin/bx_stats.py (2)

Line range hint 49-59: Consider improving error handling for zero-length molecules

While the code handles the case where inferred length is 0 by setting coverage to 0, this might mask underlying data quality issues. Consider:

  1. Logging these occurrences for debugging
  2. Adding validation to ensure start position is always less than end position
  3. Documenting why zero-length molecules might occur and how they're handled
 def writestats(x, writechrom, destination):
     """write to file the bx stats dictionary as a table"""
     for _mi in x:
         x[_mi]["inferred"] = x[_mi]["end"] - x[_mi]["start"]
+        # Log zero-length molecules for debugging
+        if x[_mi]["inferred"] == 0:
+            sys.stderr.write(f"Warning: Zero-length molecule detected: {writechrom}:{_mi}\n")
         if x[_mi]["inferred"] > 0:
             x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / x[_mi]["inferred"], 1.0),4)
         else:
             x[_mi]["covered_bp"] = 0
         if x[_mi]["inferred"] > 0:
             x[_mi]["covered_inserts"] = round(min(x[_mi]["insert_len"] / x[_mi]["inferred"], 1.0), 4)
         else:
             x[_mi]["covered_inserts"] = 0

Line range hint 134-146: Document edge cases and rationale for length calculations

The logic for calculating insert sizes and lengths handles different types of reads (paired, supplementary, unpaired) but could benefit from better documentation explaining:

  1. Why max(0, read.template_length) is used for paired reads
  2. Why max(abs(read.template_length), read.infer_query_length()) is used for unpaired reads
  3. The implications of these calculations on molecule coverage
     if read.is_paired:
-        # by using max(), will either add 0 or positive TLEN to avoid double-counting
+        # Use max() to ensure non-negative TLEN and avoid double-counting
+        # For paired reads, we only want to count positive template lengths
         isize = max(0, read.template_length)
-        #isize = min(bp, abs(read.template_length))
-        # only count the bp of the first read of paired end reads
-        #bp = bp if read.is_read1 else 0
+        # Use the smaller of template length or query length to avoid overestimation
         bp = min(abs(read.template_length), read.infer_query_length())
     elif read.is_supplementary:
-        # if it's a supplementary alignment, just use the alignment length
+        # For supplementary alignments, use alignment length directly
+        # as they represent additional mappings of the same read
         isize = bp
     else:
-        # if it's unpaired, use the TLEN or query length, whichever is bigger
+        # For unpaired reads, use the larger of TLEN or query length
+        # to capture the full extent of the molecule
         isize = max(abs(read.template_length), read.infer_query_length())
harpy/reports/leviathan.Rmd (2)

Line range hint 211-219: LGTM! Consider adding input validation.

The conditional check improvement makes the filtering logic more robust by ensuring proper handling of the "default" case. However, consider adding input validation for plot_contigs to handle edge cases.

Add validation before the conditional check:

+# Validate plot_contigs input
+if (!is.null(plot_contigs) && !is.vector(plot_contigs)) {
+  stop("plot_contigs must be a vector")
+}
+
 if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){
   # make sure that the contigs with SV are the ones being plotted, not the others
   fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$contig), 1:2]

Line range hint 190-190: LGTM! Consider documenting the rationale.

The replacement of dots with "1" in the length column helps prevent potential issues with invalid values. However, consider adding a comment to explain why dots are replaced with "1" specifically.

Add documentation:

+# Replace dots (representing missing/invalid lengths) with "1" to ensure valid numeric values
+# This prevents potential issues in downstream calculations and visualizations
 sv$length <- gsub("\\.", "1", sv$length) %>%  as.numeric()
harpy/reports/naibr.Rmd (1)

Line range hint 294-301: Consider improving code style and documentation.

The code block could benefit from:

  1. Consistent indentation (R typically uses 2 spaces)
  2. A more descriptive comment explaining the purpose of the default contig selection

Consider this refinement:

-if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){
-  # make sure that the contigs with SV are the ones being plotted, not the others
-  fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2]
-  # limit the data to only the 30 of the largest present contigs
-  fa.sizes <- fa.sizes[1:min(nrow(fa.sizes), 30), ]
-} else {
-  fa.sizes <- fa.sizes[fa.sizes$contig %in% plot_contigs, 1:2]
-}
+# When "default" is explicitly specified as the only plotting option,
+# automatically select up to 30 of the largest contigs that contain structural variants
+if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")) {
+  # Filter for contigs containing structural variants
+  fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2]
+  # Limit to 30 largest contigs
+  fa.sizes <- fa.sizes[1:min(nrow(fa.sizes), 30), ]
+} else {
+  # Use explicitly specified contigs
+  fa.sizes <- fa.sizes[fa.sizes$contig %in% plot_contigs, 1:2]
+}
harpy/reports/leviathan_pop.Rmd (1)

224-224: LGTM! The condition is now more robust.

The improved condition ensures that the default contig selection only triggers when plot_contigs is exactly ["default"]. This prevents potential runtime errors from incorrect assumptions about the plot_contigs parameter.

Consider adding a comment to explain the default behavior for better maintainability:

+# Use default contig selection when plot_contigs is ["default"]
 if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){
harpy/reports/naibr_pop.Rmd (1)

Line range hint 296-303: Consider adding validation for empty data scenarios.

While the condition check is improved, there are potential edge cases that could lead to issues:

  1. If there are no SVs, fa.sizes could become empty after filtering.
  2. Empty fa.sizes could affect subsequent operations and visualizations.

Consider adding validation:

 if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){
   # make sure that the contigs with SV are the ones being plotted, not the others
   fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2]
+  if (nrow(fa.sizes) == 0) {
+    warning("No contigs with structural variants found.")
+    return(NULL)
+  }
   # limit the data to only the 30 of the largest present contigs
   fa.sizes <- fa.sizes[1:min(nrow(fa.sizes), 30), ]
 } else {
   fa.sizes <- fa.sizes[fa.sizes$contig %in% plot_contigs, 1:2]
+  if (nrow(fa.sizes) == 0) {
+    warning("None of the specified contigs found in the data.")
+    return(NULL)
+  }
 }
harpy/reports/align_stats.Rmd (1)

Line range hint 588-597: Add validation for plot_contigs to prevent potential errors

If plot_contigs is empty or contains contig names not present in contigs, the filtering operations may result in empty data frames, which could cause errors downstream. Consider adding validation to ensure plot_contigs contains valid contig names.

Apply this diff to add validation:

 if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")){
   # Limit the data to only the 30 largest contigs
   if (nrow(contigs) > 30){
     .contigs <- contigs[1:30, ]
     tb <- filter(tb, contig %in% .contigs$contig)
   } else {
     .contigs <- contigs
   }
 } else {
+  # Validate that plot_contigs are present in contigs
+  valid_contigs <- plot_contigs[plot_contigs %in% contigs$contig]
+  if (length(valid_contigs) == 0){
+    stop("No matching contigs found in the data for the specified plot_contigs.")
+  }
   .contigs <- filter(contigs, contig %in% valid_contigs)
   tb <- filter(tb, contig %in% valid_contigs)
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 72ba05b and 59be86b.

📒 Files selected for processing (20)
  • harpy/bin/10xtoHaplotag.py (2 hunks)
  • harpy/bin/assign_mi.py (4 hunks)
  • harpy/bin/bx_stats.py (2 hunks)
  • harpy/bin/check_bam.py (2 hunks)
  • harpy/bin/check_fastq.py (2 hunks)
  • harpy/bin/concatenate_bam.py (2 hunks)
  • harpy/bin/count_bx.py (3 hunks)
  • harpy/bin/deconvolve_alignments.py (1 hunks)
  • harpy/bin/depth_windows.py (3 hunks)
  • harpy/bin/infer_sv.py (2 hunks)
  • harpy/bin/make_windows.py (4 hunks)
  • harpy/bin/molecule_coverage.py (3 hunks)
  • harpy/bin/parse_phaseblocks.py (2 hunks)
  • harpy/bin/rename_bam.py (2 hunks)
  • harpy/reports/align_stats.Rmd (1 hunks)
  • harpy/reports/hapcut.Rmd (1 hunks)
  • harpy/reports/leviathan.Rmd (1 hunks)
  • harpy/reports/leviathan_pop.Rmd (1 hunks)
  • harpy/reports/naibr.Rmd (1 hunks)
  • harpy/reports/naibr_pop.Rmd (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • harpy/bin/assign_mi.py
  • harpy/bin/check_bam.py
  • harpy/bin/check_fastq.py
  • harpy/bin/concatenate_bam.py
  • harpy/bin/count_bx.py
  • harpy/bin/deconvolve_alignments.py
  • harpy/bin/depth_windows.py
  • harpy/bin/infer_sv.py
  • harpy/bin/parse_phaseblocks.py
  • harpy/bin/rename_bam.py
🧰 Additional context used
🪛 Ruff
harpy/bin/make_windows.py

67-67: Use a context manager for opening files

(SIM115)


69-69: Use a context manager for opening files

(SIM115)

🔇 Additional comments (14)
harpy/bin/molecule_coverage.py (5)

Line range hint 4-8: LGTM: Import statements are properly organized.

The addition of the os module import is appropriate for the new file existence validation.


35-40: Excellent improvement in input validation!

The new file existence checks are well-implemented:

  • Collects all missing files before erroring
  • Provides clear error messages listing all missing files
  • Uses proper error reporting through parser.error()

This directly addresses the PR objective of verifying input file integrity.


79-80: Consider enhancing column validation.

While the stderr change is good, we could further improve the column validation:

Consider this enhancement:

-        sys.stderr.write("Error: Required columns 'contig', 'start', or 'end' not found in header\n")
+        missing_cols = []
+        if IDX_CONTIG is None:
+            missing_cols.append("contig")
+        if IDX_START is None:
+            missing_cols.append("start")
+        if IDX_END is None:
+            missing_cols.append("end")
+        sys.stderr.write(f"Error: Required columns not found in header: {', '.join(missing_cols)}\n")
         sys.exit(1)

This would provide more specific information about which columns are missing.


23-23: LGTM: Improved argument validation.

Making both arguments required is a good improvement that helps prevent potential errors from missing inputs. This aligns well with the PR objectives for better input validation.

Let's verify the argument handling across the codebase:

Also applies to: 27-28

✅ Verification successful

Let me check the argument handling patterns in these files to verify consistency.


Based on the results, I can now generate the final response as I have sufficient information about the argument handling patterns across the codebase.

Argument handling is consistent with codebase patterns

The changes to make both arguments required and update the usage message in molecule_coverage.py align well with the established patterns in the codebase:

  • Other scripts like make_windows.py and 10xtoHaplotag.py follow similar usage message formatting
  • Required arguments are consistently enforced across scripts
  • Help messages are descriptive and follow a uniform style
  • All scripts implement proper argument validation with ArgumentParser
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent argument handling in related scripts
# Look for similar argument patterns in other harpy scripts
rg -l "statsfile|fai" --type py

Length of output: 367


Script:

#!/bin/bash
# Check argument handling patterns in related files
rg -A 5 "ArgumentParser|add_argument.*statsfile|add_argument.*fai" harpy/bin/

Length of output: 7847


Line range hint 91-101: Verify coverage calculation for potential divide-by-zero scenarios.

While the error handling improvements are good, we should verify the coverage calculation logic for potential divide-by-zero scenarios, especially since this PR aims to fix a ZeroDivisionError.

Let's check for similar coverage calculation patterns across the codebase:

harpy/bin/make_windows.py (3)

3-22: LGTM! Argument parsing and input validation look good.

The argument parser is well-configured with clear help text and proper input validation.


Line range hint 38-51: LGTM! Window calculation logic is robust.

The function correctly handles window generation with proper edge case handling for the last window.


54-63: 🛠️ Refactor suggestion

Use context manager for FAI file processing.

To ensure proper resource cleanup, use a context manager for file operations.

Apply this diff:

-    with open(args.input, "r", encoding="utf-8") as fai:
+    try:
+        with open(args.input, "r", encoding="utf-8") as fai:
             for line in fai:
                 lsplit = line.split("\t")
                 contig = lsplit[0]
                 c_len = int(lsplit[1])
                 c_len += args.mode
                 makewindows(contig, c_len, args.window)
+    except ValueError as e:
+        sys.stderr.write(f"Error processing FAI file: {e}\n")
+        sys.exit(1)

Likely invalid or redundant comment.

harpy/bin/10xtoHaplotag.py (1)

3-3: LGTM! Import statements are correctly ordered and necessary.

Both imports are required for the script's functionality - os for file existence checks and gzip for compressed file handling.

Also applies to: 5-5

harpy/bin/bx_stats.py (2)

3-3: LGTM: Import addition is appropriate

The addition of the os module import is well-placed and necessary for the new file validation functionality.


33-34: LGTM: Input validation is well implemented

The file existence check is appropriately placed before any file operations and uses parser.error for clean error reporting and program termination.

harpy/reports/hapcut.Rmd (1)

192-192: LGTM! More robust vector comparison.

The updated condition if ((length(plotting_contigs) == 1) && (plotting_contigs[1] == "default")) is more robust than the previous version as it explicitly checks both the length and value, preventing potential issues with vector comparison in R.

Note: While this is a good improvement, it doesn't appear to be related to the divide-by-zero error mentioned in the PR objectives. Let's verify if there are any other files that might contain the actual fix:

harpy/reports/naibr.Rmd (1)

294-294: LGTM! The condition is now more precise.

The change improves the logic by ensuring the default contig selection only triggers when explicitly requested with a single "default" value.

Let's verify the behavior with different inputs:

✅ Verification successful

The search results show that the same condition if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")) is consistently used across multiple report templates (naibr.Rmd, leviathan.Rmd, naibr_pop.Rmd, leviathan_pop.Rmd, align_stats.Rmd). The parameter plot_contigs is set from the configuration file and passed through various Snakemake rules.

Let's verify how this parameter is defined in the configuration:


Let's try searching in config directories which are common locations for yaml files:


Based on the search results, I can now generate the final response:

The condition change is safe and consistent with the codebase

The modified condition if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")) is:

  • Used consistently across all report templates
  • Correctly handles the default configuration where plot_contigs is set to "default" when no contigs are specified
  • Matches the configuration logic that sets plot_contigs either to user-specified contigs or "default"
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how plot_contigs parameter is used across the codebase
# to ensure this change won't break existing calls

# Search for Snakemake rules that set plot_contigs parameter
rg -A 5 "plot_contigs.*=" 

Length of output: 6797


Script:

#!/bin/bash
# Search for plot_contigs configuration in yaml files
rg -A 3 "plot_contigs:" "*.yaml" "*.yml"

# Also check for any example configurations
fd -e yaml -e yml -x cat {} \; | rg -A 3 "reports:"

Length of output: 188


Script:

#!/bin/bash
# Search for yaml files first
fd -e yaml -e yml

# Then look for plot_contigs in any potential config files
fd . -t f -x cat {} \; | rg -B 2 -A 3 "plot_contigs"

# Also check for example configs in docs or config directories
fd "config|docs" -t d

Length of output: 693

harpy/reports/naibr_pop.Rmd (1)

296-296: LGTM! The condition check is now more precise.

The updated condition if ((length(plot_contigs) == 1) && (plot_contigs[1] == "default")) is more robust as it:

  1. Explicitly checks both the length and value of plot_contigs
  2. Prevents potential issues when plot_contigs is empty or has multiple values
  3. Ensures the default behavior only triggers under the exact intended condition

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (21)
harpy/bin/parse_phaseblocks.py (2)

8-13: Enhance CLI documentation for better user guidance.

While the change to a positional argument simplifies usage, consider improving the help messages:

 parser = argparse.ArgumentParser(
     prog='parse_phaseblocks.py',
     description='Parse a phase block file from HapCut2 to pull out summary information',
-    usage = "parse_phaseblocks.py input > output.txt"
+    usage = "parse_phaseblocks.py <blocks_file> > <output.txt>"
     )
-parser.add_argument("input", type=str, help="input HapCut2 phase blocks file")
+parser.add_argument("input", type=str, help="input HapCut2 phase blocks file (typically ending in .blocks)")

Line range hint 24-42: Consider improving file processing robustness.

While the core logic works, there are several opportunities for improvement:

  1. Add validation for the HapCut2 blocks file format
  2. Add error handling for malformed lines
  3. Use a more Pythonic approach for file iteration

Here's a suggested refactor:

-with open(args.input, "r", encoding="utf-8") as blocks:
-    FIRST_LINE = True
-    while True:
-        line = blocks.readline()
-        if not line:
-            break
-        lsplit = line.split()
-        if lsplit[0] == "BLOCK:":
-            n_snp     = int(lsplit[6])
-            len_block = int(lsplit[8])
-            FIRST_LINE = True
-        else:
-            if FIRST_LINE:
-                pos_start = int(lsplit[4])
-                contig    = lsplit[3]
-                FIRST_LINE = False
-                sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n")
-            else:
-                continue
+with open(args.input, "r", encoding="utf-8") as blocks:
+    try:
+        first_line = True
+        for line in blocks:
+            try:
+                lsplit = line.strip().split()
+                if not lsplit:  # Skip empty lines
+                    continue
+                
+                if lsplit[0] == "BLOCK:":
+                    if len(lsplit) < 9:
+                        raise ValueError("Invalid BLOCK line format")
+                    n_snp = int(lsplit[6])
+                    len_block = int(lsplit[8])
+                    first_line = True
+                elif first_line:
+                    if len(lsplit) < 5:
+                        raise ValueError("Invalid data line format")
+                    pos_start = int(lsplit[4])
+                    contig = lsplit[3]
+                    first_line = False
+                    sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n")
+            except (IndexError, ValueError) as e:
+                sys.stderr.write(f"Warning: Skipping malformed line: {line.strip()} ({str(e)})\n")
+    except Exception as e:
+        parser.error(f"Error processing file: {str(e)}")

This refactor:

  • Uses for loop instead of while True
  • Adds proper error handling for malformed lines
  • Validates line format before processing
  • Provides informative error messages
harpy/bin/depth_windows.py (3)

8-24: LGTM! Minor typo in help text.

The argument validation and special case handling look good. There's a small typo in the help text: "calcualte" should be "calculate".

-parser.add_argument('windowsize', type= int, help = "The window size to calcualte mean depth over (non-overlapping)")
+parser.add_argument('windowsize', type= int, help = "The window size to calculate mean depth over (non-overlapping)")

39-41: Add warning message for skipped windows.

While the zero division protection is good, it silently skips windows when WINSIZE is zero. Consider adding a warning message to help with debugging.

 if WINSIZE > 0:
     depth = _SUM / WINSIZE
     sys.stdout.write(f"{LAST_CONTIG}\t{POSITION}\t{depth}\n")
+else:
+    sys.stderr.write(f"Warning: Skipping window for {LAST_CONTIG} due to insufficient positions\n")

Line range hint 1-52: Consider adding unit tests for edge cases.

While the changes improve error handling, this script would benefit from unit tests to verify:

  1. Edge cases with varying window sizes
  2. Behavior with empty input
  3. Contig transitions
  4. Window size validation

This would help prevent future regressions and ensure consistent behavior.

Would you like me to help create a test suite for this script?

harpy/bin/infer_sv.py (2)

22-23: Enhance the error message for file not found.

While the file existence check is good, consider making the error message more descriptive to help users understand and resolve the issue.

-    parser.error(f"{args.bedfile} was not found")
+    parser.error(f"Input BEDPE file '{args.bedfile}' does not exist or is not accessible")

Line range hint 25-67: Add robust error handling for data processing.

Several potential issues need to be addressed:

  1. Missing validation for the orientation field against the conversions dictionary
  2. No protection against index errors for malformed input
  3. Duplicate code for header processing
  4. No error handling for file operations

Consider refactoring the code as follows:

 conversions = {
     "+-": "deletion",
     "--": "inversion",
     "++": "inversion",
     "-+": "duplication"
     }

+def process_header(header_line):
+    """Process the header line and return the formatted header."""
+    fields = header_line.strip().split("\t")
+    fields_corrected = [i.title().replace(" ", "") for i in fields]
+    fields_corrected.append("SV")
+    return "\t".join(fields_corrected) + "\n"
+
+def process_data_line(line, conversions):
+    """Process a data line and return the formatted line with SV type."""
+    fields = line.strip().split("\t")
+    if len(fields) < 7:
+        raise ValueError(f"Invalid input line: expected at least 7 fields, got {len(fields)}")
+    orientation = fields[6]
+    if orientation not in conversions:
+        raise ValueError(f"Invalid orientation '{orientation}'. Expected one of {list(conversions.keys())}")
+    fields.append(conversions[orientation])
+    return "\t".join(fields) + "\n"
+
 if args.failfile:
-    with open(args.bedfile, "r", encoding="utf-8") as f, open(args.failfile, "w", encoding="utf-8") as failout:
-        # first line, the header
-        line = f.readline().strip().split("\t")
-        line_corrected = [i.title().replace(" ", "") for i in line]
-        line_corrected.append("SV")
-        HEADERLINE = "\t".join(line_corrected) + "\n"
-        sys.stdout.write(HEADERLINE)
-        failout.write(HEADERLINE)
-        # the reamining lines
-        while True:
-            line = f.readline()
-            if not line:
-                break
-            line = line.strip().split("\t")
-            inference = conversions[line[6]]
-            line.append(inference)
-            NEWROW = "\t".join(line) + "\n"
-            if "FAIL" not in line:
-                sys.stdout.write(NEWROW)
-            else:
-                failout.write(NEWROW)
+    try:
+        with open(args.bedfile, "r", encoding="utf-8") as f, open(args.failfile, "w", encoding="utf-8") as failout:
+            # Process header
+            header = process_header(f.readline())
+            sys.stdout.write(header)
+            failout.write(header)
+            
+            # Process data lines
+            for line in f:
+                try:
+                    processed_line = process_data_line(line, conversions)
+                    if "FAIL" not in line:
+                        sys.stdout.write(processed_line)
+                    else:
+                        failout.write(processed_line)
+                except ValueError as e:
+                    sys.stderr.write(f"Warning: Skipping invalid line: {e}\n")
+                    continue
+    except IOError as e:
+        parser.error(f"Error processing files: {e}")
 else:
-    with open(args.bedfile, "r", encoding="utf-8") as f:
-        # first line, the header
-        line = f.readline().strip().split("\t")
-        line_corrected = [i.title().replace(" ", "") for i in line]
-        line_corrected.append("SV")
-        HEADERLINE = "\t".join(line_corrected) + "\n"
-        sys.stdout.write(HEADERLINE)
-        # the reamining lines
-        while True:
-            line = f.readline()
-            if not line:
-                break
-            line = line.strip().split("\t")
-            if "FAIL" not in line:
-                inference = conversions[line[6]]
-                line.append(inference)
-                NEWROW = "\t".join(line) + "\n"
-                sys.stdout.write(NEWROW)
+    try:
+        with open(args.bedfile, "r", encoding="utf-8") as f:
+            # Process header
+            header = process_header(f.readline())
+            sys.stdout.write(header)
+            
+            # Process data lines
+            for line in f:
+                if "FAIL" not in line:
+                    try:
+                        processed_line = process_data_line(line, conversions)
+                        sys.stdout.write(processed_line)
+                    except ValueError as e:
+                        sys.stderr.write(f"Warning: Skipping invalid line: {e}\n")
+                        continue
+    except IOError as e:
+        parser.error(f"Error processing files: {e}")

This refactoring:

  1. Adds validation for orientation field
  2. Protects against malformed input
  3. Reduces code duplication through helper functions
  4. Adds proper error handling for file operations
  5. Uses for loop instead of while True for cleaner iteration
harpy/bin/make_windows.py (1)

Line range hint 65-93: Fix resource management and sequence length calculation.

Several issues in the FASTA processing logic:

  1. Files aren't properly closed (use context manager)
  2. Sequence length calculation has an off-by-one error (C_LEN += len(line)-1)
  3. Complex nested loop structure could be simplified
 elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"):
     if is_gzip(args.input):
-        fopen = gzip.open(args.input, "rt")
+        opener = gzip.open
     else:
-        fopen = open(args.input, "r", encoding="utf-8")
+        opener = open
-    line = fopen.readline()
-    while True:
-        C_LEN=0
-        if not line:
-            break
-        if line.startswith(">"):
-            # remove newline, > starting symbol, and any comments after name
-            contig = line.rstrip("\n").lstrip(">").split()[0]
-            # keep reading until hitting another ">"
-            HEADER = False
-            while not HEADER:
-                line = fopen.readline()
-                if not line:
-                    break
-                line = line.rstrip("\n")
-                if line.startswith(">"):
-                    HEADER = True
-                    break
-                else:
-                    C_LEN += len(line)-1
-                    HEADER = False
-            C_LEN += args.mode
-            makewindows(contig, C_LEN, args.window)
-    fopen.close()
+    with opener(args.input, "rt", encoding="utf-8") as fopen:
+        current_contig = None
+        current_len = 0
+        
+        for line in fopen:
+            line = line.rstrip()
+            if line.startswith(">"):
+                if current_contig:
+                    current_len += args.mode
+                    makewindows(current_contig, current_len, args.window)
+                current_contig = line.lstrip(">").split()[0]
+                current_len = 0
+            else:
+                current_len += len(line)
+        
+        if current_contig:
+            current_len += args.mode
+            makewindows(current_contig, current_len, args.window)
🧰 Tools
🪛 Ruff

67-67: Use a context manager for opening files

(SIM115)


69-69: Use a context manager for opening files

(SIM115)

harpy/bin/10xtoHaplotag.py (4)

24-29: LGTM! Consider additional file validation checks.

The new error handling for missing input files is a good improvement.

Consider adding these additional validations:

 err = []
 for i in [args.forward, args.reverse, args.barcodes]:
     if not os.path.exists(i):
         err.append(i)
+    elif not os.access(i, os.R_OK):
+        err.append(f"{i} (not readable)")
+
+# Validate output directory is writable
+outdir = os.path.dirname(args.prefix) or "."
+if not os.access(outdir, os.W_OK):
+    err.append(f"{outdir} (not writable)")
+
 if err:
     parser.error("Some input files were not found on the system:\n" + ", ".join(err))

Line range hint 82-83: Add context manager for output files.

The output files should be managed using a context manager to ensure proper cleanup on errors.

-fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6)
-rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6)
+# Use context manager to ensure files are properly closed
+with gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) as fw_out, \
+     gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) as rv_out:
+    with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i:
+        # ... rest of the processing code ...

Line range hint 85-116: Add error handling and progress reporting for large files.

The main processing loop lacks error handling and progress reporting.

Consider these improvements:

  1. Add try-except blocks for I/O operations
  2. Add progress reporting for large files
  3. Clean up partial output files on error
+import sys
+from pathlib import Path
+
+def process_fastq_files(fw_i, rv_i, fw_out, rv_out):
+    try:
+        records_processed = 0
         fw_record = []
         rv_record = []
         i = 0
         for fw, rv in zip_longest(fw_i, rv_i):
+            if records_processed % 1000000 == 0:
+                print(f"Processed {records_processed:,} records", file=sys.stderr)
             fw_line = fw.decode().rstrip("\n")
             rv_line = rv.decode().rstrip("\n")
             if fw_line.startswith("@") and i > 0:
                 i += 1
+                records_processed += 1
                 # process the full record
                 new_fw, new_rv = process_record(fw_record, rv_record)
                 # write new record to files
                 fw_out.write(new_fw.encode("utf-8"))
                 rv_out.write(new_rv.encode("utf-8"))
                 # reset the record
                 fw_record = [fw_line]
                 rv_record = [rv_line]
             elif fw_line.startswith("@") and i == 0:
                 # its the first record don't write anything
                 i += 1
                 fw_record = [fw_line]
                 rv_record = [rv_line]
             else:
                 # just append the other lines to the record
                 fw_record.append(fw_line)
                 rv_record.append(rv_line)
+    except Exception as e:
+        # Clean up partial output files
+        fw_out.close()
+        rv_out.close()
+        Path(f"{args.prefix}.R1.fq.gz").unlink(missing_ok=True)
+        Path(f"{args.prefix}.R2.fq.gz").unlink(missing_ok=True)
+        raise RuntimeError(f"Error processing FASTQ files: {str(e)}")

Line range hint 31-45: Consider memory optimization for barcode dictionary.

The barcode dictionary could grow large with many unique barcodes.

Consider using a more memory-efficient approach:

  1. Process barcodes in batches
  2. Use a database for very large barcode sets
  3. Implement streaming processing for the barcode file

Example optimization:

def process_barcodes_in_batches(barcode_file, batch_size=1000000):
    """Process barcodes in memory-efficient batches."""
    batch = {}
    for line in barcode_file:
        if len(batch) >= batch_size:
            yield batch
            batch = {}
        bc = line.rstrip("\n").split()
        batch[str(bc[0])] = None
    if batch:
        yield batch
harpy/bin/bx_stats.py (2)

3-3: LGTM! Consider enhancing the error message.

The file existence check is a good addition. Consider making the error message more informative by checking file permissions as well.

-if not os.path.exists(args.input):
-    parser.error(f"{args.input} was not found")
+if not os.path.exists(args.input):
+    parser.error(f"Input file '{args.input}' was not found")
+elif not os.access(args.input, os.R_OK):
+    parser.error(f"Input file '{args.input}' exists but is not readable")

Also applies to: 33-34


Line range hint 51-58: Improve division by zero handling and add documentation.

While the code handles division by zero cases by checking if inferred > 0, it could be more explicit and documented. This appears to be related to the ZeroDivisionError mentioned in issue #160.

-        if x[_mi]["inferred"] > 0:
-            x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / x[_mi]["inferred"], 1.0),4)
-        else:
-            x[_mi]["covered_bp"] = 0
-        if x[_mi]["inferred"] > 0:
-            x[_mi]["covered_inserts"] = round(min(x[_mi]["insert_len"] / x[_mi]["inferred"], 1.0), 4)
-        else:
-            x[_mi]["covered_inserts"] = 0
+        # Handle coverage calculations, avoiding division by zero
+        # When inferred length is 0, we can't calculate meaningful coverage
+        inferred_len = x[_mi]["inferred"]
+        if inferred_len <= 0:
+            x[_mi]["covered_bp"] = 0
+            x[_mi]["covered_inserts"] = 0
+        else:
+            x[_mi]["covered_bp"] = round(min(x[_mi]["bp"] / inferred_len, 1.0), 4)
+            x[_mi]["covered_inserts"] = round(min(x[_mi]["insert_len"] / inferred_len, 1.0), 4)
harpy/reports/leviathan.Rmd (1)

211-211: LGTM! Consider documenting the plot_contigs parameter.

The change from plot_contigs == "default" to all(plot_contigs == "default") is a good improvement that properly handles vector comparisons. This prevents potential bugs when plot_contigs contains multiple values.

Consider adding:

  1. Documentation about the expected format of plot_contigs (e.g., in a comment)
  2. Input validation to ensure plot_contigs is not NULL or empty
+# plot_contigs: character vector specifying which contigs to plot
+# if set to "default", plots up to 30 of the largest contigs with SVs
+if (is.null(plot_contigs) || length(plot_contigs) == 0) {
+  stop("plot_contigs parameter cannot be NULL or empty")
+}
 if (all(plot_contigs == "default")){
harpy/reports/leviathan_pop.Rmd (1)

224-224: LGTM! Consider adding input validation.

The change from plot_contigs == "default" to all(plot_contigs == "default") correctly handles the case where plot_contigs is a vector, preventing potential bugs in contig filtering logic.

Consider adding input validation to ensure plot_contigs is not NULL or empty:

+if (is.null(plot_contigs) || length(plot_contigs) == 0) {
+  plot_contigs <- "default"
+}
 if (all(plot_contigs == "default")){
harpy/reports/hapcut.Rmd (5)

Line range hint 17-24: Avoid installing packages within the R Markdown script to prevent runtime side effects

The using function installs missing packages at runtime. Installing packages during script execution can lead to unexpected behavior, reproducibility issues, and potential security risks, especially in production environments or when users lack installation permissions.

Consider removing the package installation logic and ensure all required packages are installed beforehand. You can document the dependencies in the setup instructions or use package management tools like renv for reproducibility.


Line range hint 224-232: Filter out non-positive values before applying log() to avoid computation errors

Using log(subset_cont) without filtering may result in -Inf or NaN values if subset_cont contains zeros or negative numbers, leading to issues in the plot.

Consider adding a filter to include only positive block lengths:

for (cont in plot_contigs) {
  subset_cont <- df$block_length[df$contig == cont]
+ subset_cont <- subset_cont[subset_cont > 0]
  fig <- fig %>%
    add_trace(
      x = log(subset_cont),
      type = "violin",
      name = cont,
      side = "positive",
      meanline = list(visible = T),
      visible = "legendonly"
    )
}

Line range hint 310-318: Exclude non-positive values before logarithmic transformation to prevent errors

Applying log(subset_sample) directly can cause issues if there are zero or negative block lengths in subset_sample. This can result in plot anomalies or warnings.

Include a filter to ensure only positive values are used:

subset_sample <- df$block_length[df$sample == samp] 
+ subset_sample <- subset_sample[subset_sample > 0]
fig <- fig %>%
  add_trace(
    x = log(subset_sample),
    type = "violin",
    name = samp,
    side = "positive",
    meanline = list(visible = T),
    visible = "legendonly"
  )

Line range hint 196-203: Use n() instead of round(length(n_snp)) for accurate haplotype counts per contig

In the summarise function, round(length(n_snp)) may not accurately reflect the number of haplotypes. The length() function returns the length of the vector, which is always 1 in this context.

Replace with n() to correctly count the number of rows (haplotypes) per group:

percontig <- df %>% group_by(contig) %>% summarise(
-   n_haplo = round(length(n_snp)),
+   n_haplo = n(),
    mean_snps = round(mean(n_snp), digits = 0),
    median_snps = median(n_snp),
    mean_blocksize = round(mean(block_length), digits = 0),
    median_blocksize = median(block_length),
    max_blocksize = max(block_length)
  )

Line range hint 284-291: Correct haplotype count per sample using n() instead of sum(length(n_snp))

The expression sum(length(n_snp)) does not provide the intended haplotype count per sample. The length(n_snp) inside summarise returns 1, leading to incorrect totals.

Use n() to get the accurate number of haplotypes:

persample <- df %>% group_by(sample) %>% summarise(
-   n_haplo = sum(length(n_snp)),
+   n_haplo = n(),
    mean_snps = round(mean(n_snp), digits = 0),
    median_snps = median(n_snp),
    mean_blocksize = round(mean(block_length), digits = 0),
    median_blocksize = median(block_length),
    max_blocksize = max(block_length)
)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 72ba05b and b1cfddd.

📒 Files selected for processing (21)
  • harpy/_cli_types_generic.py (4 hunks)
  • harpy/bin/10xtoHaplotag.py (2 hunks)
  • harpy/bin/assign_mi.py (4 hunks)
  • harpy/bin/bx_stats.py (2 hunks)
  • harpy/bin/check_bam.py (2 hunks)
  • harpy/bin/check_fastq.py (2 hunks)
  • harpy/bin/concatenate_bam.py (2 hunks)
  • harpy/bin/count_bx.py (3 hunks)
  • harpy/bin/deconvolve_alignments.py (1 hunks)
  • harpy/bin/depth_windows.py (3 hunks)
  • harpy/bin/infer_sv.py (2 hunks)
  • harpy/bin/make_windows.py (4 hunks)
  • harpy/bin/molecule_coverage.py (3 hunks)
  • harpy/bin/parse_phaseblocks.py (2 hunks)
  • harpy/bin/rename_bam.py (2 hunks)
  • harpy/reports/align_stats.Rmd (1 hunks)
  • harpy/reports/hapcut.Rmd (1 hunks)
  • harpy/reports/leviathan.Rmd (1 hunks)
  • harpy/reports/leviathan_pop.Rmd (1 hunks)
  • harpy/reports/naibr.Rmd (1 hunks)
  • harpy/reports/naibr_pop.Rmd (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • harpy/_cli_types_generic.py
  • harpy/bin/assign_mi.py
  • harpy/bin/check_bam.py
  • harpy/bin/check_fastq.py
  • harpy/bin/concatenate_bam.py
  • harpy/bin/count_bx.py
  • harpy/bin/deconvolve_alignments.py
  • harpy/bin/molecule_coverage.py
  • harpy/bin/rename_bam.py
🧰 Additional context used
🪛 Ruff
harpy/bin/make_windows.py

67-67: Use a context manager for opening files

(SIM115)


69-69: Use a context manager for opening files

(SIM115)

🔇 Additional comments (11)
harpy/bin/parse_phaseblocks.py (2)

3-6: LGTM! Clean and well-organized imports.

The imports are appropriate and follow PEP 8 style guidelines.


14-20: LGTM! Robust argument validation and error handling.

The added checks for no arguments and file existence improve the script's robustness and user experience.

harpy/bin/depth_windows.py (1)

51-52: ⚠️ Potential issue

Critical: Add zero division protection for window calculation.

While windowsize is validated at script start, it's still good practice to add protection here for consistency and future maintainability.

-depth = _SUM / args.windowsize
-sys.stdout.write(f"{contig}\t{END}\t{depth}\n")
+if args.windowsize > 0:
+    depth = _SUM / args.windowsize
+    sys.stdout.write(f"{contig}\t{END}\t{depth}\n")
+else:
+    sys.stderr.write(f"Warning: Skipping window for {contig} due to zero window size\n")
harpy/bin/infer_sv.py (2)

3-5: LGTM: Import changes are appropriate.

The addition of the os import is necessary for the new file existence check.


Line range hint 1-67: Verify script usage and input file format.

Let's verify how this script is used in the workflow and check the expected BEDPE file format.

✅ Verification successful

Script is correctly integrated and used in the workflow

The verification shows that:

  1. The script is properly integrated into multiple NAIBR-related Snakemake workflows (sv_naibr.smk, sv_naibr_pop.smk, etc.)
  2. The usage is consistent across workflows, being used to process BEDPE files with the same command structure: infer_sv.py {input.bedpe} -f {output.fail} > {output.bedpe}
  3. The script maintains its position in the workflow, with expected input/output handling

The changes in the script are safe and maintain compatibility with the existing workflow integration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check script usage and input file format

# Check for workflow files that use this script
echo "Checking workflow usage..."
rg -l "infer_sv\.py" --type snakefile

# Check for example BEDPE files to understand the expected format
echo -e "\nChecking for example BEDPE files..."
fd -e bedpe

# Check for documentation about the BEDPE format
echo -e "\nChecking for format documentation..."
rg -l "bedpe.*format|format.*bedpe" --type md

Length of output: 374


Script:

#!/bin/bash
# Let's try a broader search for workflow files and documentation

# Check for any workflow files (Snakefile, nextflow, etc.) that might use this script
echo "Checking workflow usage..."
rg -l "infer_sv"

# Look for documentation files in various formats
echo -e "\nChecking documentation..."
fd -e md -e rst -e txt

# Check if there are any test files or test data
echo -e "\nChecking for test files..."
fd -g "*test*"

Length of output: 464


Script:

#!/bin/bash
# Let's examine the workflow files and README for context

# Check how the script is used in the workflow files
echo "Checking workflow usage in Snakefiles..."
rg -A 5 -B 5 "infer_sv" harpy/snakefiles/sv_naibr.smk
rg -A 5 -B 5 "infer_sv" harpy/snakefiles/sv_naibr_pop.smk

# Check README for any relevant documentation
echo -e "\nChecking README..."
cat README.md

# Look for any input/output examples in the test directory
echo -e "\nChecking test directory..."
fd . test/vcf/ -t f -x head -n 5 {}

Length of output: 6081

harpy/bin/make_windows.py (2)

Line range hint 38-51: LGTM! The window calculation logic is robust.

The function correctly:

  • Handles edge cases for the last window
  • Adjusts for 0/1-based coordinates
  • Uses efficient list operations

Line range hint 1-97: Verify connection to divide-by-zero fix.

While the changes improve code robustness, they don't directly address the divide-by-zero error mentioned in PR #160. Let's verify if this file is used by the affected molecule_coverage rule.

harpy/bin/bx_stats.py (1)

Line range hint 49-50: Validate molecule length calculations.

The inferred length calculation (end - start) could potentially be zero or negative. Consider adding validation to ensure meaningful length values.

-        x[_mi]["inferred"] = x[_mi]["end"] - x[_mi]["start"]
+        # Ensure meaningful molecule length
+        length = x[_mi]["end"] - x[_mi]["start"]
+        x[_mi]["inferred"] = max(0, length)  # Prevent negative lengths
harpy/reports/naibr.Rmd (1)

Line range hint 294-299: LGTM! The contig filtering logic improves robustness.

The change ensures that only contigs with structural variants are plotted, which should prevent potential errors when processing empty datasets. However, consider adding input validation to handle edge cases.

Add validation for empty plot_contigs:

 if (all(plot_contigs == "default")){
+  if (length(unique(sv$Chr1)) == 0) {
+    warning("No structural variants found in any contigs")
+    return()
+  }
   # make sure that the contigs with SV are the ones being plotted, not the others
   fa.sizes <- fa.sizes[fa.sizes$contig %in% unique(sv$Chr1), 1:2]

Let's verify if this change fixes the divide-by-zero error:

✅ Verification successful

Let me gather more information about the potential division operations in the plotting code, particularly around the modified section.

{r.*plot' harpy/reports/naibr.Rmd

Look for any operations with fa.sizes to understand potential division operations

rg -B 2 -A 5 'fa.sizes' harpy/reports/naibr.Rmd


---

Let me fix the syntax error and try again.


\{r.*plot' harpy/reports/naibr.Rmd

# Look for any operations with fa.sizes to understand potential division operations
rg -B 2 -A 5 'fa\.sizes' harpy/reports/naibr.Rmd

# Search for any code chunks that might contain division operations
rg -B 2 -A 5 'plot|ggplot|geom_' harpy/reports/naibr.Rmd

Let me try one more time with properly escaped quotes:


Based on the code context, I can now generate the final response:

The contig filtering change is correct and prevents potential divide-by-zero errors

The modification ensures that only contigs containing structural variants are included in fa.sizes, which is used to create the circos plot. This prevents potential errors in the following ways:

  • Filters out contigs without SVs before they reach the plotting functions
  • Ensures fa.sizes and sv dataframes are properly aligned
  • The subsequent BioCircos plotting code won't attempt to create tracks for empty contigs

The change is safe and doesn't require additional validation because:

  • The code already handles empty fa.sizes through min(nrow(fa.sizes), 30)
  • The filtering is applied before any division operations in the plotting code
  • The circos plot generation has proper error handling through the R markdown's chunk options
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential division operations in the plotting code
rg -A 5 "/(|\s+/\s+|\s*/\s*)" harpy/reports/naibr.Rmd

# Search for related error handling in test files
fd -e py test_ --exec rg -l "ZeroDivisionError|divide.*zero"

Length of output: 2345


Script:

#!/bin/bash
# Search for plotting code chunks and their context
rg -B 5 -A 10 '

Length of output: 93


Script:

#!/bin/bash
# Search for plotting code chunks and their context
rg -B 5 -A 10 '

Length of output: 93


Script:

#!/bin/bash
# Search for plotting code chunks and their context
rg -B 5 -A 10 "\`\`\`\{r.*plot" harpy/reports/naibr.Rmd

# Look for any operations with fa.sizes
rg -B 2 -A 5 "fa\.sizes" harpy/reports/naibr.Rmd

# Search for any code chunks that might contain division operations
rg -B 2 -A 5 "plot|ggplot|geom_" harpy/reports/naibr.Rmd

Length of output: 6384

harpy/reports/naibr_pop.Rmd (1)

296-296: Good fix for the vector comparison!

The change from if (plot_contigs == "default") to if (all(plot_contigs == "default")) is a crucial fix. In R, comparing a vector with a scalar without all() only checks the first element, which could lead to incorrect contig filtering. This ensures that the condition works correctly when plot_contigs contains multiple values.

Let's verify the usage of plot_contigs parameter:

✅ Verification successful

Good fix for the vector comparison!

The change from if (plot_contigs == "default") to if (all(plot_contigs == "default")) is a crucial fix. The codebase search confirms that:

  1. plot_contigs is consistently passed as a parameter from Snakemake configuration across multiple report templates (naibr.Rmd, leviathan.Rmd, align_stats.Rmd, etc.)
  2. The parameter can be either "default" or a list of specific contigs, as seen in the Python code:
    **({'plot_contigs': contigs} if contigs else {'plot_contigs': "default"})
  3. The same vector comparison pattern is used consistently across all R markdown reports, confirming this is the correct approach for handling the plot_contigs parameter.

The fix ensures proper handling of both single and multiple contig cases, preventing potential issues with vector comparison in R.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how plot_contigs parameter is used in Snakemake rules
# Test: Search for plot_contigs parameter in Snakemake rules
rg -A 5 "plot_contigs" '*.smk' '*.snake'

Length of output: 134


Script:

#!/bin/bash
# Let's search for plot_contigs in all files to understand its usage
rg -A 5 "plot_contigs" 

# Also search for Snakemake files with different extensions
fd -e snakefile -e Snakefile -e snake -e smk

Length of output: 17613

harpy/reports/hapcut.Rmd (1)

192-192: Correct handling of vector comparison with all() function

Changing the condition to if (all(plotting_contigs == "default")) ensures proper evaluation when plotting_contigs is a vector. This prevents unintended behavior from element-wise comparison, which could lead to incorrect branching due to partial matches.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
harpy/bin/make_windows.py (2)

54-63: Use context manager for file handling.

While the file handling works, it would be better to use a context manager to ensure proper resource cleanup.

-with open(args.input, "r", encoding="utf-8") as fai:
+try:
+    with open(args.input, "r", encoding="utf-8") as fai:
         while True:
             line = fai.readline()
             if not line:
                 break
             lsplit = line.split("\t")
             contig = lsplit[0]
             c_len = int(lsplit[1])
             c_len += args.mode
             makewindows(contig, c_len, args.window)
+except IOError as e:
+    parser.error(f"Error reading {args.input}: {e}")

Line range hint 65-93: Improve FASTA file handling with context managers and error handling.

The current implementation could benefit from:

  1. Using context managers for file handling
  2. Better error handling for file operations
  3. Simplified file reading logic
 elif testname.endswith("fasta") or testname.endswith("fa") or testname.endswith("fasta.gz") or testname.endswith("fa.gz"):
-    if is_gzip(args.input):
-        fopen = gzip.open(args.input, "rt")
-    else:
-        fopen = open(args.input, "r", encoding="utf-8")
-    line = fopen.readline()
-    while True:
+    try:
+        open_func = gzip.open if is_gzip(args.input) else open
+        with open_func(args.input, "rt", encoding="utf-8") as fopen:
+            line = fopen.readline()
+            while line:
                 C_LEN=0
-                if not line:
-                    break
                 if line.startswith(">"):
                     contig = line.rstrip("\n").lstrip(">").split()[0]
                     HEADER = False
                     while not HEADER:
                         line = fopen.readline()
                         if not line:
                             break
                         line = line.rstrip("\n")
                         if line.startswith(">"):
                             HEADER = True
                             break
                         else:
                             C_LEN += len(line)-1
                             HEADER = False
                     C_LEN += args.mode
                     makewindows(contig, C_LEN, args.window)
+                line = fopen.readline()
-    fopen.close()
+    except IOError as e:
+        parser.error(f"Error reading {args.input}: {e}")
🧰 Tools
🪛 Ruff

67-67: Use a context manager for opening files

(SIM115)


69-69: Use a context manager for opening files

(SIM115)

harpy/_validations.py (1)

170-170: LGTM! Good improvement to error reporting.

The change to use os.path.basename() improves error message clarity by showing only the filename instead of the full path. This makes error messages more concise and user-friendly.

Consider extracting the basename earlier in the function to avoid repeated os.path.basename() calls if more error reporting is added in the future:

def check_RG(bamfile):
-   samplename = Path(bamfile).stem
+   basename = os.path.basename(bamfile)
+   samplename = Path(basename).stem
    samview = subprocess.run(f"samtools samples {bamfile}".split(), stdout = subprocess.PIPE).stdout.decode('utf-8').split()
    if samplename != samview[0]:
-       return os.path.basename(bamfile), samview[0]
+       return basename, samview[0]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b1cfddd and 1a57392.

📒 Files selected for processing (5)
  • .github/workflows/tests.yml (1 hunks)
  • harpy/_validations.py (1 hunks)
  • harpy/bin/make_windows.py (4 hunks)
  • harpy/bin/molecule_coverage.py (3 hunks)
  • harpy/snp.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • harpy/bin/molecule_coverage.py
  • harpy/snp.py
🧰 Additional context used
🪛 Ruff
harpy/bin/make_windows.py

67-67: Use a context manager for opening files

(SIM115)


69-69: Use a context manager for opening files

(SIM115)

🔇 Additional comments (3)
harpy/bin/make_windows.py (2)

Line range hint 38-51: LGTM! Clean implementation of window creation.

The function correctly handles window creation and follows Unix philosophy by writing to stdout.


Line range hint 1-95: Verify integration with molecule_coverage rule.

Since this PR aims to fix a ZeroDivisionError in the molecule_coverage rule, let's verify how this script is used in that context.

.github/workflows/tests.yml (1)

556-556: Command update looks good but let's verify the script.

The change from rename_bam to rename_bam.py with reordered arguments is an improvement that makes the script extension explicit.

Let's verify the script exists and is executable:

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)
harpy/bin/haplotag_acbd.py (2)

18-20: Add error handling for directory operations.

The directory creation lacks proper error handling and user feedback. Consider adding checks for permissions and providing status messages.

Here's a suggested improvement:

 args = parser.parse_args()
 outdir = args.output_directory.rstrip("/")
-os.makedirs(outdir, exist_ok = True)
+try:
+    os.makedirs(outdir, exist_ok=True)
+    # Verify write permissions by attempting to create a test file
+    test_file = os.path.join(outdir, '.write_test')
+    try:
+        with open(test_file, 'w') as f:
+            pass
+        os.remove(test_file)
+    except IOError as e:
+        sys.stderr.write(f"Error: No write permission in directory '{outdir}'\n")
+        sys.exit(1)
+except OSError as e:
+    sys.stderr.write(f"Error creating directory '{outdir}': {e}\n")
+    sys.exit(1)

Line range hint 89-93: Improve file writing operations and error handling.

The current implementation uses list comprehension for side effects and lacks error handling for file operations.

Here's a more idiomatic approach:

 for BC in ["A","C","B","D"]:
-    with open(f"{outdir}/BC_{BC}.txt", "w", encoding="utf-8") as f:
-        ID = [f"{BC}{number:02d}" for number in range(1, 97)]
-        delim = ["	".join(tup) for tup in zip(ID, BX[BC])]
-        _ = [f.write(f"{i}\n") for i in delim]
+    output_file = os.path.join(outdir, f"BC_{BC}.txt")
+    try:
+        with open(output_file, "wt", encoding="utf-8") as f:
+            IDs = [f"{BC}{number:02d}" for number in range(1, 97)]
+            for id_str, barcode in zip(IDs, BX[BC]):
+                f.write(f"{id_str}\t{barcode}\n")
+    except IOError as e:
+        sys.stderr.write(f"Error writing to '{output_file}': {e}\n")
+        sys.exit(1)

Changes made:

  1. Used os.path.join for path construction
  2. Added proper error handling
  3. Used "wt" mode to explicitly indicate text mode
  4. Replaced list comprehension with a more readable loop
  5. Fixed the variable naming to be more descriptive
harpy/bin/concatenate_bam.py (5)

34-37: Consider using f-strings for error messages.

The error messages are clear and helpful, but could be more concise using f-strings.

-    sys.stderr.write("Please provide a single file to \'--bamlist\' (-b) featuring all the files you want to concatenate (one per line):\n")
-    sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt\n\n")
-    sys.stderr.write("Alternatively, provide the files after \'-o output.bam\':\n",)
-    sys.stderr.write("[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam\n")
+    sys.stderr.write(f"""Please provide a single file to '--bamlist' (-b) featuring all the files you want to concatenate (one per line):
+[example]: concatenate_bam.py -o c_acronotus.bam -b alignments.txt
+
+Alternatively, provide the files after '-o output.bam':
+[example]: concatenate_bam.py -o c_acronotus.bam sample1.bam sample2.bam
+""")

50-57: Consider using list comprehension for file validation.

The validation logic is correct but could be more concise.

-err = []
-for i in aln_list:
-    if not os.path.exists(i):
-        err.append(i)
-if err:
-    parser.error("Some input files were not found on the system:\n" + ", ".join(err))
+missing_files = [f for f in aln_list if not os.path.exists(f)]
+if missing_files:
+    parser.error(f"Some input files were not found on the system:\n{', '.join(missing_files)}")

58-60: Document the significance of the haplotag limit value.

Consider adding a comment explaining why the limit is 96^4 (e.g., "96 possible values for each of the 4 positions in the barcode").


85-89: Add type hints and document the BX tag format.

Consider adding type hints and documenting the BX tag format for better maintainability.

+# BX tag format: A##C##B##D## where ## are zero-padded numbers from 01-96
 if args.bx_unique:
-    bc_range = [f"{i}".zfill(2) for i in range(1,97)]
-    bc_generator = product("A", bc_range, "C", bc_range, "B", bc_range, "D", bc_range)
+    bc_range: list[str] = [f"{i}".zfill(2) for i in range(1,97)]
+    bc_generator: Iterator[tuple[str, ...]] = product("A", bc_range, "C", bc_range, "B", bc_range, "D", bc_range)

93-96: Simplify BX_NEW initialization using ternary operator.

The initialization can be more concise.

-    if args.bx_unique:
-        BX_NEW = "".join(next(bc_generator))
-    else:
-        BX_NEW = None
+    BX_NEW = "".join(next(bc_generator)) if args.bx_unique else None
🧰 Tools
🪛 Ruff

93-96: Use ternary operator BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None instead of if-else-block

Replace if-else-block with BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None

(SIM108)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1a57392 and 0bdd82a.

📒 Files selected for processing (3)
  • harpy/bin/concatenate_bam.py (5 hunks)
  • harpy/bin/haplotag_acbd.py (1 hunks)
  • harpy/snakefiles/sv_leviathan_pop.smk (2 hunks)
🧰 Additional context used
🪛 Ruff
harpy/bin/concatenate_bam.py

93-96: Use ternary operator BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None instead of if-else-block

Replace if-else-block with BX_NEW = ''.join(next(bc_generator)) if args.bx_unique else None

(SIM108)


126-126: Yoda condition detected

Rewrite as haplotag_limit < MI_NEW

(SIM300)

🔇 Additional comments (6)
harpy/bin/haplotag_acbd.py (1)

3-6: LGTM! Clean import organization.

The added imports are appropriate and follow PEP 8 guidelines for import ordering.

harpy/bin/concatenate_bam.py (3)

7-7: LGTM: Well-documented argument parser changes!

The addition of the --bx flag and updated help text clearly explains the new functionality for handling BX tag uniqueness.

Also applies to: 16-18, 20-20, 26-27


129-135: Comprehensive error message for haplotag limit.

The error message provides clear guidance and alternatives when hitting the haplotag limit.


114-128: LGTM: Robust tag handling with proper error checks.

The logic for handling MI and BX tags is sound, with proper error handling for the haplotag limit.

✅ Verification successful

Let me gather more information about the BX tag handling to verify its uniqueness across samples.


Let me check one more thing about the haplotag_limit and bc_range initialization to fully understand the uniqueness constraints.


BX tag handling is correctly implemented with guaranteed uniqueness

The implementation ensures unique BX tags across samples through:

  • A deterministic barcode generator using itertools.product with a fixed pattern (A,C,B,D)
  • Numbers 01-96 for each position, allowing up to 96^4 unique combinations
  • Proper error handling when exceeding the haplotag limit
  • Consistent BX tag generation between concatenate_bam.py and 10xtoHaplotag.py
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify that BX tags are unique across samples when --bx is used
rg -A 2 'set_tag\("BX"' | grep -v 'set_tag'

Length of output: 271


Script:

#!/bin/bash
# Check the implementation of bc_generator and BX_NEW initialization
ast-grep --pattern 'bc_generator = $_'

# Also check for BX_NEW initialization
rg -B 3 'BX_NEW =' 

Length of output: 952


Script:

#!/bin/bash
# Check haplotag_limit and bc_range initialization
rg -B 2 'haplotag_limit ='
rg -B 2 'bc_range ='

Length of output: 591

🧰 Tools
🪛 Ruff

126-126: Yoda condition detected

Rewrite as haplotag_limit < MI_NEW

(SIM300)

harpy/snakefiles/sv_leviathan_pop.smk (2)

279-279: LGTM! Documentation updated correctly.

The workflow summary has been updated to accurately reflect the new command with the --bx flag, maintaining consistency with the actual implementation.


84-84: Verify the impact of the --bx flag on BAM concatenation.

The addition of the --bx flag to concatenate_bam.py suggests a change in how BX tags are handled during BAM concatenation. Let's verify the implementation and its potential effects.

✅ Verification successful

Let me gather more information about the concatenate_bam.py script to understand how it handles the --bx flag.


The --bx flag implementation is properly handled and safe to use

The code review reveals that the --bx flag is well-implemented in concatenate_bam.py:

  • It ensures BX tags remain unique between samples by generating new unique identifiers
  • It includes proper error handling for cases exceeding the haplotag limit (96^4)
  • The implementation maintains consistency with MI tags and handles both BAM/SAM formats correctly
  • The change is documented in the script's help text
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of the --bx flag in concatenate_bam.py
# and its potential impact on BAM files.

# Test 1: Check if concatenate_bam.py handles the --bx flag
rg -A 5 "def.*main" harpy/bin/concatenate_bam.py

# Test 2: Look for related BX tag handling code
rg -l "BX:Z:" harpy/bin/

Length of output: 230


Script:

#!/bin/bash
# Check the content of concatenate_bam.py to understand the --bx flag implementation
cat harpy/bin/concatenate_bam.py

# Also check for any related test files
fd concatenate_bam.*test harpy/

Length of output: 6234

Comment on lines +7 to +16
parser = argparse.ArgumentParser(
prog = 'haplotag_acbd.py',
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes",
usage = "haplotag_acbd.py output_directory",
exit_on_error = False
)
parser.add_argument("output_directory", type = str, help = "Directory to create barcode files")
if len(sys.argv) == 1:
parser.print_help(sys.stderr)
sys.exit(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve argument parser robustness and error handling.

A few suggestions to enhance the argument parsing:

  1. Remove exit_on_error=False to ensure parsing errors are properly reported
  2. Use exit code 0 for help message as it's not an error condition
  3. Add path validation for the output directory

Here's the suggested implementation:

 parser = argparse.ArgumentParser(
     prog = 'haplotag_acbd.py',
     description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes",
-    usage = "haplotag_acbd.py output_directory",
-    exit_on_error = False
+    usage = "%(prog)s output_directory"
     )
-parser.add_argument("output_directory", type = str, help = "Directory to create barcode files")
+parser.add_argument(
+    "output_directory",
+    type=str,
+    help="Directory to create barcode files",
+    metavar="PATH"
+)
 if len(sys.argv) == 1:
     parser.print_help(sys.stderr)
-    sys.exit(1)
+    sys.exit(0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
parser = argparse.ArgumentParser(
prog = 'haplotag_acbd.py',
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes",
usage = "haplotag_acbd.py output_directory",
exit_on_error = False
)
parser.add_argument("output_directory", type = str, help = "Directory to create barcode files")
if len(sys.argv) == 1:
parser.print_help(sys.stderr)
sys.exit(1)
parser = argparse.ArgumentParser(
prog = 'haplotag_acbd.py',
description ="Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes",
usage = "%(prog)s output_directory"
)
parser.add_argument(
"output_directory",
type=str,
help="Directory to create barcode files",
metavar="PATH"
)
if len(sys.argv) == 1:
parser.print_help(sys.stderr)
sys.exit(0)

@pdimens pdimens merged commit ac9a13c into main Nov 8, 2024
19 of 20 checks passed
@pdimens pdimens deleted the secondary_deconvolve branch November 8, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error in rule molecule_coverage
1 participant