Skip to content

Latest commit

 

History

History
executable file
·
142 lines (104 loc) · 13.7 KB

README.md

File metadata and controls

executable file
·
142 lines (104 loc) · 13.7 KB

Demultiplexing

The demultiplex.py script performs the following:

  1. Demultiplexing
  2. Cluster density calculation

It contains 2 classes:

  • GetRunfolders
  • DemultiplexRunfolder

Protocol

  1. The GetRunfolders() class collects runfolders in the config-specified runfolders directory
  2. GetRunfolders.setoff_processing() is called to:
  • Check if bcl2fastq2 and gatk (used for cluster density calcs) are installed on the workstation
  • Initiate runfolder processing per identified runfolder, on runfolders that have an absent bcl2fastq2 logfile (bcl2fastq2_output.log - denotes that demultiplexing has been performed). bcl2fastq2 stdout and stderr streams are written to this file
  1. If criteria 2 is met, DemultiplexRunfolder().setoff_workflow() is called which performs a set of further checks on the runfolder to determine whether demultiplexing is required:
  • Sequencing is complete (presence of RTAComplete.txt file created by the sequencer when sequencing is complete)
  • SampleSheet does not contain any errors that would cause demultiplexing to fail - checks are carried out by the samplesheet_validator.py module which makes use of the seglh-naming library. The absence of error messages for specific tests is checked:
    • Sample sheet is present
    • SampleSheet name is valid (validates using the seglh-naming library)
    • SampleSheet is not empty
    • SampleSheet contains the minimum expected [Data] section headers: Sample_ID, Sample_Name, index
    • Sample name does not contain any illegal characters (in case this was not rectified after the early warning checks as this will cause bcl2fastq2 to fail)
  • If the sequencer does not require an integrity check, it skips straight to run_demultiplexing()
  • If the sequencer does require an integrity check the following requirements must be met for run_demultiplexing() to be called:
    1. Checksum file generated by integrity checking script must be present
    2. The run has not failed a previous integrity check performed by this script
    3. The md5 checksums in the checksum file match. This verifies the integrity between the workstation and sequencer
  1. If criteria 3 are met, the demultiplexing log file is created to prevent simultaneous attempt on the next run of the script (bcl2fastq2 is slow to create the logfile), and the cluster density calculations are performed
  2. If criteria 4 is met, and the run is not a tso run, run_demultiplexing() then executes the demultiplexing bcl2fastq2 (v2.20) command

Usage

The module can be used either from the command line or as a module import:

Command line

Multiple runfolders

The script should be run with no inputs provided when assessing production runs on the workstation. This allows it to loop over multiple runfolders and demultiplex in succession:

python3 -m demultiplex

Single runfolder

The run can be run manually for an individual runfolder as follows:

python3 -m demultiplex -r $RUNFOLDER_NAME

Module import

from demultiplex import demultiplex

gr_obj = demultiplex.GetRunfolders()
gr_obj.setoff_processing()

Configuration

Settings are imported from ad_config.py.

Logging

Logging is performed using ad_logger.

Alias Description Filename Location
Demultiplex output Catches any traceback from errors when running the cron job that are not caught by exception handling within the script TIMESTAMP.txt /usr/local/src/mokaguys/automate_demultiplexing_logfiles/Demultiplex_cron_stdout
demultiplex (script_logger) Records script-level logs for the demultiplex script TIMESTAMP_demultiplex_script.log /usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/
demultiplex (demux_rf_logger) Records runfolder-level logs for the demultiplex script RUNFOLDERNAME_demultiplex_runfolder.log /usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/
Bcl2fastq output STDERR from bcl2fastq2 bcl2fastq2_output.log Within the runfolder

Testing

N.B. Tests and test cases/files MUST be maintained and updated accordingly in conjunction with script development

Test datasets are stored in /test/data. The script has a test suite:

These tests should be run before pushing any code to ensure all tests in the GitHub Actions workflow pass.

Demultiplex.py tests

This directory contains test files used in the demultiplex test suite.

test_runfolders contains runfolders used to test GetRunfolders().rundemultiplexrunfolders(), DemultiplexRunfolder.run_demultiplexing() and DemultiplexRunfolder.check_demultiplexing_required(), and GetRunfolders.loop_through_runs().

The test cases are described below.

Test SampleSheets

Lone SampleSheet test cases are detailed below. These have been created for the purpose of testing SampleSheet related functions in the demultiplex script (valid_samplesheet and no_disallowed_sserrs). The test cases are as follows:

SampleSheet name Run Type
210408_M02631_0186_000000000-JFMNK_SampleSheet.csv SNP
210917_NB551068_0409_AH3YNFAFX3_SampleSheet.csv Custom Panel
221021_A01229_0145_BHGGTHDMXY_SampleSheet.csv TSO500
221024_A01229_0146_BHKGG2DRX2_SampleSheet.csv WES Skin

TODO check if these cover all cases

SampleSheet Name Details Expected behaviour
21aA08_A01229_0040_AHKGTFDRXY_SampleSheet.csv Empty SampleSheet with invalid name (letter in date)
21108_A01229_0040_AHKGTFDRXY_SampleSheet.csv Empty SampleSheet with invalid name (date too short)
220413_A01229_0032_AHGKBIEKFR_SampleSheet.csv Empty SampleSheet
200817_NB068_0009_AH3YERAFX3_SampleSheet.csv Custom Panel SampleSheet with invalid name (invalid sequencer ID), invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) # DONE
210513_M02631_0236_000000000-JFMNK_SampleSheet.csv SNP SampleSheet with invalid characters in the sample name
220404_B01229_0348_HFGIFEIOPY_SampleSheet.csv TSO SampleSheet with invalid name (invalid sequencer ID), invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) # DONE
220408_A02631_0186_000000000-JLJFE_SampleSheet.csv SNP SampleSheet with invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) # DONE
2110915_M02353_0632_000000000-K242J_SampleSheet.csv SNP SampleSheet with invalid name (date too long), invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) # DONE

test_runfolders

Runfolder Details Expected behaviour
999999_A01229_0000_00000TEST1 bcl2fastq2_output.log (Demultiplexing already complete) demultiplexing_requried returns False
999999_A01229_0000_00000TEST2 No flag files (Sequencing not finished) demultiplexing_requried returns False
999999_A01229_0000_00000TEST3 RTAComplete.txt, invalid SampleSheet present in test samplesheet dir with disallowed errors that would cause demultiplexing to fail (Sequencing complete but no processing has taken place yet) demultiplexing_requried returns False
999999_M02631_0000_00000TEST4 RTAComplete.txt, matching valid SampleSheet present in test samplesheet dir, InterOp and RunInfo.xml files for Picard CollectIlluminaLaneMetrics calculation, integrity check not required (Sequencing complete but no processing has taken place yet) demultiplexing_requried returns True
999999_A01229_0000_00000TEST5 RTAComplete.txt, matching valid SampleSheet present in test samplesheet dir, integrity check required, but no checksum file (Sequencing complete but no processing has taken place yet) demultiplexing_requried returns False
999999_A01229_0000_00000TEST6 RTAComplete.txt, matching valid SampleSheet present in test samplesheet dir, integrity check required, md5checksum.txt present and contains integrity check fail string (Sequencing complete but no processing has taken place yet, previous integrity check has failed) demultiplexing_requried returns False
999999_A01229_0000_00000TEST7 RTAComplete.txt (sequencing complete) , matching valid SampleSheet present in test samplesheet dir, InterOp and RunInfo.xml files for Picard CollectIlluminaLaneMetrics calculation, integrity check required, md5checksum.txt present and contains matching checksums but no previously checked checksums string (Sequencing complete but no processing has taken place yet, integrity check passed) demultiplexing_required returns True
999999_A01229_0000_00000TEST8 Matching valid SampleSheet present in samplesheet dir containing TSO samples run_demultiplexing returns False, self.run_processed == True
999999_A01229_0000_00000TEST9 RTAComplee.txt (sequencing complete), Matching valid SampleSheet present in samplesheet dir containing non-TSO run_demultiplexing returns True, self.run_processed == True (bcl2fastq2 command replaced by a dummy command)
999999_A01229_0000_0000TEST10 RTAComplete.txt (sequencing complete), SampleSheet missing, integrity check not required (md5checksum.txt present and contains matching checksums with a previously checked checksums string - processing has taken place, integrity check passed) demultiplexing_required returns False
999999_A01229_0000_0000TEST11 RTAComplete.txt, SampleSheet present and contains TSO samples, integrity check required (md5checksum.txt present and contains matching checksums but no previously checked checksums string - no processing has taken place)