The demultiplex.py script performs the following:
- Demultiplexing
- Cluster density calculation
It contains 2 classes:
- GetRunfolders
- DemultiplexRunfolder
- The
GetRunfolders()
class collects runfolders in the config-specified runfolders directory GetRunfolders.setoff_processing()
is called to:
- Check if
bcl2fastq2
andgatk
(used for cluster density calcs) are installed on the workstation - Initiate runfolder processing per identified runfolder, on runfolders that have an absent bcl2fastq2 logfile (
bcl2fastq2_output.log
- denotes that demultiplexing has been performed). bcl2fastq2 stdout and stderr streams are written to this file
- If criteria 2 is met,
DemultiplexRunfolder().setoff_workflow()
is called which performs a set of further checks on the runfolder to determine whether demultiplexing is required:
- Sequencing is complete (presence of
RTAComplete.txt
file created by the sequencer when sequencing is complete) - SampleSheet does not contain any errors that would cause demultiplexing to fail - checks are carried out by the samplesheet_validator.py module which makes use of the seglh-naming library. The absence of error messages for specific tests is checked:
- Sample sheet is present
- SampleSheet name is valid (validates using the seglh-naming library)
- SampleSheet is not empty
- SampleSheet contains the minimum expected
[Data]
section headers:Sample_ID, Sample_Name, index
- Sample name does not contain any illegal characters (in case this was not rectified after the early warning checks as this will cause bcl2fastq2 to fail)
- If the sequencer does not require an integrity check, it skips straight to
run_demultiplexing()
- If the sequencer does require an integrity check the following requirements must be met for
run_demultiplexing()
to be called:- Checksum file generated by integrity checking script must be present
- The run has not failed a previous integrity check performed by this script
- The md5 checksums in the checksum file match. This verifies the integrity between the workstation and sequencer
- If criteria 3 are met, the demultiplexing log file is created to prevent simultaneous attempt on the next run of the script (bcl2fastq2 is slow to create the logfile), and the cluster density calculations are performed
- If criteria 4 is met, and the run is not a tso run,
run_demultiplexing()
then executes the demultiplexingbcl2fastq2 (v2.20)
command
The module can be used either from the command line or as a module import:
The script should be run with no inputs provided when assessing production runs on the workstation. This allows it to loop over multiple runfolders and demultiplex in succession:
python3 -m demultiplex
The run can be run manually for an individual runfolder as follows:
python3 -m demultiplex -r $RUNFOLDER_NAME
from demultiplex import demultiplex
gr_obj = demultiplex.GetRunfolders()
gr_obj.setoff_processing()
Settings are imported from ad_config.py.
Logging is performed using ad_logger.
Alias | Description | Filename | Location |
---|---|---|---|
Demultiplex output | Catches any traceback from errors when running the cron job that are not caught by exception handling within the script | TIMESTAMP.txt |
/usr/local/src/mokaguys/automate_demultiplexing_logfiles/Demultiplex_cron_stdout |
demultiplex (script_logger) | Records script-level logs for the demultiplex script | TIMESTAMP_demultiplex_script.log |
/usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/ |
demultiplex (demux_rf_logger) | Records runfolder-level logs for the demultiplex script | RUNFOLDERNAME_demultiplex_runfolder.log |
/usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/ |
Bcl2fastq output | STDERR from bcl2fastq2 | bcl2fastq2_output.log |
Within the runfolder |
N.B. Tests and test cases/files MUST be maintained and updated accordingly in conjunction with script development
Test datasets are stored in /test/data. The script has a test suite:
These tests should be run before pushing any code to ensure all tests in the GitHub Actions workflow pass.
This directory contains test files used in the demultiplex test suite.
test_runfolders contains runfolders used to test GetRunfolders().rundemultiplexrunfolders(), DemultiplexRunfolder.run_demultiplexing() and DemultiplexRunfolder.check_demultiplexing_required(), and GetRunfolders.loop_through_runs().
The test cases are described below.
Lone SampleSheet test cases are detailed below. These have been created for the purpose of testing SampleSheet related functions in the demultiplex script (valid_samplesheet and no_disallowed_sserrs). The test cases are as follows:
SampleSheet name | Run Type |
---|---|
210408_M02631_0186_000000000-JFMNK_SampleSheet.csv | SNP |
210917_NB551068_0409_AH3YNFAFX3_SampleSheet.csv | Custom Panel |
221021_A01229_0145_BHGGTHDMXY_SampleSheet.csv | TSO500 |
221024_A01229_0146_BHKGG2DRX2_SampleSheet.csv | WES Skin |
SampleSheet Name | Details | Expected behaviour |
---|---|---|
21aA08_A01229_0040_AHKGTFDRXY_SampleSheet.csv | Empty SampleSheet with invalid name (letter in date) | |
21108_A01229_0040_AHKGTFDRXY_SampleSheet.csv | Empty SampleSheet with invalid name (date too short) | |
220413_A01229_0032_AHGKBIEKFR_SampleSheet.csv | Empty SampleSheet | |
200817_NB068_0009_AH3YERAFX3_SampleSheet.csv | Custom Panel SampleSheet with invalid name (invalid sequencer ID), invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) | # DONE |
210513_M02631_0236_000000000-JFMNK_SampleSheet.csv | SNP SampleSheet with invalid characters in the sample name | |
220404_B01229_0348_HFGIFEIOPY_SampleSheet.csv | TSO SampleSheet with invalid name (invalid sequencer ID), invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) | # DONE |
220408_A02631_0186_000000000-JLJFE_SampleSheet.csv | SNP SampleSheet with invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) | # DONE |
2110915_M02353_0632_000000000-K242J_SampleSheet.csv | SNP SampleSheet with invalid name (date too long), invalid contents (invalid header, invalid sample names, non-matching sample names, invalid pan number, invalid runtype) | # DONE |
Runfolder | Details | Expected behaviour |
---|---|---|
999999_A01229_0000_00000TEST1 | bcl2fastq2_output.log (Demultiplexing already complete) | demultiplexing_requried returns False |
999999_A01229_0000_00000TEST2 | No flag files (Sequencing not finished) | demultiplexing_requried returns False |
999999_A01229_0000_00000TEST3 | RTAComplete.txt, invalid SampleSheet present in test samplesheet dir with disallowed errors that would cause demultiplexing to fail (Sequencing complete but no processing has taken place yet) | demultiplexing_requried returns False |
999999_M02631_0000_00000TEST4 | RTAComplete.txt, matching valid SampleSheet present in test samplesheet dir, InterOp and RunInfo.xml files for Picard CollectIlluminaLaneMetrics calculation, integrity check not required (Sequencing complete but no processing has taken place yet) | demultiplexing_requried returns True |
999999_A01229_0000_00000TEST5 | RTAComplete.txt, matching valid SampleSheet present in test samplesheet dir, integrity check required, but no checksum file (Sequencing complete but no processing has taken place yet) | demultiplexing_requried returns False |
999999_A01229_0000_00000TEST6 | RTAComplete.txt, matching valid SampleSheet present in test samplesheet dir, integrity check required, md5checksum.txt present and contains integrity check fail string (Sequencing complete but no processing has taken place yet, previous integrity check has failed) | demultiplexing_requried returns False |
999999_A01229_0000_00000TEST7 | RTAComplete.txt (sequencing complete) , matching valid SampleSheet present in test samplesheet dir, InterOp and RunInfo.xml files for Picard CollectIlluminaLaneMetrics calculation, integrity check required, md5checksum.txt present and contains matching checksums but no previously checked checksums string (Sequencing complete but no processing has taken place yet, integrity check passed) | demultiplexing_required returns True |
999999_A01229_0000_00000TEST8 | Matching valid SampleSheet present in samplesheet dir containing TSO samples | run_demultiplexing returns False, self.run_processed == True |
999999_A01229_0000_00000TEST9 | RTAComplee.txt (sequencing complete), Matching valid SampleSheet present in samplesheet dir containing non-TSO | run_demultiplexing returns True, self.run_processed == True (bcl2fastq2 command replaced by a dummy command) |
999999_A01229_0000_0000TEST10 | RTAComplete.txt (sequencing complete), SampleSheet missing, integrity check not required (md5checksum.txt present and contains matching checksums with a previously checked checksums string - processing has taken place, integrity check passed) | demultiplexing_required returns False |
999999_A01229_0000_0000TEST11 | RTAComplete.txt, SampleSheet present and contains TSO samples, integrity check required (md5checksum.txt present and contains matching checksums but no previously checked checksums string - no processing has taken place) |