Example inputs can be found in the example_data/
folder, found in the root directory of this repository.
The following describes required and optional arguments to run moalmanac/moalmanac.py
.
- Required arguments
- Optional arguments
- Tumor type
- Stage
- Somatic single nucleotide variants
- Somatic insertion and deletion variants
- Bases covered
- Copy number alterations
- Fusions
- Germline variants
- Somatic variants from validation sequencing
- Microsatellite status
- Mutational signatures
- Purity
- Ploidy
- Whole genome doubling
- Disable matchmaking
- Description
- Output directory
- Preclinical databases
Alternatively, a simplified version of the interpretation algorithm can be run by using moalmanac/simplified_input.py
.
The following arguments are required to run Molecular Oncology Almanac.
--patient_id
expects a single string value which is used for labeling outputs.
--config
expects a file path to the config.ini file.
This config file contains the following sections,
function_toggle
- allows several features of the MOAlmanac algorithm to be enabled or disabledlogging
- specifies the level that the logger should be configured to useversions
- specifies the versions of the MOAlmanac algorithm (interpreter) and database.exac
- specifies the allele frequency threshold used with ExAC to specify if a variant is a common variant or notfusion
- specifies minimum spanning fragments required for review by MOAlmanac, column names expected from inputs, and how "Fusion" should be written from inputmutations
- specifies the minimum coverage and allelic fraction that a variant needs for review by MOAlmanacseg
- specifies the percentile to evaluate copy gain and loss variants from segmented copy number input files, as well as how amplification and deletion should be written as stringssignatures
- specifies the minimum contribution required to review COSMIC mutational signatures by mMOAlmanacvalidation_sequencing
- Thresholds for minimum power to detect variants and minimum allelic fraction for annotation from validation sequencing. This is further described in the Methods section of our paper.feature_types
- String labels for each biomarker type passed to the algorithm. These values will be included infeature_type
column of outputs.
--dbs
expects a file path to the annotation-databases.ini file.
This config file contains a single section databases
that lists the following:
root
- path todatasources/
directoryalmanac_handle
- path withinroot
that points to themolecular-oncology-almanac.json
datasource filecancerhotspots_handle
- path withinroot
that points to the Cancer Hotspots datasource file3dcancerhotspots_handle
- path withinroot
that points to the Cancer Hotspots 3D datasource filecgc_handle
- path withinroot
that points to the Cancer Gene Census filecosmic_handle
- path withinroot
that points to the COSMIC datasource filegsea_pathways_handle
- path withinroot
that points to the GSEA pathways datasource filegsea_modules_handle
- path withinroot
that points to the GSEA modules datasource fileexac_handle
- path withinroot
that points to the ExAC datasource fileacmg_handle
- path withinroot
that points to the ACMG datasource fileclinvar_handle
- path withinroot
that points to the ClinVar datasource filehereditary_handle
- path withinroot
that points to the genes related to hereditary cancers datasource fileoncotree_handle
- path withinroot
that points to the Oncotree datasource filelawrence_handle
- path withinroot
that points to the Lawrence et al. TCGA mutational burden datasource file
For more information about each datasource, view the datasources directory
Molecular Oncology Almanac will run successfully given any combination of the following arguments:
--tumor_type
expects a string representing the report tumor type. MOAlmanac will attempt to map this string to either an Oncotree term or code. MOAlmanac will consider clinically relevant matches of the same tumor type prior to considering matches of another tumor type.
--stage
also expects a string and is intended for use to input disease stage. This is not functionally used within the method and only is outputted for display in the produced actionability report.
--snv_handle
anticipates a tab delimited file which contains somatic single nucleotide variants (snvs). This file should follow the guideline's set by either TCGA or GDC Mutation Annotation Format (MAF). Insertions and deletions can also be included in this input, the two MAFs are simply concatenated together.
Hugo Symbol | NCBI_Build | Chromosome | Start_position | End_position | Reference_Allele | Tumor_Seq_Allele1 | Tumor_Seq_Allele2 | Variant_Classification | Tumor_Sample_Barcode | Matched_Norm_Sample_Barcode | Annotation_Transcript | Protein_Change | t_ref_count | t_alt_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BRAF | 37 | 7 | 140453136 | 140453136 | A | T | A | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000288602.6 | p.V600E | 70 | 35 |
MSH2 | 37 | 2 | 47739466 | 47739466 | G | A | G | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000406134.1 | p.D887N | 50 | 25 |
STAG2 | 37 | X | 123191810 | 123191810 | A | T | A | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000371160.1 | p.F467I | 60 | 20 |
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
Hugo_Symbol
, gene symbol associated with the variantNCBI_Build
, reference genome usedChromosome
, chromosome of the variantStart_position
, genomic start position of the variantEnd_position
, genomic end position of the variantReference_Allele
, reference allele at the genomic locationVariant_Classification
, consequence of variant:Missense
,Nonsense
,Nonstop
,Splice_Site
,Frame_Shift_Ins
,Frame_Shift_Del
,In_Frame_Ins
, orIn_Frame_Del
Tumor_Seq_Allele1
, alternate allele at the genomic locationTumor_Seq_Allele2
, second allele at the genomic location (will be the same asReference_Allele
for SNVs)Tumor_Sample_Barcode
, string associated with the tumor profileMatched_Norm_Sample_Barcode
, string associated with the corresponding normal profileAnnotation_Transcript
, transcript associated with variantt_ref_count
, number of reference alleles observed at genomic positiont_alt_count
, number of alternate alleles observed at genomic position
At least one of the following also must be included:
HGVSp_Short
, protein change associated with the variant using the one-letter amino acid codesProtein_Change
, protein change associated with the variant using the one-letter amino-acid codes
--indel_handle
anticipates a tab delimited file which contains somatic insertions and deletions (indels). This file should follow the guideline's set by either TCGA or GDC Mutation Annotation Format (MAF). Single nucleotide variants can also be included in this input, the two MAFs are simply concatenated together.
Hugo Symbol | NCBI_Build | Chromosome | Start_position | End_position | Reference_Allele | Tumor_Seq_Allele1 | Tumor_Seq_Allele2 | Variant_Classification | Tumor_Sample_Barcode | Matched_Norm_Sample_Barcode | Annotation_Transcript | Protein_Change | t_ref_count | t_alt_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PMPCA | 37 | 9 | 139312448 | 139312449 | - | G | - | Intron | ProfileA-Tumor | ProfileA-Normal | ENST00000371717.3 | 92 | 31 | |
C10orf2 | 37 | 10 | 102748300 | 102748301 | TC | TC | - | Frame_Shift_Del | ProfileA-Tumor | ProfileA-Normal | ENST00000370228.1 | p.L112fs | 294 | 28 |
MEST | 37 | 7 | 130138285 | 130138285 | C | C | - | Frame_Shift_Del | ProfileA-Tumor | ProfileA-Normal | ENST00000223215.4 | p.L168fs | 60 | 20 |
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
Hugo_Symbol
, gene symbol associated with the variantNCBI_Build
, reference genome usedChromosome
, chromosome of the variantStart_position
, genomic start position of the variantEnd_position
, genomic end position of the variantReference_Allele
, reference allele at the genomic locationVariant_Classification
, consequence of variant:Missense
,Nonsense
,Nonstop
,Splice_Site
,Frame_Shift_Ins
,Frame_Shift_Del
,In_Frame_Ins
, orIn_Frame_Del
Tumor_Seq_Allele1
, alternate allele at the genomic locationTumor_Seq_Allele2
, second allele at the genomic location (will be the same asReference_Allele
for SNVs)Tumor_Sample_Barcode
, string associated with the tumor profileMatched_Norm_Sample_Barcode
, string associated with the corresponding normal profileAnnotation_Transcript
, transcript associated with variantt_ref_count
, number of reference alleles observed at genomic positiont_alt_count
, number of alternate alleles observed at genomic position
At least one of the following also must be included:
HGVSp_Short
, protein change associated with the variant using the one-letter amino acid codesProtein_Change
, protein change associated with the variant using the one-letter amino-acid codes
--bases_covered_handle
anticipates a tab delimited file which contains a single integer representing the number of bases tested for somatic variants. This is used as the denominator to calculate tumor mutational burden (number of coding somatic variants / somatic bases tested).
29908096 |
---|
This input is looking for an integer value.
--called_cn_handle
anticipated a tab delimited file which contains one column for gene name and a second for the copy number call. For the latter, only the values Amplification
and Deletion
will be used by the Molecular Oncology Almanac.
gene | call |
---|---|
TP53 | Deletion |
CDKN2A | Deletion |
BRAF | Baseline |
EGFR | Amplification |
The rows associated with TP53, CDKN2A, and EGFR will be interpreted and scored by Molecular Oncology Almanac while BRAF will be filtered.
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
gene
, gene symbol associated with the copy number alterationcall
, copy number event of the gene.Amplification
andDeletion
are accepted and all other values will be filtered.
--cnv_handle
anticipates a tab delimited file which contains total copy number from a source such as GATK CNV or ReCapSeg, support for allele specific copy number is in progress. This file should have genes associated with segments. Amplifications are called from the top 2.5% of all unique segments and deletions called from the bottom 2.5% of all unique segments.
gene | segment_contig | segment_start | segment_end | sample | segment_mean |
---|---|---|---|---|---|
BRAF | 7 | 140035556 | 142013739 | ProfileA-Tumor | 1.250062303 |
CDKN2A | 9 | 21818453 | 27173612 | ProfileA-Tumor | 0.822108092 |
BOC | 3 | 112282632 | 113393977 | ProfileA-Tumor | 0.957205107 |
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
gene
, gene symbol associated with the copy number alterationsegment_contig
, chromosome of the copy number alterationsegment_start
, genomic location of the segment's start positionsegment_end
, genomic location of the segment's end positionsample
, string associated with the tumor profilesegment_mean
, normalized segment mean
--fusion_handle
anticipates a tab delimited file which contains fusions, specifically in the format of STAR Fusion.
#FusionName | SpanningFragCount | LeftBreakpoint | RightBreakpoint |
---|---|---|---|
EML4--ALK | 0 | 6:47471176 | 11:66563752 |
COL1A2--APBA3 | 6 | 9:35657873 | 21:46320255 |
POLR2A--AP2M1 | 12 | 17:7406801 | 3:183898675 |
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
#FusionName
, gene symbols associated with the fusion separated by--
. Genes are labeled from 5' to 3'.SpanningFragCount
, counts of RNA-seq fragments supporting the fusionLeftBreakpoint
, genomic position of the fusion's left breakpointRightBreakpoint
, genomic position of the fusion's right breakpoint
--germline_handle
anticipates a tab delimited file which contains germline variants (both snvs and indels) associated with a case profile. This file should follow the guideline's set by either TCGA or GDC Mutation Annotation Format (MAF). Column names are not case-sensitive.
Hugo Symbol | NCBI_Build | Chromosome | Start_position | End_position | Reference_Allele | Tumor_Seq_Allele1 | Tumor_Seq_Allele2 | Variant_Classification | Tumor_Sample_Barcode | Matched_Norm_Sample_Barcode | Annotation_Transcript | Protein_Change | t_ref_count | t_alt_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BRAF | 37 | 7 | 140453136 | 140453136 | A | T | A | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000288602.6 | p.V600E | 40 | 25 |
MSH2 | 37 | 2 | 47739466 | 47739466 | G | A | G | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000406134.1 | p.D887N | 40 | 30 |
STAG2 | 37 | X | 123191810 | 123191810 | A | T | A | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000371160.1 | p.F467I | 80 | 26 |
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
Hugo_Symbol
, gene symbol associated with the variantNCBI_Build
, reference genome usedChromosome
, chromosome of the variantStart_position
, genomic start position of the variantEnd_position
, genomic end position of the variantReference_Allele
, reference allele at the genomic locationVariant_Classification
, consequence of variant:Missense
,Nonsense
,Nonstop
,Splice_Site
,Frame_Shift_Ins
,Frame_Shift_Del
,In_Frame_Ins
, orIn_Frame_Del
Tumor_Seq_Allele1
, alternate allele at the genomic locationTumor_Seq_Allele2
, second allele at the genomic location (will be the same asReference_Allele
for SNVs)Tumor_Sample_Barcode
, string associated with the tumor profileMatched_Norm_Sample_Barcode
, string associated with the corresponding normal profileAnnotation_Transcript
, transcript associated with variantt_ref_count
, number of reference alleles observed at genomic positiont_alt_count
, number of alternate alleles observed at genomic position
At least one of the following also must be included:
HGVSp_Short
, protein change associated with the variant using the one-letter amino acid codesProtein_Change
, protein change associated with the variant using the one-letter amino-acid codes
--validation_handle
anticipates a tab delimited file which contains somatic variants (snvs and/or indels) from any form of validation or orthogonal sequencing on the tumor; such as re-sequencing the same tissue or somatic variants called from RNA. This file should follow the guideline's set by either TCGA or GDC Mutation Annotation Format (MAF).
Variants from this file are only used for confirmation and are not used for discovery. Specifically, MOAlmanac will look for reported somatic variants in the validation sequencing and identify if any supporting reads are present and if there is sufficient power for detection. This is consistent with best practices recommended by Yizhak et al. 2019.
Hugo Symbol | NCBI_Build | Chromosome | Start_position | End_position | Reference_Allele | Tumor_Seq_Allele1 | Tumor_Seq_Allele2 | Variant_Classification | Tumor_Sample_Barcode | Matched_Norm_Sample_Barcode | Annotation_Transcript | Protein_Change | t_ref_count | t_alt_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BRAF | 37 | 7 | 140453136 | 140453136 | A | T | A | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000288602.6 | p.V600E | 40 | 25 |
MSH2 | 37 | 2 | 47739466 | 47739466 | G | A | G | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000406134.1 | p.D887N | 40 | 30 |
STAG2 | 37 | X | 123191810 | 123191810 | A | T | A | Missense_Mutation | ProfileA-Tumor | ProfileA-Normal | ENST00000371160.1 | p.F467I | 80 | 26 |
Required fields can be changed from their default expectations by editing the appropriate section of colnames.ini. Column names are not case-sensitive.
Hugo_Symbol
, gene symbol associated with the variantNCBI_Build
, reference genome usedChromosome
, chromosome of the variantStart_position
, genomic start position of the variantEnd_position
, genomic end position of the variantReference_Allele
, reference allele at the genomic locationVariant_Classification
, consequence of variant:Missense
,Nonsense
,Nonstop
,Splice_Site
,Frame_Shift_Ins
,Frame_Shift_Del
,In_Frame_Ins
, orIn_Frame_Del
Tumor_Seq_Allele1
, alternate allele at the genomic locationTumor_Seq_Allele2
, second allele at the genomic location (will be the same asReference_Allele
for SNVs)Tumor_Sample_Barcode
, string associated with the tumor profileMatched_Norm_Sample_Barcode
, string associated with the corresponding normal profileAnnotation_Transcript
, transcript associated with variantt_ref_count
, number of reference alleles observed at genomic positiont_alt_count
, number of alternate alleles observed at genomic position
At least one of the following also must be included:
HGVSp_Short
, protein change associated with the variant using the one-letter amino acid codesProtein_Change
, protein change associated with the variant using the one-letter amino-acid codes
--ms_status
is a categorical input for microsatellite status, anticipating one of four options:
--unknown
, when status is unknown--mss
for microsatellite stable tumors, MSS--msil
for microsatellite instability "low", MSI-L--msih
for microsatellite instability "high", MSI-H
Microsatellite status is reported in the clinical actionability report.
--mutational_signatures
anticipates a tab delimited file which contains contributions to Single Base Substitution (SBS) Mutational Signatures from COSMIC version 3.4. The file should only contain signature contributions for the tumor sample being studied. We recommend generating SBS mutational signatures with SigProfilerAssignment, and have prepared a wrapper GitHub repository to run SigProfilerAssignment and format signature contributions as expected.
signature | contribution |
---|---|
SBS1 | 0.03846154 |
SBS2 | 0 |
SBS3 | 0.8525641 |
... | ... |
SBS95 | 0 |
The required fields for this file can be changed from their default expectations by editing the appropriate section of colnames.ini
. Column names are not case sensitive.
signature
, labels for each of the 79 SBS mutational signatures included in COSMIC mutational signatures version 3.4contribution
, a float value between 0 and 1 for the row's associated signature weight. This column's values should sum to 1.
--purity
anticipates a float value between 0.0 and 1.0 for the reported tumor purity. This is just used for reporting in the clinical actionability report.
--purity
anticipates a float value for the reported tumor ploidy. This is just used for reporting in the clinical actionability report.
--wgd
is a boolean argument that, when passed, is interpreted as the tumor harboring a whole-genome doubling event. If passed, MOAlmanac will match against relevant assertions.
--disable_matchmaking
removes patient-to-cell line matchmaking from being included in the actionability report.
--description
is a string input that is generally a free text field for users to enter any additional comments.
--output-directory
allows users to specify an output directory to write outputs to, the current working directory will be used if unspecified.
--preclinical-dbs
expects a file path to the preclinical-databases.ini file. This argument and ini file are required to run either module that either:
- Looks at the efficacy of relationships in cancer cell lines
- Performs genomic similarity to cancer cell lines
This config file contains a single section preclinical
that lists the following:
root
- path todatasources/preclinical/
directoryalmanac_gdsc_mappings
- path withinroot
that points to theformatted/almanac-gdsc-mappings.json
datasource filesummary
- path withinroot
that points to theformatted/cell-lines.summary.txt
datasource filevariants
- path withinroot
that points to theannotated/cell-lines.somatic-variants.annotated.txt
datasource filecopynumbers
- path withinroot
that points to theannotated/cell-lines.copy-numbers.annotated.txt
filefusions
- path withinroot
that points to theannotated/cell-lines.fusions.annotated.txt
datasource filefusions1
- path withinroot
that points to theannotated/cell-lines.fusions.annotated.gene1.txt
datasource filefusions2
- path withinroot
that points to theannotated/cell-lines.fusions.annotated.gene2.txt
datasource filegdsc
- path withinroot
that points to theformatted/sanger.gdsc.txt
datasource filedictionarey
- path withinroot
that points to thecell-lines.pkl
datasource file
For more information about each datasource, view the datasources/preclinical/ directory
--input
is an argument only used with simplified_input.py
. It accepts a tab delimited file with one genomic alteration per row based on MOAlmanac's standardized feature columns. In short the following columns are expected,
feature_type
, the data type of the molecular features and acceptsSomatic Variant
,Germline Variant
,Copy Number
, orRearrangement
. These strings can be customized in thefeature_types
section of config.ini.gene
orfeature
, the gene name of the genomic alteration.alteration_type
, classification or consequence of the genomic alteration- For somatic and germline variants:
Missense
,Nonsense
,Nonstop
,Splice_Site
,Frame_Shift_Ins
,Frame_Shift_Del
,In_Frame_Ins
, orIn_Frame_Del
- For copy number alterations:
Amplification
orDeletion
- For rearrangements:
Fusion
orTranslocation
- For somatic and germline variants:
alteration
, specific genomic alteration,- For somatic and germline variants: the protein change with 1-letter amino acid codes,
p.HGVSp_Short
- For copy number alterations: Leave blank
- For rearrangements: the full fusion separated by two dashes,
--
- For somatic and germline variants: the protein change with 1-letter amino acid codes,
For example,
feature_type | feature | alteration_type | alteration |
---|---|---|---|
Somatic Variant | BRAF | Missense | p.V600E |
Copy Number | CDK4 | Amplification | |
Rearrangement | COL1A1 | Fusion | COL1A1--CITED4 |
Germline Variant | BRCA2 | Frameshift | p.S1982fs |
If you use this method, please cite our publication: