Skip to content

knowledgesystems/reVUE-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reVUE-data

Welcome to the reVUE-data repository! This repository is designed to store data related to reVUE, a platform dedicated to collecting and analyzing information about Variants of Unknown Effect (VUEs) in cancer research. The repository contains all reVUE data in VUEs.json.

Data structure

All reVUE variants are stored in VUEs.json and grouped by gene. For each gene:

  • hugoGeneSymbol (string)
  • transcriptId (string)
  • genomicLocationDescription (string): Description of the variant location, usually include which exon or intron it is or a common pattern the variants have
  • defaultEffect (string): Summary of default effect predicted by VEP, e.g., "splice".
  • comment (string): Short summary of the actual effects, e.g. "Complete exon 3 skip"
  • context (string): Clinical or research context, e.g., "Actionable in GIST".
  • revisedProteinEffects (array): A list of reVUE variants for this gene. Each element in this array represents a variant.
    • variant (string): HGVSg
    • genomicLocation (string)
    • transcriptId (string)
    • vepPredictedProteinEffect (string)
    • vepPredictedVariantClassification (string)
    • revisedProteinEffect (string)
    • revisedVariantClassification (string)
    • confirmed (boolean): If a variant has confirmed=true, Genome Nexus will overwrite VEP protein change and variant classification to reVUE revised results in annotation APIs.
    • references (array)
      • pubmedId (string)
      • referenceText (string)
    • mutationOrigin (string): germline or somatic
    • counts (object): This part is generated by variant_count.py under ./scripts/. The data used for counting are from mskimpact, tcga (all xxx_tcga_pan_can_atlas_2018), genie, mskimpact_nonsignedout (look for data_nonsignedout_mutations.txt):
      • germlineVariantsCount (integer)
      • somaticVariantsCount (integer)
      • unknownVariantsCount (integer)
      • totalPatientCount (integer)
      • genePatientCount (integer)

VUE updating process

Submit potential VUEs

Submit new VUE here. Or you can create a github issue.

List of papers

TODO

Curation and updating VUEs.json

Please find more details on curation section below

Deployment

After merging new updates to VUEs.json, reVUE website will automatically display the most recent data.

We regularly release updates for Genome Nexus . After each release, the Genome Nexus annotation API response will reflect the latest data.

Curation

Revise variant classification

ReVUE variant classification

  • Splice_
    • Exon_Skip_ (skip one or multiple whole exon)
      • In_Frame
      • Out_Of_Frame _ Non_Start (first coding exon skipped)
    • Exon_Extension_ (extend exon)
      • In_Frame
      • Out_Of_Frame _ Nonsense (introduce stop codon)
    • Exon_shortening_ (truncate a portion of exon)
      • In_Frame
      • Out_Of_Frame
      • Nonsense (introduce stop codon)
    • Intron_Retention_ (introduce the whole intron)
      • In_Frame
      • Out_Of_Frame

Notes:

  1. If a deletion spans the whole exon, it's classified as Splice_Exon_Skip_. For example if a variant deletes exon 4 and half of exon 5, it's classified as Splice_Exon_Skip_, not Splice_Exon_shortening_.
  2. If a variant introduces part of the intron, it's classified as Splice_Exon_Extension_, only when it introduces the whole intron, then it's Splice_Intron_Retention_

In-Frame vs Out-Of-Frame

If the length of insertion or deletion is a multiple of 3, this variant is In-Frame, otherwise it's Out-Of-Frame

Revise protein change

In-Frame

For example, MET exon 14 skipping variant: 7:g.116412044G>A. First verify the transcript ID in the published or submitted data. Then, navigate to the Integrative Genomics Viewer (IGV) and locate position 116412044. This location is at the boundary of exon 14 and intron 14, which modifies the splicing process and results in the skipping of exon 14. Exon 13 ends with one nucleotide G from codon 963 Asp (D), while exon 15 starts with two nucleotides A and T from codon 1010 Asp (D). Therefore, when exon 13 is connected with exon 15, the remaining G combines with A and T, forming a new amino acid Asp (D). Based on the information above, we know the new the new protein change is p.D963_E1009del. met_exon_14_skip_example

Frameshift

For example, ATM exon 17 skipping variant: 11:g.108138071T>C. First verify the transcript ID in the published or submitted data. Then, navigate to the Integrative Genomics Viewer (IGV) and locate position 108138071. This position is the second nucleotide at the junction of exon 17 and intron 17. This location alters the splicing process, leading to the skipping of exon 17. Given that the length of exon 17 is 57 amino acids and 1 nucleotide, this results in a frame shift. Exon 16 ends with a complete codon, while exon 18 begins with two nucleotides (G and T) of codon 880 Gly (G). When exon 18 is connected with exon 16, the reading frame starts from exon 18 and takes every 3 nucleotides, leading to a frameshift. In comparison with the wild type, the codon at this position should be 823 Ala (A) of exon 17. However, due to the deletion of exon 17, exon 18 takes its place, and the new amino acid sequence changes to V P L I L, then it encounters the stop codon. From the information provided, we can deduce that the first changed codon is 823 Ala (A), which changes to Val (V). It then encounters a termination after 5 codons. Therefore, the protein change of this variant is p.A823Vfs*5. frameshift_ATM

VEP predicted annotation

Genome Nexus annotation API provides VEP predicted annotation. For example: https://www.genomenexus.org/annotation/7:g.55248980_55248981insTCCAGGAAGCCT?fields=annotation_summary Replace variant ID in the url and check annotation_summary for VEP predicted annotations.

ReVUE variant count

Preparation:

The data used for counting are from mskimpact, tcga (all xxx_tcga_pan_can_atlas_2018), genie, mskimpact_nonsignedout (look for data_nonsignedout_mutations.txt). Make sure you have all the files downloaded on local.

Script

The counting number is generated by variant_count.py under ./scripts/. After adding all other fields, run this command:

python variant_count.py

The script will do the counting and add numbers to json directly.

Genome Nexus API

For confirmed reVUE, Genome Nexus API returns the following information in response (example of EGFR inframe insertion):

"vues": {
    "hugoGeneSymbol": "EGFR",
    "genomicLocationDescription": "5 bases upstream from the 5' end of exon 20 (7:g.55248980_55248981insTCCAGGAAGCCT)",
    "defaultEffect": "splice",
    "comment": "Inset a repeated sequence from 55248980-55248992",
    "variant": "7:g.55248980_55248981insTCCAGGAAGCCT",
    "genomicLocation": "7,55248980,55248981,-,TCCAGGAAGCCT",
    "transcriptId": "ENST00000275493",
    "revisedProteinEffect": "p.A763_Y764insFQEA",
    "revisedVariantClassification": "Splice_Exon_Extension_In_Frame",
    "revisedVariantClassificationStandard": "In_Frame_Ins",
    "context": "Recurrent in lung cancer, can be linked to Level 1 TKIs",
    "vepPredictedProteinEffect": "p.X762_splice",
    "vepPredictedVariantClassification": "Splice_Region",
    "mutationOrigin": null,
    "references": [
        {
            "pubmedId": 31715539,
            "referenceText": "Sousa et al., 2020"
        }
    ],
    "confirmed": true
}