Skip to content
Victoria edited this page Apr 14, 2023 · 5 revisions

What mapperdoes?

It merges data of protein variants to related protein structures. Specifically, it generates mapping between IDs of choice, such as protein, transcript, gene or variant IDs, using a conversion file provided by the user.

Input data

The mapper command line tool requires specific input data to generate the mapping of protein, transcript, gene or variant IDs. Here is a breakdown of the input data required:

  • Interfaces database directory path: This is a directory path to the interfaces database that was previously generated using the makestructuralsdb command line tool.
  • Variants database directory path: This is a directory path to the variants database that was previously generated using the makevariantsdb command line tool.
  • IDs of choice: This can be protein, transcript, gene, or variant IDs that the user wants to generate the mapping for.
  • ID mapping file: A CSV file that contains the conversion of protein, transcripts and gene IDs, and optionally APPRIS isoforms IDs. This file is manually generated by the user based on their requirements. The file must have three columns with fixed names - "geneID", "transcriptID", and "protID". It should contain the conversion between these 3 IDs, where the IDs appearing in the structural dataset and the positions/variants files should be present. For instance, if the structural dataset was generated with a UniProt target proteome, then the protID column would contain UniProt IDs. If a variants file generated with VEP was used, then the geneID and transcriptID would correspond to Ensembl IDs.

It is essential to provide accurate input data to obtain the desired output from the mapper command line tool.

Input arguments

Input arguments of the program:

  • -pid, --prot-id: one or more IDs of protein, transcripts or genes provided via command line or from a file
  • -vid, --var-id: single or list of variants ids provided via command line or from a file
  • -psdb: interfaces database directory (required)
  • -vdb, --vardb: variants database directory (required)
  • -o, --out: output directory (default: ./3dmapper_results)
  • --id_mapping: file that contains the conversion of protein, transcripts and gene IDs and optionally APPRIS isoforms IDs (required)
  • -i, --isoform: if available in the ID mapping file, this parameter can filter by a single or a list of APPRIS isoforms. The principal isoform is set by default. Options are: principal1, principal2, ...
  • -c, --consequence: filter by variant or position consequence type (default: None)
  • -d, --dist: threshold of interface maximum distance allowed in angstroms. By default, the maximum value will be the one selected in makeinterfacedb
  • --pident: threshold of sequence identity (percentage) (default: 20)
  • -e, --evalue: threshold of evalue (default: None)
  • -f, --force: force to overwrite? Inactive by default (default: False)
  • -a, --append: two or more calls to the program write are able to append results to the same output file (default: False)
  • -p, --parallel: parallelize process (default: False)
  • -j, --jobs: number of jobs to run in parallel (default: 1)
  • -v, --verbose: print progress (default: False)
  • -l, --location: map all variants and detect their location (default: False)
  • -csv: write the mapped data to a CSV file (default: False)
  • -hdf: write the mapped data to an HDF5 file using HDFStore (default: False)

Output

The output of the mapper command depends on the selected format:

  • CSV: Generates 4 files, containing the following mapping results:

    • Interfaces: Contains the mapping results for protein interfaces.
    • Structures: Contains the mapping results for protein structures.
    • Unmapped: Contains the unmapped variants.
    • Non-coding: Contains the non-coding variants.

This output option is useful for having all the results in the same file. However, it can become difficult to handle in downstream analysis when mapping many variants or using many structures due to the large file size.

  • HDF5: Generates 4 directories, each containing individual files per protein, transcript, gene, or variant ID. The directories and their contents are as follows:

    • Interface: Contains individual mapping results for protein interfaces.
    • Structure: Contains individual mapping results for protein structures.
    • Unmapped: Contains individual unmapped variants.
    • Non-coding: Contains the non-coding variants.

This output option is useful for large-scale mapping as it eases downstream analysis. HDF5 files can be quickly read using the vaex library in Python.

Clone this wiki locally