Skip to content

Primate Project Data

Robert J. Gifford edited this page Nov 27, 2024 · 1 revision

GLUE Project Components

  1. Full Genome Reference Sequences:
    A curated set of full genome reference sequences, each with associated metadata.

  2. Alignment of Reference Sequences:
    An alignment of the full genome reference sequences, providing a standardized framework for comparative analysis.

  3. Genome Feature Definitions:
    A set of defined genome features relevant to HIV-1.

  4. Coordinate Mapping:
    A list of coordinates linking genome features to specific positions within at least one reference genome.

  5. Phylogenetic 'Alignment Tree':
    A tree that defines phylogenetic relationships between different HIV-1 clades.


Review of LANL Sequence Choices

We are assessing the LANL sequence selections to determine if all sequences are necessary for our project:

  • Metadata Considerations: Reviewing the GenBank metadata to verify sequence reliability.
  • Patent Sequence: One CRF sequence is labeled as a patent sequence. We need to evaluate its reliability.
  • Rare Subtypes: Subtypes K and H are rare and may not be sampled again. Including them in a minimal project may not be beneficial. This warrants further discussion.

Alignment Creation and Manual Adjustments

An initial alignment of reference sequences was generated using MUSCLE (Edgar), then manually adjusted in [specify alignment viewing software].

Manual adjustments included:

  1. In-Frame Coding: Ensuring that all coding genes remain in-frame by removing frameshifting indels.
  2. Codon Grouping: Grouping nucleotides into codons wherever possible, such that sets of three nucleotides that had been split by the alignment software were restored to form single codons.

Locating Genome Features on Reference Sequences

GLUE requires genome features to be mapped to specific coordinates on at least one reference sequence. We annotated genome features on the following references:

  1. HXB2 (Subtype B): The primary reference used in most epidemiological and clinical studies.
  2. NL43 (Subtype B): Commonly used in laboratory studies, providing a practical alternative reference.
  3. Subtype C Reference: Given the prevalence of subtype C in global HIV-1 infections, particularly in sub-Saharan Africa where the subtype contributes significantly to morbidity and mortality, this reference is essential. Many transmission pair sequences are also subtype C, further justifying its inclusion.

Having multiple annotated references, particularly the option to choose between HXB2, NL43, and a subtype C reference, provides flexibility and a substantial advantage in various study contexts.

Clone this wiki locally