Skip to content

NCBI Filter Background

Robert J. Gifford edited this page Oct 20, 2024 · 9 revisions

Large quantities of influenza virus genome sequence data are publicly available via GenBank. However, GenBank serves as a comprehensive repository for genetic information across all domains of life and was not developed specifically for viral data. Consequently, Influenza virus researchers face a number of data integration and comparison challenges when using GenBank, including:

  • Separate Entries for Each Segment: GenBank stores sequences for each influenza genome segment separately, which can make it cumbersome for researchers to access and analyse all relevant genomic information for a particular strain.
  • Lack of Standardised Isolate Information: GenBank lacks a standardized format for recording isolate-associated information, such as the geographical location of sample collection, host species, date of isolation, and clinical data. As a result, researchers may encounter inconsistencies or incomplete information across different entries. This hampers efforts to conduct comprehensive analyses, track viral spread, and understand the epidemiology of influenza strains.
  • Quality Control and Data Verification: With the vast amount of data submitted to GenBank, maintaining quality control and verifying the accuracy of submitted sequences and associated metadata can be an overwhelming task. Inaccurate or poorly annotated entries may lead to erroneous conclusions in research studies and impede the progress of influenza virus research.

Isolate databases

Influenza Virus Species Sequence Entries File Complete Genome Isolates Incomplete Genome Isolates
Influenza A virus Sequence Entries Complete Genome Isolates Incomplete Genome Isolates
Influenza B virus Sequence Entries Complete Genome Isolates Incomplete Genome Isolates
Influenza C virus Sequence Entries Complete Genome Isolates Incomplete Genome Isolates
Influenza D virus Sequence Entries Complete Genome Isolates Incomplete Genome Isolates

GenBank Filtering Tools

Flu-GLUE includes tools specifically designed to process and filter influenza virus sequence data from GenBank. These tools ensure a high level of order by linking sequences to isolates, standardizing metadata, and validating the data:

  1. GLUE Projects for each influenza species: These projects can download influenza sequence data from GenBank, extract metadata, validate data fields, and rapidly perform genotyping across genome segments.
  2. Console-based PERL program: Processes cleaned, validated influenza virus sequence data, checking metadata consistency and selecting the best representative sequences. It generates GLUE module definitions and console code for selective import into Flu-GLUE.

The GenBank filtering tools enhance the usability of influenza virus data from GenBank by:

  • Capturing Links Between Sequences and Isolates: Establishing connections between sequences and their corresponding isolates to ensure data consistency.

  • Validating and Standardizing Metadata: Addressing issues like redundancy, variable data quality, and non-standard definitions by validating and standardizing sequence-associated metadata.

  • Segment Recognition and Genotyping: Independently confirming the segment origin of sequences and performing genotyping, which is crucial for influenza virus classification.

  • Redundancy Management: Handling redundant sequences by selecting the best representative for each isolate segment.

  • Incomplete Isolate Identification: Identifying isolates with missing segment sequences and exporting this information for further analysis.

Via this filtering process, Flu-GLUE introduces a higher level of order to influenza virus sequence data, allowing researchers to efficiently process and analyze large datasets with improved accuracy.