GeneFamily: A Comprehensive Mammalian Gene Family Database with Extensive Annotation and Interactive Visualization
Gene families underlie genetic innovation and phenotypic diversification. However, the identification, analysis, and interpretation of gene families depend on custom bioinformatics and visualization workflows that are mainly unattainable for non-expert users. To overcome the learning curve associated with analytical tools and the time costs involved in data collection and processing, we propose GeneFamily (https://gh.deepomics.org/), a comprehensive mammalian gene family database that incorporates extensive annotations and interactive visualizations. Leveraging whole-genome data from 138 mammalian species and Pfam-A Hidden Markov models, GeneFamily encompasses 2,036 gene families. Users can explore gene family distributions, phylogenetic trees, member diversities, sequence identities, structural features, conserved motifs, chromosomal localizations, neighboring regions, gene collinearities, duplication events, and adaptive signatures via an intuitive web interface. GeneFamily reduces technical barriers and accelerates research into gene family evolution, functional diversification, and comparative genomics across mammals.
- Anaconda
- Python 3.9
- pandas >= 2.0.3
- tqdm >= 4.66.5
- scipy >= 1.10.1
- numpy >= 1.24.3
- perl >= 5.10.0
IsoformFilter.py is used to filter protein isoforms to ensure that only one representative protein sequence is retained for each gene. During evolutionary analysis, a gene may produce multiple protein isoforms due to alternative splicing or post-translational modifications. These isoforms do not represent independent evolutionary units, so a representative sequence needs to be selected for analysis.
- Protein sequence files (.pep format) from Ensembl mammalian database
- Initial gene family search results: Feature1.all.txt
- Filtered results file: Feature.all.txt
- Group protein sequences by gene ID and species information
- Sort protein sequences within each group by length in descending order
- Select the longest protein sequence in each group as the gene representative
- Filter out remaining isoform sequences
This module is used to count and organize gene family member information, classifying protein sequences that have been filtered by IsoformFilter.py into corresponding Gene Symbols (as gene members of the gene family), and establishing mapping relationships for gene family members.
- Gene family data file processed by IsoformFilter.py: Feature1.all.txt
{
"gene_family_name": ["member1", "member2", ...]
}- Read filtered protein sequence data
- Match corresponding gene symbols for each sequence
- Mark sequences with missing gene symbols as "UNKNOWN"
- Remove duplicates from each gene family member list
- Generate mapping dictionary from gene family to member list
- Establish clear gene family member relationship diagrams
- Facilitate subsequent gene family analysis and evolutionary research
- Ensure uniqueness and traceability of each gene family member
This script analyzes the distribution of gene families across different species, generates gene family diversity data, and provides foundational data support for evolutionary analysis and inter-species comparison.
- Gene family data file processed by IsoformFilter.py: Feature1.all.txt
Gene family distribution statistics across species, including:
- Number of members of specific gene families in each species
- Species distribution patterns of gene families
- Differences in gene family member numbers between species
This pipeline contains two scripts for extracting and organizing gene structure information of gene family members, providing data support for comparative analysis of gene structures.
- Filtered gene family data: Feature1.all.txt
- Ensembl mammalian database GFF files
- Gene structure information categorized by species for the gene family
- Extract structure information of target genes from GFF files
- Parse exon and intron location information
- Extract transcript structure features
- Collect gene structure-related annotation information
- Gene structure information output from Step1
- Gene family member information and species distribution information
- Gene structure information organized by gene family members
- Gene structure information organized by gene family species distribution
- Group gene structure information by species
- Classify by gene family members
- Generate structured data files
This script extracts genomic location information of gene family members and combines chromosome gene density data to provide data support for gene distribution pattern analysis.
- Filtered gene family data: Feature1.all.txt
- Ensembl mammalian database GFF files
- Chromosome length information files
- Chromosome gene density data
- Chromosome position information of gene transcripts
- Chromosome gene density distribution data
- Process input parameters (project ID, species list, etc.)
- Validate working directory
- Create necessary working directories
- Extract species list
- Generate transcript ID list
- Parse gene location information from GFF files
- Extract chromosome coordinates of transcripts
- Copy chromosome length information
- Integrate gene density data
- Process GFF format genome annotation files
- Extract gene and transcript structure information
- Establish gene position index for each species
- Ensembl mammalian database GFF files
- Filtered gene family transcript structure information
- Species-specific gene block data (.pkl format)
- Gene structure information cache files
- Analyze genomic neighboring regions (±1Mbp range)
- Extract genomic structure around target genes
- Organize data by species and gene family members
- Gene structure information data corresponding to gene family
- Species-specific gene block data (.pkl format)
- Gene structure information cache files
- Gene neighborhood structure information organized by gene family members
- Gene neighborhood structure information organized by gene family species distribution
- Read gene family information
- Parse GFF annotation files
- Establish gene position index
- Determine 1Mbp range search window
- Extract gene information in target regions
- Collect structure features of neighboring genes
- Group by species
- Group by gene family members
- Generate JSON format output
This script performs statistical analysis based on gene family collinearity analysis results between different species, revealing gene family evolutionary dynamics and functional differentiation patterns through systematic classification and quantitative assessment of gene symbol conservation and variation.
- Gene family data after isoform filtering: Feature1.all.txt
- Inter-species collinearity analysis results (from MCScanX analysis)
- {gene_family_name}.links.flow.txt
- Records collinear gene pairs with identical gene symbols
- {gene_family_name}.links.flow.unknown.txt
- Records collinear gene pairs with unknown or different gene symbols
- {gene_family_name}.species.links.flow.txt
- Statistics of gene family member collinearity distribution in each species
- Read Feature.all.txt file
- Standardize mRNA_ID (remove version numbers)
- Process Gene_symbol (convert to uppercase, fill NA with "UNKNOWN")
- Establish mRNA_ID to Gene_symbol mapping dictionary
- Filter .link.txt files
- Filter tandem repeat (tandem.link.txt) files
- Extract inter-species collinearity relationships
- Same gene symbol pairs (same_genemembers)
- Unknown gene symbol pairs (diff_unknown_genemembers)
- Different gene symbol pairs (diff_genesymbol_genemembers)
- Establish inter-species connection counts (sp_counts)
- Calculate gene symbol frequency in each species (sp_genecounts)
- Generate species gene member matrix (sp_genemembers)
- Calculate percentage of completely conserved gene symbols between species
- Identify highly conserved gene family members
- Evaluate degree of gene function conservation
- Quantify collinearity relationship strength between different species
- Analyze distribution characteristics of collinear blocks
- Study conservation level of gene arrangement