2020年8月28日MER在线发表中国农科院植保所王桂荣课题组关于绿盲蝽基因组测序的文章。本文主要从基因组层面揭示绿盲蝽的多食性以及食叶特性。
绿盲蝽属于半翅目盲蝽科昆虫,基因组大小为1.02Gb,contig N50为785kb,scaffold N50为68Mb,1016Mb的contig组装成17个大的scaffold,对应17条染色体。其巨大的基因组与转座子的扩增有关。绿盲蝽属于杂食性昆虫,主要取食叶肉。与消化,化学感受以及解毒相关的基因家族的扩增与其杂食性的特性有关。研究人员发现其唾液腺中分泌一种特异性的多聚半乳糖醛酸酶(polygalacturonase),这与其主要取食叶肉汁液的习性可能相关。
inbred strain:12 generations
Illumina sequencing:a female adult
PacBio sequencing:100 female siblings
contig assembly
Canu:(ovsMethod = sequential genomeSize = 1g):18,403 contigs of total length ~1.97 Gb, contig N50 of 259 Kb.
Three rounds of contig polishing
- The first round:contigs were polished using PacBio reads with the Arrow consensus caller in smrt-link . The original bam files generated from PacBio Sequel were aligned with contig assembly by pbalign. Then, using arrow , we polished the assembly.
- The second and third rounds: filter out adaptors and low quality sequences in raw Illumina reads. The clean data were mapped to the contigs using bwa and the assembly errors were corrected using pilon.
To filter haplotypic duplication
purge _ dups on the polished assembly: purged primary assembly of total length 1.03 Gb, contig N50 of 785 kb and a haplotig assembly of total length 936 Mb, contig N50 of 88 kb.
To assess the completeness of genome assembly, we run busco using the insecta database (OrthoDB version 9).
we used clean Illumina data to filter possible contaminations in assembly. We used bwa to align clean Illumina data with the assembly and if any contig had an Illumina coverage rate lower than 5%, it was removed.
After quality control, clean Hi-C paired-end reads were first mapped to the contig assembly by bowtie2, and then hic-pro used the alignment to detect valid alignments and filter multiple hits and singletons. Finally, lachesis was used to cluster, order and orient the contigs.
-
The 20 samples included spawn, eight tissues (antenna, mouthpart, salivary gland, head, gut, leg, wing, body) from third instar nymphs, and 11 tissues (male antenna, female antenna, mouthpart, salivary gland, head, gut, leg, wing, male genital, female genital, body) from adults.
-
RNA samples from the whole body of A. lucorum in ==six different developmental stages== including ==first to fifth nymph==, and ==adult== were also prepared for full-length transcriptome sequencing using the PacBio Iso-Seq protocol.
基因结构注释
- 重复序列
Tandem repeats were identified by tandem repeats finder
Transposable elements (TEs)
-
searching against the TE database (dfam 3.0, RepBase) by repeatmasker and searching against the TE protein database by repeatproteinmask.
-
constructing a de novo repeat library by repeatmodeler, followed by repeatmasker to find TE repeats.
- 蛋白编码基因
The gene models in A. lucorum were predicted using augustus on the TE soft-masked genome, integrating evidence from RNA sequencing alignments, Isoform sequencing alignments and protein homology searches.
- RNA sequencing alignments
20 paired-end datasets from different tissues were aligned with the genome using star. After filtering by filterBam in augustus, the sorted bam file was transferred to a hints file by bam2hints in augustus.
- Isoform sequencing alignments:
we used Iso-Seq to assist in gene prediction. gmap was used to align Iso-Seq sequences with the genome and blat2hints.pl in augustus was used to generate a hints file.
- Protein homology evidence
all Hemiptera proteins in NCBI RefSeq were download. We aligned the proteins with the genome by tblastn using 1e−5 as cutoff and filtered those with less than 50% identity.
We used exonerate to align the remaining proteins with the genome and used exonerate2hints.pl in augustus to generate a hints file.
Finally, we combined all hints files from RNA-Seq, Iso-Seq and protein homology, and used augustus with the combined hints file to predict gene models, resulting in 23,106 gene models.
To get accurate gene sets, we filtered genes with less than 35 amino acids. We aligned protein sequences of gene models with the NR database in diamond blastp using 1e−5 as a cutoff, and 16,187 gene models had homologous proteins in NR.
We also aligned 20 RNA-Seq data sets with coding sequences of gene models in bwa and 17,953 gene models had a coverage rate higher than 95%. We retained genes that either had homologous proteins in NR or had RNA support, resulting in 20,386 genes. After that, we detected 33 genes with two or more errors in start codons, stop codons or nontriplet length. We filtered those wrong genes and got the final official gene set (OGS) including 20,353 gene models.
基因功能注释
we aligned the protein sequences of genes with kegg, eggnog, nr, swiss-prot databases by diamond, using 1e−5 as a cutoff and got the best hit. We also used interproscan to search interpro databases to find motifs and domains. Taken together, ==18,721 (91.98%) genes had homologous information in those databases==, indicating that the OGS is reasonably accurate.
Moreover, trnascan-se was used to find tRNAs with default parameters.
Nine sequenced hemipteran insects and Drosophlia melanogaster as an outgroup were used to infer gene
orthology in OrthoFinder.
Phylogenetic tree and gene orthology results were displayed and annotated using Evolview.
Expanded orthologous groups in A. lucorum were determined using a rank sum test compared to other eight insects in Hemiptera.
Protein sequences of single copy genes from each species were aligned in muscle, then concatenated into one super-sequence. PhyML was used to reconstruct the phylogenetic tree based on the concatenated super-sequence with the LG + I + G + F model. Divergence times among species were calculated in mcmctree (paml package). Calibration times were set according to a previous paper, minimum = 320 Ma and maximum = 390 Ma for D. melanogaster and A. lucorum.
GO (Gene Ontology) annotation results were obtained from Interpro. GO enrichment analysis was performed using the OmicShare tools.
The reciprocal BLAST best hit was used to calculate the synonymous mutation rate (Ks) by kaks _ calculator 2.0 with default parameters.
Duplicate_gene_classifier in MCscanX was implemented to classify the origins of the duplicated genes into different types.
A set of described Hemiptera odorant receptors (ORs) and gustatory receptors (GRs) was used to search the A. lucorum gene sets by blastp with the cutoff e-value 1e−5.
Multiple PSI-BLASTP searches were initiated with divergent ORs and GRs to find any additional annotated proteins that might belong to these families, and up to four iterations were used. Finally, some ORs and GRs were corrected manually.
Ionotropic receptors (IRs), digestive enzymes, and detoxification enzymes were annotated using diamond results compared to the nr database, uniprot database and kegg database with e-value 1e−5 and confirmed by InterProScan or eggNOG.
To get a complete gene family set, reannotation of the gene family was performed. First, all digestive enzyme, chemosensory receptor, and detoxification enzyme genes got from former gene set was mapped to eight Hemiptera genomes by exonerate with identity >35%, and exonerate2hints.pl was used to generate a hints file.
Then, the region where these genes can map was used to predict gene models by augustus with the hints file, and short gene models (less than 200 bp) were filtered. The predicted gene model that doesn't exist in former gene set was added to the gene family sets.
- In A. lucorum, LTR (98 Mb), LINE (73 Mb) and DNA (88 Mb) elements are the major types of TEs, and LTR is considerably in excess of that from other compared insects.
-
The phylogeny showed that A. lucorum diverged from C. lectularius about 168 million years ago (Ma) and from A. pisum about 275 MYA.
-
Gene ontology analyses observed significant enriched GO terms involved in odorant recognition, including sensory perception of smell (GO: 0007608) and sensory perception of chemical stimulus (GO:0007606; Figure S9), which provided clues for the extremely broad host plant ranges of A. lucorum.
-
enriched GO terms associated with digestion in A. lucorum were also observed, such as hydrolase activity, acting on glycosyl bonds (GO: 001698), hydrolase activity, hydrolysing O-glycosyl compounds (GO: 0004553) and polygalacturonase (PG) activity (GO: 0004650).
PG is an essential enzyme for digestion, which hydrolyse spectin substances and then destroys plant cell walls.
- These expanded genes could play an important role in the severe damage on a wide range of plants, as PGs can hydrolyse the pectin substances and then destroy the plant cell walls and ORs could promote the pest search for host.
- Novel genes were mostly generated from gene duplication, which is recognized as a driving force of evolution.
- Using a within-genome reciprocal best blast hit, 2,609 paralogue pairs were identified, and distribution of synonymous distances (Ks values) showed that 1,502 (58%) paralogue pairs had a Ks value smaller than 0.3, suggesting that most gene duplications possibly occurred in a recent period.
- recent duplicated genes in A. lucorum are mostly derived from small local scale gene duplications, instead of whole genome duplications.
-
A. lucorum had a comprehensive digestive enzyme spectrum, with a unique group of polygalacturonase (PGs) and a significantly expanded group of serine proteases(SPs).
-
PG is a group of plant cell wall-degrading enzyme, ubiquitous in fungi, bacteria, and plants. It is also found in Hemiptera and Coleoptera, predicted to be horizontally transferred from fungi.
-
The expression profile showed that 55 PGs were specifically expressed in salivary gland with high expression levels, indicating that the salivary gland of A. lucorum has a very high ability to synthesize PGs.
- SPs are involved in various physiological processes of insects, such as digestion, development and innate immunity.
- The expansion of SPs in A. lucorum can improve its digestive capacity and may contribute to its omnivorous feeding habit, mainly phytophagous with prey to complement.
-
A large number of chemosensory receptors containing 135 ORs, 57 GRs and 33 IRs were identified in the A. lucorum genome.
-
The phylogenetic analysis that showed 40% of the OR genes (55) were contained in several clades with high protein sequence identity (>80%), indicating that OR has experienced recent gene replication.
-
Expression profile analysis showed that GRs exhibit different expression patterns and most GRs are expressed in various tissues, which is consistent with the previous reports in C. lectularius and O. fasciatus, indicating the diversity of GR functions.
-
Expression profile studies showed that 88% of IRs are expressed in antennae, suggesting that most IRs have olfactory functions. However, some IRs were highly expressed in tissues other than antennae, reflecting the diversity of IR functions.
-
Agricultural pests usually employ an efficient detoxification system containing various enzymes to overcome numerous toxins in food sources or the environment.
-
GST is a superfamily of multifunctional isoenzymes involved in the cellular detoxification of various physiological and xenobiotic substances, which is highly related to insecticide resistance in insects, as it can directly detoxify the insecticides.
-
Phylogenetic analysis of the GSTs of four true bugs showed three A. lucorum-specific branches, which contained 58% GST genes, suggesting the GSTs experienced a recent species-specific expansion in A. lucorum, enabling better detoxification of toxic substances and adaptation to the environment.
-
P450s constitute the largest and most functionally diverse class of insect detoxification enzymes, including four distinct clades CYP2, CYP3, CYP4 and CYPMito.
-
The phylogenetic tree of P450s exhibited four A.lucorum-specific branches, suggesting that P450s
experienced similar species-specific expansion as GSTs, enhancing the detoxification activity in A. lucorum.
DOI: 10.1111/1755-0998.13253