Font Size: a A A

Research On Prediction Of Angiosperm Orphan Gene Based On Ensemble Learning

Posted on:2022-09-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q J GaoFull Text:PDF
GTID:1480306740968089Subject:Agriculture and Bioinformatics
Abstract/Summary:PDF Full Text Request
Orphan genes refer to genes that are unique to one species or lineage,but have no sequence similarity among all genes in other species or lineages.Orphan genes exist widely in animal and plant genomes,which are closely related to the unique biological characteristics and environmental adaptability of species.In recent years,the development of high-throughput sequencing technology has not only promoted the sequencing of animal and plant genomes,but also turned the study of orphan genes into a heat in the research of comparative genomics.In-depth understanding of the distribution patterns and functional characteristics of orphan genes in angiosperms is of great significance for analyzing their evolution and revealing the genetic basis for the formation of special traits.Accurate and rapid identification of plant orphan genes is a prerequisite for in-depth study of plant orphan genes.However,at present,most of the identification methods of plant orphan genes are mainly based on Blast sequence alignment which is time-consuming and low in throughput,and have become the major bottleneck for the further study of the evolution of orphan genes in angiosperms.Therefore,this project takes as the research object the high-quality orphan genes of representative angiosperms that have been identified,and constructs a prediction model for plant orphan genes using machine learning methods,to develop a machine learning algorithm suitable for rapid and accurate identification of angiosperms orphan genes.The method is then used for identifying the orphan genes of angiosperms systematically,with an orphan gene database of plants constructed.The main research findings are provided as follows:(1)Through comparative genomics research,4,649,7,013,1,330,103,1,417,109,509,790,993 and 2,036 orphan genes were identified respectively in the genomes of one kind of early evolution of plants and 9 representative angiosperms,namely,Chlamydomonas reinhardtii,Amborella trichopoda,Arabidopsis thaliana,Camellia sinensis,Citrus sinensis,Populus trichocarpa,Sorghum bicolor,Oryza sativa,Triticum aestivum L.and Zea mays.Further study found that compared with non-orphan genes,the orphan genes of angiosperms had distinctive structural differences,which were characterized by short average length,low GC content,less introns and high isoelectric point.(2)Constructed an orphan gene prediction model based on machine learning.Taking Amborella trichopoda,Arabidopsis thaliana,Oryza sativa and Camellia sinensis as research objects,a prediction model for plant orphan genes was constructed by employing five different machine learning methods(SVM,RF,AdaBoost,LightGBM and XGBoost),in which 7 pieces of gene characteristic information,including gene length,GC/AT percentage,protein length,molecular weight,isoelectric point of protein,average exon number,average number of introns were selected.In view of the unbalanced classification of orphan and non-orphan gene samples,a method combining “oversampling" and"random undersampling"(R?SMOTE)was proposed to obtain the balanced sample data set.We input the balanced data sets into the above five models,and found that the five machine learning methods based on R?SMOTE had effectively improved the accuracy,recall rate and F1 value of each prediction model.The results showed that the prediction effect of the ensemble learning model is better than that of SVM.(3)The experimental comparison further revealed that the composite model based on R?SMOTE in combination with integrated learning methods(RF,AdaBoost,LightGBM and XGBoost)had higher prediction accuracy,recall rate,and F1 value,compared to the single model combined with SVM,and that the composite model based on R?SMOTE in combination with XGBoost had the highest prediction accuracy of 90%.We also studied the importance of gene characteristics,and the results showed that protein molecular weight and isoelectric point contributed the most to the accuracy of a prediction model.(4)The system identified orphan genes of 100 angiosperm species.They were systematically identified using the developed XGBoost prediction model in combination with extensively collected angiosperm genomics data and gene characterization information.It was found that orphan genes were distributed in various quantities in angiosperms,accounting for about 1% to 30% of all protein-coding genes in the species.Compared with early differentiated angiosperms,such as Chlamydomonas reinhardtii and Amborella trichopoda,orphan genes have multiplied in recently differentiated further advanced crops.(5)A database of angiosperm orphan genes has been established.While integrating the identified angiosperm orphan gene information,the Platform of Angiosperm Orphan Genes Database(PAOGD)was further established,which was also integrated with the plant orphan gene prediction platform and various bioinformatics tools,such as functional enrichment analysis,correlation analysis,primer design,and sequence alignment,and was considered helpful for researchers to quickly retrieve and deeply mine the rich plant orphan gene data in the database,with the visualization realized.The prediction algorithm of plant orphan gene and the construction of database developed in this study not only provide a methodological basis for analyzing the evolution of orphan genes in angiosperms,but also lay a solid data foundation for further studying the function of angiosperm orphan gene and revealing the genetic basis of their special traits.
Keywords/Search Tags:Orphan gene, machine learning, model building, design of algorithm, database construction, genome evolution
PDF Full Text Request
Related items