| The protein is an important material basis for life activities and are assumed to be the main components of cellular metabolism.All physiological processes of organisms need the participation of protein.However,it is worth noting that proteins in cells are not isolated but interact with other proteins to perform their biological functions together.The loss of some proteins may cause normal cellular function disorders,and may even affect the development and survival of organisms,which are called essential proteins.Key proteins are indispensable for maintaining normal physiological function and vital to disease diagnosis and drug target design.Therefore,it is important to identifying key proteins for revealing molecular mechanisms and biological processes.The traditional biological methods are time consuming and expensive.Recently,with the development of high-throughput technologies,the data of protein-protein interactions(PPIs)is accumulated.Numerous calculation methods based on the enormous biological data are developed to identifying essential proteins.These methods can be roughly divided into two categories: one is network-based topological characteristics,the other is the fusion of other protein biological information.Therefore,how to effectively integrate multiple biological information,excavate the inherent characteristics of different data and improve the recognition rate is a hot topic.In this thesis,based on the topological properties of the PPI network,three different measures are proposed to identify essential proteins by fusing the other biological information.The article mainly focuses on three aspects:(1)A new method for identifying essential proteins using the GO-PPI network,reconstructed based on gene ontology semantic similarity metric,is developed.Considering that the PPI network contains a large number of false positive data,five gene ontology(GO)based semantic similarity metrics are used to calculate the confidence scores of PPIs.The links with low-confidence scores are assumed as false-positive data and are filtered,and the GO-PPI network is made of the remained links.Six topology-based centrality methods are applied to test their performance,and the numerical results show that the performance of these centrality methods under refined PPI networks is relatively better than that under the original networks.(2)A method combining GO annotation information,subcellular localization information and protein domain information was proposed to predict key proteins.Firstly,the importance of structure was described by the protein domain information.Then,edge clustering coefficient,GO annotation information and subcellular localization information were used to describe the importance of function.Finally,a new method TGSD was proposed by combining the information mentioned above organically.The experimental results show that TGSD can effectively improve the prediction number of essential proteins.(3)A new method is proposed to predicting essential proteins by fusing multi-omics data.The accuracies of some methods merely based on the topological properties of PPI network are not satisfied.Therefore,we consider to combine the other biological information with the topological properties of the network together.In addition,different biological data reflects different characteristics of proteins,we combing the known protein complex information,the gene expression profile,GO terms information,subcellular localization information,and protein’s orthology data with the PPI network,and develop an algorithm named CEGSO.The fusion of multiple information can effectively reduce the false positive rate in the PPI network.Meanwhile,the optimal parameters are obtained by using the differential evolution algorithm,and we evaluate its performance on the benchmark PPI network.The simulation results show that the CEGSO is more accurate and robust than other compared methods. |