Font Size: a A A

Research On Protein Function Prediction Based On Deep Convolutional Network And Data Fusion

Posted on:2022-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:G J ZhouFull Text:PDF
GTID:2480306530498264Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Proteins are usually translated from mature RNAs(isoforms).They are an important material basis for living organisms and participate in various life processes.With the widespread application of high-throughput biotechnology,the collected protein sequence data,protein functional annotation data,and transcriptome sequencing data(RNA-Seq)have grown rapidly,and the number of proteins with unknown functions has also been increasing.Accurate and comprehensiveannotating protein functions can not only help people correctly understand the life mechanism,but also provide important data for drug development,disease analysis,and gene enrichment analysis.Therefore,protein function annotation is one of the basic tasks in the post-genomic era.Gene Ontology uses a series of standardized GO terms to annotate protein functions.GO has more than 45,000 functional terms,and there are complex structural relationships between terms,forming a directed acyclic graph(DAG).Due to the limitations of biological wet experiments,a protein is usually only annotated with a few or dozens of terms,and with a large portion of missing ones,which poses a huge challenge for accurately predicting protein functions.In addition,a single gene can be transcribed to generate different isoforms by alternative splicing,resulting in a variety of protein variants with different functions.Different protein variants have different functions,but are generated from the same genes,which prevents calculation methods from accurately identifying protein functions.The vigorous development of transcriptome sequencing(RNA-Seq)technology has brought a large amount of transcriptome data,provided instance-level data support,and laid the foundation for us to accurately identify individual functions at the protein variant level.Based on this,in more fine-grained research,predicting the function of the direct translation template(isoform)of protein variants is a new direction for protein function prediction,and it has become a research hotspot in protein function prediction in recent years.In this paper,combining multiple protein data and gene ontology structure,in order to improve the accuracy of protein function prediction,aiming at the deficiencies of current protein function prediction algorithms,the main tasks of protein function prediction are as follows:(1)To use the knowledge encoded in the GO hierarchy,we propose a deep Graph Convolutional Network(GCN)based model(Deep GOA)to predict GO annotations of proteins.Deep GOA firstly quantifies the correlations(or edges)between GO terms and updates the edge weights of the DAG by leveraging GO annotations and hierarchy,then learns the semantic representation and latent inter-relations of GO terms in the way by applying GCN on the updated DAG.Meanwhile,Convolutional Neural Network(CNN)is used to learn the feature representation of amino acid sequences with respect to the semantic representations.After that,Deep GOA computes the dot product of the two representations,which enable to train the whole network end-to-end coherently.Extensive experiments show that Deep GOA can effectively integrate GO structural information and amino acid information,and then annotates proteins accurately.Experiments on Maize and Human protein sequence dataset show that Deep GOA outperforms the state-of-the-art deep learning based methods.The ablation study proves that GCN can employ the knowledge of GO and boost the performance.(2)We propose a deep multi-instance learning based framework(DMIL-Iso Fun)to differentiate the functions of isoforms.DMIL-Iso Fun firstly introduces a multi-instance learning convolution neural network trained with isoform sequences and gene-level annotations to extract the feature vectors and initialize the annotations of isoforms,and then uses a class-imbalance Graph Convolution Network to refine the annotations of individual isoforms based on the isoform co-expression network and extracted features.Based on the known isoform function annotations in the maize B73 fifth edition genome data,DMIL-Iso Fun has a significant improvement in accuracy compared with the existing isoform function prediction methods.This article also further studied the specific GO terms: DNA binding(GO: 0003677),zinc ion binding(GO: 0008270)and phosphatidylinositol phosphokinase activity(GO: 0016307).DMIL-Iso Fun can accurately distinguish these GO terms be at the isoform level.In addition,the method tested the feasibility and superiority on the human data set.
Keywords/Search Tags:Protein function prediction, Alternative splicing isoforms, Gene Ontology, Deep learning, Graph convolutional network
PDF Full Text Request
Related items