Font Size: a A A

Extraction Of Shortest Representation Of Protein Folds Based On Convolutional Neural Network

Posted on:2021-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y PanFull Text:PDF
GTID:2480306500475514Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Protein universe is a set containing proteins from all living organisms,which connects sequences,structures and functions of different proteins.Establishing an entire protein universe from sequential and structural information acquired by experiment methods is a key problem in bioinformatics,and is of great importance in protein structure prediction,protein evolution path analysis and protein structure design.In this paper,starting from a simplified representaion of protein structure--contact map,we trained a deep convolutional neural network(DCNN)and studied the shortest feature vectors that was able to recognize different protein folds correctly.We analyzed the high dimension distribution of these shortest feature vectors with spectral clustering and other methods,and then constructed a protein fold space.The shortest feature vectors we obtained consider both information integerity and redundance,and are able to represent all the seven common protein classes and their spatial relationship.Our research fill gaps on description of spatial position and relationship of classes which is absent from previous researches,and may improve the understanding of similarity between protein classes.DCNN is main feature extracting method in our research.By modifying network structure,features of significant benefit to fold recognition were extracted from protein structure information,in order to separate domains of different class in feature vector space.For the purpose of seeking shortest representation of protein folds,number of neurons in feature layers was strictly limited.Multiple network model was built and their results were evaluated by several method.Finally,we found that every information which fold recognition require was possessed at feature vector length 8.Meanwhile,separation order of classes in clustering process also showed consistence with common understanding of protein structure similarity.Feature vectors of length 8 were defined as our shortest representation of protein folds.In addition,we found a critical dimension for protein fold representation around 4 dimensions.At the critical dimension,information in feature vector changed rapidly and folds from different classes were basically separated.After acquiring shortest representation of folds,we projected them into 3-dimension space by principal components analysis(PCA)for visualization.In this fold space map,folds from different classes processed their own regions with clear boundaries.Clustering result of folds showed high coherence with true class labels.Besides,not only common class like All ?,All ?,?/? and ?+? which often appeared in previous studies,but also multi-domain proteins,membrane proteins and small proteins found their position in our fold space map,which make it possible to push forward study about their relation in protein space.The arrangement of this thesis is shown as below: In chapter 1,protein universe,protein fold recognition and deep learning are briefly introduced as background of our work.In chapter 2,main method and structure parameter used in this thesis are specific explained.Chapter 3 discusses the distribution of our shortest representation in high dimension space and 3-dimension fold space map is shown for clear display.Finally,a summary of our work and prospection of future work are declared in chapter 4.
Keywords/Search Tags:Protein universe, deep learning, convolutional neural network, protein fold recognition
PDF Full Text Request
Related items