Font Size: a A A

Research On Methods For Protein Sequence Motif Discovery Based On Profile Hidden Markov Model

Posted on:2016-11-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:T SonFull Text:PDF
GTID:1310330482967096Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Applying Machine learning methods in protein sequence motif discovery is to detect bio-logically meaningful pattern in a set of sequences with a common attribute, and it has been a a focus in the field of bioinformatics. Motifs of protein sequences play an important role in cellular functions of proteins, such as the post-translational modifications, interactions and subcellular localizations, etc. When finding motifs by using protein sequences, there are problems of data imbalance, lack of data and so on. Traditionally, computational methods in motif discovery are based on regular expression and position weight matrix. Hidden Markov model (HMM) is an important probability model in sequence data processing and statistical learning and it has been widely adopted in speech recognition, behavior recognition, text recognition, fault diagnosis, biology sequence analysis and other fields. Compared with the regular expressions and position weight matrix, the motif found by HMM-based methods can obtain a richer representation. This paper researches on the machine learning algorithms based on profile HMM and their applica-tions in protein sequence motif discovery, and the main research contents are as follows:1. For the data imbalance problem in targeting motif discovery caused by uneven distri-bution of proteins in various subcellular compartments, a novel motif discovery algorithm is proposed based on the balanced sampling method which combines undersampling with over-sampling. Based on mimicking cellular sorting pathways, this algorithm adopts discriminative HMM to distinguish targeting motifs of different compartments. At the stage of data prepro-cessing, a simulated evolution method is applied to solve the multi-class imbalance problem; at the stage of HMM training, a random under-sampling method is introduced for the imbal-ance between the positive and negative datasets. Experimental results show that, in the task of discovering targeting motifs of nine subcellular compartments, motifs found by this algorithm are more conserved than the others without considering data imbalance problem and recover the most known targeting motifs from Minimotif Miner and InterPro. Meanwhile, the found motifs are used to predict protein subcellular localization and achieve higher prediction precision and recall for the minority classes. For the noisy sequences generated by simulated evolution, this paper further adopts active learning to select the most informative and representative synthetic sequences for discovering discriminative motifs. Experimental results show that, the improved algorithm identifies more significant targeting motifs. The found motifs are more conservative and useful in predicting protein subcellular localization;2. In order to improve the recognition effect of the multi-type functional motif, a new motif discovery algorithm based on selective training of profile HMM is proposed. Firstly. Because the protein motifs are predominantly located in intrinsically disordered regions of sequences and the residues of motifs are more conservative than their surrounding residues, it is useful to adopt the ordered region masking and relative local conservation (RLC) masking to improve the signal to noise ratio of sequences. This can decrease the number of positions where a motif can occur by chance, consequently increasing the likelihood of seeing a given motif multiple times and making motifs of interest more easily identifiable. To deal with the masked sequences, a new al-gorithm is introduced for motif discovery based on HMM. Experimental results show that, it not only reduces the computational complexity of training HMM but also ensures the performance quality of motif discovery based on profile HMMs; Secondly, this algorithm applies evolution-ary weighting to make the important sequences in evolutionary process get more attention by the selective training of profile HMMs. Experimental results show that, profile HMM-based al-gorithm complements the existing ones in finding complex motifs and provides another way for multi-type functional motif analysis;3. Variant motifs with the same functional site class are typically characterized by similar motif patterns. To discovery them, this paper firstly adopts the average disorder profile and the s-tatistical significance test of motif location to systematically study the relationship between them and the predicted disorder. Under the default parameters of IUPred which is a tool for predicting intrinsically disordered regions of proteins, ordered region masking improves the signal to noise ratio of sequences; Secondly, based on the ordered region masking, the proposed algorithm aug-ments the training set to diminish the impact of the lack of sequences of variant motifs; Finally, it trains discriminative profile HMMs to distinguish between the variant motifs. Experimental results on 37 sets of variant motifs show that, compared with the generative algorithms of motif discovery, ordered region masking and training set augmenting helps discriminative algorithms to deal with their inherent problems, and does more for them than generative ones to distinguish between variant motifs with little difference in motif patterns.
Keywords/Search Tags:Machine Learning, Bioinformatics, Hidden Markov Model, Protein Sequence Motif Discovery
PDF Full Text Request
Related items