Font Size: a A A

Multiple Classifier Systems For Protein Function Prediction

Posted on:2011-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:D M HuangFull Text:PDF
GTID:2120360305454951Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, in order to extract hidden important information from the stored large amounts of data, data mining techniques have emerged.In the field of data mining, classification plays as an important role of data analysis techniques, which analyses the inputing data through training the data set with focused characteristics, looks for an accurate description or model, and then predicts data type for unknown data sample. Classification problem in artificial intelligence, machine learning, pattern recognition and other fields has been extensively studied, and there are a number of traditional classification algorithms. However, these algorithms, with training through the known types of data set to get a single classifier, are reckless in scalability and efficiency. In addition, it is very difficult for them to deal with the classification task of the complex mass of data . Thus, the multiple classifier system has been put forward, which make use of the members of the classifier combination, related testing information and a ensemble approach to obtain a comprehensive classification prediction information, thereby enhance classification accuracy and reliability. How to obtain more useful information from the different members of the integrated systems to improve the classification performance, has become an important research questions in the field of data mining.Classification usually needs to predict the class label the forecast data belongs to. In sample set, each data belongs to a certain type of discrete disorder. Classification algorithm train from data set, analyses them, and then establishes classification model. The next phase is to classify the unknown types of data with this classification model. Here, we described the traditional classification techniques, including the commonly used classifier models, such as the k-neighbors, decision tree, support vector machines, Bayesian methods, neural networks, etc.; then the methods evaluating performance of classifier, such as hold-out and cross-validation method, were introduced.For the multiple classifier systems, with good performance should be in accordance with necessary and sufficient conditions : the base classifiers should be accurate and diverse. In other words, multiple classifier systems need to solve the following issues: the base classifier generation strategy, the base classifier selection, the base classifier fusion methods, and its assessment. The"overproduce and choose"strategy is adopted. As for the classifier generation strategy, you can operate data sets, classes, as well as properties, or change the classification model of the structure ,or improve the classification algorithm.The author studied the structure of multiple classifier systems and level of integration strategies at all levels, did research on diversity evaluation, and summarized combination methods. Then proposed classifiers generated strategy with training on different sources of data set ,which is a method of operating data set to extract the most representative samples . It considers classification performance and selecting the representative data set, and can generate candidate classifiers with better performance. With these candidate classifiers, we needs to select a subset of the optimal classifier from them. In order to care about the systematic assessment of performance, we carry out the selection method based on diversity and accuracy. The selection method takes account not only diversity problems the conventional classifier considered, but also the classification accuracy itself and the ensemble performance, which will help improve the total classification accuracy. In the final phase, with output of the member classifiers, we select a combination with the maximum principle to determine the final output as the final output.On protein function prediction, this paper introduced the commonly used protein databases, and devided protein function prediction methods into three categories from the perspective of machine learning, which are: supervised methods, semi-supervised methods, unsupervised methods.In this paper, multiple classifier systems show good results on theoretical and technical aspects. However, there are many problems needed to be in deeper study. For example: the structure of multiple classifier system topology and integration of decision-making research, the candidate classifier set of selected optimal subset needed to be considered acts of independence among classifiers, diversity, locality and other conditions; how to integrate multiple member classifiers to determine output information to get better classification performance, involved with building a fusion system, etc., therefore, the impact of various factors that affect classification system should be considered. In the phase of selecting members of the classifier, the mutual independence, should be concerned about as whether you can make a more sound theoretical analysis to give a better measure for the members of the classifier correlation, as well as the comprehensive consideration of the problems in procedure of the classifier generation and combination. In addition, the optimization of system design, as a research priorities, has been carried out to achieve some meaningful results, but it can't dynamically choose the best multiple classifier system architecture for a given categorization task, which is still an unresolved issue.In addition, the research in multiple classifier systems are always fixed in such conventional pattern, maybe we should search for another way to improve .
Keywords/Search Tags:classification technology, multiple classifier systems, GPCRs, Protein function prediction
PDF Full Text Request
Related items