Font Size: a A A

A Study On Multi-Label And Imbalanced Data Classification In Bioinformatics

Posted on:2008-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:K ChenFull Text:PDF
GTID:2120360212476045Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Multi-label and imbalanced data classification is an important problem in machinelearning. Many real-world applications, such as text categorization and subcellular local-ization of protein sequences, involve multi-label classification with imbalanced data. Un-fortunately, most traditional learning algorithms are designed for single-label and balancedproblems. So they don't work well on multi-label and imbalanced data sets.The subcellular localization of proteins is an important problem in bioinformatics. Thesubcellular location of a protein is close related to its function. So if we want to know thefunction of a protein, it's very helpful to know its subcellular location. But experimentaldetermination of subcellular location is time consuming and costly. So it's very necessary tostudy how to predict the subcellular location by the amino acid sequence of a protein usingmachine learning ways. Unfortunately, this problem is a typical multi-label and imbalanceddata problem. That is to say, proteins in some locations are much more than those in otherlocations. And a protein may exist in more than one location. Most traditional learningalgorithms do not work well on this kind of problems.In this paper, we use min-max modular (M3) network to address the subcellular localiza-tion problem. M3 network is an efficient classifier for solving large-scale complex problems.It can decompose a complex problem into a series of small and simple subproblems. Thesesubproblems are independent with each other, and can be solved in parallel. In the pre-dicting phase, two simple integration principles can be employed to combine the outputs ofsubproblems to get the solution of the original problem. Experimental results show that M3network is better than traditional SVM classifier in classification accuracy when solving thesubcellular localization problem, especially for the small classes. For these locations, SVMclassifier can only achieve very low accuracy. When we use M3 network to decompose theproblem into small subproblems, the classification accuracy improved significantly. Besides,our experiments also show that M3 network is much faster that traditional classifiers. Andconsidering that the subproblems of M3 network can be processed in a parallel manner, ourmethods will be even faster on massively parallel machines.
Keywords/Search Tags:min-max modular network, multi-label and imbalanced data classifica-tion, bioinformatics, subcellular localization, support vector machine
PDF Full Text Request
Related items