Font Size: a A A

The Research Of Feature Selection And Its Application In Bioinformatics

Posted on:2016-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:W J ZhangFull Text:PDF
GTID:2180330461476515Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of system biology, genomics, proteomics, metabolomics and other biology techniques are widely applied in disease diagnosis, drug treatment, etc. Bioinformatics data, such as genomic data, proteomic, and metabolomic data, are usually high dimensional. Feature selection methods can help to reduce the dimension and filter the noise and irrelevant features. This paper studies the feature selection methods and their applications in bioinfomatics.The analysis of bioinformaitcs time series data helps to search for the development of diseases and find the predictive biomarkers which can provide warning information for the occurance of a disease. Traditional time series methods commonly used to deal with values of a monitoring variable measured continuously in a relatively short time interval, which are mainly used for prediction, anomaly detection and classification, etc. Unlike traditional time series data, bioinformatics time course data usually contain a number of samples and high dimensional variables and the number of time points are less. This paper studies the feauture selection methods of bioinformatics time course data, and proposes a weighted relative difference accumulation method wRDA. Different time points may lie in the different stages in the disease development process. To reflect the difference, wRDA weights the time points and accumulates the weighted relative differences of the time points to find the predictive biomarkers. To show the validation of wRDA, wRDA was applied to an animal experiment and a crowd of liver metabolomics time course data, respectively. Meanwhile, according to the characteristics of clinical time course data and considering the influence of samples’ storing time, wRDA was extended to w2RDA by weighting the sampling time points. The result of the animal’s metabolomics time course data shows that the wRDA can find the known important metabolites, which reflect different courses of liver disease. In the mean time, the selected metabolites have excellent classification performances to discriminate the liver disease group from the control group and dicriminate the HCC samples from non-HCC samples. In the crowd of liver metabolomics time course experiment, the serum bile acid in precancerous stage was found kept rising for a long time. Therefore bile acid is speculated to be the risk factors for liver cancer.To deal with the problem that bioinformatics data usually have a high dimension with a small number of samples, this paper proposes a feature selection method ReliefF-WS to filter the noise and reduce the dimension. First samples are measured and each sample is given a weight based on the degree of the category overlapping, where samples with good quality will have high weights and samples with bad quality will have low weights. ReliefF algorithm is a fast and efficient filter method of feature selection, the idea of weighting samples according to category overlapping is applied to ReliefF method, which can reduce the bad influence caused by samples with bad quality in the process of updating the features’weights. The experiments were done on nine public bioinformatics datasets and the results showed that ReliefF-WS can rank the variables more accurately than ReliefF algorithm.
Keywords/Search Tags:Bioinformatics, Metabolomics, Time Series, Feature Selection
PDF Full Text Request
Related items