Font Size: a A A

Application Of Data Mining In Medical Diagnosis Based On Regularized Regression Model

Posted on:2018-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:J H LiangFull Text:PDF
GTID:2334330518467095Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data mining is an effective information extraction method and information discovery method,with which we can extract the data from the hospital database,then analyse and evaluate these data,find out the hidden rules and information behind the data,provide the basis for scientific methods of medical treatment and judgment.The random forest algorithm is a new technology of data mining,it is a classifier combination model,which is capable of processing high dimension and nonlinear sample data,so it has been widely used in many fields.However,at present,there are two problems in the random forest algorithm: first,all of the proposed methods based on it have not been proved in theory so that they cannot be used in practice;second,the improvement of random forest in efficiency is still a weak point.According to the above two problems,this paper makes corresponding analysis,first of all,it proves the rationality of the proposed method in theory,and proves the improved algorithm with test.In order to solve the above problems,this paper proposes a new method called Optimal Sampling Times and No Release Random Forest Algorithm,OSNR-RF.The main contents of the paper are as follows:Firstly,this paper introduces the basic knowledge of data mining,illustrates the importance of data preprocessing in data mining,and discusses the feature selection algorithm used in the paper in detail.Secondly,this paper introduces the ridge regression model(RR)in regularized regression model has the characteristics of high prediction accuracy,strong interpretability of the model.The ridge regression model itself is unbiased,validity,consistency and asymptotic normality all these are excellent estimation properties of parameters.Then,make variable choice on the data the ridge regression model,next,the random forest algorithm is introduced briefly,and the changes of the sample size of training set and the infuence of the improved sampling method to the original algorithm is studied.The theory proves that the correctness of this algorithm,so that:(1)to find the optimal number of repeated sampling,propose and prove that with the increase of the number of each repeat sampling n,the error rate of random forests is reducing.After repeated experiments,an optimal sampling interval(N<n<2N)has been proposed;(2)this paper proposes a new equivalent of not repeating sampling method,so as to reduce the running time of the random forest algorithm,which improves the efficiency of random forest.Combining the previous two improvements,this paper not only proves it in theory,but also verifies it with experiments,which makes the OSNR-RF algorithm has higher classification efficiency.Finally,in the experimental verification section,by using the standard data set of UCI and the breast cancer data set of Maternity and Child-care Hospital,dissertation has reduced the problem of data over fitting with RR model,then the data set preprocessing has been done.Using the OSNR-RF algorithm,experiments has been performed on the processed data set,the comparison has been made in the classification accuracy and training efficiency,detailed performance test has been made.By contrast,the performance of the synthesis of non-back-up random forest algorithm in classification performance is improved,and the overall performance is more stable.
Keywords/Search Tags:medical data mining, regularization, ridge regression, random forest, non repeated sampling
PDF Full Text Request
Related items