Font Size: a A A

Chinese Parallel LDA Algorithm Based On Hadoop And Data Mining In Electronic Medical Records

Posted on:2017-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:X L DingFull Text:PDF
GTID:2334330512463713Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As the basis of Internet medical technology,electronic medical records are valuable resource containing the patient's clinical diagnosis and treatment records.The total data size of the medical record information system is above 100 TB,and the new date is growing rapidlly.The data types are diverse,which conforms to the definition of large data in academic circles.At present,the data mining practice of electronic medical records is based on the traditional clustering algorithm and association rule analysis,which processes structured data on a single computer and can not adapt to the large data environment.Hadoop is currently a popular distributed processing system,through the combination of a large number of cheap general-purpose hardware to form a huge resource pool.Hadoop is simple to deploy and has higher fault tolerance compared to Spark.This paper chooses the LDA model in the topic model as the goal of parallelization,and the parameter estimation method is Gibbs sampling methodIn this paper,we introduce the point mutual information algorithm PMIk to increase the dynamic update function of ICTCLAS word segmentation system,and give a parallel framework to deal with large-scale data sets.The input documents are divided from the external and internal sub-block,in order to avoid the dependence of the acquisition parameters,the form is the use of diagonal distribution of data.In the Gibbs sampling every word counts the normalized word frequency vector superimposed on the appropriate random number sequence,then we filter out the words below the threshold.In this paper,the Chinese corpus of Fudan University is used to analyze the experimental results from the three indexes of accuracy,confusion and speedup.The results show that the improved word segmentation algorithm can effectively improve the accuracy and recall rate of the word segmentation.The improved parallel LDA algorithm can significantly reduce Model run time.Finally,this paper takes the real electronic medical records of newborns as the object of data mining,and uses the parallel LDA algorithm for document classification and feature discovery.The results of mining show that the accuracy of algorithm classification is high.The descriptive lexical matrix of the algorithm contains the candidate features.The single factor analysis of variance(ANOVA)is used to test the factors influencing the incidence of four neonatal diseases.
Keywords/Search Tags:Medical data, Parallel LDA, Gibbs sampling, Hadoop
PDF Full Text Request
Related items