Chinese Parallel LDA Algorithm Based On Hadoop And Data Mining In Electronic Medical Records

Posted on:2017-02-21

Degree:Master

Type:Thesis

Country:China

Candidate:X L Ding

Full Text:PDF

GTID:2334330512463713

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

As the basis of Internet medical technology,electronic medical records are valuable resource containing the patient's clinical diagnosis and treatment records.The total data size of the medical record information system is above 100 TB,and the new date is growing rapidlly.The data types are diverse,which conforms to the definition of large data in academic circles.At present,the data mining practice of electronic medical records is based on the traditional clustering algorithm and association rule analysis,which processes structured data on a single computer and can not adapt to the large data environment.Hadoop is currently a popular distributed processing system,through the combination of a large number of cheap general-purpose hardware to form a huge resource pool.Hadoop is simple to deploy and has higher fault tolerance compared to Spark.This paper chooses the LDA model in the topic model as the goal of parallelization,and the parameter estimation method is Gibbs sampling methodIn this paper,we introduce the point mutual information algorithm PMIk to increase the dynamic update function of ICTCLAS word segmentation system,and give a parallel framework to deal with large-scale data sets.The input documents are divided from the external and internal sub-block,in order to avoid the dependence of the acquisition parameters,the form is the use of diagonal distribution of data.In the Gibbs sampling every word counts the normalized word frequency vector superimposed on the appropriate random number sequence,then we filter out the words below the threshold.In this paper,the Chinese corpus of Fudan University is used to analyze the experimental results from the three indexes of accuracy,confusion and speedup.The results show that the improved word segmentation algorithm can effectively improve the accuracy and recall rate of the word segmentation.The improved parallel LDA algorithm can significantly reduce Model run time.Finally,this paper takes the real electronic medical records of newborns as the object of data mining,and uses the parallel LDA algorithm for document classification and feature discovery.The results of mining show that the accuracy of algorithm classification is high.The descriptive lexical matrix of the algorithm contains the candidate features.The single factor analysis of variance(ANOVA)is used to test the factors influencing the incidence of four neonatal diseases.

Keywords/Search Tags:

Medical data, Parallel LDA, Gibbs sampling, Hadoop

PDF Full Text Request

Related items

1	Based On The Motif Of The Gibbs Sampling Algorithm To Find New Methods Of Research
2	Research And Design Of TCM Data Mining System Based On Hadoop
3	Research And Improvement Of Apriori Algorithm For Medical Cloud Data Based On Hadoop
4	Research On Medical Insurance Data Mining Based On Hadoop
5	Research On Parallel Processing Methods For Big Data In Medical&Healthcare
6	Analysis And Research Application Of Hyperthyroidism Disease Model Based On Medical Big Data
7	Design And Implementation Of ECG Data Acquisition And Storage System Based On Hadoop
8	Research On Parallel Processing Of Sparse MRI And Direct Fourier Transform Reconstruction Algorithm
9	Analysis And Research Of Tumor Mode Based On Medical Big Data
10	Design And Implementation Of Medical Big Data Analysis And Prediction System Based On Regression Model