| In recent years,with the popularization of the domestic Electronic Health Records system,the accumulation of medical texts has increased.The medical texts contain a large number of important information of patients,such as diseases,symptoms,diagnosis and treatment,etc.These data play an important role in the subsequent related work,such as disease analysis and disease prevention.Therefore,the mining and analysis of Electronic Health Records have been received more and more attention in the field of natural language processing.The information on Electronic Health Records is stored in text form,and the terms of disease and symptom descriptions are not uniform due to the doctor’s personal habits during writing the medical record,which will lead to the errors in the work of docking medical expense payment systems,medical data statistics and so on.Therefore,it is important to map clinical text data to a standard terminological database,that is,to represent text in code.This thesis studies the entity analysis and automatic coding of medical data.The specific research contents are as follows:1)A clinical text entity recognition method based on Att-Bi-LSTM-CRF is proposed.This method incorporates Chinese word embeddings with stroke n-gram information(cw2vec)into Bi-directional Long Short-Terms Memory(Bi-LSTM)network and uses an attention mechanism to determine how much information to use.Finally,in order to make the prediction label more reasonable,the conditional random field(CRF)is used for labeling.2)A short text clustering method based on convolutional neural network and K-means is proposed.The short text data of the disease is simple to express,so this thesis expands the short text data by the external ICD-10 terminological database,and the word2 vec learns the expanded short text representation,then uses the convolutional neural network to learn the deep feature representation and realizes the clustering through K-means.3)An automatic disease coding method based on deep learning and examples is proposed.This thesis merges multi-method,including deep learning,similarity calculation,example-based comparative table.The neural network learns the mapping relationship between text and coding from the training data to realize the coding prediction.The similarity calculation based on TF-IDF is used to select the coding that similarity with the disease.The example-based method is used to solve the problematic coding.The experimental results prove that the method proposed in this thesis is effective.For the disease or diagnosis description in medical data,the accuracy of the entity recognition method based on deep learning model is about 82%.The expansion of disease short text,convolutional neural network and traditional K-means algorithm can complete the short text clustering of disease.The deep learning method solves the most frequently used coding in hospital diagnosis,the similarity calculation and the example-based comparative table solve the coding that is infrequent and difficult to judge in the hospital.By combining deep learning and example-based methods,the coding types are covered as much as possible,and the accuracy of automatic disease coding is improved.At last,this thesis describes the exiting problem and the further research plans. |