Font Size: a A A

Information Extraction And Recognition Algorithms For Chinese Papery Medical Documents

Posted on:2020-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhaoFull Text:PDF
GTID:2404330575995030Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Artificial intelligence has penetrated into all walks of life,bringing a lot of convenience to society.However,there are still many fields that have not yet been tapped and have a large demand.For example,a large number of papery medical documents retained by patients in the medical industry have not been utilized,which is one of the important sources of medical big data.Paper-based medical documents are still widely used,while the content printed on papers is difficult for patients to store and manage.In contrast,electronic medical documents not only help solve these problems,but also promote the development of telemedicine and medical big data.Thus,transforming traditional printed medical documents into electronic ones becomes a key issue.It is worth noting that recognizing Chinese medical document in image form is a challenging task,as there are a variety of characters,including Chinese,English,Greek alphabets,mathematical symbols and so on.Also,the structure of Chinese characters is often intricate.In order to address the problem mentioned,this paper designed a complete solution—deep learning-based method for extracting information from Chinese medical documents,including text detection and recognition.The text detection algorithm based on deep learning avoids the cumbersome steps of the traditional ones and has high detection efficiency.Moreover,the deep learning-based detection model also shows a high accuracy rate;considering the cost of labeling the single-word recognition model is too high,we have customized the recognition module for the Chinese medical documents based on Tesseract.Tesseract can be used as a pre-labeling tool and supports multi-language character recognition,which is faster than the neural network model at training stage.By analyzing the experimental results of the single-word recognition model,we found that the recognition model based on single words relies on the result of character segmentation,and the recognition effect on the near-word is not good,and it is sensitive to image quality(image blur,text tilt,etc.).In addition,the most of the mainstream text recognition models are designed for single-scale characters,and are mainly used for the recognition of English and numbers.Therefore,we make an in-depth analysis of the scale of character features and propose a multi-scale sequence recognition model.This model greatly reduces the recognition errors of near-words.Moreover,The sequence recognition model based on multi-scale features proposed in this paper treats the test single text block as a sequence,eliminating the step of character segmentation,and the context information of the entire sequence also plays a role in assisting recognition.In order to prove the validity of our proposed model,we trained on the synthesized data set and evaluated it on the real test sheets we collected.The experimental results show that the recognition effect of our model is better than the current mainstream algorithms.
Keywords/Search Tags:optical character recognition, text detection, text recognition, Chinese medical documents, deep learning
PDF Full Text Request
Related items