Font Size: a A A

Structured Methods For Pathological Reporting Of Lung Cancer

Posted on:2022-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:X H WuFull Text:PDF
GTID:2504306494980529Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The pathological report of lung cancer contains a lot of information related to the disease,which is of great significance for the prevention and treatment of lung cancer.However,this kind of data is not effectively used because it is stored in the database in the way of natural language without structured processing and cannot be recognized by the computer.Structured work of pathological report will provide data support for medical-related academic research.However,existing structured research work often focuses on the improvement of theoretical algorithms,and it is difficult to meet the needs of engineering applications through a single algorithm.This paper aims to construct a comprehensive extraction framework by model plus rules,based on the existing theoretical model,improve the accuracy of lung cancer pathology report structuring,and achieve practical usability in engineering.The research contents of this paper mainly include the following aspects: a comprehensive extraction framework was constructed by selecting the lung cancer diagnosis report data from a third-class A hospital in Shanghai,and various entities in the lung cancer pathology report were extracted by model plus rules.In terms of model design,the CRF model based on Bert was designed and selected as the optimal model by embedding all kinds of algorithms and features with contrast words to learn various models.In terms of rule setting,regular expression was used to match the entity type of "degree of differentiation",and the longest text in the matching result was selected as the output result.Experimental results: The overall accuracy of the final model was 0.987,and the prediction accuracy and recall rate reached 1 for multiple entity types.Finally,the backtracking module is designed.This module supports viewing and counting erroneous and unpredicted entity types,thus providing data support for improving and verifying the performance of entity type extraction.By means of entity type reconstruction,annotated corpus modification and rule-based information extraction,three problems such as unreasonable entity type setting,wrong labeling and too few training samples of some entity types were solved,which effectively improved the performance of prediction.The research work of this paper mainly makes two contributions.In terms of theory: a comprehensive extraction framework of rule plus model was constructed.Firstly,the optimal model was selected based on different word embedding methods word2 vec and BERT.Then,the entity types that appeared less frequently in the corpus were extracted by rules to further improve the prediction accuracy and recall rate on multiple types of entities.Practically: The traceability module is built to support the view of the error predicted and unpredicted entity types,so as to achieve the purpose of analysis and verification.And support corresponding to the original pathology report,easy to modify the mislabeled data,improve the quality of the sequence labeling corpus.
Keywords/Search Tags:Pathological Report, Named Entity Recognition, Attention Mechanism, Bert, Bi-LSTM, CRF
PDF Full Text Request
Related items