With the widespread application of computer technology in the medical field,how to use big data development and artificial intelligence technology to mine the key information in medical case literature is still a challenge.Exploring data processing programs of different scales,designing highly available data warehouses,providing accurate analysis results,and implementing a simple and easy-to-use visualization platform are all difficult points for a medical big data development program with high reference value.This paper is based on the free full-text database of biomedical and life science journal documents of the National Institutes of Health and the National Library of the United States to mine text information.In order to manage the library data in a standardized way,the paper draws on the data processing experience of the first-line Internet companies in the process of mass data development,and innovatively introduces data from the three links of data ETL,data link component selection and deployment process,and data analysis and visualization.Warehouse flat modeling method,containerized deployment distributed development environment,configurable data analysis visualization platform to build a complete medical case literature data warehouse and efficient and fast data processing link.In the process of preliminary technical research,we have made an in-depth understanding and comparison of the commonly used technology stacks in storage,calculation,query,analysis and visualization.Among them,kylin,impala and hive distributed services were actually deployed,and the query performance on tens of millions of data sets was actually tested.In the text processing part,the medical case report literature data is cleaned and extracted with the help of commonly used machine learning preprocessing code libraries and natural language processing algorithms to achieve reasonable splitting of unstructured text data and effective information mining,and is based on self-developed User-defined functions build a complete data warehouse of medical case literature.Based on the constructed data warehouse and algorithm classification results,it can meet the analysis of the research field,funding institution,publication country,publication time period,etc.of the literature,and the word frequency statistics of each module keyword,and it can also support multi-dimensional and multi-index Multi-table joint query analysis.Finally,based on the superset with rich display styles and support for mounting of multiple data sources,the document time distribution,research field distribution,geographical distribution,and keyword word cloud were visually displayed,and the overall distribution of the document database data was visually displayed.The medical big data development program studied in the thesis has new reference significance in complex and massive text data processing and content mining. |