Font Size: a A A

Research On The Key Technologies Of Health Database Storage Architecture And Efficient Data Access

Posted on:2019-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2404330623450543Subject:Engineering
Abstract/Summary:PDF Full Text Request
Health database is the dynamic file information related to residents health,and is of great significance to establish and improve the resident medical treatment,health care services and social health development.Due to its own free model and semi-structural characteristics of medical health data,at the same time,the management of resident health records will face massive data processing,how to design the information store strategy and meet the needs for efficient data access of different users,and all of these factors pose great challenges to the development of the universal health archive database.Therefore,it is of great significance to research the health database storage architecture and the efficient data access methods.The main content of this paper is to design and implement a healthy database storage architecture that uses MongoDB database as a data source for daily business processing,Hive as a target data storage,and Spark SQL for upper-level statistical analysis.We perform load-driven performance optimization to obtain more effiscient data access and statistical analysis.In order to construct the overall storage architecture of health database,the realization and optimization of ETL process are completed,it includes the change data capture based on the oplog and the anomaly detection using Gaussian Mixture Model(GMM),and the automation process of ETL.At the same time,it includes the design and implementation of a data analysis method to optimize the PDI default resolution efficiency;and changes the original polling partition method,with introducing a new data partitioning mode.On the other hand,Hive and Spark SQL are compared based on statistical analysis capabilities and optimization,in order to achieve more efficient data access.The main work of this paper includes the following aspects:(1)Design of health database storage architecture.The paper analyzes and tests the operating mechanism,storage architecture and synchronization mechanism of MongoDB and its performance under different scenarios,and conductes a performance analysis on the task of connecting MongoDB to Spark SQL and compared with that of Hive.Through the above test and analysis,we design and determine the overall health database storage architecture.(2)Implementation of health database storage architecture.The paper designs and implements MongoDB change data capture method OBMCDC,which based on MongoDB operation log(oplog)to achieve change data capture,to avoid the step to change the source data or the source data format requirements,at the same time,it will not cause any interference to the primary business.The paper realizes the detection of abnormal data using Gaussian mixture model,and the ETL automation process using PDI.This paper designed and implemented DataParse data parsing method and UDM partitioning method.The DataParse data parsing method serializes the data query result of MongoDB,analyzes the path of each field required,reduces the reorganization process of field reassembly by fields in the data parsing process,and optimizes the default parsing process of PDI;The paper adds a new partitioning method named UDM,for PDI supports only the default polling partition.With this method,logically adjacent data is divided into the same partition,and it reduces the data transfer between partitions and optimizes the performance in the statistical queries.(3)Implementation of efficient data access to health database.The paper designs and achieves the performance optimization of Hive and Spark SQL data access and statistical analysis.The scheme of data skew processing is designed and implemented,and optimizes Hive through the join strategies and the number of MapReduce.It compares the performances of different scenarios in Hive and Spark SQL,and analyzes the influences and applicable scenes from the file format,compression methods using different query requests(aggregation,named fields query,connection),respectively.
Keywords/Search Tags:ETL, MongoDB, Hive, Spark SQL, Change Data Capture, Partition
PDF Full Text Request
Related items