| As a big data application of government has accumulated a wide variety of data, the PB amount of mass data information, but also to produce more than 1TB of data a day. The sources of various data sources are not unified, the data types are diverse, the data storage methods are different, the business systems are scattered. The feedback speed of the users to the data retrieval and application is higher and higher, the existing system has a sharp decline in the performance and the full-text search ability. At the same time, how the external data of multiple source formats is loaded into the database efficiently and quickly is also a problem to be considered. Therefore, we urgently need to use big data processing technology to design the business application of big data storage and integrated application program. This article is to discuss how to solve the above problems through the Hadoop, ElasticSearch technology and ETL applications.Traditional relational database has a large performance bottleneck in the process of big data and traditional full-text database, which need to be solved by using distributed data comparison engine and distributed full-text search technology, etc., and the high frequency of multi data source is also need to be realized by ETL tool. At present, the distributed data search technology based on Hadoop architecture, ElasticSearch distributed full-text retrieval technology, and full-text retrieval technology, as well as ETL Kettle applications can meet the above requirements. But there are still some problems in the data comparison and retrieval efficiency of address types, and the high frequency incremental load efficiency of multi data sources. The improved optimization of the address matching algorithm, the Chinese word segmentation and Kettleās own data loading plug-in is needed. To solve these problems, the main work includes:(1) the overall architecture and function design of the system is analyzed. (2) to establish a distributed data matching engine and optimize the address comparison algorithm. (3) to establish a distributed full-text search application and improve the full text search efficiency; (4) to select a suitable ETL extraction method for high frequency incremental load, and improve the data loading performance by using multi thread processing method and loading code. It is proved that the distributed data matching engine and whole text retrieval technology can meet the requirements of a certain unit. Through system testing and implementation, it is proved that Hadoop, ElasticSearch based distributed data matching engine and full-text retrieval technology can solve the problem of the full text retrieval of the full text search and data, and can achieve the high frequency incremental loading of multi data source, reduce the overall investment, improve the overall performance of the system, and realize the integration of the business system. |