Font Size: a A A

Design And Implementation Of Data Cleaning System Based On Memcached

Posted on:2018-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:F X QiuFull Text:PDF
GTID:2428330542968209Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Dirty data exists in large numbers,such as:misspelled letters,inconsistent print formats,invalid data values,invalid data values,non-standard shorthand uppercase,multiple representations of the same entity(Repeated),referential integrity is not uniform,etc.;for large amounts of data to be cleaned when the need for high-performance professional server,for personal data analysts or businesses will increase additional hardware expenses.But at present,there are actually a large number of computer nodes(personal PC or dedicated server)are in the state of excess performance.In the actual data mining and analysis process will be collected a large number of unclassified,unformatted and unwashed data files,these data contains a wealth of relationships can be tapped,but want to dig out from the massive data Value data needs to be categorized,identified,formatted and cleaned,time-consuming and labor-intensive if done manually,and the quality of the data being finally cleaned out is not high;standardizing and automating these efforts is required.However,for the mass data file cleaning work,the processing performance of the machine will require relatively high,the size of each cleaning data are not the same,if an increase of a number of good performance processing machines,it may be resource waste.This system designs data cleaning system as data loading module,data cleaning rule module,data processing module and data analysis module,and uses memory database Memcached as an intermediate cache server,so that each module can run independently and use Memcached for data exchange.By using the Memcached Distributed Cache Server as the exchange queue for the intermediate data of all the modules in this system,the Memcached High Performance System joins the usual office PCs or off-the-shelf servers to form a large-scale data classification and cleaning system,Massively classify and clean large amounts of data in parallel by maximizing the integration of existing hardware and network resources.In the early stage of data mining,good and structured data can be generated through this system,which provides better preconditions for the later data mining and analysis of the integrity and reliability of the work.
Keywords/Search Tags:Memory Database, Memcached, Cata Cleaning
PDF Full Text Request
Related items