Font Size: a A A

Research And Implementation Of Some Main Techniques In Data Preproceesing System

Posted on:2013-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:F W BaiFull Text:PDF
GTID:2248330371977796Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of database technology and information technology, a large number of data for transaction management and data analysis have been accumulated among enterprises. How to effectively use these data becomes the greatest concern of enterprises. In the need of extracting a small amount but extremely valuable information from the large number of data, data mining came into being. However, as the result of the unique architectural design of each database, data collection errors, random errors by data input, inadequate maintenance and so on, there are some problems in these data inevitably. In addition, the sharp increase in the amount of data brings great difficulties to data mining tasks. These problems mentioned above largely affect the success of data mining tasks. Therefore, it is necessary to improve the quality of data before carrying out data mining tasks, namely data preprocessing.This paper first introduced the basic knowledge and main tasks of data preprocessing. Followed by the detailed introduction of the data preprocessing system, including the part the system has achieved and the part this paper achieved. Then a detailed description of data described in XML format, this paper puts forward one kind of data format based on XML schema definition and a batch processing method for handling large amounts of data collection and analyzes the XML parsing methods. Then the similarity measure algorithms are described and compared, this paper pointed out the problems and improving methods within these algorithms, and put forward a distance measure algorithm concerned the data distribution and an improved consine similarity measure algorithm. Finally, this paper carried out analysis of discretization algorithms and proposed a discretization algorithm based on similarity measure.
Keywords/Search Tags:Data Mining, Data Preprocessing, Preprocessing System, XML Format, Similarity Measure, Disretization
PDF Full Text Request
Related items