Font Size: a A A

Distributed Environment, Data Mining Classification Algorithm

Posted on:2006-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:N BinFull Text:PDF
GTID:2208360182968970Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
As an important research area in data mining, traditional algorithms and models of classification work by regularly uploading mission critical data in the warehouse for subsequent centralized data mining application. This centralized approach is fundamentally inappropriate for most of the distributed and ubiquitous data mining applications. The long response time, lack of proper use of distributed resources, and the fundamental characteristics of centralized data mining alogrighms do not work well in distributed environments. Starting by distributed points of view, the paper presents thorough exploration and analysis on classification knowledge, and advances superior distributed algorithms.Firstly, the paper proposes DSPRINT algorithm that adopts the verticaly-partitioning-datasets and synchronously-updating-hash-table technique to build decision trees in heterogeneous distributed environment, and its improved algorithm that adopts ideas of section estimation and section filtration. In DSPRINT, the data structure of histogram is offered to combine class lists into each attribute list, which helps reduce data to fit in memory. DSPRINT further increases the accuracy of the classification model by applying hashtable to record and supervise nodes splitting among distributed data sites. The strategy of vertically partitioning datasets and synchronously updating hashtable are designed to choose out and modify corresponding entries in hashtable, based on the lowest value of Gini. Aiming at the weakness that DSPRINT shows low efficiency in dealing with continuous attributes, an improved algorithm is developed by introducing ideas of section estimation and section filtration, that is to partition value domains of continuous attributes by sampling, then estimate the probability of finding the best split point in each section, and last search likely sections one by one. Using measurements from actual implementations of these algorithms, it showed that DSPRINT algorithm and its improved algorithm exhibit approximately equal in classification accuracy where appropriate number of sections is selected. Moreover, the accuracy is higher and invariablewith the increasement of the number of distributed sites. Experiments also showed that improved DSPRINT performs faster than DSPRINT.Secondly, for classification problems with monotonicity constraints in distributed data mining, the paper suggests to extend R.Patharst's method on building monotone decision trees to distributed environment. As a complementation of DSPRINT, the algorithm adds Update rule in DSPRINT. Thus, a non-monotone decision tree can be repaired to be monotone only by adding corner elements to the leave and growing a few more branches where necessary, without frequently inserting elements to datasets in distributed data sites.Thirdly, considering problems existing in traditional distributed data mining, such as data fragmentation, results integration and security, the paper addresses combination of the technique of mobile agents and data mining, which makes it feasible to establish a data mining system available for data analysis and application of huge and distributed databases. In this approach, communications between facilitators, data mining agents, databases and users, programs running and codes transference is able to be fulfilled due to mobile agents' mobility, parallelizability, asynchrony and resource optimization.
Keywords/Search Tags:data mining, classification rules, decision trees, monotone constraints, mobile agent
PDF Full Text Request
Related items