Distributed Environment, Data Mining Classification Algorithm

Posted on:2006-09-01

Degree:Master

Type:Thesis

Country:China

Candidate:N Bin

Full Text:PDF

GTID:2208360182968970

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

As an important research area in data mining, traditional algorithms and models of classification work by regularly uploading mission critical data in the warehouse for subsequent centralized data mining application. This centralized approach is fundamentally inappropriate for most of the distributed and ubiquitous data mining applications. The long response time, lack of proper use of distributed resources, and the fundamental characteristics of centralized data mining alogrighms do not work well in distributed environments. Starting by distributed points of view, the paper presents thorough exploration and analysis on classification knowledge, and advances superior distributed algorithms.Firstly, the paper proposes DSPRINT algorithm that adopts the verticaly-partitioning-datasets and synchronously-updating-hash-table technique to build decision trees in heterogeneous distributed environment, and its improved algorithm that adopts ideas of section estimation and section filtration. In DSPRINT, the data structure of histogram is offered to combine class lists into each attribute list, which helps reduce data to fit in memory. DSPRINT further increases the accuracy of the classification model by applying hashtable to record and supervise nodes splitting among distributed data sites. The strategy of vertically partitioning datasets and synchronously updating hashtable are designed to choose out and modify corresponding entries in hashtable, based on the lowest value of Gini. Aiming at the weakness that DSPRINT shows low efficiency in dealing with continuous attributes, an improved algorithm is developed by introducing ideas of section estimation and section filtration, that is to partition value domains of continuous attributes by sampling, then estimate the probability of finding the best split point in each section, and last search likely sections one by one. Using measurements from actual implementations of these algorithms, it showed that DSPRINT algorithm and its improved algorithm exhibit approximately equal in classification accuracy where appropriate number of sections is selected. Moreover, the accuracy is higher and invariablewith the increasement of the number of distributed sites. Experiments also showed that improved DSPRINT performs faster than DSPRINT.Secondly, for classification problems with monotonicity constraints in distributed data mining, the paper suggests to extend R.Patharst's method on building monotone decision trees to distributed environment. As a complementation of DSPRINT, the algorithm adds Update rule in DSPRINT. Thus, a non-monotone decision tree can be repaired to be monotone only by adding corner elements to the leave and growing a few more branches where necessary, without frequently inserting elements to datasets in distributed data sites.Thirdly, considering problems existing in traditional distributed data mining, such as data fragmentation, results integration and security, the paper addresses combination of the technique of mobile agents and data mining, which makes it feasible to establish a data mining system available for data analysis and application of huge and distributed databases. In this approach, communications between facilitators, data mining agents, databases and users, programs running and codes transference is able to be fulfilled due to mobile agents' mobility, parallelizability, asynchrony and resource optimization.

Keywords/Search Tags:

data mining, classification rules, decision trees, monotone constraints, mobile agent

PDF Full Text Request

Related items

1	Decision Tree Classification Algorithm Parallelization And Its Application
2	Mining Association Rules With Constraints
3	Research Of Data Mining Algorithm Based On Association Rules
4	Research And Applications Of Data Mining
5	Study On Afforestation Decision Based On Data Mining
6	Design And Implementation Of Course Assessment And Analysis Of Decision System Based On Data Mining
7	Data Mining And Its Applications In The Field Of Medicine
8	Search Of Classification Algorithms For Data Mining
9	Study On Data Mining Technologies Based On Mobile Agent
10	Research Of The Decision Trees And It's Application