Research Of Web Text Classification Based On Decision Tree Classification Algorithm

Posted on:2012-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Lin

Full Text:PDF

GTID:2178330338994824

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The development of data mining plays an important role in the theory of computer algorithms. Particularly since the new century began, data mining has played a role on database and data warehouse. The great successes of the search engines make it become part of an important branch of computer research. Development of decision tree classification has reflected this point of view.The CLS method is the first of decision tree classification algorithm. ID3 algorithm appeared next, and the C4.5 algorithm, an improved ID3 algorithm, CART algorithm, SLIQ algorithm and SPRINT algorithm and so on were proposed. The emergence and even the improvements of these algorithm theories enrich the decision tree method. Text classification is a very important task in Web data mining. The processes of text classification have four important steps: text representation, feature extraction, classifier construction and rule extraction. Feature extraction and classifier construction have large computation. What method selected and used to feature extraction and what method to construct classifier play a significant impact on the entire classification process.In this paper, firstly, several classical decision tree classification algorithms are researched and analyzed. Differences among these algorithms are given after comparing them. Secondly, improve on the C4.5 algorithm, use the McLaughlin to replace the formula, and gain simplified formula for the rate of information, get a new formula for new algorithm last, this formula not only greatly simplifies the complexity of original formula, but also does not cause deviation.C4.5 algorithm is implemented on the premise that assumption is no association between attributes, independent from each other. However, this assumption may not be true practically on situations, so the property-related concepts and user interest degree was introduced, and the impact of two algorithm was analyzed. One of advantages of C4.5 algorithm is that it can deal with continuous attributes; an improved method based on the original proposed was given in the paper, and the greatly of time on memory and computing was reduced when handling of continuous attributes, the computational efficiency was improved.Improved C4.5 decision tree algorithm used on Web text classification makes the application of decision tree classification algorithm promote greater. Shortcoming ofÏ‡~2 statistics methods on feature extraction was analyzed, negative and positive of the contributions to the class segmentation were not reflected. In this paper, it was improved based on the original, the contribution to word segmentation is more clearly, the improved decision tree classification was used on contract classification and the rule extraction was realized finally. The algorithm is simply applied to a county development zone in the information collection of OA system, experimental data show that the workload of editing information was reduced.

Keywords/Search Tags:

data mining, decision tree, C4.5 algorithm, text classification, rule extraction

PDF Full Text Request

Related items

1	Observation Analysis Of Decision Tree Extraction From Artificial Neural Network
2	Research And Application Of Classification Algorithm Based On Decision Tree Rules
3	Inductive Decision Tree Classification Model In The Military Transport Vehicle Management System
4	Hot-rolled Data Analysis Based On Decision Tree Learning And Rules Extraction
5	The Research Of Decision Tree Algorithm In Data Mining
6	Research On The Automatic Lassification Algorithm Of Archive Text Based On Decision Tree
7	Research Of Decision Tree Optimizing And Association Rule Mining Algorithms
8	Classification Rule Mining In Financial Applications
9	The Research On Application Of Data Classification In Teaching Of High Learning
10	The Decision Tree Classification Method Is Applied Research In The Insurance Business