| Data coming from the real life as classification information is usually continuous format, but continuous feature is not proper inputs for most inductive machine learning algorithms. Discrete values are intervals in a continuous spectrum of values. Transforming continuous feature into discrete ones can reduce the tested data's scale greatly and the discrete values are more concise to represent and specify, easier to use and comprehend and closer to a knowledge-level representation than continuous values. Discretization can improve many machine learning methods' efficiency and accuracy, so people are paying more and more attention on it.For the advantages of "no need of assumption of data for machine learning", "getting knowledge from incomplete or inconsistent data" and "the form of the knowledge more concise to represent and comprehend", rough set theory have been considered as an advanced induction learning method. The discretization methods based on rough set can describe the dependency among the attributes precisely to produce the better discretized data.Most existing discretization methods have been reviewed and analyzed in this paper. Furthermore, new optimized discretization methods based on rough set for supervised learning are also proposed to resolve the problems discovered in the existing methods. The main research work consists of several aspects as follows1. Propose a new axe by which discretization methods can be classified further and introduce most existing methods under the classification frame.In general, discretization methods of continuous attributes can beclassified on several axes: Dynamic vs. Static, Unsupervised vs. Supervised, Local vs. Global, Direct vs. Incremental, Top-down vs. Bottom-up. The new axe "Attribute-independent vs. Attribute-dependent" is proposed in this paper. Furthermore, we can see the methods of supervised, global, incremental, attribute-dependent features lead to the advanced discretization result through analyzing existing algorithms under the classification frame and the methods based on rough set often have the characteristics mentioned above.2. Analyze the discretization methods based on rough set and improve the MD-Algorithm according to the problems discovered in the analysis.Making use of the ordering of the cuts and attribute values on continuous features, a new method based on rough set theory is designed for reducing the complexity of the existing MD-Algorithm. The new method doesn't compress the data structure into linear space only, but also computes the cuts' importance by specific formula. So it can improve the efficiency of the original MD-Algorithm a lot.Considering the impacts of relationship among cuts to discretization accuracy, design the discretization methods that employs the cut dependency computation proposed. The methods can improve the accuracy of original MD-Algorithm.3. Complete the emulational experiment and verify the advantages of the improved algorithms.Implement the algorithms designed and several classical discretization methods to the huge scale geography information sets. Compare and analyze the discretization results to prove the advantages of the proposed methods. |