| Korean characters, having more than five hundred years of history, not only possess the common features of the Chinese and the Western characters, but also own their unique character structure. With clear rules on the structure and pronunciation, Korean character system has played a tremendous role in the development of culture and history for the Korean nationality and is used widely and holds an important influence in oriental countries. The structure of Korean character is very complex and form a symbol set possessing a large number of characters. Hence, research on the structure of Korean characters based on the theory and method of information theory and machine learning is an important subject for the intelligent processing of Korean characters. In this dissertation, the statistic characteristics of Korean character structure were studied in detail after analyzing its composition rules, which provides the basis for decisions about pre-classification of Korean character recognition.Firstly, the concept of structure distance of Korean characters and its simple calculation method were proposed based on the uniqueness of linearization of Korean characters and the occurrence of basic letters for composing a character, which provides the measurement for describing the difference between different character structures. According to the proposed concept of structure distance, the whole character set was divided into42equivalence class that each class corresponded to the subset containing the same character structure. It offered a new method for pre-classification of character recognition and also may greatly reduce the burdens of character fine classifier.Then, probability distribution of Korean character structure was analyzed after studying a large number of actual Korean documents. The probability of characters with different structures in Korean documents revealed the use of characters with different structures, the main character structure and the average complexity of the character in documents.Finally, the information gain of basic letters in different locations in a character was obtained when classifying structures, and then, the classification decision tree of a character structure was built based on ID3algorithm. By decision tree, the key basic letters set with the maximum information gain could be obtained. A pre-classification algorithm for printed characters was presented based on the usual12character structures, which offered an effective scheme for the pre-classification of Korean character recognition.The statistic experimental results on actual documents show that the modern Korean documents are generally composed of the characters having simple structure. More than99%of the content in the actual documents can be expressed by the only17kinds of structures while the number of Korean character structures is42. And on average, each Korean character contains2.67basic letters in actual documents. The vowels and eventually sound consonants is the key letter type for the pre-classification of character structures. And based on it, the effective pre-classification algorithm can be designed and implemented. |