Font Size: a A A

A Method For Multi-language Recognition And Multi-coded Identification

Posted on:2013-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:S S LiFull Text:PDF
GTID:2235330371488386Subject:Information Science
Abstract/Summary:PDF Full Text Request
Pushed by globalization, the activities of education, economic and cultural carried out across the national borders. On one hand, the booming of the Internet speeds up the process of globalization, on the other hand, the existence of language barrier is going to be the last barrier of globalization. Under this circumstances, the technology of Automation Language Recognition moves forward lukewarmly.It takes lots of time to do a detailed research in the field of Automation text auto-classification, Machine Translation, Multilingual Information retrieval research at home and at board. Since industry makes a consensus on the opinion that "Automation Language Recognition could be recognized as one of the special case, which based on certain characteristics, of Text Auto-classification". The research of Text Auto-classification began since the1960s, and then come out to date, it has gone through three periods of Text Auto-classification, Artificial Auxiliary Classification and Machine Learning. There are many statistical algorithms for auto-classification, such as KNN、 Decision Tree、 Rocchio Algorithm、 Navie Bayes Algorithm, Support Vector Machine、 Maximum Entropy Model、 Genetic Algorithm and Neuronic Network Algorithm, etc. They all play well in the area of this research. And one of the most important branches is Machine Translation which is used to play the crucial role in a multilingual information retrieval system. Machine Translation is one of the most important research fields of Machine Learning, so it is supposed to be burdened with the core module of most cross-language information retrieval system. With the help of dictionary, corpus and ontology, or the Internet free tools based on dictionary, corpus and ontology, for example, Internet Passport MT system, online WorldLingo MT system, it is easy to realize communication between query and multilingual document in the backstage supporter’s database. As the pioneer of machine translation processing, and always has an important affection on the Multilingual information retrieval, Language Recognition has been neglected for years though.The existing problem in Language Recognition belongs to the field of Natural Language Processing, rather than the field of Text Auto-classification in a personal viewpoint. The language recognition program in this paper based on the N-Gram Language Module, which is also known as a Single-order Markov chain. The theory has application in Part-of-speech tagging, sound-character conversion and voice speech recognition. It is the most successful approach to achieve fast and accurate Speech Recognition System. It is used to auto recognize language of textuary style in this paper. It chose seven languages, Chinese, English, French, German, Russian, Japanese and Korean, who are the most popular languages on Internet as the experimental subjects. The speech recognition experiments were divided into two stages of training of multilingual corpus and testing of language recognition, the texts of training and testing came from the Open Directory Project. The result of the language recognition experiment came out to prove that the program had a fine performance on recognizing English and German whose average recognition accuracy of long texts or short texts is100%, then Russian comes second, whose recognition accuracy is94.44%, followed in descending order by Chinese Simplified94.44%, Chinese Traditional83.33%, French83.33%, and Korean16.67%. Korean would be recognized well when keeping the affect from the Chinese characteristics out.Further more, the experiment selected two Japanese encoding, EUC-JP and SHIFT-JIS, after the pattern above, made an exploratory test for the effectiveness when N-Gram theory applied to the coded identification. It’s more exciting that it has also done a good job in distinguish the Japanese code EUC-JP from another Japanese code SHIFT-JIS,the proportion of correctly identified are85%and95%, both identification error were less than0.0020. Coded identification using N-Gram theory is one of lightspot of this paper.Where after, the paper introduced the full-text search framework by the name of Lucene3.5, and how its index module and search module works that are correlative with multi-language recognition combined with its core code; Then analyzed its built-in class Analyzer, modified the details of language recognition process to made the needs of the interfaces of index module and search module, the identification results of Chinese Simplified and Traditional Chinese reunification return type of "Chinese", while Japanese and Korean reunification return type of "CJK". Thus the multi-language auto recognition program transformed into one module of Lucene3.5, added the multi-language recognition function when indexes were built or queries were submitted. May the program could give a hand in cross-language information retrieval, and smooth the user experience. This job is initial and new. It is only design for module and interfaces are available because of space and time constrains, the next task of study would be the realization of Cross-language Information Retrieval System based on Lucene.
Keywords/Search Tags:Language Recognition, Coded Identification, N-Gram, Cross-language Information Retrieval, Lucene
PDF Full Text Request
Related items