Font Size: a A A

The Design And Implementation Of The Chinese E-Mail Classification System Based On Text Classification Technology

Posted on:2008-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhongFull Text:PDF
GTID:2178360215497623Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Spam filtering is an important problem to be researched in the field of internet. The main problem with spam filtering is the definition to spam is subjective, and a mail, which might be spam to one user, might contain useful information to another user. In some senses, classifying e-mails by their contents is more meaningful than just simply classifying them into spam and non-spam.Classification of e-mails by content is an important application in the field of text classification technology; therefore the thesis first introduced the basic concepts as well as the background of the text classification technology, and discussed the process of text automatic classification systematically. The key technology related in text classification, including vector space model, feature abstraction, machine learning methods, was expounded theoretically and described in algorithm.Then the implementation plan of e-mail automatic classification system was brought out, and the system architecture based on text classification technology was given out. For reference, the realization of segmentation and syntactic analysis of text used Chinese phrase analysis system-ICTCLAS and the syntactic parser was based on PCFG-PROP of Chinese Academy of Science, which made the index item extracted from the text to have more probability trendy towards the focus word, to improve the systematic precision and recall. The concept of threshold was put forward, which enhances the categorization function of the system. The system was implemented under VC, the evaluations and results were given.The experimental result shows that, to apply the improved Simple Vector Distance Algorithm as the classification algorithm can improve the system classification function effectively, and achieve the expected recall and precision.
Keywords/Search Tags:E-mail classification, text classification, VSM, mail decoding, Chinese Phrase Segmentation, Feature selection, threshold value, Simple Vector Distance Algorithm
PDF Full Text Request
Related items