Font Size: a A A

The Study Of Short Text Classiifcation Algorithm Based On Semantic

Posted on:2014-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiuFull Text:PDF
GTID:2268330425481306Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the widespread popularity ofmobile communication equipment to the rapid growth in the number of micro blogging,online chats, BBS title, news headlines, buy evaluation in the form of short text. Short textare mostly produced in people’s interaction, topics involving various fields of political,economic, military, entertainment, life, expressed the views and positions of the people ofvarious social phenomena. Short text has the brief form, rich in content, including a largeamount of information. The rapid growth of the short text to people’s lives brought greatconvenience, but at the same time because the utilization efficiency of short text much fasterthan people, so the people in front of the vast amounts of information at a loss, it is difficultto effectively get submerged useful information, a huge waste of time, money, effort to bringthe majority of users; On the other hand short text contains a mass of harmful and uselessinformation, it has a serious impact on the decision-making efficiency of governmentdepartments, companies, enterprises and other managers. In the face of massive text data, textclassification technology has played a pivotal role in how obtaining the required data andinformation accurately and efficiently. Especially the use of text classification technology inshort text data, can be in early warning of public opinion, popular language analysis, topictracking and detection,has a wide application prospect. How to classify the short text, meetthe requirements of all kinds of information processing, gradually become the researchhotspots in recent years in the field.Although many scholars have done some research on Chinese short text classificationmethod, but the overall research is still in its infancy stage. Network personalized informationcustomized acquisition and mining system sub-topics: the Web personalization informationanalysis as the backing, after the in-depth analysis of short text features and text classificationtechniques, the paper put forward a kind of short text classification algorithm based onsemantic, it has important research significance and practical significance. In this paper, themain contributions are as follows:1. The paper through in-depth study of the characteristics of short text and textclassification algorithm technology, summed up the difficulty of the short text classification is that the characteristics of sparse and high-dimensional data, to determine the concept as theshort text feature item granularity,concept contains rich semantic information can effectivelyimprove the characteristics of short text semantic expression ability.2. Based on the "HowNet" short text feature processing method. In case of handling largeshort text, the short text feature processing method can effectively reduce the short textfeature space dimension, enhance the category expression ability of the feature space. Thismethod uses the HowNet of the keywords short text word sense disambiguation, determinethe concepts represented by the words in context, at the same time, the concept of training settext feature class feature concept extraction, increase the corresponding weight, prominent infavor of classification the main characteristics, maintaining the sparsity of the secondarycharacteristics and noise.3. Put forward a kind of short text classification algorithm based on semantic. Thealgorithm is based on the method dealt with by the HowNet short text feature, a large amountof the drawbacks of the traditional KNN classification algorithm two improvements: Onehand, calculate the center vectors of the training set for each category, the center of thedomain and the approximate region radius of each category for different regions; the otherhand, treat the classification text initial judgment Category, credited category record sheet.Classification calculate the distance of the center vector text feature vectors corresponding tothe training set categories, according to the text to be classified in the category record tableturn class center within the text directly can judge for this category of text in addition to theapproximate domain direct judgment will not in this category, the real need for KNNalgorithm judgment text just fall into the category approximation within the text. Theexperimental results show that the algorithm can effectively improve the efficiency andperformance of the short text classification.
Keywords/Search Tags:short text, text classification, HowNet, feature extension, KNN algorithm
PDF Full Text Request
Related items