Font Size: a A A

Multi-label Chinese Question Classification Research Multi-mark Chinese Question

Posted on:2017-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:X X QiFull Text:PDF
GTID:2358330488964845Subject:Software engineering
Abstract/Summary:PDF Full Text Request
CQA is a new form of network application which is appeared and became popular in recent years,typically such as baidu knows, yahoo and so on.In CQA sites, users put forward his own problems, answer the question of others, and give the assessment to others who answer his questions.The problems of CQA site are from all aspects in daily life,answers to these questions accumulate over a long period,which make it formed a huge encyclopedic knowledge treasure.Nowadays,interactive question and answer is committed to provide users with a series of the most accurate and relevant information, not just a question of answer.Study on related issues in these sites is an online test problem collection.That is to put forward a series of questions, and collect its recommended related issues,then identify whether the related questions are related to user question which is raised by users.Question answering system generally includes three main parts. That are question analysis, information retrieval and answer extraction.Question answering system generally includes three main parts. That are question analysis, information retrieval and answer extraction.How to fully understand the user's question in the question analysis stage, how to send the related document in information retrieval stage and how to accurately extract answer from relevant documents in the stage of answer extracting,is becoming the core of the question answering system which needs to be solved.Compared with the text, question generally does not exceed 200 characters, which not only leads to sparse features of question obviously, but also causes other problems such as the weak signal of description concept and the more noise data. In addition, abbreviations words, irregular deformation words and colloquial words appeared in questions also affect the performance of traditional text pretreatment and text representation.As a consequence, Chinese question has its own features, such as question shortness, sparse features and ambiguity of a lot of questions and so on. For example, the question, "what to eat to keep fit", which not only belongs to the category of "Facial toning" but also belongs to the category of "Health care".Therefore, this pager make research according to the characteristics of the Chinese question, such as sparse features and ambiguity of question.After in-depth research and trial, we acquire achievements As follows for the Multil-label Chinese question classification.(1) In order to improve the category feature of the feature word and maximally retain its semantic category information. This paper use Wikipedia knowledge to dig the hidden information. The words,which are highly correlated with feature words in the semantic level, are chosen to assist question classification. By using the relevant concept set of words extracted from Wikipedia as expansion-word set and by extending feature from semantic level through expansion-word set to construct the semantic vector space.Wikipedia as the knowledge system with features of openness and collaborative edit for users possesses lots of merits, such as wide coverage of knowledge, high degree of structural, and fast speed of updating information.Wikipedia is the collection of hypertext documents. It consists of pages with rich links. It mainly contains theme pages, redirection, disambiguation pages and links. Theme page as the most basic and important element for Wikipedia contains unique ID which is used to describe an individual concept.Wikipedia use the same page to describe synonymous concepts. Among these concepts, only one concept page which is called redirect page contains information of description and explanation and other synonymous concepts use redirects to link to this page. Disambiguation page provides the strategy to deal with polysemy.Links can be described as bridge among page and page is connected by hyperlinks in the theme page.(2) The underlying assumptions of ML-kNN is that there is no correlation among labels and labels are independent.We use the multi-label Chinese question classification algorithm we proposed to speculate label set for unlabeled questions. The algorithm take the correlation of other types statistical information in the neighbor into account by maximum a posteriori. Third, by iterating the algorithm of Multi-label question classification(ML-CQC), it can fully leverages the correlation among labels that are got from the classified instances. Experiments show that the multi-label Chinese question algorithm(ML-CQC) we proposed we propose is feasible and effective on the basis of features expansion of Chinese question.(3) On the basis of the algorithm ML-CQC,we put forward the algorithm SML-CQC. The core idea of SML-CQC is reducing training instance by extracting instance which is to reduce the time complexity of computing neighbor.Then weight prior probability on the basis of neighbor instances category label information,at last find out the maximum a posteriori probability and finally get the label of the unseen instances.
Keywords/Search Tags:Question Answering, classification of Chinese questions, feature extension, multi-label
PDF Full Text Request
Related items