Font Size: a A A

Research On Text Vector Representation Model And Its Improvement

Posted on:2019-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2417330551958737Subject:Statistics
Abstract/Summary:PDF Full Text Request
Text mining is the most important step in Knowledge Discovery of Database and Natural Language Processing.The difference between text mining and general data mining is that the text data is semi-structured,and the primary task of mining text data is the structured representation of text data.However,the existing text representation methods have several problems such as insufficient semantic information extraction,high dimension of representation model,and large complexity of model construction.Therefore,it is necessary to study the text representation model and improve the existing problems in the model.Focusing on the core content of text mining to text representation problem,this article is based on combing and summarizing the existing text vector representation models,based on co-occurrence analysis theory,long tail theory,and boolean algorithm rule supporting.The core issues have carried out more in-depth research and made corresponding improvements to the text vector representation model.This paper first introduces the research background,purpose,significance and research status of text vector representation model,and points out the main research contents,research methods and technical routes and innovations of the article.This was followed by the introduction of the relevant theory and text vector model and its history.The co-occurrence analysis theory,long-tail theory,and Boolean algebra algorithm used in this paper are introduced;the development path of text vector representation model is combed and summarized in detail.The development path of the text vector representation model was reviewed,and the construction principle of the text vector representation model and development context was clarified.Thirdly,it is an improvement research on the key technology of the text vector representation model.It's studied in-depth and improved to the core problem of the new model of text vector representation-co-occurrence latent semantic vector space model in terms of weight setting,feature dimension reduction,semantic information extraction and so on;Firstly,co-occurrence latent semantic vector space model(CLSVSM)based on multiple estimation methods is proposed for the weight setting problem;Secondly,in order to reduce the complexity of model combined with the long tail theory,we proposed to truncated co-occurrence latent semantic vector spacemodel(TCLSVSM);The most important is to propose the generalized latent semantic vector space model(GCLSVSM)for the extraction of latent semantic information combined with the Boolean algebraic product idea,and to broaden the scope of topic aggregation of the latent semantic vector space model.Fourthly,clustering and evaluation of improved models of latent semantic vector space models.A series of experiments show the advantages of improved models in text clustering.Finally,the summary and prospect are carried out.The main content of this article is summarized,and the future research and improvement direction is put forward.
Keywords/Search Tags:vector representation model, co-occurrence analysis, text clustering, TCLSVSM, GCLSVSM
PDF Full Text Request
Related items