Font Size: a A A

Research On The Methods Of Web Text Mining For Information Retrieval

Posted on:2013-03-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:H WenFull Text:PDF
GTID:1228330395475815Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today, the Internet has become a popular and interactive information publishing media.As a huge, open, heterogeneous and dynamic data container, it produces and holds a largescale all kinds of information. Scattered resources and no unified management structure leadto the contents really interested being flooded among a large number of unrelated information.Through the research of web data mining, the new methods of web text mining areapplied in the information retrieval. It can be used to improve the accuracy and efficiency ofweb pages classification and clustering, the organization of searching results. It is directly orindirectly able to solve the problems and defects of the search engine. Therefore, the researchof web text mining methods for information retrieval has a very important theoreticalsignificance and commercial applications.At present, the web text mining is a very active research field from the perspective ofinformation retrieval. Although there have been some encouraging results and applications, itdoes not reach a mature stage. It is still facing with many critical problems demanding promptsolutions. There is no “best” feature selection method yet. It is difficult for people to improvethe accuracy and the efficiency of the traditional classification and clustering algorithms onhigh-dimension sparse text data. Huge amounts of data are difficult to find, and how toeffectively improve the organization and publishing of searching results is yet unsolved.Based on the existing methods and research, this thesis further carries out a studysurrounding the key issues of the web text mining. For categories unbalanced data and onlinegoods evaluations samples, the solutions of the feature dimension reduction are givenrespectively. Taking semi-supervised learning as the main study object, several newsemi-supervised algorithms are proposed and applied to the web text mining analysis. Aneffective solution is proposed to improve search results organization. On several standard datasets some related experiments verify the validity of the improved methods.The thesis makes the main research subjects and contributions as follows:1. A semi-supervised classification based on Naive Bayesian and enhanced ExpectationMaximization (EM) is proposed for classification problems on imbalanced text data. Firstly, afeature selection function containing strong category information is constructed to control thedimension of feature vector space and preserve useful feature terms for unbalanced data set.Secondly, the basic EM algorithm is improved by gradually transferring unlabeled documentspossessing maximum posterior category probability to labeled collection. The improvement isbenefit to avoid them interfering with categories being identified of other unlabeled samples. 2. A semi-supervised classification using feature distribution is proposed to deal with theon-line goods evaluation text classification problem that has obvious emotion tendency. Thejoint probability distribution of the characteristics items and categories in information gainmethod is revised by using the categories distribution of the items. The adjusted featureselection method is able to reserve the features chat truly possess higher category distinguishability by enlarging the feature difference between the different categories. The featuredistribution selection method is combined with the enhanced EM algorithm to carry outsemi-supervised task. The improved methods achieve better classification performance.3. A strong classification features affinity propagation clustering method is proposed tosolve the problem that the traditional web document clustering methods can not obtain idealaccuracy and efficiency. The rapid affinity propagation algorithm is improved by absorbingthe idea of semi-supervised clustering. In the process of clustering, strong classificationfeatures are extracted from a small amount of labeled samples to improve the cosinecoefficient similarity matrix of training samples. In each round of iteration, it is advised totransfer unlabeled documents which possess maximum category certainty from unlabeledcollection to labeled collection to achieve better convergence destination.4. An integrating seeds spread affinity propagation text clustering is proposed to improvethe usage effect of a small amount of samples with category labels. At the initial stage ofclustering, using Seeded-K-means algorithm and the nearest neighbor pruning to train a smallamount of labeled samples into a bigger scale and higher quality seed set. The supervisedinformation is extracted from seeds to construct a more effective similarity matrix. Thealgorithm is accelerated to achieve the correct convergence target. It offers an effectivesolution to solve large scale high dimensions spare web text analysis problem.5. A web search results clustering algorithm based on latent semantic information andsuffix tree is proposed to improve the organization of search results. The method firstcombines the advantages of vector space model and suffix tree clustering models. Some pagedocuments which have more the same phrases are composed of a base cluster. It uses thesemantic information of candidate label phrases to offer descriptive, readable and conceptualtopic labels for the final page clusters. Clustering web snippets is able to make search resultseasy to browse and help users quickly find web information interested.
Keywords/Search Tags:Web text mining, Semi-supervised learning, Expectation Maximization, Affinitypropagation, Suffix tree clustering
PDF Full Text Request
Related items