Font Size: a A A

Research On Generating Text And Image Summarization Of Microblog Event

Posted on:2017-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:H QuFull Text:PDF
GTID:2348330503489880Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As an emerging research area, event extraction from social media such as Twitter, Sina micro-blog etc. has attracted much attention. Messages posted on micro-blogs have been reporting everything from daily life stories to the latest local and global news and events. How to filter out noisy information from social media streams and identify and summarize events has been the hot research field. Existing approaches have considered only textual summaries which are often poorly written. Alternatively, images are able to quickly convey information. In this paper, I also investigate how images can be used as a source for summarizing events.This paper attempts to provide a new method of offline multi-modal correlation text and image summarization based on micro-blog events. I introduce a probabilistic method to jointly exploit three types of relations, namely, text-text relation, image-image relation and text-image relation for identifying event. In particular, I propose topic-based graph ranking method to calculate the relevancy. The approach can jointly exploit the different types of relations among texts and images on a given event in micro-blogs. At the same time, this paper presents a method for combing event-relevance with information-diversity in the micro-blogs event summarization. The Random-walk Graph Ranking Model strives to reduce redundancy while maintaining relevance in re-ranking retrieved micro-blogs and in selecting appropriate texts and images for event summarization. In contrast to conventional media, event detection from micro-blog streams poses new challenges. Micro-blogs contain large amount of meaningless message and polluted content, which negatively affect the detection performance. In addition, traditional text mining techniques are not suitable, because of the short length of micro-blogs, the large number of spelling and grammatical errors, and the frequent use of informal and mixed language. To address those problems, I adopt the method of expanding words based on Latent Dirichlet Allocation. Using LDA to derive latent topics from the corpus, the topics are used as features to enrich the representation of short micro-blog text. Since images in social media can also be noisy, irrelevant and repetitive, I adopt a technique of perceptual hash to detect duplicate images and near-duplicate images. Aside from the problem of duplicate images, the major problem in the social media images is that of irrelevant images such as self-portraits, reaction images, QR codes etc. I train a Support Vector Machine(SVM) classifier to filter out these images due to their unsuitability for summarization purposes. The experimental results show that our method has better performance on average accuracy and recall ratio.
Keywords/Search Tags:micro-blog analysis, event identification, text and image summarization, multi-modal correlation
PDF Full Text Request
Related items