Font Size: a A A

Segmentation Of Social Media Web Page And Extraction Of Topic Frequent Cluster

Posted on:2012-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:S JieFull Text:PDF
GTID:2218330338971982Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the spread of the Internet and the rapid development of computer technology, Web has become an important platform of obtaining knowledge, sharing technology and communicating information in daily life. More and more Internet users can access to blog, BBS and communities and so on to publish user-experienced contents and user-centered contents which form rich social media in the form of text, image, music and video. It is an important problem to be solved urgently. Therefore, Web information extraction technology generates and causes more and more attention in the academic fields and commercial fields.Compared with the traditional information resources, unstructured or semi-structured web pages which are lack of standard syntax structure account for more than 95% of the social media web pages .Because of the mass, openness, diversity and dynamic characteristics of social media web page, a large number of information can not be got and utilized by traditional natural language processing technology and application from social media web pages. We can widely understand user's demand, product defect and social hot by extracting and making comprehensive analysis of relevant information from social media web pages, such as product information, forum post information. It has great social and economic value.At present, most social media sites make use of databases and predefined templates to dynamically generate web pages. The different area of web page respectively represents menu, navigation, copyright and main content and so on. HTML tags used in these different areas always are different, but, sometimes using same HTML tags in some local areas. This paper takes full advantage of above feature of social media web pages to automatically extract information by stable patterns which are induced from these recurring local areas. This paper takes social media web page as research object, focuses on the key technology of web information extraction and presents a segmentation and extraction method for content-rich pages in social media.The main contributions of our work are as follows: 1. It identifies the frequent blocks which have similar structure in page using k-means method and obtains a collection of frequent cluster, 2. It identifies the topic frequent cluster from the collection of frequent clusters, 3. It induces extraction rules from the frequent blocks in topic frequent cluster. The experiment results show that it is efficient and robust for social media pages with various styles and layouts with high precision and recall.
Keywords/Search Tags:social media, web information extraction, clustering, extraction rules
PDF Full Text Request
Related items