Segmentation Of Social Media Web Page And Extraction Of Topic Frequent Cluster

Posted on:2012-12-27

Degree:Master

Type:Thesis

Country:China

Candidate:S Jie

Full Text:PDF

GTID:2218330338971982

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the spread of the Internet and the rapid development of computer technology, Web has become an important platform of obtaining knowledge, sharing technology and communicating information in daily life. More and more Internet users can access to blog, BBS and communities and so on to publish user-experienced contents and user-centered contents which form rich social media in the form of text, image, music and video. It is an important problem to be solved urgently. Therefore, Web information extraction technology generates and causes more and more attention in the academic fields and commercial fields.Compared with the traditional information resources, unstructured or semi-structured web pages which are lack of standard syntax structure account for more than 95% of the social media web pages .Because of the mass, openness, diversity and dynamic characteristics of social media web page, a large number of information can not be got and utilized by traditional natural language processing technology and application from social media web pages. We can widely understand user's demand, product defect and social hot by extracting and making comprehensive analysis of relevant information from social media web pages, such as product information, forum post information. It has great social and economic value.At present, most social media sites make use of databases and predefined templates to dynamically generate web pages. The different area of web page respectively represents menu, navigation, copyright and main content and so on. HTML tags used in these different areas always are different, but, sometimes using same HTML tags in some local areas. This paper takes full advantage of above feature of social media web pages to automatically extract information by stable patterns which are induced from these recurring local areas. This paper takes social media web page as research object, focuses on the key technology of web information extraction and presents a segmentation and extraction method for content-rich pages in social media.The main contributions of our work are as follows: 1. It identifies the frequent blocks which have similar structure in page using k-means method and obtains a collection of frequent cluster, 2. It identifies the topic frequent cluster from the collection of frequent clusters, 3. It induces extraction rules from the frequent blocks in topic frequent cluster. The experiment results show that it is efficient and robust for social media pages with various styles and layouts with high precision and recall.

Keywords/Search Tags:

social media, web information extraction, clustering, extraction rules

PDF Full Text Request

Related items

1	Research On Language And Key Techniques For Accurate Information Extractionrules Towards Complex Web
2	Design And Implementation Of Web Information Extraction Rules
3	Research On Information-Quality-Oriented Methods For Relation Extraction From Social Media
4	Optimizing Of Extraction Rules And Expressing Of The Rules With XQuery In Web Information Extraction Systems
5	Research Of Chinese Personal Social Relation Extraction Based On News Data
6	Social Media Based Disaster Event Extraction And Spatiotemporal Analysis
7	XML-based WEB Information Extraction System Research And Implementation
8	Design And Implementation Of Web Information Extraction System SEU-WIE
9	The Design And Implementation Of Information Extraction In VOCA
10	Technology For Domain-oriented Automatic Information Extraction From Semi-structured Web