Research On Ontology-based Video Website Supervision Method | | Posted on:2014-01-12 | Degree:Doctor | Type:Dissertation | | Country:China | Candidate:W K Yin | Full Text:PDF | | GTID:1228330398456592 | Subject:Information security | | Abstract/Summary: | PDF Full Text Request | | With the improvement of network bandwidth, the increase of network users, as well as the popularity of digital products, the video content becomes more and more rich and video websites also increase a lot. However, because of the openness, anonymity, and lack of management, many unhealthy video website is incorporated in the Internet. These unhealthy video websites have a very negative impact on the growth of young people and social stability. Although the government has stepped up the fight against unhealthy websites, however, the unhealthy websites still exists, and are easy to obtain through the help of search engine. Therefore, how to automatically discover and accurately identify unhealthy video websites to realize effective supervision becomes a problem worthy of study.The main problems of video website supervision include:(1) the proposed theme website found methods lack of the method of building theme crawling initial website list. However, the quality of the initial list and the number of the websites in the list has a great impact on the efficiency of focused crawling. Besides, the current video website theme relevance calculation methods are all based on the text features, and ignore the visual characteristics of video website. Finally, how to effectively tunnel the irrelevant pages is not resolved well either;(2) lack of computer-readable semantic description of unhealthy video website:Traditional automatic or semi-automatic domain ontology construction method relies on natural language processing techniques. However, due to the complexity of the natural language and limited by the performance of natural language processing tools, the quality of domain ontology built by these methods is often not high;(3) the document representation model of traditional web classification is generally assumed that feature items are independent of each other. However, words in natural language are difficult to meet the independence condition. Existing ontology-based classification systems generally use ontology for auxiliary classification, which did not take full advantage of the ontology’s own structure and semantic information.For these problems, the main research work and innovations are as follows:1. We propose a meta-search based video website automatic discovery method. First, the method use meta-search technology to automatically discover a part of the video websites, and design a keyword update and evaluation mechanisms to provide high-quality search keywords to the search agent. After that the meta-search results will be provided as the initial list to the theme crawling module for further video website crawling. Then we judge a web page is weather a video page or not by analyzing the visual features of candidate video player and the label features. If a web page is judged as the video page, the subsequent web pages can be judged by calculating the similarity of the DOM Trees between these two web pages. Finally, based on correlation analysis of web pages and links, we propose an energy model to calculate the energy of each hyperlink parsed in the search process, which determining the direction and step length of the theme crawling. Experimental results show that the proposed method can effectively find video websites.2. We propose a hyperlink structure graph clustering based domain ontology automatic construction method. The method first constructs a domain-specific hyperlink structure graph using Wiki, and then uses LSI algorithm to calculate the weight of each hyperlink. Then our method uses CPMw algorithm to cluster the weighted undirected hyperlink structure graph. After this step, domain thesaurus can be achieved. Experiments show that our method can get better results.3. We propose an ontology-based web page health degree calculation method. This method first transforms the web page into document concept graph, and then uses the random walk based weighting algorithm to calculate the weight of edges in domain ontology graph and document concept graph respectively. Then we use maximum tree generation algorithm to convert the document concept graph and domain ontology graph into tree structures. Using the edit distance as the similarity metrics to calculate the minimum cost matching between the domain ontology tree and the document concept tree, and then we can get the health degree of the web page. Experimental results show that the accuracy, recall and F1value of our method are0.96,0.957and0.958, which shows that our method can effectively indentify the unhealthy web pages. | | Keywords/Search Tags: | Ontology, Ontology Automatic Construction, Video Website Discovery, Video Website Identification, Topic Crawling, Web Page Tunneling, DomainOntology Graph, Web Document Concept Graph, Web Page Relevance Calculation | PDF Full Text Request | Related items |
| |
|