| With the Internet's influence on all aspects of society, more and more personal information appears on the internet. As a part of the internet, people-oriented search engine is springing up in recent years. Meanwhile, domain-specified people-oriented search engine is also a new thing, and the research on it is not mature. Nowadays,more and more attention is paid on the teaching level and researching level of the college teachers, and the demand of the information searching of the teachers is growing quickly. This paper is taking the college teachers'information extraction in computer science as the application background and concentrating on domain-specified people information extraction from heterogeneous web sources. Ultimately we build a system of college teacher-oriented searching in computer science. This paper is focus on the following issues.First, this paper gets the data resource by two methods, one of which is based on topic spider, and the other of which is to identify the informative web pages of teachers from the results searched by the search engine. We regard the second method as a web pages categorization problem. To solve the problem, we create a classifier according to SVM model based on the web pages'structure feature and content feature. And then, to cut down the processing time of the classifier, we proposed two new feature selection methods based on feature's contribution to the category and SVM training weight.Second, according to the characteristics of the informative web pages, we can design a classifier to categorize informative web pages. To improve classification methods, we combine rule-based method with machine learning method, considering both structure and content feature of web pages. To deal with classification of multi-record web pages, we use two methods, one of which bases on the density of HTML tags, and the other one of which bases on content feature. While dealing with single-record pages, we do extractions based on webpage structure, building a classifier designed according to the SVM model. Experimental results showed that rule-based and structure-based classifier plays well.Third, on the basis of the classification of informative web pages, we propose a rule-based method which can be used on person property extraction. First, we construct an inspiration word base of field person information extraction. Meanwhile, we construct a rule base of personal property extraction, according to the features of field personal information extraction and structure-based webpage. Then we can extract information of personal properties. Experimental results showed that method we proposed plays well.Finally, we apply methods in this paper to extract information of college teachers in computer science and build a system of college teacher-oriented searching in computer science. This system has now assembled 4134 teachers'information in computer science around 120 colleges in China in total and provided several searching methods. |