Font Size: a A A

Automatic Annotation Technique For Information Extraction

Posted on:2011-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ShiFull Text:PDF
GTID:2178360302999179Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The explosive growth and popularity of the World Wide Web has resulted in a huge amount of information sources on the Internet. Due to the heterogeneity and the lack of structure of Web information sources, previously access to such information sources has been limited to browsing and searching. Now a lot of intelligent information processing techniques of retrieval, integration, extraction and data mining come out to help people access Web data of interest readily. Information extraction (IE) is such an effort to automate the translation of input pages into structured data.Currently there are many IE systems and tools like WINE, SoftMealy and SRV, most of them are supervised systems which require manual annotation of training instances in order to learn extraction rules. However, such annotation is tedious, time-consuming and subject to changes, in particular when Web sites upgrade. So how to provide semantic annotation for training documents becomes urgent and must be; it's nice to automate such annotation work so as to deal with different data sources readily.In this paper, we present a finite-state-transducer-based method of automatic annotation, which can deal with pages with missing attributes, multiple-valued attributes, multi-ordering attributes. Moreover, we also argument it with probability theory to reduce the uncertainty of the state machine. The experimental results show that our algorithm can annotate Web pages efficiently and accurately and thus speed-up extraction rules learning in Web information extraction systems.We select real Web pages for experiment and calculate the recall and precision ratios for the purpose of evaluation. The results show that our algorithm can well annotation pages with missing attributes, multiple-valued attributes, multi-ordering attributes...
Keywords/Search Tags:Information Extraction, Annotation, GATE, Finite State Transducer, Probability Theory
PDF Full Text Request
Related items