Font Size: a A A

Literature Information Extraction System From Academic Homepage

Posted on:2012-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2218330362956478Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As Internet grows up, huge amount of information is available on the Web. Web documents are written in HTML language which is designed for human-reading only. With the explosive growth of information, people are urged to find a way to let machines to help finding specific information. This is not an easy task under current web architectcure, not only because webpages are written in natural language, but also they differ greatly in styles and layouts. Thus, the availability of technologies that can do deep analysis of webpages is becoming a great necessity.LineX is a literature information extraction system from academic homepages. The system can automatically detect academic homepages and extracts author's profile and publications. The extracted result will be further processed and integrate into a searching system. Academic homepages usually differ in styles and content, hence a rule-based extraction method will not achieve goood performance. The core algorithm in LineX is support vector machine and conditional random filed. SVM is mainly used to do classification of text content and CRF is used to extract subfield from long text. LineX firstly divides webpage into sequences of coherent text units and then do classification and tagging. Later the result will be trimmed and regulated. During the extraction process, additional webpage features such as title, tag features and semantic associations are utilized to improve extraction precision. Other rule-based methods are also utilized to deal with the cases where machine learning method can't handle.Experiments are done for randomly sampled academic homepages. The result shows that LineX has achieved very high precision on all subfileds extraction task. Dictionaries features and HTML features have contributed a lot to the overall performance.
Keywords/Search Tags:Information Extraction, Natural Language Processing, Machine Learning, Semi-Structured Information
PDF Full Text Request
Related items