Literature Information Extraction System From Academic Homepage

Posted on:2012-06-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2218330362956478

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

As Internet grows up, huge amount of information is available on the Web. Web documents are written in HTML language which is designed for human-reading only. With the explosive growth of information, people are urged to find a way to let machines to help finding specific information. This is not an easy task under current web architectcure, not only because webpages are written in natural language, but also they differ greatly in styles and layouts. Thus, the availability of technologies that can do deep analysis of webpages is becoming a great necessity.LineX is a literature information extraction system from academic homepages. The system can automatically detect academic homepages and extracts author's profile and publications. The extracted result will be further processed and integrate into a searching system. Academic homepages usually differ in styles and content, hence a rule-based extraction method will not achieve goood performance. The core algorithm in LineX is support vector machine and conditional random filed. SVM is mainly used to do classification of text content and CRF is used to extract subfield from long text. LineX firstly divides webpage into sequences of coherent text units and then do classification and tagging. Later the result will be trimmed and regulated. During the extraction process, additional webpage features such as title, tag features and semantic associations are utilized to improve extraction precision. Other rule-based methods are also utilized to deal with the cases where machine learning method can't handle.Experiments are done for randomly sampled academic homepages. The result shows that LineX has achieved very high precision on all subfileds extraction task. Dictionaries features and HTML features have contributed a lot to the overall performance.

Keywords/Search Tags:

Information Extraction, Natural Language Processing, Machine Learning, Semi-Structured Information

PDF Full Text Request

Related items

1	Research On Machine Learning For Natural Language Processing And Transmission
2	Research On High Risk Information Processing Module Of Internet Public Opinion Based On Natural Language Processing
3	Modeling And Learning Of Representations For Natural Language Sentence-level Structures
4	Design And Implementation Of The Information Processing System Of Safety Accidents Based On Understanding Natural Language
5	Research On Keyword Extraction And Structured List Data Extraction
6	Research On Multimodal Algorithm For Strutured Document Information Extraction
7	Narrative Information Extraction with Non-Linear Natural Language Processing Pipeline
8	Research On Key Technologies Of Constructing Person Entity Relation Graph For Public Information In The Web
9	Research On Semantic Information Extraction For Semi-structured Documents
10	Natural Language Processing Aiming To The Core Texts Of Scientific Literature