| Faced with the massive text information generated by the information explosion, how to obtain the necessary information more quickly and accurately is an issue of common concern. The research of text information extraction is arising in such background to meet the demand. Its purpose is to provide useful information tools and methods to obtain information from massive online text quickly and accurately.Through the extraction of information from research papers, not only can we effectively organize and manage these papers, enhance user's retrieval efficiency, but also we would be able to carry out many statistical works, such as topic analysis, related papers statistical, citation analysis of journals, research institutes, certain papers or scholars. In addition, it also helps to find out the hot spots and trends of the research. So extracting information of research papers automatically is of great value in research.At present, the method based on statistical learning is a relatively new text information extraction model. It has achieved good effect and been thought to be of great value in the application. Among them, text information extraction based on conditional random fields (CRFs) has been of considerable concern in particular.After a comprehensively analysis of various text information extraction approaches, the approaches of information extraction from research papers based on CRFs were mainly studied, and during them the traditional approaches were found that they had two limitations:â‘ the granularity of text object to be extracted was fixed at the level of word, or fixed at the level of text block, so the traditional approaches could not segment and extract the text flexibly at the proper granularity in accordance with different circumstances;â‘¡in the extraction process, the traditional approaches were not able to adequately utilize the rich integral characteristics information contained in the text, as well as rich context information in the text. Such limitations had been particularly evident when they dealt with the text composed by complex fields or containing much information.On the basis of research results by the related scholars at home and abroad, a hierarchical method of information extraction from research papers based on CRFs was proposed. Firstly, according to the layout information, the lines with the first character not spaces were combined with the former lines into big lines, which were processed as the basic units in exaction. Secondly, according to the requirements of the information extraction from research papers based on CRFs, appropriate feature functions were developed for the CRFs. Thirdly, the algorithm made use of the format information such as list separator, new line character and line header character, and combined them with the feature functions of CRFs to segment the text hierarchically into proper lines, blocks and words. Finally, the parameters of CRFs were obtained through training, and then the CRFs was applied to the information extraction of research papers in special fields. Experimental results show that the proposed method possesses better performance than that based on the CRFs simply segment text into total words or blocks. |