| Protein plays its physiological functions by interacting with other proteins. Ascertaining and analysising protein-protein interaction can help to understand cell mechanism, and may provide information for developing new drugs and new disease diagnosis methods. Currently, a lot of new protein-protein interaction information published every month in the form of research papers, these papers are usually included in the biomedical literature database. Extracting protein-protein interaction information by reading the enormous literature database with manual mode is very time-consuming and labor-consuming. How to finish this job rapidly and effectively has become a major task. Therefore, this dissertation will explore the topic of protein-protein interaction information extraction. The main research work and contributions of this dissertation are as follows:We survey on the state of the art of protein interaction information extraction, and summarize the main problems in existing algorithms. A two-step algorithm for protein interaction information extraction is presented. In this algorithm, protein name information is first extracted from the literature. After that, protein-protein interaction information is extracted from the results generated by first step. This algorithm provides a novel and effective approach for protein interaction information extraction.Data bases used in this research field are compared, and then GENIA3.02 is selected as major data set. By using different combination policies to assemble five word features(word, part of speech, prefix, suffix and pre-class), the experimental results show that protein name information extraction algorithm has the best performance when these five features is combined;the performance of protein name information exaction algorithm based on Support Vector Machine(SVM) is better than traditional algorithm based on dictionary, and is similar with the algorithm based on Maximum Entropy;the performance of protein interaction information extraction algorithm elaborated in this dissertation is better than other algorithms.In the end, a protein-protein interaction information extraction system is designed. This system is realized by modular structure technology. It is composed of follow six modules: Documentation Pre-processing Module, Feature Extraction Module, Protein Name Information Extraction Module, Protein Name Information Extraction Results Filter Module, Protein-Protein Interaction Information Extraction Module, and Data Display Module. Excepting Data Display Module, other modules have been finished. |