| As a fundamental part of biomedical text mining technology, Protein-Protein Interaction (PPI) extraction has great research significance and application value, and has received increasing attention by researchers in recent years. The current research on PPI generally adapts the statistical machine learning method, and has achieved acceptable results. However, the current methods still suffers from two difficult questions:one is the lack of the annotated data; the other is vocabulary gap and data sparseness in feature expression. Firstly, the insufficiency of the annotated data will lead to lower efficiency, and the manual data-tagging usually requires large and expensive experiments; then, the One-Hot encoding, which is widely used in traditional machine learning methods on PPIe in feature representation, omits the word-order and semantics information, unable to express the latent relative information, limiting the performance on PPI.To address the above problems, this paper conducts the research in the following two aspects:(1) We introduce the transfer learning method to solve the problem of annotated data insufficiency, and propose an improved algorithm which is called "DisTrAdaboost" to avoid "negative transfer". In order to overcome the lack of training data, we introduce the instance-based transfer learning method to boost the performance on PPI extraction. Due to the distribution variance between data fields, the current TrAdaboost algorithm is too slow to converge. In contrast, our DisTrAdaboost algorithm can accelerate convergence by adjusting the initial weight according to the relative distribution. In our experiment, both DisTrAdaboost and TrAdaboost algorithms have achieved good performance on AIMed corpus; when the same experiment is performed on IEPA, TrAdaboost falls into "negative transfer", while DisTrAdaboost keeps transfer efficiency.(2) We propose an word representation approach on feature representation to overcome the "data sparseness" and "vocabulary gap" problems. In this paper, we employ an unsupervised word representation approach to learn the latent sematic information from the large annotated data. Then each word is mapped as a real-valued vector or divided into a category based on the sematic information, making that the similar words share similar distribution, and the two problems can be solved. In our experiment, we employ three word representation methods, including:distributed representation, vector clustering representation, brown clustering representation. The effects of the three above methods are compared on PPI extraction task. Experimental result shows that the distributed representation method make great improvement on five public PPI corpora:AIMed, Biolnfer, HPRD50, IEPA and LLL, which performs much better than the two clustering-based representation methods, achieving the F-scores of 69.7% 74.0%,78.0%,76.5% 和 87.3%, that is better than other state-of-art methods. |