Font Size: a A A

Research On Biomedical Event Extraction Method For Small Sample And Imbalanced Dataset

Posted on:2020-09-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LuFull Text:PDF
GTID:1360330602955535Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
The overwhelming amount of literature published in the molecular biology domain makes it difficult for life science researchers to obtain detailed views of biological information.Traditional reading model can no longer meet reasearcher's needs.In recent years,researchers have made use of text mining technology to provide name entity recognition and relation extraction services in the biomedical domain.However,NER and relation extraction are insufficient to help researchers understand increasingly complex biomedical texts.Therefore,the focus of text mining in a biomedical text has shifted from NER and relation extraction to fine grained and complex events.Event extraction from the biomedical text is the task of extracting the semantic and role information of biological processes,which are often complex structures.Therefore,this is crucial for researchers to adopt structured knowledge to represent biomedical events,and it is necessary to extract biomedical events with an efficient and accurate method.Better extraction of biomedical events is helpful to the mining and sorting work and the efficiency of research in gene ontology database,protein relational database and pathway database.Two problems exist in the biomedical event copus.On the one hand,the sample distribution is highly imbalanced.The order of magnitude difference between positive sample and negative sample will result in the classifier to tilt to majority class samples,and the effect of classification model is obviously reduced.On the other hand,small sample datasets problem.the training dataset is limited.Especially an over-fitting problem that would be generated when the training data were limited,annotating data for training requires an enormous cost.However,there are few methods to improve the biomedical event extraction performance by improve small sample and imbalanced dataset.Therefore,in order to solve the problem of the small sample and imbalanced data in multi-class event classification in biomedical literature,the paper uses the event extraction method of pairwise model,and proceeds from sentence representation and sample effective selection,deeply explores and studies the methods of predictive correction,semi-supervised learning and active learning,and effectively improves the performance of biomedical event extraction.The main work of the paper is as follows:(1)Aiming at the problem of the small sample in biomedical event extraction,a method of collaborative learning in biomedical event extraction based on SVM classification model and CNN model is proposed.Firstly,the unlabeled biomedical corpus is extracted by the artificial designed features with SVM model.Secondly,two new representations are generated,namely,the dependency word sequence and the dependency typed sequence by expanding the dependency path of each sample in the event sample set.In addition,the CNN model based on the two sequence vectors is used to extract events from unlabeled corpus.The pseudo labeled sample results from collaborative learning are fused and the two results are selected according to the conflict probability evaluation rules,which could enhance the training set,and SVM classification model is used in the test set.The performance and effectiveness of the proposed approach are evaluated through amount experiments.The experimental results show that the proposed approach can alleviate the problem of small sample in the biomedical event corpus and improve the generalization ability of classification model.(2)For the problem of imbalanced dataset in biomedical event extraction,a sample filtering based on sequential pattern and correct the prediction results based on joint scoring mechanism for biomedical event extraction are proposed in the paper that can improve classification performance and identification rate of biomedical events.Firstly,the sample dataset is constructed by pairwise model.The sequence pattern algorithm is used to filter de negative samples,and adjust the propotion between positive and negative samples,so as to ensure that the impact of samples on the results of classifier tends to be balanced.Secondly,considering the joint information between triggers and arguments in multi-argument events,the triple of multi-argument event is extracted directly by using SVM classifier,the prediction results of binary relation and triple relation are integrated.Finally,a joint scoring mechanism combining the Convolution Depth Structural Semantics Models and the importance of trigger is used to correct the prediction results.Experiments show that the proposed method can effectively balance dataset in biomedical event extraction,and weaken the situation that the classification boundary is biased towards majority classes,and improve the prediction ability of model.(3)For the problem of small samples and the difference of class distribution in biomedical event,a method of biomedical event extraction based on clustering query synthesis confidence evaluation is proposed,which combines semi-supervised learning with active learning.The low confidence and high confidence samples are labeled by experts and expanded adaptively.Firstly,SVM model is used to predict unlabeled data,and the prediction analysis data set is constructed.Secondly,clustering is used on prediction analysis data set to determine the representative and non-representative samples in each cluster.The outliers in the representative samples and the near center points in the non-representative samples are queried,which are synthesized as abnormal points.These abnormal points are taken as low confidence samples to be labeled by experts,while the others are high confidence samples.According to the distribution of each event class in biomedical events,the high confidence samples of each event class are adaptively expanded to corpus to adjust the balance and small sample problem between multi-class events.The experimental results show that the proposed method can ahchieve better biomedical event extraction performance and improve the generalization ability of classification model.In conclusion,in view of the small sample and imbalanced dataset in biomedical event corpus,the paper discusses and studies semi-supervised learning,predictive correction and active learning from the perspective of expanding dataset and adjusting the sample distribution of each event.And the proposed approach can improve the generalization ability of biomedical event extraction classifier,as well as accuracy and robustness.
Keywords/Search Tags:Biomedical event extraction, Small sample dataset, Imbalanced dataset, Sequential pattern, Collaborative Learning
PDF Full Text Request
Related items