Font Size: a A A

Research On Hybrid Samp Ling Algorithm And Its Application In Medical Question Answering System

Posted on:2019-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:L X ZhangFull Text:PDF
GTID:2428330545454894Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Machine Learning and Big Data fields,Imbalanced Dataset Classification has become one of the hotspots of current research.In practical applications,the imbalanced dataset are often encountered,such as: medical diagnosis,fraud detection,earthquake prediction and so on.How to improving the classification accuracy of positive samples is the focus of the research.Most classification algorithms tend to be biased toward most negative class(majority class),while the recognition rate for positive class(minority class)is low.Based on the in-depth analysis of imbalanced data processing methods and related research of medical question answering system(QA),On two-classification dataset,this paper proposes a hybrid sampling algorithm based on sample subdivision named SS-HSA(Hybrid Sampling Algorithm Based On Sample Subdivision)and study the application of this method in medical QA system.The main contents of this paper are as follows:(1)The algorithms of ENN undersampling,Borderline-SMOTE oversampling,Random-SMOTE+ENN hybrid sampling and ISMOTE oversampling are analyzed in detail,which provide the theoretical basis for the SS-HSA algorithm proposed in this paper.(2)The Hybrid Sampling Algorithm Based On Sample Subdivision,This paper presents a hybrid sampling algorithm based on sample subdivision on the data level.This algorithm combines the advantages of Borderline-SMOTE,ISMOTE,and ENN sampling algorithms,adding the idea of sample subdivision.On the one hand,it precisely controls of the number of samples generated and improves the rationality of newly generated samples of positive samples;and on the other hand,effectively removes the boundary samples in the dataset,making the boundary clearer.The comparison of experiments shows that this combined method is superior to the Borderline-SMOTE,ISMOTE and Random-SMOTE+ENN sampling methods on classification effect of the overall dataset and positive class.(3)In the medical QA system,the algorithm proposed in this paper is used to sample the dataset so that the number of positive and negative answers is equalized.Based on that,the model training and sorting are performed.The experimental results show that this hybrid sampling method is applied to the system,which improves the judging ability of medical question answering system effectively.
Keywords/Search Tags:Imbalanced dataset, Positive class, Negative class, Hybrid Sampling, QA System
PDF Full Text Request
Related items