Font Size: a A A

Research On Machine Learning-based Medical Insurance Fraud Detection

Posted on:2021-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:H FengFull Text:PDF
GTID:2514306455481824Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
China’s social security system includes a series of systems such as social insurance,assistance,and subsidies.Among them,the medical insurance system has an indispensable and important position.With the development of the country’s economy,the people’s living standards have continued to improve,and people have paid more attention to personal health.With the implementation of the medical insurance system,the medical insurance fund has played a very good role in subsidizing the general public’s difficulty when seeing a doctor and suffering from illness.However,it is accompanied by the problem of medical insurance fraud.Medical insurance fraud not only seriously threatens the legality of the medical insurance subsidy fund,but also will become a stumbling block for the effective implementation of government health insurance policies.This paper uses the consumption records data of insured persons who have undergone desensitization.First,we use statistical knowledge and visualization techniques to clean the data and do descriptive analysis of the variables to study the distribution characteristics of normal insured persons and fraudsters.Then,the four types of features,namely the amount type,the frequency of individual visiting doctor,the three-item items,and the disease type,were constructed respectively.Using this sample feature matrix,data reduction was performed in the Logistic model,the GBDT model,and the XGBoost model;and for the problem of data imbalance,Easy Ensemble integration method and SMOTE +Tomek links comprehensive sampling were adopted.From the model’s final prediction results,the GBDT model based on SMOTE + Tomek links comprehensive sampling performs more balanced in accuracy and recall rate;but overall,the stacking-based fusion model performs better by considering F1 value,with recall of 75.47% and precision of69.32%,which can improve the efficiency of medical insurance institutions in identifying fraudsters.According to the analysis of the importance of the variables in the GBDT model and the XGBoost model,in general,fraudsters spend more on various types of projects,the sum of the amounts declared by various projects,and the sum of the amounts of drug declarations and occurrences than those of normal insured persons.The behavior of fraudsters usually show a pattern of high frequency in a short term.
Keywords/Search Tags:Medical fraud, Logistic regression, GBDT, XGBoost, Stacking, Importance of variables
PDF Full Text Request
Related items