In the context of Healthy China,people’s demand for drug safety is increasing.In recent years,with the strengthening of drug safety monitoring in China,the National Medical Products Administration has collected a large amount of adverse drug reaction(ADR)event data in the form of spontaneous reports every year.ADRs are posing a serious threat to people’s health.How to strengthen the risk re-evaluation of post-marketing drugs has become an important topic.At present,most countries divide drugs into two categories: prescription drugs(Rx)and over-the-counter drugs(OTC)according to the symptoms and severity of adverse reactions.In China,OTC drugs are further divided into OTC-A and OTC-B.The order of the risk levels of drugs from high to low are Rx,OTC-A and OTC-B.The category conversion of post-marketing drugs in China is mainly performed manually by experienced medical experts.Experts re-evaluate the post-marketing drugs according to factors such as the frequency of adverse events and the degree of harm during the clinical use of the drug,so as to realize the re-conversion of drug categories.However,this method has flawed due to insufficient sample and subjective bias.Therefore,this paper proposes a drug risk assessment model based on ADR big data and machine learning to provide decision support for drug risk management.The research mainly includes the following parts:(1)Data preprocessing.This paper uses the spontaneous reports(781,956)from 2011-2018 provided by Jiangsu ADR Monitoring Center as the research data.The data is split and normalized from three aspects: data format,drug name and adverse reaction terms,so as to obtain a dataset with one-to-one correspondence between drug names and adverse reactions.The proportional reporting ratio(PRR)algorithm is used to detect potential signals between drugs and adverse reactions,thereby establishing a data matrix with drugs as samples and adverse reactions as features.Referring to the Chinese Pharmacopoeia,the class of each drug is labeled,resulting in a dataset with three class labels,which lays the data foundation for subsequent classification models.(2)Sample augmentation.Due to the imbalance of three classes of drugs in the dataset,this paper introduces the kernel function into the original synthetic minority over-sampling technique(SMOTE),and proposes an improved K-SMOTE algorithm.This algorithm realizes the expansion of the minority samples(OTC-A and OTC-B),so that the quantity of drugs in each class can be balanced.(3)Feature enhancement.Combining feature selection(FS)and generative adversarial network(GAN),this paper proposes a FS_GAN model for feature enhancement of high-dimensional sparse data.The model measures the importance of adverse reaction features in the ADR dataset,and selects features with higher importance scores.GANs are trained based on high-scoring features to generate artificial features that conform to the distribution of real data.On the basis of retaining the original adverse reaction features,the FS_GAN model adds more effective data to improve the data sparsity of the high-dimensional feature space and achieve feature enhancement of ADR data.(4)Model construction.In this thesis,the random forest(RF)algorithm is used to establish a three-class classifier,which is combined with the FS_GAN feature enhancement model and the K-SMOTE algorithm to construct a drug risk assessment model(FS_GAN+K-SMOTE+RF).Another two groups of models(RF and K-SMOTE+RF)are set up for comparative experiments,and the validity of the models for classifying drug risk levels is verified through a variety of evaluation indicators.The results show that the K-SMOTE algorithm and FS_GAN feature enhancement technology proposed in this paper can greatly improve the performance of the model.The overall accuracy of our model can reach 97.90%,of which the F1 index of Rx is 98.76%,the F1 index of OTC-A is92.51%,and the F1 index of OTC-B is 94.62%.Therefore,the drug risk assessment model proposed in this paper has practical application value and will contribute to pharmacovigilance in China. |