| Genome-wide association study(GWAS)is an important method for genomics.Through the comparison of the entire genome sequence between the diseased population and the normal population statistical analysis is used to find genetic mutation sites--single nucleotide polymorphisms(SNP)--which have a significant effect on disease occurrence.Then combining with covariates such as gender,age,and ethnicity,by using regression analysis to study the co-variance of covariates and significant SNP loci to the disease,thereby revealing the cause of the disease.However,there are two aspects of privacy risks in the GWAS data release.First,the genotype sample data set of the SNP significance study and its test statistics will be published anonymously on the research website,and some patients voluntarily provide their genotype data to the public.Research site,these genotype data are the main sources for identifying personal identity,inferring relative genes,and identifying the privacy of an individuals’ illness,but anonymity and access control are not enough to protect these genotype data.There are still statistical attacks,multiple database connection attacks,The risk of background knowledge attack;Secondly,when using regression analysis to study the common effect of multiple SNPs and covariate data on disease,the regression coefficient is often calculated through the cost function to obtain a regression model,that is,multiple SNPs and covariate data Impact model of disease probability.However,there is a model inversion attack when the regression model is released directly,and the training data set will be reconstructed and leaked.If a model service interface is provided for researchers to use,there is a risk of model extraction attack.The release of GWAS genotype data and the release of regression models are aimed at enabling researchers to obtain more research materials and derive more statistical algorithms,which in turn leads to more new discoveries.However,the privacy risks in it undoubtedly hinder the contribution,sharing and release of data.At present,differential privacy is considered to be the most suitable privacy protection method for data publishing scenarios.The differential privacy mechanism can measure the privacy risk after disturbing data through the privacy budget.Regarding the above two aspects of privacy risks,the current genetic privacy protection work only considers the use of differential privacy to protect the release of GWAS statistical data,and the use of cryptography and secure multi-party computing to secure the genotype data,but it is not considered to be published internally.The privacy issues of genotype data on research sites or volunteer sites have ignored the privacy issues of published genotype data,and have not yet guarded against privacy risks caused by leaks in regression analysis models.In view of these two shortcomings,the specific research work of this article is as follows:(1)Introduce the basic knowledge of GWAS privacy protection and the current status of GWAS privacy protection.First,the basic knowledge of GWAS is summarized,including genetic variation loci SNPs,different genetic models,statistical testing methods,regression analysis methods,etc.Secondly,the thesis shows the two main research contents of GWAS and their corresponding Research methods and results of research releases;finally,summarize the current privacy risks and protection status for different release scenarios.(2)Aiming at the privacy issue of GWAS genotype data release,this chapter will introduce a differential privacy protection method that meets the Nash equilibrium in Chapter 3 to achieve the balance between privacy and data utility of genotype data after disturbance.We consider patients and researchers as two participants in non-cooperative games,and the number of genotype perturbations is the strategy.Patients care about privacy measures,and researchers care about the utility after the perturbation-the p-value.First,based on the expected utility,a reasonable interference interval for the number of target genotypes is calculated.Secondly,based on this interval,the Nash equilibrium point between utility and privacy is obtained according to the utility privacy benefit matrix.Finally,the original genotype matrix is subjected to differential privacy perturbation based on the equilibrium point to obtain the corresponding random genotype matrix.Both theoretical and experimental results show that the number of genotype perturbations found by this method can simultaneously satisfy the best expected utility and privacy that we define.(3)Aiming at the privacy issue published by the GWAS regression model,this article will introduce a differential privacy protection method that perturbs the cost function to calculate a new regression coefficient.Most phenotypes in GWAS are classified phenotypes,and logistic regression analysis is required.Therefore,this article focuses on the protection of logistic regression models.Since the cost function essentially represents the probability that the predicted value of the regression model belongs to the actual value,and the regression coefficient comes from the maximum likelihood estimation of the cost function,it is different from the previous protection methods that directly add noise to the logistic regression coefficient We expand the cost function through Taylor expansion.The log-likelihood function is converted into a low-order polynomial.The difference between the original polynomial and the perturbed polynomial function is called a"difference function".We only add noise to the difference function coefficient of the cost function,and then solve the new regression coefficient to ensure that the perturbed cost function is closest to the original cost function,so that the regression effect of the new regression coefficient is closer to the original regression model.After verification,this method reduces the sensitivity of the data set to noise,and the randomness of the noise also guarantees a certain degree of privacy,so that the perturbed regression model has strong privacy while ensuring that the original prediction is accurate.Rate,which can be used to.protect the privacy of published regression models. |