| With the development of big data,there are two kinds of information in modern society.One is the so-called effective information with value,and the other is the information that seems to be worthless or repeated.From the point of view of the whole process of data acquisition,cleaning,analysis and publishing,there are many privacy disclosure problems to be solved in each stage regardless of the nature of the data.Considering the problems of privacy disclosure in two aspects,one is that privacy may be maliciously mined due to the correlation of data itself,and the other is that the third party responsible for data analysis is not trusted,so it is necessary to protect the data privacy in the process of analysis and publication.This thesis studies the privacy protection technology of data analysis and data publishing process.In terms of data analysis,existing studies consider data dimensionality reduction from the perspective of data correlation,and propose correlation sensitivity to reduce the impact of correlation on data privacy protection.The difficulty is how to find an approximate optimal subset instead of the original data set and reduce the dimension of the original data set.In terms of data publishing,the mainstream technology of privacy protection is the privacy protection of histogram publishing,and the method is to reconstruct the histogram or to change the order of adding noise and reconstruction.The difficulty is also the reconstruction process of histogram.Aiming at the above problems,this thesis proposes a healthcare data statistics publishing technology based on differential privacy for healthcare data.The main research work is as follows :Firstly,the actual background and practical significance of the research are analyzed.This thesis describes the related work from three aspects,analyzes the possible shortcomings in the existing work,and gives the main research work and contributions of this thesis.Secondly,in view of the correlation between multiple features in health medical data and the correlation between data,it is possible to weaken the differential privacy protection effect,and reduce the risk of privacy leakage caused by correlation by deleting features.However,considering that the existing feature deletion methods randomly select one feature from two specific collinear features,the dataset after deleting the feature may affect the prediction performance in data analysis.This thesis presents a feature selection method for extracting maximum feature set from original dataset.The Bron-Kerbosch algorithm is used to solve the maximum clique of the complement graph of the undirected graph to obtain the maximum independent set of the original undirected graph to extract effective features.The problem of information leakage caused by the correlation between features is solved.The feasibility and effectiveness of the proposed method are verified by experimental comparison.This shows the effectiveness of the proposed method in extracting features and the improvement of prediction performance compared with other methods.Thirdly,the phenomenon of ’ zero bucket ’ and ’ heavy tail ’ appears in most healthcare data due to the small amount of data or even close to zero in a certain interval,and the phenomenon of hidden histogram expressing data features such as histogram ’very gentle ’ appears in the interval of the dataset.In view of the fact that the existing histogram cannot truly or better reflect the obvious data distribution features,and the privacy budget allocation problem in the process of histogram adding noise.In this thesis,a non-equal-width histogram method is proposed.The non-equal-width histogram is constructed by using non-uniform empirical distribution function according to data sparsity to obtain the boundary points of each group,and the privacy budget is allocated according to the size of each group,so as to protect the histogram reconstruction.The feasibility and effectiveness of the proposed method are verified by experimental comparison.This thesis illustrates the rationality of grouping and privacy budget allocation,and shows the obviousness of data distribution and the improvement of query accuracy in long-range query compared with other methods. |