| Association rule Mining is an important research direction in the era of big data,which can mine important information and provide decision support.Association rules are generally divided into positive association rules and negative association rules.Association rule Mining in large data sets is a difficult problem.The predecessors proposed an association rule mining algorithm based on Gibbs sampling,which can effectively mine important association rules from large data sets,reduce the search space of 2~n to polynomial level,and make it possible to mine association rules in large data sets.However,The time cost of this algorithm in Gibbs sampling process is too high,which hinders its wide application.In this algorithm,only positive association rules can be mined,and negative association rules are not considered.Therefore,how to mine the positive and negative association rules with low time cost in big data sets has become an important research topic.Aiming at the problem that the sampling time of association rule mining algorithm based on Gibbs sampling is too long,this thesis proposes an association rule mining algorithm based on dynamic Hash_Gibbs sampling from the algorithm point of view.Firstly,the data set is preprocessed.After the missing value processing is completed,the data is thermally encoded and converted into binary data.And block the data,decimalize each block of data to reduce the storage space.After decimalization,the data and its support are stored in the Hash table for subsequent quick search.In Gibbs sampling,it is necessary to calculate the data support degree and then calculate the conditional transition probability.When calculating the support degree,first traverse the Hash table,and if the search is successful,directly return the corresponding support degree.If the search fails,the support is calculated by the improved support calculation method,and the support is returned.In order to reduce the probability of secondary failure,the data and support are dynamically inserted into the Hash table.The improved support calculation method first obtains the column number with a value of 1 in the sample data,and then traverses the above columns in the data set.After sampling,the importance of each dimension is judged by evaluation function,and the most important features are selected to form a new data set,in which association rules are mined.The time and space complexity of the original algorithm and the proposed algorithm are analyzed from the complexity theory.Experiments on simulated data sets and real data sets show that the proposed algorithm is superior to the original algorithm in all kinds of factors affecting Gibbs sampling,showing the time advantage,but the memory space is increased.Aiming at the problem that the association rules mining algorithm based on Gibbs sampling can only mine positive association rules and the mining speed is too slow,this thesis proposes a positive and negative association rules mining algorithm for Gibbs sampling based on OpenMP from the perspective of parallel technology..The algorithm increases the mining of negative association rules and introduces correlation coefficient in Gibbs sampling to identify positive and negative association rules.The calculation methods of four different types of rules in Gibbs sampling are given.In order to speed up the sampling efficiency,OpenMP is used to accelerate the calculation of support.In the experiment part,the simulation data set and real data set experiments are carried out.In the experimental part of the simulated dataset,the Apriori algorithm,the Apriori algorithm based on the correlation coefficient,the Apriori algorithm based on the cosine similarity and the positive and negative association rules mining algorithm for Gibbs sampling based on OpenMP are used in a simulated dataset with a small number of features.In the experiment part of real data set,four kinds of association rules are mined by using Apriori algorithm based on correlation coefficient and Apriori algorithm based on cosine similarity under different minimum support and minimum confidence conditions.Under different adjustment parameters,the positive and negative association rules mining algorithm for Gibbs sampling based on OpenMP are used to mine two types of association rules,A→B and A→?B,and the frequency of the most important feature ID and the most important association rules are obtained.The parallel efficiency of this algorithm in mining two types of association rules is analyzed.Experiments show that the algorithm can effectively mine positive and negative association rules in data sets,and the parallel mining efficiency is also high. |