| Query optimization has always been a critical problem in the database field.The business application of a database needs a lot of query execution support.The speed of query execution determines the overall efficiency of the database.Query optimization is the primary means to accelerate query execution.After decades of research in a traditional database,many query optimization achievements have been accumulated.Traditional query optimization methods usually borrow past research experience and set fixed rules for query optimization.The revised rule setting is universal,but this method lacks learning and cannot apply different directions with the change in data distribution.At the same time,the traditional query optimizer requires staff to adjust a large number of parameters.The performance improvement mainly depends on the CPU’s progress,and the CPU’s development has also entered a bottleneck.Cardinality estimation is a critical component of the query optimizer.Cardinality estimation solves the problem of estimating the number of results before the query is executed.Accurate cardinality estimation can guide the selection of connection order and cost prediction in query optimization and improve the effect of query optimization.Traditional cardinality estimation methods usually use independence assumption and sampling histogram to estimate cardinality.The independence assumption model is rough.Although the time efficiency of cardinality estimation is high,the accuracy is low.The calculation of the sampling histogram method depends on the sampled data and statistics,which needs to consume significant computational resources to improve the accuracy of cardinality estimation.Therefore,the traditional query optimization methods need to balance efficiency and accuracy and can not take into account.In recent years,the deep learning model has achieved excellent research results in the processing of high-dimensional data such as images,audio,and text.The combination of the learning model and cardinality estimation has become a new research hotspot.However,the existing methods mainly solve the cardinality estimation problem through supervised learning,which requires many training samples,resulting in a long time of learning model training to achieve a good cardinality estimation effect.Therefore,the research of this paper adopts the way of unsupervised learning.Based on the transformer model of deep learning,a cardinality estimation method based on a multi-feature partition hybrid model is proposed.The main contributions of this paper are:1)The core problem of cardinality estimation is the overall distribution of learning data.Therefore,it is necessary to encode high-dimensional data tables.The coding method used in this paper is to traverse the whole data table,count the types of attributes in each column of the entire data table,and represent the attribute values in the form of numbers.After getting the data table in digital shape,the data is encoded by one-hot and embedding.This paper only needs to traverse the data table once to complete the preliminary coding and then provide the vectors required by the model according to the numbers in the data table.2)A cardinality estimation algorithm based on the transformer and mask mechanism is proposed.The mask mechanism is used to mask the data to meet the definition of conditional probability.The attention mechanism of the transformer can learn the relationship between data and different attribute columns.By combining the two closely,we can learn the conditional probability distribution and then solve the problem of cardinality estimation.3)the first mock exam method is based on a multi-feature partition model.According to the complexity of data features(feature types),the data are input into different layers of models to learn the data distribution.Finally,the learned data distribution is mixed through a threelayer,fully connected neural network to improve the accuracy of cardinality estimation effectively.4)Through a large number of experiments,the cardinality estimation method proposed in this paper is compared with an open-source database,a commercial database,a Bayesian network,and supervised learning.The conclusion is that this method can accurately model the data distribution and has achieved better experimental results in both time efficiency and estimation accuracy. |