| Clustering is a common unsupervised learning technique used to discover the category structure in a set of data.Although there are many algorithms for clustering,they rarely involve the issue of feature selection,that is,which features of the data should be used by the clustering algorithm.Different from the supervised learning technology,the feature selection of clustering is more difficult.It does not have the category label of the data,and there is no obvious criterion to guide the search.At the same time,it is necessary to determine the number of cluster categories,which will also affect the feature selection problem.In this article,we use the Multivariate Mixtures Erlang with irrelevant features for feature selection,first use the CMM algorithm to select the initial value with higher quality,and then use the GECM algorithm with feature saliency to fit the model parameters,and then the minimum message length(MML)criterion is added,which reduces the feature saliency of irrelevant features and tends to 0,which is in line with the purpose of feature selection.The algorithm can estimate feature saliency and the number of clusters at the same time.Finally,it is applied in simulated data and real data to verify the GECM-MML algorithm,and compare it with the feature selection results of other models.It can be obtained that the model’s fitting performance and clustering performance are all have been optimized after using this algorithm for feature selection,which can effectively reduce the prediction error rate of the model. |