Font Size: a A A

Two-stage Clustering Algorithm Based On K-Means And Prototype Network And Its Application

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X Q PengFull Text:PDF
GTID:2428330626455308Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The rapid development of the field of big data has made users' data information increasingly perfect,the number of warehouses more reasonable,and the quality of data continuously improved,so the value of data has become greater and greater.How to make reasonable use of user data for personalized service and recommendation has become a research hotspot of intelligent social platforms.User layering is the basis of personalized services,so the clustering algorithm used is very important.Currently,the widely used K-Means algorithm is limited by the choice of similarity measures in the clustering of mixed data.Therefore,this paper proposes a two-stage clustering algorithm based on the K-Means algorithm and the prototype network.The prototype network is extended to unsupervised clustering,so that an embedding space can be obtained through the training of the prototype network,so that the mixed data is highly aggregated within the class and highly separated between the classes after being projected into the embedding space.To verify the feasibility of the algorithm,we first tested it on a handwritten dataset.The data set has a total of 1700 samples and a total of 10 categories.The first step is to use the K-Means algorithm to label samples within the threshold range to implement data transformation,and then put the labeled samples into the prototype network training to obtain the embedding space and complete the clustering of all samples.At the same time,we also compare the clustering effect with K-Means algorithm,K-Means algorithm,and PCA-based algorithm.The clustering effect was compared using five indicators: homogeneity score,completeness score,ARI,AMI,Silhouette,and V-measure.The results show that the algorithm proposed in this paper has the highest indicators,followed by the PCA-based algorithm,and the traditional K-Means algorithm has the lowest effect.Among them,The algorithm proposed in this paper has the highest indicators,followed by the PCA-based algorithm,and the traditional K-Means algorithm has the lowest effect.Among them,the homogeneity score of the clustering effect of this algorithm is 0.707,which is 0.036 higher than the PCA-based algorithm,indicating that the purity within the same cluster after clustering is improved;the integrity score is increased by 0.058,indicating that all members of a given class are assigned The effect on the same cluster has been improved;the ARI index has increased by 5 percentage points,indicating that the accuracy of the clustering has improved;the contour coefficient score of the clustering result of the algorithm in this paper is 0.332,which is 0.176 higher than the PCA-based algorithm,and the contour coefficient is The obvious improvement indicates that the samples of the same type are more aggregated,the differences between the types are more obvious,and the clustering effect is better.After verifying the algorithm,we apply the algorithm in this paper to the field of user layering.User data comes from the short-term rental platform,and the data is desensitized.Data extraction and integration were completed through My SQL,and finally integrated into 27 user behavior characteristics including continuous variables,ordered discrete variables,and unordered discrete variables.The user's gender ratio and age structure are obtained through descriptive statistics.It can be seen that the user data has no gender skew.From the user's age structure,the sample covers a wide range of users,and the users are mainly young and middle-aged people.It is cleaned data without missing data.Then preprocess the data to remove the dimension.Then use the clustering algorithm in this article to perform clustering,and finally divide users into five categories;The first category: all indicators of this type of users are negative values,which are completely lost users.The second category: This type of user pays more attention to the cost-effectiveness of the room,and is the common user with the largest proportion.The third category: Consumer activities and platform interactions of this type of users have almost disappeared.Belongs to lost users.The fourth category: This type of users has a short registration time and high user activity,and is a potential new user of the platform.Fifth category: This type of user belongs to high-value users with high loyalty and economic benefits.On the basis of user layering,we will achieve digital precision marketing.
Keywords/Search Tags:K-Meansalgorithm, Prototype network, User layering, Mixed data clusterin
PDF Full Text Request
Related items