Font Size: a A A

Outlier Detection For Categorical Data Based On Attribute Grouping Weight And Maximum Likelihood

Posted on:2024-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:K Q ZhangFull Text:PDF
GTID:2568307094981689Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Outlier detection is one of the important research contents in the fields of data mining and machine learning,and its task is to detect outliers generated by different mechanisms from other data objects.It can serve for decision-making by mining rich,valuable and potential information.As an effective means of high-dimensional outlier detection,attribute grouping outlier detection divides attributes into several subsets according to the relationships among attributes,then identifies or measures outliers in each attribute subset,so as to mine global and local outliers and effectively alleviate the interference of "the curse of dimensionality".However,the existing attribute grouping outlier detection methods fail to depict the difference between attribute groups and the deviation degree of attribute groups,which affects the performance of high-dimensional outlier detection.In this thesis,we make full use of the relevance and deviation among attributes,and conduct in-depth research on attribute grouping and outlier detection for categorical data based on attribute grouping weights and standardized maximum likelihood criteria,which further improve the performance of high-dimensional outlier detection.Main research results are as follows:(1)An attribute group weight-based outlier detection algorithm for categorical data is proposed.Firstly,the method defines the attribute group deviation factor according to the frequency of data patterns and their code lengths and uses it as a basis of merging attribute groups,which effectively portrays the deviation among attribute groups and further improves the search efficiency in the process of attribute grouping.Secondly,it uses the information entropy cumulative sum to define the attribute group weights,which effectively reflects the difference among different attribute groups.Thirdly,the outlier score function is redefined based on the attribute group weights,and an outlier detection algorithm for categorical data is proposed on this basis.In the end,experimental results on UCI、NTU、KEEL and synthetic datasets,validate that the outlier detection algorithm not only has high detection accuracy and efficiency,but also has good extensibility and scalability,and it can be applied to the outlier detection task of high-dimensional categorical attribute datasets.(2)An attribute grouping and outlier detection algorithm based on normalized maximum likelihood criterion for categorical data is proposed.Firstly,the algorithm performs binary processing on the data attributes,according to the normalized maximum likelihood encoding lengths,redefines the attribute group deviation factor,and more comprehensively reflects the uncertainty of the attribute groups and effectively depict the deviation degree of attribute groups,so as to further improve the search efficiency in the process of attribute grouping.Secondly,the weight of attribute groups is redefined by using the proportion of standardized maximum likelihood encoding length of attribute groups,which reflects the importance of related attribute groups.Then,according to the attribute group weights,the outlier score function is redefined,and an outlier detection algorithm based on the normalized maximum likelihood criterion for categorical data is proposed.In the end,experimental results on UCI、NTU、KEEL and synthetic datasets,validate that the outlier detection algorithm not only has high detection performance,but also has good extensibility and scalability.
Keywords/Search Tags:Outlier detection for categorical data, Attribute group, Attribute group weight, Deviation factor, Normalized maximum likelihood criterion
PDF Full Text Request
Related items