| Outlier detection in high-dimensional data faces the challenge of the curse of dimensionality,where the number of features unrelated to outlier detection increases,leading to increased computational complexity and negatively affecting detection results.Data self-representation methods can be used for outlier detection,amplifying the differences and correlations among the data.However,existing techniques fail to account for the influence of inter-feature correlations on outlier detection,thus rendering them unsuitable for high-dimensional data.To address the above issues,this paper delves into exploring inter-feature correlations and conducts in-depth research on outlier detection based on feature grouping and data self-representation,proposing an outlier detection algorithm suitable for high-dimensional data.The main research contributions of this paper are as follows:(1)In this study,we propose a feature extraction and grouping algorithm called the Balanced Association-based Feature Grouping(BAFG)algorithm.Firstly,we balance the consideration of both data proximity and the probabilistic relationships among features in feature extraction and grouping,to extract a subset of strongly associated features.Secondly,we define the basis for partitioning the final feature groups by measuring the redundancy among features,which not only reduces the impact of high-dimensional data on the grouping results but also obtains more effective feature partitions.This algorithm comprehensively and thoroughly reveals data information and provides a solid foundation for subsequent detection algorithms.(2)Building upon the aforementioned research,we propose a feature grouping and data self-representation based outlier detection algorithm,named the Feature grouping and Data Self-Representation based Outlier Detection(FDSR-OD)algorithm.Firstly,we incorporate the balanced association measurement concept into the process of sparse linear combination among data,resulting in a data self-representation matrix that contains both data and feature information.Secondly,we propose a calculation method based on fused inter-group data self-representation,forming a global data self-representation matrix.We further introduce an outlier detection algorithm based on fused data self-representation,detecting outliers through graph random walks on the directed weighted graph formed by the global data self-representation matrix.Finally,we combine the BAFG algorithm to propose the feature grouping and data self-representation based outlier detection algorithm,which effectively improves the accuracy and generalization of the algorithm.(3)Based on the research mentioned above,a system for detecting outliers in astronomical spectra based on feature grouping and data self-representation was designed and implemented using Pycharm as the development tool.The architecture and functional modules of the system were described in detail,and the analysis of its performance demonstrated that it provides an effective approach for large-scale astronomical spectral outlier mining.This paper demonstrates the effectiveness and generalization of the BAFG algorithm and FDSR-OD algorithm using synthetic datasets,UCI datasets,and ODDS datasets.Compared to other comparative algorithms,they exhibit higher detection accuracy.Additionally,when applied to LAMOST astronomical spectral data,they offer a new approach for large-scale outlier mining in astronomical spectra. |