A New Weighted GKNN For The Filling Of Missing Data

Posted on:2022-10-13

Degree:Master

Type:Thesis

Country:China

Candidate:W Wen

Full Text:PDF

GTID:2480306542999379

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

Data missing is common in many different fields,which may affect the progress of data mining,or even lead to different results.Therefore,the filling of missing data is a common application in various classification problems when there is missing training data.One of the more extensive filling methods is based on the k-nearest neighbor algorithm(KNN),which is one of the simple and effective filling methods in many machine learning algorithms.The classical KNN algorithm uses Euclidean distance as the measurement method and selects the relevant attribute values of the nearest neighbors to fill in the missing values.The measurement method is better for numerical data filling,but it is not applicable for heterogeneous data.In the classification problem,a large number of missing data filling work did not consider the problem of class labels,ignoring the dependencies between attributes.For the above two aspects,this paper studied respectively,proposed the new weighting method and the existence of class tags in heterogeneous data.The main work contents are as follows:(1)In order to fill in the non-random missing financial data(numerical type),based on the classical k-nearest neighbor algorithm,the third order Minkowski distance is used to find out the similar samples(i.e.k nearest neighbors)of the same class as the missing data according to the k-nearest neighbor rule,which is regarded as a new training data set.Then,a K-nearest neighbor algorithm(OKNN algorithm)was proposed to improve the original mean assignment method,and the combined weight coefficients of each index were given to the new training data set.Finally,the new algorithm is verified by an example according to the optimized weight coefficient.The results show that the proposed OKNN combination filling method is superior to the classical KNN algorithm and the weighted KNN algorithm distance filling method.(2)By using Euclidean distance to measure the similarity between samples,the importance of sample attributes is ignored,and the real differences of attributes are regarded as the same.Therefore,when the data set density correlation is not obvious,the similarity measurement effect is better,which is also the weakness of density correlation sensitivity.In the third chapter of the paper,it will be introduced in detail that when the density correlation is obvious,the Mahalanobis distance measurement is more advantageous,and the Euclidean distance is also a special case of it,so the selection is reasonable to a certain extent.In mixed data,data preprocessing and the selection of methods to measure sample similarity are obviously more complicated.Previously,some scholars have proved that grey correlation analysis is more suitable for and measures the similarity between two samples in heterogeneous data sets.On this basis,grey distance is used to replace Euclidean distance and grey KNN filling model is established,whose performance is better than classical KNN filling effect.You also need to consider the importance of attributes,the impact of different attributes,and the degree of correlation.An iterative KNN assignment method based on the weighted gray between missing data and all training data is proposed by weighting the gray distance through attribute importance,which can select similar samples more accurately and ensure that the assignment of training data is aimed at improving the filling performance.Finally,an appropriate amount of data is selected from UCI data set for example verification.Compared with other similar KNN algorithms,the weighted grey iterative KNN shows better performance in filling in the missing data.

Keywords/Search Tags:

Missing value, K-nearest neighbor algorithm, Grey correlation, Data filling, Attribute importance

PDF Full Text Request

Related items

1	Comparative Research On Data Filling Algorithms Under Different Missing Mechanisms
2	Studies On Missing Value Problems In Protein Homology Modeling
3	Research Of Classification Algorithm Based On K Nearest Neighbor
4	Recognition Of Essential Proteins Based On Improved Edge Clustering Coefficient And K-nearest Neighbor Algorithm
5	Using K-Nearest Neighbor And Multiple Improved Methord To Identify Anti- And PRO-Apoptosis Proteins
6	Asymptotic Properties Of Nearest Neighbor Estimation For Functional Data
7	Use TensorFlow To Implement An Automatic Phase Picking Method Based On The Nearest Neighbor Method
8	Missing Data Filling Method And Empirical Analysis
9	Research On Approximate Nearest Neighbor Search Algorithm Based On Graph
10	Research On Constructing Paleontological Phylogenetic Tree With Missing Data And Inapplicable Data