Font Size: a A A

Research On Air Quality Time Series Data Processing And Clustering Analysis

Posted on:2024-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhuFull Text:PDF
GTID:2530306929973689Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the rise of artificial intelligence and the advent of the era of big data,complex and various data has been produced.Mining potential patterns and information in time series data is a hot topic in current research.As an unsupervised data mining technique,clustering analysis can identify the structural features from time series data,group similar time series data into the same cluster,and assign dissimilar time series data into different clusters.Clustering analysis of air quality time series data can not only predict air quality changes in the future period of time,but also find the source of pollution and provide policy makers with useful decisions.However,there are usually missing values in the real-time air quality data collected from the monitoring stations,which will affect the accuracy of air quality time series data mining,including clustering analysis.Therefore,to cluster time series data more accurately,this paper firstly proposes a feature-driven time series clustering algorithm,called k Feat TS,based on graphs constructed by mutual k nearest neighbors.Secondly,a first five and last three logistic regression imputation method,called FTLRI,is proposed to effectively deal with missing values in time series data.Finally,the proposed time series clustering algorithm k Feat TS is applied to the clustering analysis of air quality time series data.These two methods,FTLRI and k Feat TS,have been respectively proved to be effective in missing value imputation and clustering analysis of air quality data.The main works of this paper are summarized as follows:(1)Because common time series clustering methods measure the similarity of time series fragments or fixed features,they cannot process feature-rich time series data.This paper proposes a feature-driven time series clustering algorithm,called k Feat TS,based on graphs constructed by mutual k nearest neighbors.This method extracts the main features of the time series data,plots graphs based on these main features,uses the mixed matrix to integrate the graphs based on the main features,and finally performs clustering.Through the experiment on9 different datasets in UCR database with 5 common time series clustering algorithms in recent years,it is proved that k Feat TS can achieve more accurate clustering analysis results on different time series datasets of various sizes,various lengths and various categories,and has certain robustness.(2)Since missing values in air quality datasets will affect the accuracy of clustering analysis,common missing value imputation methods cannot deal with the correlation of time series data on the time axis,and data with high missing rate cannot be accurately filled,this paper proposes a first five and last three logistic regression imputation method,called FTLRI.Combined with the sliding window model,a first five and last three model is proposed,which fully considers the correlation of data on the time axis and the correlation between attributes.In addition,FTLRI combines these two correlations and uses logistic regression algorithm to train a high-accuracy imputer suitable for missing values to fill in missing values.Before the experiment starts,the rows with missing values in the datasets need to be deleted to let the datasets become complete data,and the datasets are processed into data with missing rates of5%,10%,20%and 40%according to each size of the datasets and a certain step size.In this paper,FTLRI is compared with five common missing value imputation methods and a recent neural network imputation method on the processed datasets.It is proved that FTLRI has superior imputation performance compared with other methods.(3)The proposed time series clustering algorithm k Feat TS is applied to the air quality data monitored from Lanyuan Hotel in Lanzhou in 2022 for clustering analysis.The time series data of six pollutants(PM2.5,PM10,SO2,NO2,O3 and CO)are analyzed separately.The experiment shows that k Feat TS divides the corresponding time series of pollutants into different clusters to obtain clusters of different pollution levels under different pollutants.This experiment demonstrates the usefulness and correctness of k Feat TS in the field of air quality time series data analysis.
Keywords/Search Tags:Time Series Data, Cluster Analysis, Missing Value Imputation, Air Quality Data
PDF Full Text Request
Related items