Font Size: a A A

Data Preprocessing And Latent Space Mapping Analysis On Large-scale Multi-source Time Series

Posted on:2019-03-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:W W ShiFull Text:PDF
GTID:1360330590470397Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of the Mobile Internet and Big Data technology,the amount of data is growing at an exponential rate.The quality of data preprocessing is considered to be a key issue in industrial processes,market successes and decision-making activities.Data preprocessing typically refers to the processing of input data that contain missing data,noise data,and redundant data.This paper first introduces the basic concepts and basic analysis methods of time series.Combining with the concrete research contents,this paper summarizes and analyzes the research methods and research status of multi-source time series.Then,the data preprocessing algorithm for multi-source time series is studied.We proceed with the study deeper and deeper along the line from simple problems to hard ones.In this paper,we focus on multi-source time series,including multi-source time series with auxiliary data source,multi-source time series with high dimension and multiple kinds of noise.A data preprocessing framework based on low-dimensional multi-source time series with auxiliary sources is proposed.Based on the fact that time series are collected from multiple data sources,the optimal linear regression model and optimal support vector machine model are established by utilizing the correlation characteristics between data sources in order to improve data quality.Furthermore,the models are built with the aid of auxiliary data sources.Since there are many optional data sources for different systems,the overall preprocessing framework for data fusion and noise reduction is established to diagnose the fault of the system.In this paper,we propose a data preprocessing framework(DPF)that can be used in power systems.In this data preprocessing framework,we propose missing data prediction models based on Optimized Linear Regression(OLR),Optimized Support Vector Machine(OSVM)and Refined Support Vector Machine(RSVM).The models are designed to better predict the missing values in the chromatogram of transformer oil,to improve the data quality of the raw data of the power system,and to be prepared for the follow-up fault diagnosis.In addition,we propose a method based on the Pearson correlation analysis to fuse data from auxiliary data sources.We extract the auxiliary information,which is a hidden factor of the diagnose process.We also designed a data cleaning method based on principal component analysis(PCA).By preprocessing the merged data,the data dimension and noise of the original training set are reduced,which improves the accuracy of fault diagnosis,and can further reduce the time of the training model.The implementation of the proposed methods based on large-scale multi-source time series in parallel environment verifies the effectiveness of the proposed methods,and brings higher execution efficiency at the same time.A data preprocessing algorithm is proposed based on matrix factorization with designed regularization.In order to improve the robustness of the proposed methods and to deal with the analysis of the raw low-dimensional multi-source time series as a whole,the raw data matrix is mapped to the hidden space,combined with the regularized constraint term to improve the accuracy of missing data prediction.In this paper,we propose a method to accurately extract the hidden factors in the latent space in the process of matrix factorization.The constrained matrix factorization is used to predict the missing data of the multi-source time series.The method decomposes the target equations by using the smoothness of each time series and the cross-source information to constraint the decomposition of matrix.Accordingly,Smoothness,Correlated Sensor Constraint(CSR),and Uncorrelated Sensor Constraint(USR)constraints are introduced.Based on this,five corresponding models were established.The experimental results reveal the validity of the hidden factor extraction in the matrix decomposition process after introducing the constraint.In addition,the implementation in the parallel environment not only verifies the effectiveness of the proposed method,but also demonstrates the efficiency of dealing with large-scale data.A dynamic matrix decomposition model is established to handle the dynamic characteristics of time series,which can be applied to the rapid updating of already trained model when new samples enter the system.This dynamic model ensures that the error is below reasonable limits when the model is updated after the arrival of the new samples,and the refining strategy ensures that the dynamic model remains robust after long-term updating.Similarly,for large-scale data processing,we implement a parallel computing environment to achieve dynamic matrix decomposition model.A data preprocessing algorithm based on feature selection and tensor factorization is proposed for multi-source time series with high dimension and multiple kinds of noise.For the time series with multiple kinds of noises scenarios,we propose a feature extraction scheme which is suitable for noisy time series.And a kernel model is established in the hidden space to achieve the purpose of classifying the noisy time series accurately.Here,we propose a new Supervised Temporal Tensor Kernel(STT)framework to obtain better time series classification accuracy.STT is designed to extract compact and precise representations from the time series with high dimension and multiple kinds of noise.The proposed framework STT overcomes some of the drawbacks of traditional approaches,such as the relatively high integrity requirement for given training data sets,the lack of time delay between multiple data sources in the original time series,and the need for high SNR.STT consists of three steps:(1)Robust max pooling for feature selection;(2)Supervised temporal factorization for extracting more compact representation information of selected features;(3)Kernel generation via tensor structure projection.The experimental results show that the proposed method can achieve excellent performance even when the noise in raw data sets is very high.A data preprocessing framework based on selective tensor construction and tensor structure projection is proposed.In order to predict future data of the time series without filling the missing values in raw data,we propose an Incomplete time series prediction based on Selective tensor modeling and Multi-kernel learning(ISM)framework.ISM consists of three parts: tensor construction;hidden factor extraction;multi-kernel learning.For the multi-source time series with missing noise,an optimal tensor construction method is designed.After that,the tensor can be used to map the noisy data to hidden space to achieve the purpose of noise reduction.Finally,tensor structure projection combined with multi-kernel learning is used to predict the future values of time series.ISM framework achieves better performance than the traditional and the state-of-the-art methods.The superior performance of ISM demonstrates the effectiveness of latent factor extraction and the combination of multi-kernel learning and the tensor structure projection.
Keywords/Search Tags:time series, regression model, matrix factorization, tensor factorization, latent space
PDF Full Text Request
Related items