Research And Implementation Of Principle Component Analysis And Factor Analysis Parallelization Based On Spark

Posted on:2018-05-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2310330518495567

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of science and technology, all walks of life have produced large amounts of data that could not be imagined in the past. It is necessary to get useful information from the massive data.Principal component analysis and factor analysis are very important methods to extract useful information from the data. In this paper, principal component analysis algorithm and factor analysis algorithm were studied,and those two algorithms is implemented on Spark platform. The main work of this paper is as follows:(1) The realization of a mathematical statistics function set based on Hadoop platform, which includes the most commonly used statistical functions: common statistics(including the average, variance, number,median and other 11 statistics), univariate analysis(determine the correlation between the dependent variable and the independent variable by calculating the covariance between a dependent variable and multiple independent variables), multivariate analysis(by calculating the correlation coefficient matrix between multiple variables to determine the correlation between the two variables), hypothetical test(including univariate T test,paired sample T test, independent sample T test), the self-help method(the data are re-sampled to calculate the mean and variance of the sample).(2) The principal component analysis and factor analysis based on Spark are realized. The idea of divide-and-conquer is to turn "big problems" into "small problem", and then use Spark's distributed computing capabilities to solve "small problem" in parallel, doing their best to save computing time. The method of QR decomposition (QR decomposition is the most efficient matrix factorization) is used to solve the eigenvalues of the block matrix to improve the efficiency of the "small problem". The algorithm combines the divide-and-conquer idea and the efficient QR decomposition algorithm, and makes full use of the parallel computing ability of the Spark platform.Finally, the experiment is carried out on data sets of different sizes.Experimental results show that the parallel algorithm proposed in this paper can improve the computing efficiency.

Keywords/Search Tags:

principal component analysis, factor analysis, Spark, QR decomposition, divide and conquer

PDF Full Text Request

Related items

1	Research On PM2.5 Concentration Prediction Of Improved GS-SVM Based On Wavelet Decomposition And Principal Component Analysis
2	Research On Divide And Conquer Algorithms For Complex Electromagnetic Problems
3	The Research Of Divide And Conquer Algorithms For Skew-symmetric Tridiagonal Eigenvalue Problems
4	Robust Principal Component Analysis And Its Applications
5	Application Of Spatial Weighting And Higher-Order Principal Component Analysis In Multivariate Geoscience Information Synthesis
6	Research And Application Of Principal Component Analysis Algorithm For Low-rank Tensor Decomposition
7	The Empirical Analysis Of A-share Market On The Collinearity Problem Of Multi-factor Stock Selection Model
8	Research On Stocks Data Analysis Based On Spark MLlib
9	Studies And Applications Of Grouped Principal Component Analysis And Kernel Principal Component Analysis
10	How To Effectively Use The Principal Component Of Principal Component Analysis