Font Size: a A A

Research And Implementation Of Principle Component Analysis And Factor Analysis Parallelization Based On Spark

Posted on:2018-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2310330518495567Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of science and technology, all walks of life have produced large amounts of data that could not be imagined in the past. It is necessary to get useful information from the massive data.Principal component analysis and factor analysis are very important methods to extract useful information from the data. In this paper, principal component analysis algorithm and factor analysis algorithm were studied,and those two algorithms is implemented on Spark platform. The main work of this paper is as follows:(1) The realization of a mathematical statistics function set based on Hadoop platform, which includes the most commonly used statistical functions: common statistics(including the average, variance, number,median and other 11 statistics), univariate analysis(determine the correlation between the dependent variable and the independent variable by calculating the covariance between a dependent variable and multiple independent variables), multivariate analysis(by calculating the correlation coefficient matrix between multiple variables to determine the correlation between the two variables), hypothetical test(including univariate T test,paired sample T test, independent sample T test), the self-help method(the data are re-sampled to calculate the mean and variance of the sample).(2) The principal component analysis and factor analysis based on Spark are realized. The idea of divide-and-conquer is to turn "big problems" into "small problem", and then use Spark's distributed computing capabilities to solve "small problem" in parallel, doing their best to save computing time. The method of QR decomposition (QR decomposition is the most efficient matrix factorization) is used to solve the eigenvalues of the block matrix to improve the efficiency of the "small problem". The algorithm combines the divide-and-conquer idea and the efficient QR decomposition algorithm, and makes full use of the parallel computing ability of the Spark platform.Finally, the experiment is carried out on data sets of different sizes.Experimental results show that the parallel algorithm proposed in this paper can improve the computing efficiency.
Keywords/Search Tags:principal component analysis, factor analysis, Spark, QR decomposition, divide and conquer
PDF Full Text Request
Related items