Font Size: a A A

Analysis of large data sets with linear and logistic regression

Posted on:2004-08-14Degree:Ph.DType:Dissertation
University:University of Central FloridaCandidate:Hill, Christopher MichaelFull Text:PDF
GTID:1468390011466124Subject:Engineering
Abstract/Summary:PDF Full Text Request
The need for analysis of extremely large data sets is common in many environments including business, industry, and government. Regardless of the techniques employed for analysis, they are all susceptible to problems inherent in large data sets. This research investigates the impact of large numbers of observations on traditional linear and logistic regression analysis. The use of simulated data sets with known relationships enhances the research. Ensuring the simulation is a reasonable representation of what one would expect in the real world is a significant issue, and the research provides a guide for constructing the simulated data based on benchmarking historic studies, foundational literature, and real data. The results of the research show these traditional regression techniques are susceptible to some common problems, including size related issues, identification of non-significant patterns as significant, and data quality problems (noise in the data). In addition, there are issues specific to traditional regression analysis. These include breakdown of normally used statistics in model building, effects of spurious patterns and Type I and II regression errors, and impact of types of variables. The effect of these issues is incorrect selection of variables and poor coefficient estimation. The prediction function of regression seems less affected than the estimation function. The research also investigates approaches to reduce Type I and II variable selection errors by considering simple adjustments to statistics, experimental design properties, other variable selection methods, and segregation of insignificant terms. The result is that no single solution can simultaneously address all issues. The best approach is dependent on the user, the data, the application, and the impact of each kind of variable selection errors. Despite these issues, the traditional regression techniques may be some of the better techniques available when dealing with large data sets.
Keywords/Search Tags:Large data sets, Regression, Issues, Techniques
PDF Full Text Request
Related items