Font Size: a A A

Detection Of Influential Observations In Mixed Linear Models Using Two Types Of Estimation And Prediction Methods For Genetic Data Analysis

Posted on:2009-04-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:S F YouFull Text:PDF
GTID:1100360242494323Subject:Statistical genetics and bioinformatics
Abstract/Summary:PDF Full Text Request
Mixed linear models for genetic data analysis is one of the most challenging problems for statisticians as well as geneticists, because it traditionally focused on linear, quadratic and the likelihood estimation methods which are not robust to aberrant cases in response as well as in the factor space. Vibrant inspection, through quality data check and model specification is the only way in understanding the effect of unusual data points on the results of analysis. Keeping this notion in mind, the present study was conducted to propose a technique in the framework of adjusted unbiased prediction (AUP) via minimum norm quadratic unbiased estimation (MINQUE) method (say, Method-I) for detection of unusual data points in mixed linear models for genetic data analysis. The proposed method was compared with the best linear unbiased prediction (BLUP) via expectation and maximization (EM) algorithm (called, Method-II) for checking its validity. In addition, to address the consequence of influential observations and outliers in biological research to two real data sets.A general genetic model was considered to illustrate the proposed method and to compare it with the existing methods by taking into account various influence diagnostic statistics. Four influence diagnostic statistics i.e. the analogue of Cook distance (CD(β)), Andrews-Pregibon statistic (AP) , Cook-Weisberg statistic (CW) and variance ratio (VR) were applied for detecting influential data points influencing the fixed affects of a mixed linear model; while the analogue of Cook distance (CD(e)) was used for inspecting the influential data points affecting the random components of the aforementioned model. To check the efficacy and reliability of the proposed method, Monte Carlo simulations were conducted for variable setting of aberrant observations in the phenotype data of a general genetic model. All these simulations were performed by a program written in C++ programming language. It was not rigorously proved that Method-I perform better as compared to Method-II and vice versa. Almost the same detection ability and trends regarding the presence of aberrant observations in the response were recorded from both the methods, using the aforementioned influence diagnostic statistics for the influence of i-th data point influencing the fixed and random components of a mixed linear model.In the present study, both the methods were compared for the false positive rate by taking a clean data set. The values of each influence diagnostic statistics for the influence of fixed and random components of a general genetic model (mixed linear model) were more clustered under the Method-I as compared to Method-II. It indicates the robustness of a proposed method (Method-I) in the presence of unusual observations and built our confidence that it will perform better in identifying aberrant observations. In simulation, for different perturbation in the phenotype data with regard to various genotype(s), location(s) and year(s), it was observed that our approach showed the same trend, very nice resemblance and in agreement with the Method-II under a variety of influence diagnostic statistics. However, in some of the situations, Method-I showed larger magnitudes for some of the influence statistics and vice versa.The main results from the simulations and the real data sets are summarized as follow:1. Our approach is verified to perform well in identifying the aberrant observation in the response vector of mixed linear model, if exists. If their is only one aberrant observation in the phenotype data, regarding any genotype corresponding to either location or year, it could be successfully detected using either of the influence diagnostic statistics under both the methods. If their exist multiple influential observations in the phenotype data of a general genetic model, some of them could be effectively detected by both the methods while for others, the influence diagnostic statistics will show some sort of noise.2. A program written in C++ programming language is developed to identify the influential observations and outliers in the data analysis of a general genetic experiment in the framework of mixed linear model. The program also provides the estimates of variance components and prediction of random effects involved in the model. In addition, the significance (P-value) of each individual observation in a data set.3. The results of general genetic model, analyzed in the framework of mixed linear model showed both the masking and swamping effects in the presence of multiple unusual data points in the phenotype values.4. In worked example (general genetic experiment), it was observed that the presence of influential observations and outliers can badly distort the estimates of variance components and prediction of random effects (breeding values). The removal of these data points can bring drastic change in the parameters' estimates of a mixed linear model and provide useful results. In QTL mapping data, the results demonstrate that clean data set give ways in identifying additional QTLs with individual effects; and improved estimates of phenotypic variation (heritability), and particularly that of residuals can be obtained in the absence of influential observations and outliers. In general, it was observed, in both the data sets analyzed, that the removal of influential observations and outliers can bring substantial change in the estimates of various parameters of a mixed linear model. However, it is not claimed that biologically outliers and influential observations may not be good data points.5. The method can be easily extended to more complex genetic models i.e. additive dominance, additive dominance maternal models etc. for studying the effect of unusual data points on variable genetic and non-genetic effects involved in the mixed linear model. In addition, it can also be used in microarray data analysis based on mixed linear model approach to identify the hidden peculiarities caused by machine or data entry or recording errors, or might be possibility of differentially expressed (not expressed) genes.
Keywords/Search Tags:Influential data points, mixed linear model, general genetic model, simulation, AUP via MINQUE (1), BLUP via EM-algorithm
PDF Full Text Request
Related items