A general theory for evaluating joint data interaction when combining diverse data sources | | Posted on:2009-02-03 | Degree:Ph.D | Type:Thesis | | University:Stanford University | Candidate:Polyakova, Evgenia I | Full Text:PDF | | GTID:2448390002497981 | Subject:Mathematics | | Abstract/Summary: | PDF Full Text Request | | Accounting for data interaction is a necessary and critical step in any data integration algorithm. Data interaction, whether it is through information redundancy, compounding or cancellation, can change completely the image provided by mere association of individual data ignoring their interaction. Data interaction is just as important as the individual data information content and depends on both data values and the unknown being assessed. Yet, most data integration algorithms ignore completely or partially data interaction, by assuming some form of data independence for a given question asked or, worse, for any question asked. More advanced analysis acknowledges dependence of data information content but still models it only as linear dependence (linear correlation) between any two data rather than considering all data together; and this linear correlation is assumed independent of the data values and of the event or value being estimated.;In this study, the general problem of data integration is expressed by combining probability distributions conditioned to each individual datum or data event into a posterior probability for the unknown conditioned jointly to all data. The goal of this thesis is to develop a method/model of statistical analysis accounting for data interaction. Addressing this goal, we propose the nu expression which is the sister of previously developed tau expression. Both nu and tau expressions provide an exact analytical solution to the problem of data integration by combining individually conditioned probabilities while accounting for interaction between data. This is achieved by separating individual data information and data interaction. The nu and tau interaction parameters are data values-dependent and, even more critically, unknown value-dependent. This data value-dependency (heteroscedasticity) allows for a better representation of joint data interaction than do traditional regression or kriging weights which are independent of the data values. However, the greater that heteroscedasticity, the more difficult becomes the inference of the data interaction parameters. We investigate the behavior of the nu and tau parameters versus data values. The nu parameters being ratios of ratios of likelihood probabilities appear more stable than the tau parameters and could be estimated starting from summary statistics of the actual data values taken altogether. Also, the tau interaction weights depend on specific ordering of the data. While such ordering is important, in most applications it is the global (independent of the data sequence) representation of such interaction that matters. The tau expression fails to provide such global measure. The nu model allows the derivation of a single, data sequence independent, interaction measure.;The nu model proposed is extensively tested using synthetic data sets. The test experiments confirmed superior features of the nu model compared with the tau model or traditional statistical approximations. The practicality of the nu expression will depend on our ability in generating proxy training data from which to borrow and export the nu parameters. | | Keywords/Search Tags: | Data interaction, Data integration, Parameters, Data values, Combining, Individual data information, Data information content | PDF Full Text Request | Related items |
| |
|