Font Size: a A A

Statistical Researches On Local Similarity Analysis And Their Applications In Biological Time Series

Posted on:2020-12-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:1360330572489008Subject:Financial mathematics and financial engineering
Abstract/Summary:PDF Full Text Request
Metagenome is a collection of all microbial genetic material in a natural environment,which is one of the most popular research fields in the bioinfor-matics.Metagenomics does not.need to culture microorganisms.It directly extracts the genetic information of all microorganisms present in the envi-ronment,studies the interactions between them,and analyzes the diversity of microbial communities.With the rapid development of molecular biology technology,especially the emergence of Next Generation Sequencing(NGS)technology,substantial metagenomic sequencing data are constantly being produced.In the face of massive sequencing data,how to use these data for analysis of microbiome is a great challenge for bioinformatics.Benefitting from the sharp decline of sequencing cost,a large number of time series microbial community data have been generated in molecular biological over the past decade.Among the statistical methods for time series,local similarity analysis(LSA)has been extensively used to a wide range of environments to investigate the temporal and spatial evolutions of microbial communities and capture potential local and time-shifted associ-ations between microorganisms that cannot be distinguished by traditional correlation analysis.Initially,the permutation test is popularly applied to evaluate the statistical significance of local similarity analysis.More recently,a theoretical method has also been developed to analyze the statistical signif-icance of local similarity scores.However,this method and permutation test require the assumption that the time.series data are independent and iden-tically distributed(i.i.d.),which can be violated in many actual problems.In this paper,some novel approaches are developed to accurately evaluate statistical significance of LSA for stationary time series data.The main con-tents can be summarized as follows:In the chapter 2,based on the theoretical approximation for i.i.d.data,we develop a theoretical method to assess statistical significance of LSA for stationary time series data.Data Driven LSA(DDLSA).In DDLSA.long run variance is used to adjust the asymptotic theory of LSA,and the limit distribution of LS score for stationary time series is obtained,where the long run variance is estimated by a nonparametric kernel method.In addition,we also investigate an alternative method,LSA for residuals(LSAres),for statistical significance evaluation of LSA by assessing the theoretical statis-tical significance of LSA of the residuals based on a predefined statistical model.By simulations we show that both methods have controllable type ?error rates for stationary time series,while the statistical significance of other approaches can be grossly oversized.We apply both methods to human and marine microbial datasets,where most of possible significant associations are captured and false positives are efficiently decreased.In the chapter 3,a new approach based on moving block bootstrap is pro-posed to analyze the statistical significance of LSA for stationary time series,denoted as Moving Block Bootstrap LSA(MBBLSA).Firstly,the original se-quence is divided into overlapping small blocks of the same length.Then the blocks are randomly drawn with replacement,and merged into resamples of the same length as the original sequence.Since each block is stationary,the resamples also preserve part of stationary feature of the original sequence.Therefore,MBBLSA can overcome the disadvantage of permutation test that time series must be i.i.d..The selection of block length in the moving block bootstrap method plays a crucial role,thus we need to find an appropriate block length selector.In this paper,we choose a simple block length selector based on the autoregressive coefficient of AR(1)model.Finally,MBBLSA is applied to simulation studies and three real datasets,indicating that the performance of MBBLSA is better than that of other methods.In the chapter 4,we introduce a variant of LSA,local trend analysis(LTA).In LTA,the original sequence is converted into the trend sequence,and then the local trend score is calculated as the local similarity score of the trend sequence.In order to evaluate the statistical significance of lo-cal trend score,a new method Stationary Theoretical Local Trend Analysis(STLTA)is proposed.Using the matrix spectral decomposition theory,we obtain the adjusted variance of the trend series of different changing trend alphabet.Then the more precise limit distribution of local trend score can be acquired.The simulations exhibit that the type I error rate of STLTA is closer to significance level in different time series models.STLTA is applied to different metagenomic data.The results show that our method is more efficient than permutation test and the statistical significance evaluation of local trend analysis for i.i.d.series.
Keywords/Search Tags:Local similarity analysis, Statistical significance, Stationary time series, Long run variance, Nonparametric kernel estimation, Moving block bootstrap, Local trend analysis
PDF Full Text Request
Related items