Font Size: a A A

Some Statistical Computations Models And Its Applications In Biomedical Information Processing

Posted on:2017-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:G C LiuFull Text:PDF
GTID:1224330485479615Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
All topics of this project come from the real medical and biological problems. In this dissertation, four main effective statistical models were developed by time series analysis, statistical signal processing, statistical machine learning and pattern recogni-tion, meta analysis and other approaches. Then these models were utilized to solve the following problems, 1)Fetal electrocardiogram (FECG) extraction and noise reduction, 2)identification of protein coding regions in Eukaryotes,3)virus prediction (classifica-tion) in large scale Next generation sequencing (NGS) reads, and 4)association between alcohol dependence (AD) and neuropeptide Y(NPY) genetic polymorphism.High resolution Fetal electrocardiogram (FECG) plays an important role in assisting physicians to detect fetal changes in the womb and to make clinical decisions. However, in real situations, clear FECG is difficult to extract because it is usually overwhelmed by the dominant maternal ECG (MECG) and other contaminated noise such as baseline wander, high frequency noise, etc. In Chapter 1, we proposed a novel integrated adap-tive algorithm based on Independent Component Analysis (ICA), Ensemble Empirical Mode Decomposition (EEMD) and wavelet shrinkage (WS) denoising, denoted as ICA-EEMD-WS, for FECG separation and noise reduction. First, ICA algorithm was used to separate the mixed abdominal ECG (AECG) signal and to obtain the noisy FECG. Second, the noise in FECG was reduced by a three-step integrated algorithm comprised of EEMD decomposition, useful subcomponents statistical inference and WS process-ing, and partial reconstruction for baseline wander reduction. Finally, we evaluate the proposed algorithm using simulated datasets. The results indicated that the proposed ICA-EEMD-WS outperformed the conventional algorithms in signal denoising.The identification of protein coding regions plays a critical role in gene structure prediction, and it can be looked as a pattern recognition or classification problem in bioinformatics. A number of techniques have been suggested for discriminating between the protein coding regions (exons) and noncoding regions (introns) in the eukaryotic DNA sequences. Among these methods, the discrete Fourier transform (DFT) based digital signal processing (DSP) techniques have been successfully utilized for their ad-vantages on requiring no prior knowledge. But these DFT-based methods rapidly lose their effectiveness in the case of short DNA sequences, because of their nature limitation of low spectral resolution and spectral leakage effects. In Chapter 2, a novel integrated algorithm based on autoregressive (AR) spectrum analysis and wavelet packets transform (WPT) is presented to improve the efficiency and accuracy of the coding regions identifi-cation. In this algorithm, the DNA sequences are converted into numerical sequences by Code13 mapping method. Then taking the numerical sequences as the observed signal of an AR model, the efficient Marple algorithm is utilized to estimate the of the AR model by calculating the parameters of the Yule-Walker equations. Finally the PSD at frequency θ=2π/3 (also referred as the three-base periodicity (TBP) property) is used to obtain a numerical quantity named signal to noise ratio (SNR), then SNR curve is utilized to identify the exons after denoised by WPT. We show that the new algorithm outperforms the conventional DFT-based approaches in improving the prediction accu-racy of protein coding regions distinctly by testing GENSCAN65, HMR195 and BG570 benchmark datasets.Virus (especially pathogenic virus) has been a threat to human health for thou-sands of years and in recent years, emerging new viruses and their variants appeared frequently. So computational biology approaches plays an important role in assisting the medical experts as technical assistance, to quickly narrow down suspected virus fil-tering range from generation sequencing massive short sequence databases. Which also will help them to accelerate the follow-up experiments to confirm the virus by providing high-quality candidates, to significantly save invest in experiment costs, to improve new virus emergency response capacity and timeliness, to accelerate vaccine development and mass production, and to save lives and reduce the infected population. In Charpter 3, we developed a set of comprehensive classification algorithm for virus and human iden-tification (classification) and a further different viruses class prediction by combining sequence alignment and alignment-free approaches. Firstly, BLAST was used to classify the unknown sequence(s) into virus or human by aligning to a large virus database and a human database respectively. If a highly homologous target sequence can be found in the virus or human database, then the unknown sequence can be classified into the category of the target sequence, so the algorithm stops. For those unaligned sequences, our pro-posed alignment-free method can play a complementary role. First, the DNA sequence was transferred into a numeric vector, as the input of a support vector machine (SVM) classifier. The SVM classifier performed a virus-human level prediction. The output of SVM will be the category of the unknown sequence. Furthermore, if a sequence has been predicted into "virus", people still might want to know more detailed about this virus, for example, is it belong to double strand DNA virus (dsDNA) or single strand RNA positive (ssRNA(+)), etc. That is, to predict their virus level label. So based on this results, a multiple Random Forest (RF) classifiers were utilized to perform a six-virus level prediction for those classified as "virus" in SVM step. Eight independent testing datasets were utilized to test the proposed and other approaches. The results indicated that our proposed integrated algorithm outperformed the other approaches in virus-human level, especially for short sequence prediction. Though the accuracy in six-virus level prediction were lower, still the results can be taken as a reference for the biologist. In summary, this study could help biologists and medical experts filter candidate virus from massive NGS short reads, thus greatly narrow the candidate viral sequences. Which will help to improve the efficiency of virus confirmed in particular identifying pathogenic virus, and provide strong technical support for the treatment and prevention of major epidemic diseases.Alcohol dependence (AD) is a typical chronic alcoholism, which is a special state of mind due to long-term repeated drinking of the wine. Within all the world’s disease risk factors from 1990-2010 (20 years), drinking increased rapidly from top six to top three, only ranked after hypertension and secondhand smoke. Excessive alcohol con-sumption leads not only to health-related damage, and will bring harm to society, such as traffic accidents, crime, child abuse, domestic violence and various forms of harm, etc. Accordingly, alcohol-related problems have become one of the important worldwide public health problems, including China.Although the incidence of alcohol dependence continues to increase, but the exact cause and pathogenesis is still not fully understood. Present study suggests that AD is a multifactorial genetic and environmental complexi-ty associated mental illness. And a large number of studies have confirmed that AD is closely associated with genetic factors. In the association study between AD and neu-ropeptide Y (NPY) gene polymorphism, many researchers have conducted more than ten years of research in different populations worldwide. But in two single nucleotide polymorphisms (SNP), namely rs16139 and rs16147 sites, the results are not consisten-t, or even completely opposite conclusion were drawn. So the AD susceptibility genes still uncover. This is because different people in different genetic backgrounds and en-vironmental factors, and among different ethnic groups, leading to the same in different populations may exist between different ethnic gene allele and genotype frequency dif-ferences, therefore the impact of the same disease may also be different. So how to use the existing randomized case-control studies to find susceptibility genes for alcohol de-pendence syndrome, screening high-risk groups from the genetic level and for providing targeted early intervention, diagnosis, personalized treatment has important clinical val-ue and social benefits. In view of the information available on NPY gene polymorphism and alcohol dependence syndrome association study there have been inconsistent find-ings, in Chapter 4, we mainly focus on whether there is a significant association between NPY gene polymorphism with AD. Meta analysis of the current use of SNP published on neuropeptide Y (NPY) gene polymorphism, in particular two important SNP (i.e. rs16139 and rs16147) were performed for epidemiological literature AD risk quantitative analysis and comprehensive evaluation. In this chapter, we strictly performed analysis in accordance with the basic requirements of Meta analysis of SNP through extensive collect existing high quality research literatures, comprehensive quantitative analysis of the existing literature on the association between NPY gene polymorphism and alcohol dependence syndrome. First, Hardy-Weinberg equilibrium (HWE) were tested for con-trol groups, followed by a test of heterogeneity across studies. Then a genetic model selection strategy based on Logistic regression model was used to find the best genet-ic model. Then the dominant genetic model were recommended as the best model for this research. So the dominant genetic model was performed to combine the p value of each study and conduct a subgroup analysis. Finally, the funnel plot and Egger lin-ear regression and Begg rank correlation method were utilized to test the publication bias. The results indicated that from the whole point of view, i.e. the majority of the population, there is no sufficient evidence to support a significant association between NPY gene polymorphism and alcohol dependence syndrome. However,individual group-s (such as the Finns), there is a significant association between SNP rs16139 and AD found by subgroup analysis. In this chapter, performing meta-analysis in the multiple existing studies, will increase the sample size by pooling them, and also improve the statistical power. Especially when multiple or inconsistent results are not statistically significant, the use of meta-analysis will obtain a more comprehensive result close to realistic situation. Which will help the clinicians and researchers in-depth understand the pathogenesis of AD, and provide scientific basis for gene diagnosis and treatment.In Chapter 5, we summarize the four research subtopics, especially focus on the limitations and the reasons of the researches. Finally, specific improvement plan were given for further research.
Keywords/Search Tags:fetal electrocardiogram (FECG), Ensemble Empirical Mode Decomposi- tion (EEMD), protein coding regions, virus prediction, alignment-free approach, alcohol dependence (AD), genetic polymorphism, Meta-analysis
PDF Full Text Request
Related items