Font Size: a A A

Development Of BIPES Pipeline For Bioinformatic Analysis Of Microbial Diversity

Posted on:2013-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:H F ShengFull Text:PDF
GTID:2230330395461843Subject:Occupational and Environmental Health
Abstract/Summary:PDF Full Text Request
The microbioal communities exist in all parts of the biomass, it is closely related to the number of research areas. In the medical field, the human symbiotic flora known as the second genome, closely related to health; in the field of environment, microbial communities drive basic elements of life, such as C, N and S etc, occurred biogeochemical cycling, decompose various pollutants; in the field of ecology, there are more relevant content and microbial community structure and its dynamic change; in addition, the microbial community research areas, including industrial and resource microorganisms, agriculture and soil microorganisms.For answers to scientific issues related to microbial community must first be clear and accurate analysis of the microbial community structure, which refer to microbial species presented in the sample, and the number of them. However, in the tradition of the microbial community analysis methods, flux, accuracy, and cost of the three factors make the determination of microbial community multidisciplinary bottleneck technology. High throughput refers to a single sample, required to obtain a high-throughput data; the same time, using the method of analysis of samples of the flux is high enough, that is, a greater number of samples analyzed. Microbial community structure research methods, the accurate on the one hand refers to the microbial species (or taxa) characterization of information to be as clear as possible; the other hand, the quantitative different taxa to be as accurate as possible. However, traditional technologies such as DGGE, gene chips and other means are not in a lower cost, high throughput and accuracy requirements.In recent years, in the field of microbial communities in a short tag sequence determination of16S rRNA sequencing technology by454has become a breakthrough. It uses the interactive development of the pyrosequencing method to obtain high-throughput data and relevant bio-informatics tools contributed to the breakthrough of the microbial community structure research methodology. However,454Determination of16S rRNA tag technology due to the high cost hinder its universal use, sequencing errors and bioinformatics computational tools are also some problems.Compared with the454Illumina platform is able to provide a large number of sequence to significantly increase sample throughput and reduce analysis costs, and higher sequence accuracy. However, the Illumina sequencing is characterized by short sequence length. Determination of the target16S rRNA variable region can not be achieved in the past. Meanwhile, due to the number of sequences obtained by the Illumina platform has increased tenfold, the original bioinformatics analysis tools are not appropriate, how to solve the bottleneck of the computation is restricting the Illumina analysis of microbial community.In this study, first, we verify the amplification of16S rRNA variable region by barcode primer, the Illumina Paired-end seuencing of PCR products, and then sorting through the sequence, assembly, quality control, bioinformatics analysis to obtain the represent sequences of microbial community in the target samples. This method is called BIPES. We are the fist to use Illumina, PE75and PE101sequencing technology, for the progress of the sequencing technology, reading through the16S rRNA V6variable region, and the establishment of a series of quality control algorithm to compare the accuracy of different analysis pipeline. The results showed that the accuracy of Illumina single-end sequences is only abOTU97.9%, its distribution is characterized from the sequence5’ end of99.9%to the end of the3’end of85%. In the reverse complementary overlap process of the Paired-end sequences, the decline in the quality of the3’ end sequence to be corrected, so the sequencing accuracy was significantly increased to99.65%. And then by removing the sequences which have more than1mismatch bases in the40-70bp site during overlap process and have any error bases in primers, the accuracy of BIPES sequence are further improved to99.93%. Error base are an order of magnitude lower than the454law. In the study, we found that BIPES can basically reflect the relative amounts of initial template sequences, but the long sequence and the sequence of the high GC content will be underestimated, indicating that the PCR also has a significant impact on the community analysis. In the sequencing of16S rRNA V6, the output read number for a single run of BIPES is20-50times as much as pyrosequencing; and the cost of of each BIPES read is less than1/40cost of a pyrosequencing sequence. BIPES treat the16S rRNA V6variable region as the characteristics of the taxa, can further analyze phylogeny and comparison better accuracy. As a cost-effective method, BIPES can be widely used in microbial community structure of the environment and human microbiome.After acquired a large number of sequences, in order to further analysis of community structure represented by the sequences, and then a and (3diversity analysis, a lot of bioinformatics analysis are needed. First, sequences should be aligned, and then a certain similarity of sequences clustered into operational taxonomic units (OTU), this step is a critical step for bioinformatics analysis of microbial diversity. This study developed a new two-stage clustering (Two-stage-clustering, the TSC) method, to reduce the demand for computing resources, and with good accuracy. In TSC, sequences are divided into two groups according to the abundance followed by clustering. For the distribution of characteristics of microbial communities and features of high-throughput sequencing error, high-frequency sequence is less, while the low-frequency sequence is more in the sequencing result. Strict hierarchical clustering algorithm are adopted to high abundance sequences, while with high accuracy of the algorithm, its computation rises geometric with the sequence number increasing. However, TSC algorithm effectively control the number of hierarchical clustering sequences. Subsequently, we used greedy heuristic method to clustering low-frequency group that contains most of the rare sequence to improve performance. In this process, all comparison are based on the accuracy of the highest global alignment algorithm, Needleman-Wunsch algorithm, to obtain accurate OTU clustering. To further enhance the computational efficiency and accuracy, the TSC employed two-step pre-clustering. Clone4397up data analysis showed that the TSC can accurately cluster known data, expected43OTU. By analyzing a set of sequences of abOTU110000real data Costello day3, the results show that TSC just consume370s and185M of memory to complete the clustering process, in addition to UCLUST, other methods required time and memory, respectively, more than10times and5times of TSC. This study have found that, clustering after dividing the sequences into two groups not only improve the computational efficiency, but also reduce unreasonable OTU composed by noisy sequence. This kind of OTU is characterized by two abundant sequences connected low abundant sequence, named ARA. According the further analysis result, the TSC OTUs-does not exist ARA, the proportion of ARA in the other methods are:SLP4.2%, UCLUST3.0%,2%of Mothur the CL Mothur AL2.3%Mothur SL45.5%, ESPRIT-SL22%. Results of rarefaction curves results show that the TSC is more gentle lower than UCLUST and the other mehod used AL algorithm. In addition, DCA and PCoA analysis the effect of OTU obtained by different clustering methods of Costello data on structure comparison, results shows that TSC, UCLUST and ESPRIT-AL can separate oral cavity, gut samples well. At the same time, the analysis of a group of unpublished data shows that the TSC can show that the location and temperature are two factors that affect the sample communities, while the UCLUST can only display that the temperature is the only factor. Based on these two datasets, we conclude that, in general, TSC and UCLUST get similar beta diversity results, but sometimes TSC method tend to get slightly better effect than UCLUST, and vice versa. This study suggests that in the process of high-throughput sequencing analysis of PCR amplicons, distribution characteristics of sequencing data is a very useful feature to improve the computational efficiency and accuracy.Then analyzed a antibiotics data by this method, BIPES removed7-22%of low-quality sequence in this dataset. Alpha diversity analysis showed that dayO samples have the best diversity, and day3-7is more richness than of day14-21. The beta results show that the time and concentration of antibiotics is the two major factor affecting the microbial community structure.This paper established an analysis method based on Illumina sequencing and the diversity analysis of microbial community. We have established a BIPES technology to get high-quality V6sequences, and developed the TSC algorithm computing millions of sequences to get accurate clustering. At the same time, the OTU can be carried OTU further taxonomy analysis by GAST and RDP, to get species in samples and the relative number of various taxon. According to the clustering results, we can compare the alpha and beta diversity of different samples, and do statistic analysis to find microbial communities characteristics of the samples, and provide basis for the bioinformatics analysis of the biological studies of the microbiome.
Keywords/Search Tags:Microbe, Community, BIPES, Bioinformatic Methods
PDF Full Text Request
Related items