Font Size: a A A

The Error Analysis Method Research Of High-throughput Sequencing Data

Posted on:2015-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y S DongFull Text:PDF
GTID:2310330518972138Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The birth of high-throughput DNA sequencing technology is a technical revolution in molecular biology research, and its low-cost, fast, and high-throughput makes it replacing the traditional sequencing technology. With continuous development of sequencing technology, high-throughput sequencing gradually becomes mature and is widely used in many fields, such as biology, medicine and so on. The application of high-throughput sequencing is more and more widely, and researchers have paid more and more attention to increase its accuracy. In any field where sequencing technology is used, the requirement of accuracy for the sequencing data is very high. If there are more sequencing errors in the process of sequencing, these errors will affect the follow-up work of analysis and treatment,and even make the follow-up work not be successfully proceeded. So, we want research the analysis methods of sequencing errors.The characteristics of high-throughput DNA sequencing technology is direct sequencing of a target group of nucleotide sequences, in comparison with conventional sequencing techniques, which greatly improves the accuracy of acquisition of genetic information. In order to obtain information on specific genes, we must firstly align obtained sequences to the reference genome to find their locations in preparation for subsequent analysis. Since the obtained sequences have individual differences with the reference genome and sequencing errors exist in sequencing process, in process of alignment to reference genome, there is a problem that sequencing data can not be mapped and can not be used.In this paper, we make an analysis that targets to short read high-throughput sequencing data. Due to different sequencing platforms will generate different sequencing errors, the method designed in this paper is different to the traditional methods of analysis, and improves the shortcoming that making an analysis not base on concret sequencing data. In this paper,for the specific data, we analyze its unique disciplines of sequencing error generations, and make an estimation of its unique pattern using Bayesian theory. Use it as a reference in the sequence alignment, we improve the mapping successing rate of the data. Experiments show that the probability of sequencing error occurring at backward position is higher than that occurring at other positions in sequencing reads, and the probability is different with different positions and different error types. The discipline above is also different with different sequencing platforms and changes with experiment environment. Through verification experiment, the method designed in this paper successfully rescues a part of sequencing data that can not successfully mapped to the genome before. We also demonstrate the effectiveness of the proposed method through overlapping region analysis between reliable data and rescued data. By increasing the mapping rate of sequencing data, we improve the utilization of the sequencing data.
Keywords/Search Tags:next generation sequencing, bayesian theory, DNA, mapping, error analysis
PDF Full Text Request
Related items