The Information Analysis Of Real-Time Sequencing Based On Dual Mononucleotide Addition | | Posted on:2016-01-16 | Degree:Master | Type:Thesis | | Country:China | Candidate:C G Mao | Full Text:PDF | | GTID:2310330503477351 | Subject:Biomedical engineering | | Abstract/Summary: | PDF Full Text Request | | Real-time sequencing methods have many advantages, such as maintaining the characteristics of natural nucleotides and eliminating the subsequent processing in the next sequencing cycles, but there is still one drawback that not every sequencing reaction gives an efficient message which may affect the cycle efficiency, thus affecting the read length. Recently, our research group proposes a real-time decoding sequencing strategy in which a template is determined without directly measuring base sequence but by decoding two sets of encodings obtained from two parallel sequencing runs. When a template is cyclically interrogated twice with any two kinds of dual mononucleotide addition, two sets of encodings are obtained sequentially. The two sets of encodings allow for the bases to be sequentially decoded, moving from first to last, in a deterministic manner. This strategy applies fewer cycles to obtain longer read length compared to the traditional real-time sequencing strategy. In this paper, the biological information associated with the real-time decoding sequencing is studied to provide software support for the DNA sequencing. The main contents of the present study are as follows:[1] The encoding and decoding algorithmAccording to the decoding sequencing principle, three kinds of encoding and decoding algorithms (the character algorithm, the first-order mode algorithm, and the bitwise algorithm) are designed. And the tests of these algorithms are passed in the simulation data sets. In the simulation data sets,1000 fragments with 1000-bp randomly generated by simulating are translated three sets of encoding information. For each DNA sequence, two sets of encodings are randomly selected for decoding according to the algorithm, and then the DNA sequence is obtained, which is compared with the sequence of the original simulation. Finally the 1000 DNA sequences are tested and 100% decoding correct rate are obtained.[2] The sequencing simulation algorithmThe real-time decoding sequencing method does not essentially change the mechanism of the acquisition and the assessment of the signal intensity, thus the statistical distribution of signal strength is the similar with the conventional real-time sequencing platforms. The model of real-time decoding sequencing is established by studying the statistics distribution of the signal strength in 454 sequence platform. The model uses Normal distribution to analog positive signal and Log-normal distribution to analog negative signal. According to the ART sequencing simulation algorithm, the other sequencing simulation algorithm has also been achieved. Firstly, the sequencing simulation algorithm simulates sequence replication process by randomly breaking the sequence into fragments. Then the simulation sequencing process based on empirical distribution is obtained. The two simulation algorithms have been tested in simulated datasets, and results indicate that if length of the "homopolymers" or "similar polymers" is longer, the sequence quality is lower, and the sequencing errors is greater. These algorithms can simply simulate the real-time decoding sequencing procedure and provide theoretical support of evaluating the validity and accuracy of the data processing algorithms and predicting the sequence information for the real-time decoding sequencing.[3] The sequencing data processing① the alignment algorithms of resequencing sequenceThe real-time decoding sequencing method has the problem of "homopolymers" and "similar polymers". Thus it will produce false matches, thereby affecting the downstream analysis when the traditional sequence alignment algorithms are used. Based on Smith-Waterman-Gotoh alignment algorithms that have the ability of recognizing "homopolymers" or "similar polymers", two algorithms (Homopolymer-Aware-Smith-Waterman-Gotoh and Peak-Aware-Smith-Waterman-Gotoh) are designed. Homopolymer-Aware-Smith-Waterman-Gotoh treates the "homopolymers" or "similar polymers" as a unit, the longer homopolymer segments are treated as smaller gap penalty. The homopolymer penalty function is a linear decreasing function. Peaks-Aware-Smith-Waterman-Gotoh uses the peak value to enhance the quality of sequence alignment.But its penalty function is not a linear function. Both methods’penalty of homopolymer is setted in advance according to the reference sequence. The results show that both methods can achieve the sequence’s best match, effectively avoid false matches. In order to improve the performance of sequence alignment and maintain the accuracy of Smith-Waterman-Gotoh algorithm, the strategy that is similar with SSAHA is adopted the hash table for genome and position seeds sequence of short sequences is firstly built.Thus it can recognize "homopolymers" and "similar polymers" to extend sequence alignment for achieving the optimal alignment by using one of the alignment algorithms.② the algorithms of reverse complementary sequenceThe double chains of the template are sequenced in high throughput DNA sequencing. However, the information of one single DNA strand cannot be directly used for alignment. It should be converted into the reverse complementary sequence that can be used.The simply algorithm related to the reverse complement sequence are desinged, and these algorithms have been tested in simulated datasets.[4] The characteristic analysis algorithmThe real-time decoding sequencing has the characteristics with SOLiDTM that can distinguish between "SNP" and "sequencing errors".This study use this feature to achieve the characteristic analysis algorithm of the real-time decoding sequencing designed by using the characteristics. Firstly, the algorithm identifies all non-matching locus from pairwise alignments, excludes some invalid sites of non-matching locus. Then the non-matching sites that don’t meet the requirements are excluded by setting the threshold of sequence quality, threshold of the average sequenc quality in the neighborhood, threshold of the distance from alignment end and threshold of the quality of alignment. Finally, the distinguishment between "SNP" and "sequencing errors" in real-time decoding sequencing is further optimized. According to the test of the algorithm in simulated datasets, this algorithm has the ability to distinguish between true "SNP" and "sequencing error". | | Keywords/Search Tags: | real-time sequencing, encoding, decoding, sequence alignment, homopolymer, similar polymers, characteristic analysis | PDF Full Text Request | Related items |
| |
|