Font Size: a A A

Test And Comparation Of Softwares Suitable For RNA-seq Reads Mapping Via Simulated And Real Reads

Posted on:2016-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:W TianFull Text:PDF
GTID:2180330503453050Subject:Bio-engineering
Abstract/Summary:PDF Full Text Request
Since 2008, theprincipleof RNA-seq technology isto get the reverse transcription cDNA sequence randomly, and to obtain the information of transcriptome sequence and expression through the assembly or alignment with reference sequence. During years of development, RNA-seq sequencing technologies have helped scientists solve many problems at transcription level, such as the specific characteristics of the corresponding one or several differences in gene expression, gene fusion, and alternative splicing events, base editing, assembly and annotation of the novel transcripts. In recent years, lncRNA and circular RNA sequencing based on the technology of RNA-seq have drawn widespread attention, which provides new way to interpret life. RNA-seq sequencing technology has stepped into mature phase, and High-throughput sequencing also developed. In 2008, a run of RNA-seq could only produce tens of thousands of 25 to 40 bp length of the sequence. However, with the rapid development of high throughput sequencing technology, now we can gain from a run of hundreds of millions or even billions of sequence length that can be 100 bp or even longer. Comparedwith DNAsequence, RNA-seq always poses two challenges for alignment. RNA sequence covers several extrons separated by long introns generally. Additionally, the length of sequence obtained by hiseq2000 is 100 bp, and longer sequences are obtained by hiseq2500 and 4000. Long sequence may contain two or more extrons, which makes the alignment more difficult. Alternatively, large amounts of RNA edit, indel, alternative splicing and fusion gene require more robust ability for alignment software.RNA-seq analysis begins with mapping reads against a reference genome to determine the location from which the reads originated[1,2]. During this process, the quality of alignment software often becomes the prerequisite for the success of the entire project. During the process of alignment, we value the result from three aspects, including alignment speed, the demand for computer hardware(mainly referred as memory) and alignment accuracy. Presently, the alignment software for RNA-seq sequencing during the process of scientific research and production in our country mainly includes SOAP,BWA,Tophat and so on. Professor Zhang in Ji’nan University introduces one kind of alignment software, Fanse2, based on Windows system. He demonstrated that it is highly outstanding no matter from aspects of the speed, the requirement for the system or alignment accuracy, which also suggests that China’s progress in the field of bioinformatics.At present, the comparative studies of RNA- seq alignment software were covered by Brian j Haas[3], Garber M[4] and Kvam VM[5] etc. However, these researches mainly concentrated on the reviews of RNA-seq methods, and statistical methods of gene expression. In 2015, Steven Salzberg compared Hisat withthree other different softwaresand found that Hisat is faster and shows higher accuracy than others, while Steven’s study wasn’t including commonly used software such as SOAP and BWA. On the basis of the experiment in Steven Salzberg, we use three models, no mismatch, mismatch and indel, to analyze sequencing simulation data and real data, respectively. We conduct a comprehensive comparison of the eight software in matching speed, alignment accuracy and the requirements for computer performance, and the results shows that Hisat and STAR have more advantages in speed ratio, Hisat is significantly higher than other software on the alignment accuracy, Hisat memory consumption shows minimum. Overall, Hisat ranks the strongest comprehensive ability. Test for reads have only one exon shows that Fanse2 gives best result in compare reads with mismatch or indel to reference sequences, indicate that Fanse2 is more suitable for search for the expression of genes and detection of SNP and indel.In order to find out if the mapping rate can be influenced by the size of genome and degree of repeat, we also compared the result of alignment using human and C.elegant as reference, respectively. The Result shows that There is no significant difference when the genome size and repeat degree changes.We, from different aspects, used different test data to make comparisons of eight commonly used software in scientific research fromworldwide based on RNA-seq,aiming to provide insight and practical reference for scientific research and production process.
Keywords/Search Tags:RNA-seq sequencing, reads, gene expression, mapping, mismatch
PDF Full Text Request
Related items