Font Size: a A A

A New Method For Quickly Quantifying Gene Abundance Based On Sequencing Data

Posted on:2021-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LinFull Text:PDF
GTID:2370330611498001Subject:Biology
Abstract/Summary:PDF Full Text Request
Since the birth of sequencing technology in the 1970 s,after decades of technological innovation,a pattern dominated by next generation sequencing technologies has finally formed.High-throughput sequencing technology,which based on next-generation sequencing technology,has become the most common and important tool in developmental and disease research.The popularity of high-throughput sequencing technology and deep sequencing brings new requirements for sequence alignment and gene quantification software.The software currently used to process sequencing data can be divided into two types: alignment-based and alignment-free alignment.Although the former can retain more information between subsequence,it consumes more time and memory and is more susceptible to mutations and sequencing errors.And such software generally requires additional quantitative software assistance.The latter will use a lot less time and resources,and can avoid the impact caused by the exchange and recombination of small fragments in the sequence.In general,there is still room for improvement in the accuracy of existing software.Therefore,we want to develop a quantitative software that improves accuracy while ensuring speed.We fragmented the reference sequence and reads to form short fragments with specific lengths and then counted the number of these short sequences.The type and number of short fragments contained in each reference sequence have formed the characteristic spectrum of the reference sequence.We believe that there is a linear relationship between the number of short fragments derived from the reference sequence and those derived from the sequencing result,and once the ratio of the two can be found,we can obtain the abundance of reference sequence.We use the feature spectrum of the reference sequence as the coefficient matrix and the number of short fragments derived from the sequencing results as the result vector to construct a linear equations.Least square method is used to simplify ad solved the equations.We designed some adjustable parameters for this model and found the optimal parameter values through a series of gradient experiments.Testing using simulation data shows that our software has higher accuracy overall compared to several existing mainstream alignment-free alignment software.Our software appears to be more accurate in quantifying genes with high and medium expression levels.For genes with low expression levels,our software will produce some false negative and false positive results.Through the analysis of the results,we think that a possible reason is that these genes have regions with high homology with some highly expressed genes.Under the interference of highly expressed genes,the quantification of these genes may be incorrectly increased or underestimated.We measured the memory and time required for software,and compared it with some other software.The results show that our software consumes relatively much time and memory when indexing,and relatively little time and memory when quantifying.Finally,we also made a preliminary extension in single-cell data processing,the results show that in time and memory use,our method has a greater advantage.Through the development of this software,we hope to improve the speed and accuracy of sequencing data processing to meet the needs of rapid and accurate detection in research and medical treatment.
Keywords/Search Tags:transcriptome sequencing, quantification, alignment-free, least squares method
PDF Full Text Request
Related items