Font Size: a A A

Big Data Analysis And Application Of Forest Genomes

Posted on:2019-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:2393330590450347Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the extensive development of large-scale sequencing and application of next-generation sequencing technologies,more and more biological sequences and related information have been sequences.Data management in the biological genome seqeucne mainly includes data access,comparision,excavation and research.How to manage the data in the biological genome is a key issue for bioinformatics researchers.At present,a large number of bases constitute most of the genomic sequences of higher organisms.However,gene sequencing experiments only obtain subsequences of genomic sequences,vast majority of sequences cannot be acquired at one time.Therefore,it is necessary to use computer algorithms and corresponding softwares to guide the genome sequence assembly.The effective information in the spliced genome was mined using the biological information method.This paper proposes three types of algorithms for error correction of sequencing data,genome assembly and gene family identification,the main work is summarized as follows:This paper proposes an improved IKNN algorithm,which uses the short-segments as sample from the second-generation sequencing and the long reads as test from the third-generation sequencing as input.Second-generation sequencing technology,the main production platform at present,has the main advantage of producing high-throughput and high-accuracy sequencing data,but it produces short-length sequencing reads.The third-generation sequencing technology,which is gradually evolving,produces longer but high rate of error reads.It is a necessart operation to design algorithm and software for the correction of third-generation sequencing data.IKNN algorithm judges the sample according to K training samples that are adjacent to the sample to be classified.The optimal K value is set to blast short fragments to the long fragments and correct the long fragments.The algorithm can not only obtain the third-generation sequencing data with high classification accuracy,but also propose a hybrid error correction and splicing algorithm to achieve high accuracy of the third-generation sequencing data.This paper present LSA algorithm based on mixed assembly of second and third-generation data.The second-generation sequencing technology has produced a large amout of sequencing,and has developed a number of genome splicing softwares,which is a relatively mature way to obtain a complete genome.Because most genome have many features such as multiple repeats,high heterozygosity and multiple branches.While assemblying,in order to follow the correct path selection to assemble high-precision genomes,we present LSA algorithm based on mixed assembly of second and third-generation data.The principle of LSA algorithm is to use the third-generation long reads to guide the assembly path.This can not only select the path for the emergence of branches,but also avoid the problem of inability to continue assembly due to no path guidance.In this paper,the method was successfully used to assemble the chloroplast of Ziziphus jujuba,the mitochondrial genome of Thellungiella parvula and Salix Sucnowensis(GenBank accession number: KU351660,KT988071 and NC029317.1).The structures and function of these three plant organelle genomes was further analyzed,which provide an important reference for future plant organelle research.This paper proposes a gene family indentification algorithm based on HMM algorithm,and designs a set of general processes based on functional analysis.Transcription factors,which mainly regulate plant development and cell metabolism.In developmental regulation,the transcription factor has an important role in the product obtained through coding.Finanlly,this algorirhm and functional analysis process will be used to mine WOX gene family of Salix suchowensis.The results showed that there are 15 members of WOX genes in Salix suchowensis.These members play an important role in the maintenance of stem cells,the development of lateral organs,the formation of floral organs and the development of embryos.Further functional analysis if the WOX gene familu such as sequence analysis,chromosome location,structure and motif location,phylogenetic analysis and expression profiling is conductive to revealing gene species differentiation,evolution history and gene function.It lays the solid foundation to study transcription factors in plants growth and resistence to adverse environment.
Keywords/Search Tags:IKNN Algorithm for Sequencing Data Correction, LSC Algorithm Plant Organellar Genomes Assembly, HMM Algorithm for Identification of Transcription Factor Family
PDF Full Text Request
Related items