Font Size: a A A

Lossless Reference DNA Data Compression Method Based On ICBDS Optimization

Posted on:2020-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:S W DuFull Text:PDF
GTID:2370330620951114Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
DNA is a kind of polymer that stores the genetic information of living things.Nowadays,research on DNA has become a hot issue.With the continuous development of high-throughput sequencing technology,the cost of sequencing is getting lower and lower,and the sequencing cycle is getting shorter and shorter,that lead to DNA data scale grows exponentially.The storage of massive DNA data resources with a small amount of space in limited resources has become a new challenge for biologists and computer experts.In recent years,DNA data compression methods have been proposed to increase the compression ratio,and some to reduce the compression time.The compression method proposed by Nour and Amr has a great advantage in compression time compared to the previous method,but is limited to bacterial DNA data.In this paper,RU(recently used)transform and MG(merged)transform are proposed to improve the method and two improved step-by-step compression methods are given.Each method is divided into two compressions.The main work of this paper is as follows:(1)Propose RU transformation for DNA data compression.The first compression performs a series of operations on the DNA data,first converting the DNA data into a binary file containing only 0 and 1 and a base sequence file having different adjacent characters,and then passing the base file through the RU.The transformation is transformed into a small integer sequence file,and then converted into a binary file by using the Hafman coding idea,and finally all the binary files are converted into ordinary character files;In the second compression,the general text compression algorithm LZ77 algorithm is used to uniformly compress all the obtained character files.(2)Propose MG transformation for DNA data compression.The first compression performs a series of operations on the DNA data.First,the DNA data is converted into a binary file containing only 0 and 1 and a base sequence file containing only three characters.Next,the base sequence file is converted into a binary file and a base sequence file whose length is halved by MG transformation,and the resulting base sequence file is converted into a binary file by the Hafman coding idea and finally all the binary files are converted into ordinary character files;In the second compression,the general text compression algorithmLZ77 algorithm is used to uniformly compress all the obtained character files.For the two compression methods in this paper,the test data from the DNA data compression algorithm commonly used in GenBank database is selected for experimental demonstration.The experimental results show that compared with the method of Nour and Amr: for bacterial DNA data,the compression time and the decompression time of the DNA data compression method based on RU transformation are saved by more than 70%,but the compression rate is reduced by 1.5% on average,the MG-based DNA data compression method saves both compression time and decompression time by more than 50%,but the compression rate is reduced by 0.5% on average;for non-bacterial,the two methods improve the compression rate,while the compression time and decompression time are saved by more than 20%.
Keywords/Search Tags:DNA Data, RU, MG, LZ77 Algorithm, Compression, Decompression
PDF Full Text Request
Related items