Font Size: a A A

Research Of Compression Algorithms On The Structure-aware Compressive Data

Posted on:2015-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:P H LiFull Text:PDF
GTID:2180330452464067Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the increase of people’s demand and the continuous development of infor-mation technology, genome sequencing and spectral imaging technologies are alsodeveloping rapidly. Technical advances provide excellent service for the public. How-ever, with the growth in demand and the constant change in modern industry, we arein the presence of “Big Data” age. Faced with the limits of storage capacity, networkbandwidth, battery, resolution and computation capacity, the efficient signal samplingand compression of the structure-aware compressive data are becoming important con-cerns for researchers. In bioinformatics, different formats of genome sequences haverelationship with each other, where FASTA format is a kind of text format after se-quencing used to represent nucleotide or amino acid sequences. SAM/BAM format(i.e., Sequence Alignment/Map format) contains complete genome alignment infor-mation for the downstream analysis and provides a general comparison format fordifferent sequencing platforms. The21st century is the biomedical century. With thedevelopment of sequencing technology and the rapid expansion of sequencing orga-nizations, there are massive redundancy in the existing genetic data, especially thatin FASTA and SAM/BAM format. On the other hand, with the compressive sensingbecoming mature, researchers develop wide studies in a variety of fields, of whichmulti-spectral imaging area is one important branch. No matter the genome data inFASTA and SAM/BAM format, or the sampled measurements obtained through theCoded Aperture Snapshot Spectral Imaging system (CASSI), all of them have the re-spective characteristics of compressive structure. How to design the correspondingcompression algorithms to adapt to their structures is the new challenge in signal pro-cessing area.Accordingly, this paper presents a two-pass lossless genome compression algo- rithm based on the non-sequential contextual models and Maximum Entropy Princi-ple (MEP) for FASTA format. The first pass handles genome compression with andwithout reference sequences by adopting dictionary method to substitute the repeatsintra or inter sequence to promote the compression efficiency. In the second pass, weintroduce the non-sequential contextual models that are more suitable for the genesequences with non-traditional regularity to improve the diversity and comprehen-siveness combined with traditional sequential contextual modes. Meanwhile, unlikeBayesian averaging method that tends to favor only one model among many with MAP(i.e., Maximum A Posteriori) estimation, logistic regression model based on MEP isproposed to combine contextual models. The corresponding paper “DNA-COMPACT:DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique” hasbeen published in PLoS ONE.For SAM/BAM format, this paper propose a hierarchical multi-reference genomecompression algorithm. It compresses a SAM format file sorted by position within thereference by extracting the11mandatory fields and a variable number of optionalfields into12separate files and compress these files in parallel. For the “Sequence”field, we improve the rate of the exact mapped reads in target sequence by taking ad-vantage of several public reference sequences and gradually shortening the length ofunaligned reads as well as realigning the shortened reads. For the “Quality Value”field, we further propose a lossy quantization approach using the k-means clusteringalgorithm where users can set the compression grade k and investigate its impact ondownstream applications. For the remaining ten fields, we explore their self-regularityand interrelationship, and then adopt appropriate compression algorithms for each ofthem. Compared to the existing schemes, the program not only improves the compres-sion efficiency, but also provides various choices for compression grades, making itmore adaptable and scalable. The corresponding paper “HUGO: Hierarchical mUlti-reference Genome cOmpression For Aligned Reads” has been published in Journal ofthe American Medical Informatics Association.To store more effective information by compressed measurements, compressivesensing requires to provide a noncoherent sampling matrix, with which the sampleddata should not have existed in the sparse basis. Then, the obtained measurementsby sensing matrix is no longer redundant and compressive. However, in the multi- spectral imaging system, researchers are still hoping to further compress the com-pressed sensing measurements to ensure the realtime transmission in applications suchas environmental remote sensing, astrophysics and military target discrimination. Ac-cordingly, this paper present a lossless compression scheme targeted for the CASSI.Combined with a conditional entropy minimization model, we convert the compres-sion problem into seeking a reversible transformation, which has more redundancy andcorrelations to compress, of the measurements matrix. Through statistical modeling,the transformed measurements based on mean filter transformation and coded aper-ture are proved to be more approximate to the distribution of original spectral images.Meanwhile, bit-plane coding for the transformed measurements is applied by takingadvantage of the known code aperture. The significant performance of this embed-ded coder on random compressive measurements is evaluated through experiments.The corresponding paper “Embedded Transform Coding based Lossless Compressionin Compressive Spectral Imaging with Coded Aperture” has been accepted by DataCompression Conference(DCC’2014).
Keywords/Search Tags:Genome compression, FASTA, SAM/BAM, Com-pressive sensing, Coded aperture, Multi-spectral imaging, Transformcoding, Bit plane coding
PDF Full Text Request
Related items