Double-byte text compression

Posted on:2002-09-07

Degree:Ph.D

Type:Thesis

University:Hong Kong Polytechnic (People's Republic of China)

Candidate:Cheng, Kwok Shing

Full Text:PDF

GTID:2468390014450389

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

This thesis describes the inefficiency of conventional compression algorithms for double-byte text, and proposes efficient algorithms for improving the compression performance of the double-byte languages in general, but the algorithms do not target for compressing a specific language.; In this thesis, a survey of commonly used text compression algorithms is presented first. Then, the characteristics of double-byte text and important terminology related to text compression is introduced. Since the properties of double-byte text are different from ASCII text, conventional compression algorithms obtain an unsatisfactory compression performance for the double-byte text. The weakness of the algorithms is explained in detail.; In order to have an accurate performance evaluation on the conventional or proposed compression algorithms, we have built representative corpora with total file size over 40M bytes for the languages of Chinese, Japanese and Korean, which are the well known and commonly used double-byte language in Asia. The corpora cover a wide range of different areas and topics. They will be released for public use.; Throughout this thesis, a systematic study from analyzing the character-based compression model, the word-based compression model to the cascading compression model will be taken. The main concern is to propose algorithms to improve the compression ratios. The execution time and memory consumption of the algorithms are also considered. In this project, all compressors are run on Sun UltraSPARC 5 Workstation with 400-MHz UltraSPARC III Processor and 64 Mbytes Memory. The operating system is Solaris.; In the character-based model, a new algorithm called indicator dependent Huffman coding scheme (IDC) is proposed. IDC uses multiple Huffman trees to encode the input text. It obtains the best average compression ratio of 1.91 among the variations of Huffman coding scheme, and outperforms an UNIX Huffman-based compressor, PACK by 43%.; Another encoding algorithm, called arithmetic coding scheme is under the same category as Huffman coding scheme. They both belong to statistical compression algorithms.; In the word-based compression model, well-known compression algorithms LZSS and LZW are chosen for experiments.; In the cascading compression model, we propose to combine the advantages of the above two models to form a better compression model.; The last part of this project is to study the compression model for Hypertext Markup Language (HTML). (Abstract shortened by UMI.)...

Keywords/Search Tags:

Compression, Text, Huffman coding scheme

PDF Full Text Request

Related items

1	SIMD Instruction Acceleration Of Huffman Coding Algorithm And Its Application To Privacy Preserving Compression
2	Eeg Signal Acquisition System Wireless Data Compression Algorithm Research
3	Research And Application Of Huffman Coding In Environmental Protection Real Monitoring System
4	Implement Video Compression And Decompression On DSP
5	Infrared Image Real-time Compression Based On DSP
6	Compression Of Radar Data
7	Research On Image Compression Coding Using PCNN
8	The Research Of The SAR Image Float-Point Compression
9	The Research Of SoC Test Data Compression Based On Multiple Features Sequence Coding
10	Data Compression And Its Resistance To The Channel Error Performance