Font Size: a A A

Double-byte text compression

Posted on:2002-09-07Degree:Ph.DType:Thesis
University:Hong Kong Polytechnic (People's Republic of China)Candidate:Cheng, Kwok ShingFull Text:PDF
GTID:2468390014450389Subject:Computer Science
Abstract/Summary:PDF Full Text Request
This thesis describes the inefficiency of conventional compression algorithms for double-byte text, and proposes efficient algorithms for improving the compression performance of the double-byte languages in general, but the algorithms do not target for compressing a specific language.; In this thesis, a survey of commonly used text compression algorithms is presented first. Then, the characteristics of double-byte text and important terminology related to text compression is introduced. Since the properties of double-byte text are different from ASCII text, conventional compression algorithms obtain an unsatisfactory compression performance for the double-byte text. The weakness of the algorithms is explained in detail.; In order to have an accurate performance evaluation on the conventional or proposed compression algorithms, we have built representative corpora with total file size over 40M bytes for the languages of Chinese, Japanese and Korean, which are the well known and commonly used double-byte language in Asia. The corpora cover a wide range of different areas and topics. They will be released for public use.; Throughout this thesis, a systematic study from analyzing the character-based compression model, the word-based compression model to the cascading compression model will be taken. The main concern is to propose algorithms to improve the compression ratios. The execution time and memory consumption of the algorithms are also considered. In this project, all compressors are run on Sun UltraSPARC 5 Workstation with 400-MHz UltraSPARC III Processor and 64 Mbytes Memory. The operating system is Solaris.; In the character-based model, a new algorithm called indicator dependent Huffman coding scheme (IDC) is proposed. IDC uses multiple Huffman trees to encode the input text. It obtains the best average compression ratio of 1.91 among the variations of Huffman coding scheme, and outperforms an UNIX Huffman-based compressor, PACK by 43%.; Another encoding algorithm, called arithmetic coding scheme is under the same category as Huffman coding scheme. They both belong to statistical compression algorithms.; In the word-based compression model, well-known compression algorithms LZSS and LZW are chosen for experiments.; In the cascading compression model, we propose to combine the advantages of the above two models to form a better compression model.; The last part of this project is to study the compression model for Hypertext Markup Language (HTML). (Abstract shortened by UMI.)...
Keywords/Search Tags:Compression, Text, Huffman coding scheme
PDF Full Text Request
Related items