| Automatic Cantonese text segmentation has important significance for Cantonese information processing, a practical Cantonese text segmentation system would help to promote the development of word processing and even sentence processing. On account of the subjectivity of word segmentation, this paper aims at developing a practical Cantonese word segmentation standard, referring to the national standard GB13715.Unlike Putonghua, the written Cantonese has been in great chaos for ages with some serious problems as variant characters, traditional and simplified characters. The existing of variant characters leads to variant words and impedes the development of Cantonese information processing. On the basis of previous studies, this paper has collated the frequently used Cantonese variant characters, and made a variant characters list. In the meantime, this paper also has designed the parts-of-speech tags for Cantonese information processing referring to some existing parts-of-speech tags for Cantonese and Putonghua, and completed the Cantonese word segmentation specification. Utilizing self-built and existing corpora, we tested the specification. The final result has proved the specification to be practical, and can be further developed to create a software system on automatic Cantonese text segmentation. |