| This paper mainly studies the heterogeneous data problems in Chinese part-of-speech (POS) tagging. Chinese word segmentation and POS tagging (Chinese S&T) is the basic task in Chinese natural language processing, syntactic analysis and semantic analysis. A lot of systems or algorithms, such as dialogue systems, information retrieval, information extraction, keyword extraction algorithm, often utilize the results of segmentation and POS tagging. With the popularization of Internet and the raising interests of the research-ers, the number of heterogeneous data increases. This paper studies the heterogeneous da-ta mainly from two aspects:heterogeneous target data and heterogeneous training data.The mainly problem of heterogeneous target data in part-of-speech tagging task is that the target objects we want to label are heterogeneous. In modern Chinese articles or con-versations, it is very popular to involve a few English words, especially in emails and In-ternet literature. Therefore, it becomes an important and challenging topic to analyze Chinese-English mixed texts. The underlying problem is how to tag part-of-speech for the English words involved. Due to the lack of specially annotated corpus, most of the English words are tagged as the oversimplified type, "foreign words". In this paper, we present a method using dynamic features to tag POS of mixed texts, which can use the word-level information to boost the tagger. Methods using synthetic data overcome the problem of lack of annotated mixed-text corpus, the "unified label" can also reduce the influence of out-of-vocabulary (00V). Experiments show that our method achieves higher performance than traditional sequence labeling methods. Meanwhile, our method also boosts the performance of POS tagging for pure Chinese texts.The research of heterogeneous training data is mainly in how to make better use of these heterogeneous corpora to boost the performance of tasks like segmentation or POS tag-ging. Recently, it has attracted more and more research interests to exploit heterogeneous annotation corpora for Chinese S&T. In this paper, we propose a unified model for Chi-nese S&T with heterogeneous annotation corpora. We first automatically construct a loose and uncertain mapping between two representative heterogeneous corpora, Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD). Then we regard the Chinese S&T with heterogeneous corpora as two "related" tasks and train our model on two het-erogeneous corpora simultaneously. Experiments show that our method can boost the performances of both of the heterogeneous corpora by using the shared information, and achieves significant improvements over the state-of-the-art methods.There are two major contributions of this paper:1. Utilizes the word-level information by dynamic features, and overcomes the OOV and lack of annotated corpus problems of POS tagging for Chinese English mixed text, which is heterogeneous.2. Proposes a model that can train on two heterogeneous corpora simultaneously, and boost the performances of both of the heterogeneous corpora by using the shared in-formation. |