| Text is the most widely used form of digital content.Text comprehension is the foundation of understanding multimedia content.Generally speaking,document classification is one of the most important methods to organize and make use of information that exists in unstructured textual format.Document classification is a widely studied research area of natural language processing.However,sometimes we are interested in more than the pure classification,and we would like to understand document topics,paragraph labels,keywords.For example,in the task of TV News labeling,we not only need to know the classification labels of TV News scripts,but also want to understand topics and keywords of scenes and News stories.In the task of complaint processing,we not only need to know the complaint category,but also want to understand the keywords of complaints,the main issue of these days.At this time,we need interpretable and flexible classification methods.Roughly speaking,traditional machine learning approaches,e.g.,Decision Trees,K-Nearest Neighbors,as interpretable models have relatively poor performance on document classification.Meanwhile,deep learning methods,e.g.,CNN,RNN,have outstanding performance on document classification,but do not facilitate a comprehensive theoretical understanding of learning.Recently,topic modeling approaches,e.g.,Latent Dirichlet Allocation(LDA),as interpretable,flexible,and high-performance classification approaches have gained the attention of researchers.As one of the most powerful tools in text mining and natural language processing,topic modeling approaches comprise a family of methods for uncovering latent structure in text corpora,and have been widely applied in the field of information retrieval,media analysis,keyword discovery,text classification,software engineering,etc.Standard LDA is a completely unsupervised algorithm,which can not support supervised document classification,and has not used prior knowledge,for example,document observed labels,which can significantly improve classification accuracy.Consequently,there is growing interest in incorporating prior information into the customizations of topic modeling,and there are many challenges,e.g.,efficiently supporting multi-label document classification,handling label noise that widely exists in realworld applications,supporting semi-supervised classification.Given the above challenges,this thesis launches researches and explorations.Specifically,the main novelties and contributions of this thesis are summarized as follows:(1)To efficiently support multi-label document classification,this thesis proposes a novel supervised topic model,Twin Labeled LDA(TL-LDA),which has two sets of parallel topic modeling processes,one incorporates the prior label information by hierarchical Dirichlet distributions,the other models the grouping tags,which have prior knowledge about the label correlation.The two processes are independent from each other,so the TL-LDA can be trained efficiently by multi-thread parallel computing.Quantitative experimental results compared with state-of-the-art approaches,e.g.,Dependency-LDA,demonstrate the proposed model gets the best scores in solving single-label classification,and gets the best scores on 3/4 metrics while multi-label classification,including non power-law and power-law datasets.In summary,benefiting from the cooperation of twin sub-models,the proposed TL-LDA has outstanding performance and generalizability on document classification,meanwhile keeps concise and efficient.(2)There are few researches on topic modeling approaches under label noise,which widely exists in real-world applications.To address this issue,this thesis proposes two robust topic models for document classification problems: Smoothed Labeled LDA(SL-LDA)and Adaptive Labeled LDA(AL-LDA).SL-LDA is an extension of Labeled LDA(L-LDA),which is a classical supervised topic model.The proposed model overcomes the shortcoming of L-LDA,i.e.,overfitting on noisy labels,through Dirichlet smoothing.AL-LDA is an iterative optimization framework based on SL-LDA.At each iterative procedure,the Dirichlet prior,which incorporates the observed labels,is updated by a concise algorithm.After several iterations,the influence of noisy labels would be reduced.The updating algorithm based on maximizing entropy and minimizing cross-entropy principles is an effective and unified approach that optimizes the Dirichlet prior on both high quality labeled and mislabeled documents.In other words,this method avoids identifying the noisy label,which is a common difficulty existing in label noise cleaning algorithms.Quantitative experimental results on Noisy Completely at Random(NCAR)and Multiple Noisy Sources(MNS)settings of original datasets,including single-label and multi-label collections,demonstrate our models have outstanding performance under noisy labels.Specially,the proposed AL-LDA has significant advantages relative to the state-of-the-art topic modeling approaches,e.g.,Dependency-LDA,under massive label noise.(3)To support semi-supervised document classification,this thesis proposes a novel supervised topic modeling approach,Neural Labeled LDA(NL-LDA).The topic model generation procedure incorporates prior knowledge to improve the classification performance.However,these customizations of topic modeling are limited by the cumbersome derivation of a specific inference algorithm for each modification.To address this issue,NL-LDA builds on the VAE framework deemed as a black box inference method.To incorporate prior information,NL-LDA has a special weighted generative network that allows a variety of extensions for incorporating prior knowledge.The proposed model can support semi-supervised learning based on the manifold assumption and low-density assumption.Quantitative experimental results demonstrate our model has outstanding performance on both supervised and semi-supervised document classification relative to the compared approaches,e.g.,SCHOLAR,MCCTM,including traditional statistical and neural topic models.Specially,the proposed NL-LDA performs significantly well on semi-supervised classification under a small amount of labeled data.(4)The traditional TV news labels rely on human annotators,so accuracy and efficiency can not meet the requirements of convergent media.Recently,automatic TV News labeling methods based on artificial intelligence technology have started drawing the industrial attention.To address this challenge,this thesis proposes an automated end-to-end pipeline for story segmentation and content labeling in TV News broadcast.Specially,we apply supervised topic modeling approaches,which not only support interpretable News script classification,but also provide topics and keywords of scenes and News stories.The proposed topic models not only improve labeling performance,but also help the story segmentation process.Specifically speaking,this thesis utilizes TL-LDA for News script multi-label classification,improves the method robustness by AL-LDA and supports semi-supervised News script classification by NL-LDA.All these technologies have been well applied in TV stations,e.g.,CCTV,SZMG. |