| Weakly-supervised Text Classification(WTC)is one of the most important tasks in Natural Language Processing,which can significantly reduce the labor cost of human annotation in the supervised setting.It adopts the category words as the only supervision and employs weakly-supervised methods to perform competitive results among supervised methods.WTC mainly consists of two steps:1)generating category words as supervision;2)training weakly-supervised classification models.In the stage of category words generation,existing methods mostly concentrate on directly applying the label descriptive words,and generating by conventional shallow topic models,etc.Comparing with existing methods,Neural Topic Models(NTM)is capable of learning topical words which are rather representative and highly semantically related to each class,therefore providing more accurate category words as supervision.Nevertheless,even with the confident category words,there exist great noises in the weak supervision,making it intractable to learn confident classification models.Therefore,it is necessary to design weakly-supervised classification methods which can handle the weak supervision and refine it during classification.To handle the above issues,we investigate the two stages of WTC separately and the main research consists of three parts.Firstly,we study the NTM method to generate confident category words as weak supervision.Secondly,we study the weakly-supervised method to learn reliable classification model utilizing the weak supervision.Finally,we combine the two sets of approaches.We using the proposed NTM methods for generating category words as weak supervision to learn classification model with our proposed weakly-supervised methods.Our research target is to study on weakly-supervised text classification by incorporating neural topic model for supervision generation.The detailed contributions are as follows:1.Neural topic modeling methods are capable of providing category words which can sufficiently represent each class.However,existing methods lack of reasonable extraction on semantic information,resulting in poor semantic expressiveness among the generated topical words.To address the three aspects of the problem,this thesis proposes the following three contributions:(1)Due to the extreme sparsity of short text data,standard neural topic models struggle to utilize sufficient word co-occurrences to learn topic representations that contain semantic information.To address the lack of semantic information over topics,we propose a Dual Word Graph Topic Model(DWGTM)facing the short text datasets.DWGTM learn semantic topic representations and alleviate the sparsity problem in short text data by simultaneously modeling the global word co-occurrence graph and word semantic correlation graph.The experimental results indicate that the model is capable of learning topic representations that balance the consistency and semantic information.(2)Some of the existing methods use the inner product of topic and word vectors to construct the decoder,leading to similar words in semantic space have similar weights in topic representations,severely limiting the expressive power of topics.To address the issue of topic redundancy,we propose a Generative Model with Nonlinear Neural Topics(GMNNT)facing the general text dataset.We construct neural topics for each topic independently to capture the nonlinear relationships among words in the semantic space,resulting in more accurate topic representations.Experimental results demonstrate that the model is capable of learning reliable topic representations and significantly alleviating the problem of topic semantic redundancy in existing methods.(3)To address the issue of generated topics cannot simultaneously handle both semantic information and discriminability facing text data with network links,we proposed a Layer-Assisted Neural Topic Model(LANTM).LANTM uses two channels to encode the text contents and the network links separately,and designs a layer-wise aggregation method to interactively learn topic representations enriching two types of information.Experimental results demonstrate that the model is capable of learning topic representations with high semantic information and strong discriminative power.2.Most existing weakly-supervised methods adopt the self-training method for learning classification models,which are intractable to handle the weak supervision with strong sparsity.To address the two aspects of the problem,this thesis proposes the following two contributions:(1)To address the issue of extremely limited weak supervisions,we proposed a Weakly-supervised Text Classification method with Wasserstein Barycenter Regularization(WTC-WBR)facing the multi-class classification task.WTC-WBR introduces a general weakly self-training method to supplement and iteratively refine the scarce supervision by combining the initial supervision and the model predictions.Additionally,WTC-WBR incorporates a Wasserstein barycenter regularization to constrain the deep text features in a geometric space.The experimental results indicate that WTC-WBR can effectively utilize the scarce supervisions to learn reliable WTC models.(2)To address the issue of strong sparsity in weak supervision in multi-label scenario,we propose a Category Word Selection method with Significance Ranking and Crowd-sourcing(CWS-SRC)and a Weakly-supervised Multi-Lab el Text Classification method with Correlation-aware Label Propagation(WMLTC-CLP).CWS-SRC generates category words that sufficiently represent category information with significance ranking over the topical words.WMLTC-CLP introduces a general text correlation-aware label propagation method to supplement and iteratively update the sparse weak supervision using text neighbors and label correlations.Experimental results indicate that CWS-SRC can generate more accurate category words and WMLTC-CLP can effectively supplement and refine the extremely sparse weak supervision.3.To combine the above two contributions,we set the category words generated by our proposed NTM methods as weak supervision and train the classification models with our proposed weakly-supervised classification methods.The detailed contributions are as below:Firstly,based on the category word generation framework CWS-SRC,we generate category words using the three proposed NTM methods DWGTM,GMNNT and LANTM in the first part.Secondly,we use the category words as weak supervision and train the classification models with our proposed weakly-supervised methods of WTC-WBR and WMLTC-CLP in the second part and validate the performance.Experimental results indicate that comparing with category words generated by other NTM methods and other settings,category words generated by our proposed NTM perform better classification results,indicating that combining neural topic models for weakly-supervised text classification is feasible and effective. |