| In recent years,we could see various information with massive amounts on the internet.Large amounts of news are published by different kinds of mass media.Specifically,in finance domain,people tend to pay more attention on internet news as enterprises are more sensitive to their reputation.However,while there're a lot of news available,not all news really worth reading letting alone detail analysis.Most finance news could be advertises or reports.Which increases the time costs while people trying to get relevant information.At the same time,proper analysis on valuable finance news could bring us more commercial value.Thus,it could help to reduce much work if we could apply effective identification on such news before analyzation.In which way we could release analysist from reading huge amounts of irrelevant news.Therefore,it is well worth to carry out research on how to effectively identify financial news with potential value from a large amount of news on the internet.In this paper,we identify valuable finance news with text classification methods.Different from traditional text classification task,the text we need to categorize is longer and texts between different categories are more similar.Thus,we could know that what we need to handle with is actually a fine-grained long text classification problem.As one of the basic natural language processing tasks,text classification is always an important research topic.Researchers usually pay more attention to the text modeling methods while talking about this natural language processing task.However,most existing text modeling methods generally achieve better performance on short text,and there are fewer attempts around document modeling problem as the number of tokens in documents is much more than ordinary short text.In this paper,we propose that in the text modeling process for the fine-grained document classification problem,a hierarchically modeling architecture could perform better.Inspired by the work of predecessors,we build a hierarchical long text modeling architecture based on pre-training language model BERT(Bidirectional Embedding Representation from Transformers).Our proposed architecture could be added with different document encoders to capture document representation.We use two particular document encoders and get two long text classifiers.We do experiments with several dataset and results show that our proposed classifiers could perform better on two news classification datasets compared with existing models.Which demonstrates that our long text representation framework do perform better for long texts modeling.As mentioned before,the text we need to classify are more similar to each other among different categories,which is to say we need to solve a fine-grained classification problem.To achieve that,we build auxiliary text matching tasks for the current classification tasks and perform multi-task learning to improve the finegrained classification performance.In the process of constructing dataset for the text matching task,we design algorithms to construct negative examples according to the results of confusion matrix,in which process we could control to get more negative examples between easily confused categories.Experiments on our own dataset and open dataset both show that this fine designed auxiliary learning task can help us improve the classification performance on top of multiple basic models.Finally,we continued the experiment of fine-tuning on target task after multi-task learning,and experiments results show that optimal performance could be further achieved. |