| In recent years,bullet subtitle video websites have developed rapidly,and more and more users have begun to pay attention to the culture of bullet subtitle,and like this video mode which can be interactive in real time.However,some of the bullet subtitle contain illegal information,advertisements,abuse,pornography and other information,which greatly affect the user’s experience.Filtering and visualizing the content of the bullet subtitle can help improve the user’s movie-watching experience and purify cyberspace.This thesis mainly takes BILIBILI video website as the research object and carries out research on the method of bullet subtitle filtering and visualization.The main work is as follows.Firstly,aiming at the problem that the traditional machine learning text classification method has poor effect on bullet subtitle classification due to sparse features of bullet subtitle text,and neglecting the combination of text content and user information,which leads to low bullet subtitle recognition rate,a bullet subtitle short text classification algorithm based on improved feature extension is proposed.First,user features are constructed according to user attributes,and two new features of user credibility level and user identity credibility are constructed on the basis of original features.Then,the text extension method is improved to retain the original text semantics to the greatest extent and avoid invalid extension.Finally,the text features and user features are integrated to classify the bullet subtitle.The experimental results show that this method can effectively improve the recognition rate of inappropriate bullet subtitle.Secondly,aiming at the problem that the labeled data of bullet subtitle is scarce,an improved semi-supervised classification algorithm based on labeled data and unlabeled data is proposed in this thesis.First,the text extension method is used to expand the text of the bullet subtitle to enhance the ability of information representation of the bullet subtitle.Second,using a small amount of labeled data and a large amount of unlabeled data,combining LDA model and support vector machine algorithm,the unlabeled data with consistent category labels obtained by LDA model and initial support vector machine is added to the labeled data,which realizes the expansion of the annotation set of bullet subtitle.Finally,the classification model of bullet subtitle is trained on the newly constructed label data set to carry out bullet subtitle classification.The experimental results show that the semi-supervised classification method proposed in this chapter can effectively improve the classification performance of bullet subtitle.Thirdly,in view of the current lack of effective video bullet subtitle visualization methods,this thesis analyzes the problems existing in the existing video bullet subtitle websites.Combined with the research contents of the first two parts,this thesis puts forward a visualization scheme of bullet subtitle based on classification and a high-quality display scheme of bullet subtitle based on quantitative indicators,which purifies the movie-watching environment and improves the user’s movie-watching experience. |