| The rapid development of social media has resulted in an abundance of short textual data,presenting new challenges for natural language processing tasks such as keyphrase extraction and sentiment analysis.Particularly,when applying keyphrase extraction techniques to short texts on social media,issues like missing and redundant keyphrase often arise.The primary reasons for this are as follows: Firstly,the nature of such texts limits the available contextual information,leading to sparse feature representation.Secondly,current keyphrase generation methods based on static word embeddings fail to capture the comprehensive contextual and semantic information of short texts,making it difficult to address polysemy.Moreover,the tasks of keyphrase generation and sentiment classification are often treated independently,without leveraging the interdependence between them,resulting in redundant training and resource wastage.To address these issues,this study explores a topic-aware sequence-to-sequence model that incorporates the dynamic word embedding model BERT.The specific contributions are as follows:(a)To tackle the problems of missing keyphrase and weak feature extraction capability,a BERT-based keyphrase generation model is proposed.Firstly,a copy mechanism is employed to handle missing keyphrase,while leveraging the pre-trained knowledge from BERT,which is trained on a large-scale corpus,to capture the contextual and semantic information more comprehensively,alleviating the issue of sparse features.This not only reduces the model’s reliance on large-scale annotated data but also addresses the inability of traditional static word embeddings to express polysemy,thus aiding the generation of higher-quality keyphrase.(b)To address the shortcomings of incomplete and redundant keyphrase generated by the model,a topic-aware keyphrase generation model is explored.On one hand,topic modeling is utilized to extract the thematic information from the entire corpus,transforming the text into a low-dimensional distributional vector of text topics.This vector is then fused with the sequence features in the decoder,enriching the current text representation and generating more comprehensive topic-relevant keyphrase.On the other hand,attention mechanisms are employed to better learn the correlation between target keyphrase and the source text features,thereby reducing the generation of redundant keyphrase.(c)This study introduces sentiment classification labels to a general microblog dataset and explores a multi-task joint learning model for keyphrase generation and sentiment classification based on shared text encoding.By learning the semantic feature representation of sentences through a shared encoder,the decoder performs the keyphrase generation task,while the classifier outputs the sentiment label of the text.This approach reduces training redundancy and resource wastage that occur when the two models are trained independently,while also enhancing the quality of keyphrase generation.Experiments were conducted on a Chinese microblog dataset,and the experimental models were compared and analyzed against other models.The results demonstrate significant improvements in various evaluation metrics,indicating the potential application prospects of the enhanced model in social media analysis. |