Research On Short Text Classification Method Based On Feature Reduction And Semantic Extension

Posted on:2021-03-17

Degree:Master

Type:Thesis

Country:China

Candidate:M Zhou

Full Text:PDF

GTID:2428330614460381

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the development process of the Internet era,the data format of short texts has gradually become a mainstream text format under the impetus of online socialization.As compared with traditional text forms,short texts have shorter text lengths and larger data scales,so the problem of high-dimension and sparseness is the first challenge to be faced when mining short text data.Furthermore,short texts contain less semantic information and ambiguity information etc,which makes it difficult for traditional text mining methods to complete classification tasks efficiently and accurately.Therefore,how to further compress the feature dimensions of short texts,improving the performance of short texts representation,and then achieving a higher classification accuracy has become a research hotspot in the field of short text mining.In view of the above problems,this dissertation focuses on short text classification,and our main work is as follows:(2)Aiming at the high-dimension and sparsity problem of short texts,a classification method based on signed hash feature reduction is proposed.The method first preprocesses the short texts,uses improved jieba-fast multi-thread word segmentation to divide the phrase,and removes stop words to improve the performance of text representation.Secondly,to reduce the high-dimensional problem of massive short text,we use a signed hash mapping method to project high-dimensional short texts into a vector space with a fixed dimension,stores the text content in the form of a sparse matrix,and distinguishes text that may be ambiguous.Finally,the random forest is used as a classification model to predict.Experimental results show that the proposed method performs well in short texts classification accuracy,meanwhile,it achieves a good balance between hardware consumption and model accuracy.(3)Aiming at the poor performance of the text representation caused by the less semantic information of short texts,in terms of hierarchal clustering and LSTM,a classification model based on fuzzy semantic extension is proposed.First,the proposed model uses the Skip-Gram to train the word vector of data sets and uses hierarchical clustering in the word embedding space.And the clustering center vector is fuzzy matched with the word vector of the external corpus according to the semantic similarity to obtain a text representation containing semantic information.Second,access to LSTM(Long Short-Term Memory)for high-level feature extraction,and then import the Stochasticpooling pooling layer to extract global features and further dimensionality reduction,and finally connect the softmax layer to output classification results.Experimental results show that this method can effectively supplement the semantic information of short texts and output a higher accuracy classification result.

Keywords/Search Tags:

short texts classification, hash map, random forest, hierarchical clustering, semantic extension

PDF Full Text Request

Related items

1	Research Of Short Texts Classification Algorithm
2	Researches On GPR Shallow Target Detection Based On Hierarchical Clustering Algorithm And Random Forest
3	Research On Hierarchical Classification Methods For Chinese Texts And The Related Application
4	Research On Short Text Classification Based On Semantic Extension
5	Semantic Representation and Interpretation of Short Texts with Deep Learnin
6	Research On Short Texts Classification Methods Based On Features Fusion And BiLSTM
7	The Research Of Theme Analysis Technology On Short Texts
8	Research On Cross-domain Classincation For Short Texts
9	Research On Algorithm Of Semantic Net Mining Of Short Texts Based On Wordnet
10	Analysis Of Affective Tendency For Chinese Short Texts