Font Size: a A A

Improved Naive Bayes Algorithm With Application To Text Classification

Posted on:2021-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:G WuFull Text:PDF
GTID:2517306113453504Subject:Statistics
Abstract/Summary:PDF Full Text Request
The Internet which changes with each passing day makes the text data increase day by day,and overloaded text information increases the difficulty of tasks such as retrieval and classification.Traditional text classification has been difficult to meet human needs.Automatic text classification technology makes up for the shortcomings of traditional text classification.It can automatically complete text classification tasks,making information retrieval and classification tasks more concise and efficient.Machine learning,as a data mining technology,can learn from large amounts of data to obtain the information what people need.As an important classification technology in machine learning,Naive Bayes algorithm is widely used in text classification because of its simple structure,solid theory,high efficiency and accuracy.However,its feature independence assumption,theoretical probability distribution requirements are difficult to meet in practice,and the limitations of the shallow learner will cause the problem of poor classification.Therefore,this paper improves the Naive Bayes algorithm from the following two perspectives:Aiming at the problem that Naive Bayes algorithm belongs to shallow learner,a new naive Bayesian ensemble algorithm called deep ensemble naive Bayes(DENB)is proposed.Inspired by the ensemble idea of Deep Forest(gc Forest),the algorithm combined Bernoulli Naive Bayes(BNB),Gaussian Naive Bayesian(GNB)and Multinomial Naive Bayes(MNB)into a deep learning structure naive Bayes.The proposed DENB algorithm overcomes the problem of insufficient expression of shallow learning features.Experiments on three classic data sets of sports article classification,company type classification and spam filtering prove that the accuracy,recall rate and F1 value of the proposed algorithm are significantly increased,and the algorithm performance is better.Aiming at the problem that the Naive Bayesian algorithm requires strict probability distribution and on the input data feature independence assumption,this paper takes a binary classification task as an example,and proposes an improved Bernoulli Naive Bayesian algorithm based on coding;The algorithm first encodes the original input by the integration of the tree,and then uses the Bernoulli Naive Bayes algorithm to train and test the encoded data.The results show that the encoding method satisfies the requirements of the Bernoulli Naive Bayesian algorithm for the probability distribution of the input data,and there are differences between the trees used for encoding,which guarantees the independence between the encoded features to a certain extent;Experiments on classification of sports articles show that the improved Bernoulli Naive Bayes algorithm based on coding has a good classification accuracy rate,which verifies the effectiveness of the improved coding method.
Keywords/Search Tags:Naive Bayes, Text Classification, Ensemble, gc Forest, Encoder
PDF Full Text Request
Related items