Improved Naive Bayes Algorithm With Application To Text Classification

Posted on:2021-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:G Wu

Full Text:PDF

GTID:2517306113453504

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

The Internet which changes with each passing day makes the text data increase day by day,and overloaded text information increases the difficulty of tasks such as retrieval and classification.Traditional text classification has been difficult to meet human needs.Automatic text classification technology makes up for the shortcomings of traditional text classification.It can automatically complete text classification tasks,making information retrieval and classification tasks more concise and efficient.Machine learning,as a data mining technology,can learn from large amounts of data to obtain the information what people need.As an important classification technology in machine learning,Naive Bayes algorithm is widely used in text classification because of its simple structure,solid theory,high efficiency and accuracy.However,its feature independence assumption,theoretical probability distribution requirements are difficult to meet in practice,and the limitations of the shallow learner will cause the problem of poor classification.Therefore,this paper improves the Naive Bayes algorithm from the following two perspectives:Aiming at the problem that Naive Bayes algorithm belongs to shallow learner,a new naive Bayesian ensemble algorithm called deep ensemble naive Bayes(DENB)is proposed.Inspired by the ensemble idea of Deep Forest(gc Forest),the algorithm combined Bernoulli Naive Bayes(BNB),Gaussian Naive Bayesian(GNB)and Multinomial Naive Bayes(MNB)into a deep learning structure naive Bayes.The proposed DENB algorithm overcomes the problem of insufficient expression of shallow learning features.Experiments on three classic data sets of sports article classification,company type classification and spam filtering prove that the accuracy,recall rate and F1 value of the proposed algorithm are significantly increased,and the algorithm performance is better.Aiming at the problem that the Naive Bayesian algorithm requires strict probability distribution and on the input data feature independence assumption,this paper takes a binary classification task as an example,and proposes an improved Bernoulli Naive Bayesian algorithm based on coding;The algorithm first encodes the original input by the integration of the tree,and then uses the Bernoulli Naive Bayes algorithm to train and test the encoded data.The results show that the encoding method satisfies the requirements of the Bernoulli Naive Bayesian algorithm for the probability distribution of the input data,and there are differences between the trees used for encoding,which guarantees the independence between the encoded features to a certain extent;Experiments on classification of sports articles show that the improved Bernoulli Naive Bayes algorithm based on coding has a good classification accuracy rate,which verifies the effectiveness of the improved coding method.

Keywords/Search Tags:

Naive Bayes, Text Classification, Ensemble, gc Forest, Encoder

PDF Full Text Request

Related items

1	Chinese Text Categorization Method And Implementation
2	Research On Imbalanced News Text Mining Based On Improved Random Forest
3	Research On News Classification And Recommendation Method Of Taiyuan Education Bureau Government Affairs Big Data Platform
4	Random Forest Algorithm Based On Optimized Auto-encoder
5	Research And Implementation Of Automatic Correction Of Programming Work Based On Git
6	Realization Of Text Classification And Recognition Based On NLP Method
7	Research On Classification Of Imbalanced Datasets Based On Random Forest
8	Research On The Influence Of Students' Behavior On Academic Achievement Based On Data Mining
9	Research On Unbalanced Data Classification Based On Ensemble Learning
10	The Method Of Selecting Local Feature Words And Its Application In Text Classification