The Text Corpus Refining Research

Posted on:2018-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhang

Full Text:PDF

GTID:2335330515971841

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

The text corpus is the foundation of the text data mining.Many of the text corpus comes from the actual work of production life,usually defined by industry experts.In this paper,the data set from the mayor's public telephone office,as the change in the industry categories,different periods corpus will inevitably have a lot of bad data,due to large corpus,usually not detailed check by experts,so you have to use data mining methods to find the error classification data,aiming at the fault classification data again by industry experts check them one by one.The content of this article is to filter the data in the corpus of the corpus,so as to correct the data classification of industry experts.This article discusses the discriminating classification of text data.In this paper,first of all,the text classification technology and process are given in this paper,and then discussed the properties of the naive bayesian method,finally discussed the refining text corpus research,discusses the category discriminant method of choosing wrong data,an empirical analysis is also given.Under the condition of large data,the method of marking the text data manually through industry experts not only consumes a lot of manpower,material resources,financial resources,but also consumes a lot of time.For these reasons,it is impractical to employ industry experts to manually correct.In accordance with certain rules,the bulk of the text data tag category is another effective method,the method can effectively avoid the shortcomings of the first method,but the text data category mark the accuracy is relatively low.Combined with the advantages of the two methods,we propose a third method,first batch on the text data tag category,give the text data of the category mark error to the industry expert for manual marking,and then text data in the text corpus is corrected with text data marked by industry experts.Text corpus is based on the third method of refining.Using different methods to extract the data of class discrimination errors in text corpus,in all the methods of category discrimination are wrong text data is the most likely for the category mark the wrong text data.The purpose of text corpus refinement is to extract the text corpus that is most likely to be incorrectly marked for text.This part of the text data is given to industry experts,artificial marking categories,and finally,the text data of the text corpus is corrected based on the text data manually annotated by industry experts.In this paper,the general process of text data classification is briefly introduced.Then,the basic theory,parameter estimation and optimization method of Naive Bayesian algorithm are introduced.Finally pretreatment of text corpus,key words extraction,the purpose and method of text corpus refinement,extract text data from categories to discriminate errors and so on to study.The focus of this paper is to study the method of extracting text data for classifying errors.

Keywords/Search Tags:

Naive Bayes, High-dimensional, Short text, Text corpus, refining

PDF Full Text Request

Related items

1	An Emperical Study On The Development Of The Text Organizing Ability Of English Composition In High School From A Corpus-based Perspective
2	A Study Of Framing Narratives In Tourism Text Translation-A Corpus-based Approach
3	The Application Of Machine Learning In The Prediction Of Movie Box Office
4	Research And Implementation Of Automatic Labeling System For Quasi Writtern Language Korean Speech Corpus
5	A Research On English Text Types And Their Translation
6	Text Types And Translation Strategies
7	A Practice Report On The C-E Translation Of Nuclear Power Engineering Text From The Perspective Of Eco-translatology
8	Research In Automatic Contrast Technique Of Vocabulary In Mongolian Text
9	Text Character Design Exploration Of Multi-dimensional Performance
10	A Study On Translation Strategies Of Scientific Text From The Perspective Of Three-Dimensional Transformation In Eco-translatology