Research On Korean Big Data Text Mining Based On Statistical Methods

Posted on:2020-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Quan

Full Text:PDF

GTID:2415330572475390

Subject:Basic mathematics

Abstract/Summary:

What we are living now is a high-speed,constantly evolving social environment.The constant innovation of technology has brought us into a new era of big data.Developed technology not only enriches people’s lives,but also people.The relationship between the two has been reduced,the distance between the heart and the heart has narrowed,and more importantly,the way people communicate.Big data as the center of the modern technology environment is an extremely important resource.Big data as its name implies is of course "a huge amount of data",but the real value is not its big,is the information it contains which can be effective.The information used makes it gradually become a meaningful product of the times.How to find out that this information is valuable and can be used by people?Then we must explore it through text mining technology.Text mining is a combination of machine learning,parallel computing,statistics,data mining,natural language processing,probability,graph theory,etc.,covering the essence of the above disciplines.Based on this,text mining is studied by many scholars and experts.The reason is that it combines multiple disciplines and technologies,and there are no obvioxis academic restrictions,which can enable scholars in various fields to exchange and cooperate.Due to North Koreans long-term policy of closing information,official statistics are scattered in various documents or news reports,which has brought many inconveniences to the systematic study of the Korean Peninsula issue.The stody of language big data text mining is to solve such problems.The total number of news data selected in this study is about 5 million,including 1.5 million Korean data and 3.5 million Chinese data.These data are imported into Transwarp Data Hub,a relatively comprehensive and comprehensive big data platform.Carry out key steps such as data cleaning and conversion,text pre-processing and classification,establishing the Korean news-specific vocabulary and updating the Korean corpus,and using R Studio in the server to apply statistical knowledge for text mining research,according to the Korean Peninsula data and The characteristies of the specific problem establish an analytical model,combined with various mature algorithms to achieve Korean big data text mining,the key link is the update of the Korean corpus and the establishment of the Korean news exclusive vocabulary.Based on the updated corpus and the established Korean news exclusive vocabulary,the updated corpus and the Korean corpus have a higher consistency in the part-of-speech tagging,and analyzed the results.The final study results can provide strong data support and theoretical support for humanities and social science scholars to study many issues on the Korean Peninsula.

Keywords/Search Tags:

Statistical methods, Big data text mining, R Studio, Korean corpus

Related items

1	A Comparative Study On The Text Complexity Of CET 4 And CET 6 Reading Comprehension Texts: A Data Mining Approach
2	Analysis And Research To The Data Of CET-4 Score Based On Data Mining
3	Applying Web Data Mining To The Parallel Corpus: The Automatic Identification And Alignment Of The Corresponding Units
4	A Report On The Translation Of Data Science For Business-What You Need To Know About Data Mining And Data-analytic Thinking(Chapter 3)
5	A Translation Project Report Of Data Science For Business -What You Need To Know About Data Mining And Data-analytic Thinking(Chapter One) By Foster Provost And Tom Fawcett
6	New statistical methods for analysis of historical data from wildlife population
7	Application Of Machine Learning In Data Mining Of Earthen Archaeological Sites Monitoring
8	Research On Data Mining Of Massive Minority Cultural Resources Based On Spark
9	Research And Application Of Data Mining In College English Teaching And Evaluation
10	A Research On Scoring SAQs Of Online Listening By Data Mining