Font Size: a A A

Research On Korean Big Data Text Mining Based On Statistical Methods

Posted on:2020-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y M QuanFull Text:PDF
GTID:2415330572475390Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
What we are living now is a high-speed,constantly evolving social environment.The constant innovation of technology has brought us into a new era of big data.Developed technology not only enriches people's lives,but also people.The relationship between the two has been reduced,the distance between the heart and the heart has narrowed,and more importantly,the way people communicate.Big data as the center of the modern technology environment is an extremely important resource.Big data as its name implies is of course "a huge amount of data",but the real value is not its big,is the information it contains which can be effective.The information used makes it gradually become a meaningful product of the times.How to find out that this information is valuable and can be used by people?Then we must explore it through text mining technology.Text mining is a combination of machine learning,parallel computing,statistics,data mining,natural language processing,probability,graph theory,etc.,covering the essence of the above disciplines.Based on this,text mining is studied by many scholars and experts.The reason is that it combines multiple disciplines and technologies,and there are no obvioxis academic restrictions,which can enable scholars in various fields to exchange and cooperate.Due to North Koreans long-term policy of closing information,official statistics are scattered in various documents or news reports,which has brought many inconveniences to the systematic study of the Korean Peninsula issue.The stody of language big data text mining is to solve such problems.The total number of news data selected in this study is about 5 million,including 1.5 million Korean data and 3.5 million Chinese data.These data are imported into Transwarp Data Hub,a relatively comprehensive and comprehensive big data platform.Carry out key steps such as data cleaning and conversion,text pre-processing and classification,establishing the Korean news-specific vocabulary and updating the Korean corpus,and using R Studio in the server to apply statistical knowledge for text mining research,according to the Korean Peninsula data and The characteristies of the specific problem establish an analytical model,combined with various mature algorithms to achieve Korean big data text mining,the key link is the update of the Korean corpus and the establishment of the Korean news exclusive vocabulary.Based on the updated corpus and the established Korean news exclusive vocabulary,the updated corpus and the Korean corpus have a higher consistency in the part-of-speech tagging,and analyzed the results.The final study results can provide strong data support and theoretical support for humanities and social science scholars to study many issues on the Korean Peninsula.
Keywords/Search Tags:Statistical methods, Big data text mining, R Studio, Korean corpus
PDF Full Text Request
Related items