| With the development of Internet,the volume of data is booming and takes on an exponential growth trend.It is urgent to find useful information and knowledge from massive data,especially text data,to guide production and life.Russian is one of the working languages of the United Nations and the official language of many countries including Russia,Kazakhstan and so on.It is spoken by approximately 5.7% of the total population.Moreover,as a neighboring country of China,The economic and trade relations between Russian and China occupies an important position in both countries.Research on Russian text analysis and corresponding text mining methods are helpful to provide strong support for commercial analysis and management decision-making of relevant organizations.It plays a positive role to the promotion of Sino-Russian trade cooperation.In order to improve the performance of Russian text mining methods,this paper studies Russian text clustering methods and related term extraction methods.The main research content of this article:(1)Research on Russian text term extraction method based on multi strategies.Term extraction is the key foundation of text mining.In view of the low recall rate of the current Russian text term extraction methods,this paper proposes a Russian text term extraction method,which combines multi strategies including Russian POS analysis,grammatical rules and string frequency statistics to automatically extract Russian words and multiword expressions.This method uses the stop part-of-speech and stop word list summarized by the experiment to divide the Russian text into word string collections,and then combines string frequency statistics and substring deletion methods to extract frequent word strings.The frequent word strings are fitered by using Russian grammar to get a set of candidate terms.Experiments show that this method is convenient and effective,and the resulting term set can be directly used as a term database for text mining tasks.(2)Research on the DMD-kmeans method of Russian text clustering.Factors such as feature selection,the cluster number,and the algorithm model will all affect the performance of text clustering.Aiming at overcoming shortcomings of the k-means method,this paper proposes a Russian text clustering method based on term extraction.Based on the results of term extraction,the method first selects text feature terms to construct a text vector and uses the mean shift clustering method to determine the k value.After that,this method combines the density and distance principles to determine the initial clustering center,and then uses the k-means clustering method to realize the Russian text clustering.Experiments show that this method is better than existing algorithms in terms of error sum of squares and stability.(3)The application research of Russian text term extraction and Russian text clustering.In order to further verify the effectiveness of the research methods in this paper,we use the methods to extract Russian text terms and perform Russian text clustering on the two types of corpus including the United Nations Parallel Corpus and Taiga Corpus.In summary,focusing on the Russian text clustering,this paper studies the Russian term extraction method based on multi strategies and the Russian text clustering method based on term extraction.Experiments show that the accuracy of the Russian term extraction method proposed in this paper is higher than the n-gram method and the Russian text clustering method proposed in this paper is better than the k-means and k-means++ methods in terms of error sum of squares and stability.Therefore,using this method can quickly extract valuable information from massive Russian text data for providing strong support for the related organization’s management decision-making. |