| Automatic identification of abbreviations is a key of documental understanding automatically.In the time which big data, cloud computing, internet of things growing influence and gradually into the people’s daily life, the level of community management intelligence constantly improve, that demanding to strengthen development, usage,integration and sharing of the information resources. It also demands to take use of the modern information technology such as computer and network to each field,aspects and links of social management and service widely, promoting intelegent technology in public administrative, sharply upgrading efficiency and benefits of social management. Facing the documents which occupying an important place, having a wide range and mass information in modern society, relying on manual handling is inefficient.How to use the computer information processing technology to automatically understand and process the documents is one of the most important tasks for us. To understand and process the document automatically,realize Office Automation, is an important element of the intelligent community management.The key is to use all available features and identifying components of official documents for automatic understanding of official documents and processing.As an important part of vocabulary, abbreviations has the symbolic features in form and unique, rich connotation. It is an important symbol component of document. Therefore, the knowledge mining of abbreviations in the document is an important task of document automatic understanding.At the same time, automatic identification of abbreviations is a difficulty of documental automatic understanding. From the form, there are a lot of "unknown words" among abbreviations which computer can not identify. From the meaning, although the abbreviations’ form is simple, but they often have unique and rich connotation, that is difficult to understand and grasp through simple literally,and even the computer program whose artificial intelligence level is higher can not accurately understand the "sublime words with deep meaning".Research on knowledge mining of document abbreviations in this paper, is to find the structural or semantic features, change rules of use and development through the statistics, analysis, induction, comparison from the static and dynamic system, providing the mentality and the method for automatic identification of document abbreviations,that be in the service of document intelligent information processing. Official abbreviations knowledge mining, can provide theory and practice of automatic recognition of document acronyms, help to further improve and optimize the performance on automatic segmentation and tagging software,and improve the efficiency and accuracy of the document automatic understanding, solve the problem of document automatic understanding. In addition, it can help to statistic and describe in-depth, to make contribution to the study of the common language, to provide reference for exploring the social political and cultural development and so on, which has the important theory significance and application value for reference.This paper builds a document keywords table,11kinds of abbreviations dictionary and "Modern Chinese Dictionary" abbreviations databases. Through the statistical analysis on the abbreviations’ mode, length, structure and attributes, this paper discoveries that the way of extracting core morpheme is the main way of thumbnail abbreviations. The relationship between the abbreviations morphemes which are formed by extracting core morphemes is a random parameter. Correlation has the great significance for the recognition of abbreviations and provides a train of thought for the recognition of abbreviations based on relevance theory.On this basis, this paper has established "contemporary Chinese political educational document corpus" which has more than12million words. After the word segmentation, mark, processing, this paper analyzes the dynamic distribution and bivariate correlation combination and verifies the conclusion of static system abbreviations knowledge mining, and obtains an ideal result by identifying and extracting abbreviations.The thesis is divided into six major parts.Chapter1Introduction:mainly elaborates this paper’s purpose and significance, research status, theory and methods. Chinese lexicology, computational linguistics theory, language information processing and office automation theory is the main guiding theory of this research. Corpus linguistics,static and dynamic methods, qualitative and quantitative analysis are the main methods in this paper.Chapter2Basic research on the document abbreviations knowledge mining: through the statistical analysis of document keywords table, abbreviations dictionary and "Modern Chinese Dictionary" abbreviations databases, extracts the abbreviations’ mode, length, structure and attributes features. We found:the way that extract core morpheme is main thumbnail way, the relationship between composition morpheme is random, the frequency features is important parameter which can make recognition according to correlation theory; noun abbreviations and verb abbreviations is the focus of knowledge mining; the important grammar function features of digital type abbreviations is "numeral+noun" and "numeral+verb" combination that have reasonable meaning, which provides important ideas for automatically recognition of digital type abbreviations. Conclusion:taking the correlation between morphemes of an abbreviation as basic parameter, taking the function as auxiliary parameters based on relevance theory, and focusing on research of words whose syllables is two to four, can be the basic paths to automatically identification of abbreviations in official documents.Chapter3Development of the document corpus:introduces the document corpus research purpose, principles of selection, sampling method, corpus processing, especially the principles and methods of abbreviations proofreading.Chapter4Quantitative analysis of documents corpus abbreviations:analyzes abbreviations’ length, functional attribute and structure mode dynamic distribution, verifies the basic conclusion of static abbreviations knowledge extraction.Chapter5Study on automatically identification of document abbreviations:it is the subject of this study and the main innovations,based on the bivariate correlation theory, makes the sampling statistics of word correlation in document corpus and makes the extraction experiment focus on the "1+1" type,"1+2" type,"2+1" type,"2+2" type,"numeral+noun" pattern and "numeral+verb" pattern combinations according to the conclusion of abbreviations knowledge mining in the static and dynamic systems. Good results have been obtained. The conclusions are drawn as follows:Making the extraction through statistic and analysis of word correlation based on the relevance theory is correct. The focus of abbreviations recognition and extraction should be the "1+1" type,"1+2" type and "2+1" type combination.Frequency and function are important parameter to identify abbreviations automatically, which combine the two inspection can enhance the pertinence, feasibility and validity of the recognition and extraction of abbreviations. The quasi abbreviations in document can help the computer identify the form, source and system unit of document,whose key of automatic identification should be "noun+noun" pattern combination. The recognition and extraction of digital blanket type abbreviations can focus on the "numeral+noun" pattern and "numeral+verb" pattern.Chapter6Conclusion:verifies that document abbreviations knowledge mining is feasible and effective based on the relevance theory and puts forward the principle of optimizing document abbreviations dynamic glossary and points out the research deficiency. |