| As the information age arrives our living is more and more dependent on computer technology and a lot of information and data is accumulated. The data is records of people's activities. Through the analysis of the data we can find out the hidden pattern of behavior. It is provided for marketing, strategic decision-making as the basis of argument. In recent decades the technology of relational database system is mature and the application is widely used in the statistics, queries, etc. But it cannot thoroughly explain the logical relationship among the fields of the database table, forming a model based on sample data and predict future trends by entering the current data. At present, IT circle facing one of the difficult problems is how to automatically and efficiently extract the hidden knowledge from the massive information. This field of software development and application is still in its early stages, so it results in "information explosion" and there are no measures to the phenomenon. The data collected in the large database system or the data scattered in the various types of text files becomes the data in graves and has been sleeping after the data inputting. The data owners should have made decisions based on the magnanimous data and in actually depend on their intuition, for they cannot extract the valuable knowledge from data. In this case, they have resorts to experts or an expert support system in the field. Current knowledge cannot be complete and should continue to be discovered and improved, so the bias and errors of the resulting decisions will be inevitable. Besides changing from the experts'knowledge and experience to the computer system knowledge base is a time-consuming and laborious process. We hope that the computer technology contribute to finding the valuable knowledge more effectively and systematically from the massive data—the "data grave" becomes a "data gold". In 1989, Gregory Piatetsky-Shapiro brought forward KDD (Knowledge Discovery in Databases). KDD is the process of finding outinteresting, interpreted, useful and novel data. Data mining is an important step in KDD. The premier professional body in the field is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. Since 1989 they have hosted an annual international conference and published its proceedings. Scholars in different fields are concerned about the development and research of knowledge discovery and put forward a number of related terms. For example, "data mining" , " information extraction ", "Intelligent Data Analysis", "Exploratory Data Analysis", "information discovery", "information harvest", " data archedogy " and so on. "Knowledge discovery" and "data mining" are the most widely used terms. Data mining is mainly used in mathematical statistics, database systems, data analysis systems and information management systems; Knowledge Discovery is used relatively in artificial intelligence and machine learning. Many software vendors roll out their own data mining software products, and eager to set the industry standard, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Freely available open-source software systems like RapidMiner, Weka, KNIME, and the R Project have become an informal standard for defining data-mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provide a standard way to represent data mining models so that these can be shared among different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009. Our domestic universities and research institutions have also invested in research and development of data mining technology but not yet made the data mining applications. For most domestic enterprises, data mining technology is also a "luxury consumer goods". Due to the high degree of informationization there is the precondition to apply data mining technique in financial institutions, securities companies and telecommunication enterprises. Each securities companies will provide customers for their appraisal report of the performance of listed companies, the system analysis of financial index, and classification of Listed companies in different states.The research aims at the implementation of data mining classification techniques and applies it to the classification of sub-groups of the Shanghai Stock Exchange listed companies. It verifies whether the data based on financial index can significantly distinguish between the performance of different groups and how much reliable financial index gives investors the information in the end. At the same time pay attention to the abnormal groups and alert investors their changes.This paper introduces the concept of data mining, the current state of development and related algorithms. According to the procedure of data collection, data cleaning and transformation, model building, model evaluation and analysis by EXCEL2007, SPSS, SQL2005, C# and other technologies, it has achieved the classification of listed companies. In order to give different financial indexes of the corresponding coefficients, it has conducted a principal component analysis and solved the linearity among financial indexes. It has classified the listed companies by using K-MEANS clustering algorithm. K-means Clustering is a kind of unsupervised clustering, based on the certain similarity of attribute fields to classify; it is very easy to find that whether examples come together naturally in order to verify whether these examples can belong to the same concept things so that it can provide the optional guide for selecting input attribute fields in the Supervises Learning model and further analyzing. The same reason, the classification in the supervises mode cannot be confirmed to be right and we need select the other attribute fields to re-design the mode if cases in the unsupervised mode cannot be clustered naturally .In addition, the abnormal cases group can be found by using K-means algorithm. In this paper, the abnormal cases group means the companies that have abnormal financial index, which may probably have fake financial statement or either have a very good or very bad performance. For preventing the risk of investment, the investors need to pay more attention to these abnormal companies. According to the consequence, we can extract common factor from the financial index. In order to make the review more persuasive we need to combine the analysis of financial index with the experts'opinions. |