| The use of data mining as an interdisciplinary field of knowledge and technology has expanded dramatically.However,despite the widespread use of data mining approaches in solving complex problems,there is no consensus on the appropriateness of the application,the choice of the best data mining method in solving specific problems,and the attribute selection method.Furthermore,due to the rapid changes in the nature of data in various disciplines,data mining methods also need continuous adjustment to cope with these evolving changes.In this thesis,our primary focus is on using data mining models and examining their suitability in Financial Statement Fraud(FSF)detection.The issue of FSF detection in the data mining context is interesting for the following reasons.(ⅰ)The class imbalance of financial data due to the rare number of fraudulent data in a random sample requires adequate attention before running the model.(ⅱ)Many ratios and raw data are presented in financial statements,but not all of these variables are useful enough to identify fraudulent data.Therefore,there is a need to examine these variables and introduce the relevant attributes carefully.(ⅲ)Research is needed on the use of state-of-the-art data mining methods such as ensemble learning in fraud detection.In this thesis,we used two distinct datasets and fully developed and implemented two Cross Industry Standard Process for Data Mining(CRISP-DM)frameworks to detect fraudulent financial statements.We used financial data from China’s and the U.S.stock markets.We set one of these frameworks as the benchmark for the other one.Furthermore,we set another external benchmark to validate and evaluate the results of this thesis.To the best of our knowledge,there is no other study in the research literature that has two distinct frameworks derived from two different datasets.We extracted China’s data from the COMPUSTAT database and the U.S.data from the Accounting and Auditing Enforcement Releases(AAER)database.The U.S.Securities and Exchange Commission(SEC)database is also used for obtaining real fraudulent data.In order to address the issue of data redundancy,we used four feature selection methods,three of which are state of the art,and one of which is the classical model.We also extended the literature by proposing a new feature selection model that combines the genetic algorithm and fuzzy logic.Financial data is naturally imbalanced.The reason for this is the rare incidences of fraudulent data in a dataset.To solve this problem,we utilized Synthetic Minority Oversampling Technique(SMOTE)technique in the first framework and the RUSBoost as Classifier in the other framework.RUSBoost has the advantage of both oversampling and under-sampling.In the first framework,we first used a clustering-based classifier for preliminary data classification.And then,we tested the success rate of the five supervised-based classifiers according to their precision and recall values.To simulate the first framework with the realworld scenarios,we set up experimental samples,including different percentages of fraud,nonfraud,and suspicious data,and also formed testing samples of various sizes.The second benchmark of the study is Bao’s latest research,which uses raw data instead of financial ratios,and Ensemble instead of classical classifiers to test the effectiveness of his model.We also used raw financial data,ensemble learning,and the same performance metrics to build standard foundations for comparison.Furthermore,we expanded the research literature by introducing another performance indicator called "execution time".We find that the Multi-Layer Feed Forward Neural Network(MFFNN)model has a higher success rate than other classic models in the first framework.Based on this,we set the MFFNN Classifier as the primary benchmark of this thesis.Based on comparisons between 14 financial ratios and 28 raw financial data in this thesis,we find that having the theoretical backing of financial ratios does not necessarily lead to these ratios being more robust compared to raw data in fraud detection.We also find that not only a larger number of variables did not necessarily lead to a higher Classifier success rate,but it may even deteriorate Classifier results.Further,we find that while all five feature selection models gave more or less the same results,the Wilcoxon model was recognized as the best model in terms of choosing the least number of variables without compromising the Classifier’s success rate.Our proposed feature selection model has a relatively similar rate of successful classification and shorter running time.Setting the same benchmark,we found that at least 7 out of 28 variables can be omitted.Our study makes contributions to the data mining literature by(ⅰ)Developing two distinct data mining frameworks derived from two different datasets for FSF detection,(ⅱ)Examining a wide range of classic and one state of the art classifier and presenting a comprehensive comparison analysis,(ⅲ)Addressing the problem of attribute selection using an innovative model and four existing models,(ⅳ)Introducing a new performance metric to identify realtime classifiers.The thesis also suggests several theoretical and practical contributions to FSF detection.Among other things,this thesis expands the framework of fraud diamond by adding a new dimension called "abnormal patterns." This new dimension is supported by the results of empirical studies performed in this thesis. |