| Huge efforts are being made by computer scientists and statisticians to design and implement algorithms and techniques for efficient storage, management, processing, and analysis of biological database. Data mining is an emerging area of computational intelligence that offers new theories, techniques and tools for processing large volumes of data (Sriraam, Natasha & Kaur, Data mining approaches for kidney dialysis treatment, 2006). The data mining and statistical learning techniques were used to discover consistent and useful patterns in large datasets. These techniques are used in a computational biology and bioinformatics fields. Computational biology and bioinformatics seeks to solve biological problems by combining aspects of biology, computer science, mathematics, and other disciplines (Adams, Matheson & Pruim, BLASTED: Integrating biology and computation, 2008). The main focus of this study is to expand understanding of how biologists, medical practitioners and scientists would benefit from data mining and statistical learning techniques in prediction of breast cancer survivability and prognosis using R statistical computing tool and Weka machine learning tool. In this dissertation, data mining and statistical learning techniques were applied to breast cancer datasets for survival analysis. The breast cancer dataset from University of California, Irvine (UCI) machine learning database system and National Cancer Institute (NCI) biological database system were used for prediction and comparative study of the data mining and statistical learning techniques. The results of the classifiers or models were mixed, logistic regression did outperform decision tree, SVM, AdaBoost, Bagging and naive Bayes algorithms based on accuracy. However, the artificial neural network showed slight improvement over logistic regression, and the decision tree resulted in slightly higher classification accuracy over AdaBoost, Bagging and naive Bayes' models in terms of accuracy. Results were mixed as to which algorithm is the most optimal model, and it appeared that the performance of each algorithm depends on the size, high dimensionality of data representation and cleanliness of the dataset. |