Font Size: a A A

Cancer Identification Based On Serum Protein Markers Feature Engineering And A Multi-model CV-Stacking Approach

Posted on:2024-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:2544307160979629Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
Cancer has gradually become one of the most threatening diseases to human health in the world.In the research of cancer recognition,mainly rely on genetic testing,that is,the use of genomic data for research,while cancer recognition based on proteomics research is relatively few.Because blood protein acquisition is relatively simple and convenient,with little harm to the body,blood protein detection has a better application prospect than complex and expensive gene detection.It is of great practical significance to research the cancer recognition method based on serum protein markers,and to explore the machine learning or statistical learning method with higher classification and recognition performance of cancer is conducive to clinical diagnosis.It is an important way to realize medical intelligence.In this paper,the cancer data set based on serum protein index published in Science by Cohen et al.,a researcher from Johns Hopkins University School of Medicine,was used to study classification recognition.Aiming at this data set,this paper improved the cancer recognition performance from three aspects: feature engineering method design,algorithm design,and classification model construction.The main work was as follows:(1)RFFS,a feature selection method based on random forest,and MIRFS,a feature selection method based on feature mutual information ratio,are designed to screen redundant features(2)The feature information extraction ability of various feature engineering methods and the classification performance of various classification models are compared and analyzed(3)Based on multiple integrated classification algorithms and the multi-model Stacking framework,the CV-Stacking algorithm for cancer classification and cancer AJCC classification is designed(4)Combined with the feature engineering method and CV-Stacking algorithm designed in this paper,the RFFS-CV-Stacking for cancer classification and the MIRFS-CV-Stacking for cancer AJCC classification were constructed.The results show that the RFFS-CV-Stacking and MIRFS-CV-Stacking models presented in this paper perform better than other classification models in two kinds of classification tasks in the Cohen cancer data set,and their accuracy rates are 84.98% and73.18%,respectively.And performed significantly better on the cancer classification task than the supervised learning method used by Cohen in his study.The verification results of the proposed CV-Stacking algorithm on UCI public data sets also show that its classification prediction performance is better than other classification models and it has good robustness.The work in this paper can provide reference for the research of cancer recognition based on proteomics,and also provide support for the clinical recognition and medical intelligence of cancer.
Keywords/Search Tags:Cancer identification, proteomics, feature engineering, ensemble classification algorithm, Stacking model
PDF Full Text Request
Related items