Font Size: a A A

Research On Oncology Data Mining Method Based On Multi-objective Optimization

Posted on:2024-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:2544307172481394Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing technology,the tumor omics data collected at the molecular level has increased exponentially.When directly using machine learning models to tumor omics data for tumor diagnosis and prognosis,the models learn large number of redundant features information,which leads to the decline of model performance.In addition,the inherent category imbalance characteristic of medical omics data,where the number of diseased is much smaller than the number of non-diseased,further biases the learned model toward the majority category.Whereas,the minority categories usually contain information on diseased samples,which is undesirable for medical omics data.There is an urgent need to mine features with high precision and with high classification effect on minority categories from high-dimensional tumor omics data.Based on this,the paper takes multi-omics data of colorectal cancer(CRC)as the research object,combining machine learning methods to mine the decisive features in CRC data and improve the recognition effects on minority categories,the main research work is as follows:1.Download the multi-omics data of CRC from TCGA database and perform preprocessing of the data.It contains digitization processing of classification labels,processing of one-hot encoding of character features,removal of samples,processing of vacancy values,and merging of data sets.Laying the foundation for the next feature selection,imbalance studies,and interpretability of CRC classification models.2.In the first stage,different machine learning classification algorithms were used as base estimators for Recursive Feature Elimination(RFE),generating different feature subsets and fetching the merged set of these feature subsets;in the second stage,the feature analysis was performed on the feature subsets,fully considering the respective advantages of different machine learning algorithms.The intersection of the feature subsets was first used as the base set,and then the other features contained in the merged set of the subsets were sorted and added to the base set in turn.Based on this changing union feature set,different classification algorithms were used to classify and predict CRC,respectively.3.Respectively using Logistic Regression(LR),Support Vector Machine(SVM),Random Forest(RF),e Xtreme Gradient Boosting(XGBoost),Stacking to classify and identify the union features subsets mined by U-RFE.The performance of each classification model was analyzed until the set with the best prediction performance was found as the final decisive union feature set.4.Based on the determined decisive union feature set,the phased Stacking integrated learning algorithm framework and the multi-objective optimization algorithm framework of genetic algorithm were proposed at the algorithm level,respectively,to improve the classification performance of minority categories as much as possible while ensuring a better overall classification accuracy.5.For the black-box characteristics of machine learning models,based on the determined decisive union feature set,the local interpretable model-agnostic explanations(LIME)method was used to explain multi-objective optimization model and logistic regression model internal judgment basis of CRC data respectively,and summarize the judgment rules of each category of CRC data within the above two models respectively,to assist doctors in the diagnosis of colorectal cancer disease.
Keywords/Search Tags:multi-omics data, feature selection, categories imbalance, machine learning, multi-objective optimization, interpretability studies
PDF Full Text Request
Related items