| Colorectal cancer is a common type of cancer.Due to the alarming incidence and mortality rate,it has received increasing attention on early detection and treatment.Colorectal polyps form and grow at initial stages of most colorectal cancer cases.Detection and removal of colorectal polyps can effectively reduce the incidence of colorectal cancer.In clinical practice,colorectal microscopy is the primary means of detecting whether an individual has polyps.However,the examination is expensive,painful,and has low compliance.Due to stringent medical resource availability and large population in our country,it is difficult to t carry out colonoscopy screening for all target population as other governments do in developed countries.Therefore,it is more desirable in China than industrialized countries to characterize the relations between colorectal polyp occurrence and various potential determinants.Based on these factors,a risk prediction model was constructed in this study.Subsequently,this model can be used to predict polyp incidence for each individual.Then the personalized screening and treatment programs can be provided for the higher risk groupsThis study relies on the key project “Data Analysis and Decision for Healthcare” which is funded by National Natural Science Foundation of China.By analyzing the physical examination data collected from a third-level grade-A hospital in Beijing,the risk factors of colorectal polyps are confirmed and several risk prediction models of colorectal polyp are established.The research results can be used to provide guidance for the colorectal polyp screening,improve the utilization of medical resources and reduce the incidence of colorectal cancer in our country.Research on colorectal polyp risk prediction can play an important role in reducing the incidence of colorectal cancer in our country.The literature review found that this study is a gap in the field of disease risk prediction.Through the data exploration on the oral physical examination dataset,the quality of dataset is improved and the potential patterns are found.The process of data exploration includes missing value processing,outlier processing,univariate analysis and bivariate analysis.Traditional biostatistics methods are used by most previouse disease risk prediction studies.This study also adpts four other machine learning methods including decision trees,random forests,boosting tree,and artificial neural networks.Finally,by comparing testing prediction results of the five models,the neural network model has achieved the highest testing accuracy.Further,this study analyzes the relative importance and partial dependence of the variables.In this paper,the risk prediction model of colorectal polyp is constructed by using machine learning methods based on Chinese population data.The machine learning method,especially artificial neural network,achieves more accurate results than traditional biostatistical methods.This study also confirms emotional tendency as an important risk factor for colorectal polyp.This study can not only guide the implementation of personalized screening programs,but also provide new insights into the primary prevention of colorectal cancer. |