| As one of the most dangerous malignant tumor, lung cancer is threatening human health.The survial rate of the patient will improve if we can find out an early diagnosis method. Breath detection could be one of an ideal methbd for it is fast as well as non-invasive.Based on detecion of volatile organic compounds(VOCs) in exhaled breath of human, this paper reports research for data mining towards lung cancer-associated specific VOCs which starts from the design of an gas sampling instruments and expands to a SQL database and Hadoop platform.This paper mainly contains:1. The optimization design of gas sampling instrument, which includes the design of gasway, circuits, software and standard sampling process. To satisfy the needs of sampling in breath, an apparatus has been carried out which could collect both VOCs and EBC, which proved to be efficient in several experiment.2. Based on the instrument designed, this paper analyzed over 5000 VOCs determined by gas chromatography-mass spectroscopy (GC-MS) and find 5 confirmed lung-cancer associated VOCs as biomarkers as well as a group of potential biomarkers. An optimization diagnose model is established base on the chosen VOCs using random forest.The model achiedved a total correct rate of 86.89%.3. Based on the diagnose model, this paper carried out the design of lung cancer database which provides the basis for data storage and management, also further mining and analyzing. The database is implemented using MySQL.4. Considering the lack of cross-validation in plural fiels and analysis from multi-dimensional, the cancer data cloud platform based on Hadoop is carried out. This paper has completed the platform architecture and deployment, then carried out the reseach of parallel algorithm based on the programing model Map-Reduce. |