| With the increase in the number of data samples and the number of dimensions,higher requirements are imposed on data analysis and machine learning algorithms in the current era of big data.Machine learning algorithms have gone through a long period of development and research.From the initial theoretical research to practical applications in modern life,it has affected our way of life and production,has demonstrated strong vitality.This article studies the application of machine learning in the context of bioinformatics and chemical processes.Lung cancer is one of the dangerous diseases that human beings need to face together.Smoking is a major cause of lung cancer.Therefore,the difference between the pattern of smokers with non-smokers in lung cancer is worth studying.Based on the genomewide expression,methylation and copy number variation of lung adenocarcinoma patients,the TCGA data is used as the training set.Then EDRN / SPORE data is used as the testing set.Innovatively,through gene expression differences,known important genes and Partial least squares correlation algorithm classifies the current sample,thereby identifying different patterns and screening key characteristic genes.Finally,a total of 43 gene expression characteristics genes,48 methylation characteristics genes and 75 copy number variation characteristics genes were obtained.The accuracy of the TCGA training set is 79.2%,87.5%,and 77.1%.The accuracy of the EDRN / SPORE test set is 86.3%,76.4%,and 77.3%.Finally,the results are verified by the Kyoto Gene and Genome Encyclopedia,which strengthened the credibility of our selected characteristic genes.Fault detection and diagnosis of chemical processes is an important guarantee for the safe production of chemical processes and the interests of factories.With the rapid development of sensors and the Internet of Things technology,industrial data has been characterized by big data,variable correlation,and time-varying.In this paper,a recursive distributed principal component analysis based on mutual information of variables(IRDPCA)is proposed.For the correlation of variables between industrial data,mutual information(MI)is used to consider the relationship among variables to devided.In order to solve the big data problem,we use Map Reduce-based recursive distribution principal component analysis for modeling and optimize it by forgetting factor to overcome the problem that new data is submerged in old data.Correspondingly,a recursive Bayesian decision fusion and recursive hierarchical fault diagnosis scheme is proposed after recursive modeling.The performance of IRDPCA was performed through fluorachemical engineering process and the Tennessee Eastman process introduced a change in the efficiency of the agitator that was slow without affecting the quality of the product.Thanks to reasonable block division and the ability to track variables in time,IRDPCA shows a distinct advantage. |