| In recent decades,tumor disease has always been one of the diseases which is difficult to cure we are facing,one seriously endanger the health of mankind,and has not yet found a targeted treatment.If you want to treat cancer,first of all to correctly determine the tumor subtype,with bioinformatics professional terms is the tumor classification.The emergence of high-throughput technology directly led to the emergence of a large number of tumor data,including different platforms and different laboratory data.The production of these data requires a high cost,but there are many small samples and large genetic problems,and the data of single-platform or single-laboratory data may be one-sided and unreliable.Therefore,this paper researches the following two aspects aiming to the problems of a single data and poor classification of tumor.(1)After analyzing the current mainstream data integration algorithm,it is found that Combat fusion algorithm is excellent in all evaluation indexes,especially in small samples(<25).However,this study found that the Combat fusion algorithm moves all batches around the overall average,so that when the combat transforms data using a validated,fixed gene tag,the gene labels on those data will also be shifted.For this problem,this paper will adopt the method of using a single batch to find the mean,variance,and then cycle to select the batch to be fusion as the reference sample to select the best batch as the reference sample instead of using the traditional overall mean and the population variance to adjust.In addition,this study found that the current fusion algorithm does not take into account the characteristics of the sample itself in the batch,but rather processes the entire batch directly.In order to solve this problem,this paper will divide a batch according to the first principal component(F1)of the feature,select a threshold k to divide a batch into two parts(greater than the threshold part is k+,less than the threshold part is k-)and do the same for other batches.Then it will fuse parts of the batches that are larger than the threshold,fusing parts of the batches that are less than the threshold and fusing two batches which are divided finally.Through the comparative experimental analysis,it is proved that the improved method in this paper has a very good performance compared with the other four fusion algorithms.(2)On the basis of studying the fusion algorithm,this paper found that applying the data which is after fused to many fields can improve the effects,such as the selection of specific genes,the control network and so on.Aiming at the problem that the current classification of tumor is not high,this paper presents a framework for tumor classification.Firstly,it analyzes the classification of a single data set,and then improves the accuracy of tumor classification through the data fusion.At the same time,this paper presents a simple comparison method aiming at the problem that can not be directly compared to a single data set and a combination of several data sets because of the inconsistency of the sample size.Finally,this paper carries out two groups of experiments on real breast cancer data,and classifies them by using the common classification algorithm of machine learning,which verifies that the tumor classification based on cross-platform data fusion has a good effect. |