Font Size: a A A

Study On Parallel Implementation Of Screening Algorithm SAM Of Differential Expression Gene For Renal Cell Carcinoma Based On Spark

Posted on:2019-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhengFull Text:PDF
GTID:2394330548456873Subject:Engineering
Abstract/Summary:PDF Full Text Request
Renal cell carcinoma is one of the most common tumors worldwide.Although researchers have conducted in-depth studies on the principles of pathogenicity and therapeutic methods of renal cell carcinoma,they have not yet found out the specific causes of the disease and effective treatment methods.As scientists have conducted in-depth research on human genes,according to existing samples,screening for disease-related differentially expressed genes has become the focus of current genetics and medicine research.From the genetic point of view,it is of great significance to overcome the disease.DNA chip technology is currently the main application technology for the study of genes,and researchers can use this technology to screen out disease-related differentially expressed genes.In 2001,Virginia Tusher and Robert Tibshirani proposed the SAM algorithm.The SAM algorithm not only ensures that more differentially expressed genes can be screened,but also keeps FDR at a relatively low level.At present,with the advent of the era of big data,Spark,the big data analysis technology,was born.Spark technology uses a memory-based calculation method,removes the operation of reading disks during the calculation process,and provides a rapid tool for large data processing RDD.It can perform complex batch processing and parallelization calculations and improve the speed of operations.It is currently the main technology for big data analysis,and can quickly and efficiently filter and analyze big data.With the in-depth study of human genes,researchers have obtained a large number of gene expression data,which makes the traditional single-mode serialization computational efficiency can not meet the needs of researchers.In order to improve the efficiency of data mining,this paper uses the gene expression profile data of renal cell carcinoma,combines it with Spark technology,and uses the SAM algorithm for parallel computation,which can quickly and efficiently screen differentially expressed genes.The purpose of adopting Spark technology is toimprove the efficiency of screening for differentially expressed genes,which is of great significance for the in-depth study of the pathogenic principle and treatment of diseases.In this paper,we first download the original data of renal cell carcinoma from the GEO database,including the two sets of data of the experimental group and the control group,and preprocess the data to get the gene expression data needed for the experiment.Further work is to use the big data analysis and computing platform Spark technology to apply the gene expression profile data related to renal cell carcinoma to achieve parallelization of the SAM algorithm.The method is to use a VMware virtual machine to build a Spark cluster on a Linux system,and use Spark-Shell provided by Spark to perform interactive analysis and calculation,and obtain the running time for screening out the differentially expressed genes.Afterwards,the R language was used to compare serial experiments of the SAM algorithm in stand-alone mode.The results of the parallelization experiment of Spark are compared with the results of R language experiment,and the improvement ratio of SAM algorithm efficiency is obtained.Finally,based on the research process of the SAM algorithm,the SAM algorithm parallelization system is realized,including the introduction of SAM algorithm,the introduction of the original data,the results of the screened differentially expressed genes and the SAM visualization images.It helps researchers who want to understand SAM algorithms and helps researchers use differentially expressed genes for further analysis and experimentation.A total of 1224 differentially expressed genes related to renal cell carcinoma were screened out in the experiment,of which 540 were up-regulated and 684 were down-regulated.The parallel running time was 6237 ms.In this paper,R language was used for serialized comparison experiments.A total of 1181 differentially expressed genes were screened out,of which 570 were up-regulated and 611 were down-regulated.The serialization time was 64043 ms.Compared to the serialization experiment,the SAM algorithm is parallelized and the efficiency of the algorithm is improved by more than 10 times.To build a real cluster and use the gene expression profile data of renal cell carcinoma to achieve the parallelization of SAM algorithm to screen out the differentially expressed genes.Compared with serial experiments and virtual machine cluster experiments,it is the focus of further research that the algorithm improves efficiency.
Keywords/Search Tags:Differentially Expressed Genes, SAM Algorithm, Renal Cell Carcinoma, Spark
PDF Full Text Request
Related items