| The precision of single-cell RNA sequencing(scRNA-seq)to single-cell level transcriptome analyses holds the promise of precision medicine.The processing flow from the original FASTQ data to the gene expression matrix is called scRNA-seq data processing,which includes sequence quality control,genome alignment and transcriptome counting.Due to the continuous increase of sequencing throughput and the significant decrease of sequencing cost,the volume of scRNA-seq data is increasing exponentially.The conventional scRNA-seq data processing tools can only run on a single machine,and the degree of parallelism is low.A large amount of intermediate data between different steps need to be read and written to disk,which leads to extremely slow running speed.This obviously cannot meet the need for rapid and accurate data processing of scRNA-seq in clinical practice.Based on the in-depth analysis of existing most widely used scRNA-seq data processing tools,we developed a high-performance data processing tool for scRNA-seq based on Spark and Hadoop,which supports in-memory computing and is scalable.ScSpark can distribute scRNA-seq data processing tasks in the cluster and can significantly reduce disk accessing for intermediate results.We combine Spark and our proposed functions to implement sequence quality control and transcript counting.Furthermore,we use STAR program as our aligner.In order to avoid unneccessary disk access while reading FASTQ files and writing SAM files,we use Java Native Interface to deliver FASTQ RDD’s data,and then abstract return value to SAM RDD.We demonstrates the significant improvement in performance and the scalability of ScSpark compared with existing scRNA-seq data processing tools from the perspectives of multi-node and single-node experiments,and a preliminary biological verification is made.ScSpark’s high CPU and memory utilization compared to traditional tools is an important reason for its improved performance.As the price of computer hardware continues to fall and the throughput of scRNA-seq continues to increase,the strategy of trading CPU and memory resource for time becomes very meaningful.ScSpark is designed to meet the need for rapid,high-throughput scRNA-seq data processing in the future of precision medicine. |