Font Size: a A A

Research On The Algorithm Of Gene Data Analysis Based On MapReduce

Posted on:2015-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:J J TuFull Text:PDF
GTID:2180330467964516Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene data analysis is the current hot topic and gets attention from many researchers in machine learning and data mining fields. Gene read mapping is the key aspect for gene data analysis and gene cluster is both an important way for gene function analysis and an critical means to find function of unknown genes for biologists. Therefore, they have been noted widely by researchers.However, with the rapid development of next-generation sequence technology, the massive increment of gene read data makes traditional serial read mapping be no longer applicable and be inefficient. Meanwhile, there is a problem of low efficiency applying directly the existing serial gene clustering algorithm to the large-scale gene expression data. Therefore, how to design efficient parallel gene read mapping algorithms and gene cluster algorithms becomes the key content of this thesis. MapReduce, as one of the popular parallel technology in academia and industry, has been widely recognized. This thesis conducts depth research on the parallelization of gene read mapping and gene cluster using MapReduce technology. The main tasks are as follows:1. Propose two algorithms of read mapping based on MapReduce, which are called PSeqMap and PJuncSeqMap respectively. PSeqMap combines MapReduce and the read mapping algorithm in the software SeqMap, and implements parallel read mapping algorithm without crossing splice site. PJuncSeqMap improves PSeqMap by splitting, matching and stitching read to implement parallel read mapping algorithm with crossing splice site. One strategy of load balancing is used in PSeqMap and PJuncSeqMap respectively. Its main idea is to perform random sampling method to detect possible higher load nodes and then distribute the load evenly. We conduct an experimental verification on the Arabidopsis gene dataset. The results show that the newly developed algorithms are effective and efficient.2. Propose an improved gene read mapping in MapReduce framework (MPJuncSeqMap). We use Hadoop distributed caching mechanism, add effective biological information into the algorithm and reduce the time complexity of PJuncSeqMap to design the improved read mapping algorithm in MapReduce framework. We conduct experimental verification on the Arabidopsis gene dataset. The results show that the proposed algorithm can improve the efficiency of reading mapping further in the situation of slightly lowering sensitivity.3. Propose a density-based hierarchical cluster algorithm in MapReduce framework (DisDHC). The newly designed algorithm uses DHC for each gene data subset in MapReduce framework to obtain sparse data which is used to cluster again. At last, we design the algorithm, called DisDHC. We conduct experimental verification on the yeast dataset (GAL), the yeast cell cycle dataset (Cellcycle) and human serum dataset (Serum). The results show that DisDHC can effectively improve the efficiency of cluster in the situation of maintaining the accuracy of the original cluster algorithm.
Keywords/Search Tags:MapReduce, gene read mapping, gene cluster
PDF Full Text Request
Related items