Research On The Algorithm Of Gene Data Analysis Based On MapReduce

Posted on:2015-02-08

Degree:Master

Type:Thesis

Country:China

Candidate:J J Tu

Full Text:PDF

GTID:2180330467964516

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Gene data analysis is the current hot topic and gets attention from many researchers in machine learning and data mining fields. Gene read mapping is the key aspect for gene data analysis and gene cluster is both an important way for gene function analysis and an critical means to find function of unknown genes for biologists. Therefore, they have been noted widely by researchers.However, with the rapid development of next-generation sequence technology, the massive increment of gene read data makes traditional serial read mapping be no longer applicable and be inefficient. Meanwhile, there is a problem of low efficiency applying directly the existing serial gene clustering algorithm to the large-scale gene expression data. Therefore, how to design efficient parallel gene read mapping algorithms and gene cluster algorithms becomes the key content of this thesis. MapReduce, as one of the popular parallel technology in academia and industry, has been widely recognized. This thesis conducts depth research on the parallelization of gene read mapping and gene cluster using MapReduce technology. The main tasks are as follows:1. Propose two algorithms of read mapping based on MapReduce, which are called PSeqMap and PJuncSeqMap respectively. PSeqMap combines MapReduce and the read mapping algorithm in the software SeqMap, and implements parallel read mapping algorithm without crossing splice site. PJuncSeqMap improves PSeqMap by splitting, matching and stitching read to implement parallel read mapping algorithm with crossing splice site. One strategy of load balancing is used in PSeqMap and PJuncSeqMap respectively. Its main idea is to perform random sampling method to detect possible higher load nodes and then distribute the load evenly. We conduct an experimental verification on the Arabidopsis gene dataset. The results show that the newly developed algorithms are effective and efficient.2. Propose an improved gene read mapping in MapReduce framework (MPJuncSeqMap). We use Hadoop distributed caching mechanism, add effective biological information into the algorithm and reduce the time complexity of PJuncSeqMap to design the improved read mapping algorithm in MapReduce framework. We conduct experimental verification on the Arabidopsis gene dataset. The results show that the proposed algorithm can improve the efficiency of reading mapping further in the situation of slightly lowering sensitivity.3. Propose a density-based hierarchical cluster algorithm in MapReduce framework (DisDHC). The newly designed algorithm uses DHC for each gene data subset in MapReduce framework to obtain sparse data which is used to cluster again. At last, we design the algorithm, called DisDHC. We conduct experimental verification on the yeast dataset (GAL), the yeast cell cycle dataset (Cellcycle) and human serum dataset (Serum). The results show that DisDHC can effectively improve the efficiency of cluster in the situation of maintaining the accuracy of the original cluster algorithm.

Keywords/Search Tags:

MapReduce, gene read mapping, gene cluster

PDF Full Text Request

Related items

1	Identification Of Differentially Expressed Gene Sets Based Cluster
2	Building The Cluster Of Bacterial Essential Gene Model And Hence Constructing Its Minimal Gene Set
3	Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash
4	Expression Of Ureases From Different Sources In Prokaryotic System
5	Comparative Analysis Of Mutant Gene Cbn1 With Mutant Gene Cao And Molecular Mapping Of CAO Gene In Chlamydomonas Reinhardtii.
6	Cloning And Characterization Of The Citreamicins Biosynthetic Gene Cluster From Micromonospora Citrea NRRL18351
7	Construction And Application Of The Platform For Gene Cluster Cloning And Modification Using Red/ET Homologous Recombination For Heterologous Expression
8	Study On The Function Of Overlapping Transcription Of SIS Gene Subcluster Of Zebrafish Six2a-OT And Six6b-OT In SIX Gene Cluster
9	Clone And Gene Expression Analysis Of Kai Gene Cluster In Arthrospira Maxima
10	Cloning And Function Verification Of The Antagonistic-related Gene Cluster Of Enterobacter Cloacea B8