Font Size: a A A

The Construction And Evaluation Of Parallel Biological Computing System And Genomic Clustering And Function Annotation Of Metastasis-related Genes

Posted on:2007-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:T ZhuFull Text:PDF
GTID:1104360212990115Subject:Gynecologic Oncology
Abstract/Summary:PDF Full Text Request
The Construction and Evaluation of Parallel Biological Computing System[OBJECTIVE] To investigate the construction strategy and implementation methods of parallel computing cluster, and the configuration and installation of parallel bioinformatics softwares as well as their applications. And to discuss the improved performance of parallel computing cluster and its significance on speeding up the bioinformatics researches.[METHODS] The de novo construction strategy is deployed. The frontend and five compute nodes are connected to switchboard via gigabit ethernets. RedHat Linux AS4 update2 operating system is installed on the frontend with installation discs. Then the DHCP, NFS, TFTP servers are configured and the installation process can be initiated on other nodes via PXE boot. After the installation, the secure shell communication without password authentication should be set up between either two nodes. Build the MPICH parallel compilation and running circumstance, Ganglia visualized cluster monitoring tool and OpenPBS job management system on the frontend. After the construction of parallel circumstance, go on with the installation of parallel bioinformatics software modules on sequence BLAST, EST clustering and assembly, docking and molecular dynamics simulation. In the end, test the cluster performance with different tasks and regulate the operating parameters to get the best performance.[RESULTS] Succeeded in the construction of parallel computing circumstance, and in the installation of parallel bioinformatics softwares. The OpenPBS job management system made the computing resource distribution more reasonable. The working situations, task complement and queuing situations on each node can be easily monitored with the Ganglia visualized cluster monitoring tool. Authorized users can send the computing tasks either from the frontend or from SSH. This parallel biological computing cluster is powerful enough to do bioinformatics researches with heavy computation burdens such as sequences BLAST, EST clustering and assembly, docking and molecular dynamics simulation and so on. According to our performance test results, this cluster displays a superlinear acceleration of 8.33 times on 6 nodes than on single node. The genomic clustering and functional annotation of metastasis-related genes and high throughout virtual screening of novel metastasis-related genes[OBJECTIVE] To explore the genomic clustering and functional annotation of currently known metastasis-related genes, as well as the existence of coding hot spots for metastasis-related genes on human genome. Based on the nucleotide sequences of known metastasis-related genes and with the assistance of local parallel biological computing platform, search the human EST database for novel metastasis-related genes. The aim is to find out the internal relationship between gene function and genomic location, and to further elucidate the molecular mechanisms for tumor metastasis and to find more blocking targets for tumor metastasis.[METHODS] Download the genomic location data of known metastasis-related genes and all human RefSeq genes from public databases. After some pretreatment procedures with artificial inspection, MySQL sorting and ID conversion, we use perl script and 2x2 table Chi-square statistics to identify the coding hot spots of metastasis-related genes with statistically significance on human genome. Give annotations about gene function, structural domains, metabolic pathways such as Gene Ontology, InterPro,KEGG, BioCarta and so on to all the metastasis-related genes. Apply Function Classification Tool from DAVID to build the gene-term similarity matrix with fuzzy clustering algorithm and to perform function clustering. Try performing the above annotations and clustering on metastasis-related genes belongs to each coding hot spots. Download all the RefSeq sequences and related mRNA sequences of the metasis-related genes from GenBank and download the newest human EST database to format. Use the protein RefSeq sequences representing all the metastasis-related genes search est_human database with TBLASTN, and the matched ESTs with e-value less than 10 search nr protein database with BLASTX. The candidate ESTs undergo pre-treatment (remove vector contaminations, mask the low complexity and tandem repeat sequences), then extend the ESTs with P_PHRAP. The meaningful contigs will undergo advanced bioinformatics analysis, and potential novel metastasis-related genes are expected.[RESULTS] After the gene downloaded and pretreated, totally 787 non-redundant and high quality human metastasis-related genes and 16849 human RefSeq genes are obtained for advanced researches. With perl script and Chi-square statistics, compute the genomic distributions of all metastasis-related genes. There are 13 coding hot spots identified with statistical significance (p<0.05), lying on 2p25.2-2q31.3,3p14.2-3q22.1, 4p16-4q31.23, 6p24-6p23,8p23.1-8q24.2, 9p24.2-9q34,11p15.5-11q24, 12p13, 13q12.3-13q13.3, 15q13, 17p13.3 , 18p11.32-18q21.3, Xp22.32-Xq28 respectively. The clustering results of metastasis-related genes shows that 9 groups of genes have great contributions to tumor metastasis, which are: a. Serine-type Endopeptidase Inhibitors; b. various growth factors; c. various transmembrane surface receptors; d. hydrolases; e. apoptosis regulation genes; f. various protein kinases; g. intermediate filament proteins; h. nuclear transcription factors and TF receptors; i. genes participate in DNA repair. TBLASTN est_human database with 1115 protein RefSeq sequences from all the metastasis-related genes, 31293 matched ESTs with e-value less than 10 are acquired. The latter search nr protein database with BLASTX. Matched ESTs are pre-treated. The results from RepeatMasker show the number of masked bases are 247375bp, occupying 6.89% of all the bases. The pre-treated ESTs are clustered and assembled with P_PHRAP, totally 1682 contigs and 3125 singlets are obtained. Finally there are 64 candidate gene contigs left after artificial inspection.[CONCLUTIONS] Successfully identified the genomic coding hot spots of metastasis-related genes with bioinformatics methods. Using the fuzzy clustering algorithm, 9 groups genes proved to greatly contribute to metastasis, which covers all the procedures in classic metastasis theory. Search the human EST database with known sequences and some potential candidate novel gene contigs are detected. Data mining in the EST database is a good strategy for discovering novel genes, and the parallel computing circumstance is of great support.
Keywords/Search Tags:bioinformatics, parallel computing, EST, docking, molecular dynamics, metastasis, bioinformatics
PDF Full Text Request
Related items