Font Size: a A A

Comprehensive Analysis Of Omics Data For Plant Gene Structural Annotation And Functional Analysis Platform

Posted on:2017-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YiFull Text:PDF
GTID:1220330482992679Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Big data is a term of large datasets that are too complex for traditional relational database system to process. As sequencing technology and related biological applications develop so fast, it’s now the "Age of Big Data" in life science field. How to deal with these complicated sequencing data is an important challenge for our bioinformatists. Driven by the demand of genomics study in plants, I utilized the existing bioinformatics methods in this project, trying to analyze multidimensional omics data produced by experimental scientists and find out the mystery of life behind that. I first designed a standard analysis pipeline for large scale of plant omics data to identify new genes or new alternative splicing patterns, then I constructed a gene sets enrichment analysis toolkit focusing on plant related field, and last, I established a comprehensive database of plant non-coding RNAs, trying to focus on several important trends of large-scale bioinformatics data mining which is biological meaningful.When we get the whole genome sequences of a specific species, the first step always focus on the gene structural annotation at a whole-genome level. With the fast development of sequencing technology, epigenomics and transcriptomics information were accumulating rapidly. I built a standard analytical pipeline to utilize these omics data efficiently for gene structure annotation. Firstly, I used data sets produced by Chromatin Immunoprecipitation assays with high-throughput DNA sequencing (ChIP-seq) technology to study the two histone modifications, trimethylation of H3K4 (H3K4me3) and acylation of H3K27 (H3K27ac) at the whole-genome level. Secondly I utilized the genomic annotation of known genes to find out the distribution patterns of these two histone marks in genie regions. Then I used transcriptomics data to confirm the positive correlation between histone marks and gene expression. After integrating as much transcriptomics data as we could, including in-house and public data, I followed the pipeline to discover novel genes in Oryza sativa (Nipponbare) and Gossypium arboreum (Asiatic cotton), and accurated the sequence strand with the help of conserved distribution patterns of the two histone marks in genic area. We’ve done qRT-PCR confirmation for several novel genes of G. arboreum. After discovering the new genes, I studied on specific gene structure, tissue specific expression pattern and characteristics of histone marks distributing in chromosomes. Last but not the least, I made a series of criteria to predict alternative splicing sites of G. arboreum using ChIP-seq and RNA-seq data sets.Based on the annotation of gene structure, our next topic moved to the effectiveness of using all available datasets to analyze gene function. Such plant GO enrichment analysis toolkit as EasyGO and AgriGO can do statistical analysis and get some specific genes which enriched at some GO terms, helping biological scientists to narrow down their research range. To make further explanation for one or more groups of differential expressed genes (DEGs), I expanded the capability of Gene Ontology (GO) terms to the gene sets of nine categories, which included the GO, the plant ontology (PO), gene families, KEGG pathway, plant metabolic pathway, etc., to help annotate genes in the whole genome level. The annotation rate of the corresponding genome increased significantly, when compared to the single category of the gene sets, meanwhile the accuracy and scope of functional gene description have been improved greatly. Using the strategy of GSEA method, I developed a gene sets enrichment analysis toolkit named PlantGSEA (http://structuralbiology.cau.edu.cn/PlantGSEA) which have been updated multiple times as the users requested, and have been well recognized by scientific community.What’s more, bioinformatics secondary databases always provide multiple functional information of single DNA or protein sequence. Research on epigenomics includes not only histone marks, but also non-coding sequences with complicated regulation mechanism. When doing background research on plant non-coding RNAs, I found out that there were few databases containing multiple types of ncRNAs with different aspects of functional information. After giving a deep survey at the strengths/weakness of existing platforms, I took advantage of our resources and skills to build a comprehensive analysis platform for plant non-coding RNAs study and named it PNRD, which is accessible at http://structuralbiology.cau.edu.cn/PNRD. In PNRD I collected a total of 25739 entries of 11 different types of ncRNAs from 150 plant species,178138 pairs of miRNAs-targets in 46 species,35 miRNA expression profiles and a text-mining pool with 148 references. The platform consists of five functional modules:Search section, Browse section, Tools section, Download page and Help page.This dissertation aimed at building a comprehensive platform for plant gene structural and functional annotation, as well as omics data mining, trying to provide some solutions for tremendous data analysis. With constantly emerging high-throughput data of high complexity and noise, it’s our responsibility to strengthen the insight of the experimental scientists and help them find out the value beneath those data.
Keywords/Search Tags:bioinformatics, big data, omics data, epigenome, transcriptome, functional annotation, gene sets enrichment analysis, non-coding RNA
PDF Full Text Request
Related items