| The regulation of gene expression is essential for various cells or tissues possess different function in eukaryotic organisms, as it drives the processes of cellular differentiation and morphogenesis, leading to the creation of different cell types in multicellular organisms. Transcriptomic data analysis can be used to study gene expression regulation at the transcriptional level. RNA-Sequencing (RNA-Seq) provides researchers with a powerful toolbox for characterization and quantification of transcriptome, and is widely used in the study of gene expression regulation. Furthermore, RNA-Seq technology is a random sequencing with high-throughput, which is also widely applied in genome reannotation. A high quality and complete genome annotation is the basis of transcriptome downstream analysis, such as the analyses of gene expression, epigenetic regulation, and phylogeny. Therefore, this thesis makes full use of RNA-Seq technology to carry out deep mining of human and giant panda transcriptomics data. The main purpose is to study the diversity of expression pattern between different tumor tissues and their corresponding normal tissues of human, revealing gene expression regulation mechanism in cancer cell. Meanwhile, the deep sequencing method is applied to improve genome annotation of giant panda.(1) In most previous studies of cancer mechanisms or other disease types, researchers often paid more attention to the differences in an individual organ under normal and corresponding pathological condition to find which part of genes mainly acted to the specific disease type. The fundamental issue here is how to develop an effective analysis method to estimate expression pattern difference between different tumor tissues and their corresponding normal tissues. Many different human tissue/cell transcriptome datasets coming from RNA-Seq technology are publicly available at present, which provides the necessary prerequisite and possibility for this study. We define the gene expression pattern from three directions:1) expression breadth, which reflects gene expression on/off status;2) low/high or constant/variable expression genes, based on gene expression level and variation; and3) the regulation of gene expression at the gene structure level. The cluster analysis of gene expression profile indicates that gene expression pattern is higher related to physiological condition rather than tissue spatial distance. Two sets of human housekeeping (HK) genes are defined according to physiological condition types of cell/tissue, respectively. To characterize the gene expression pattern in gene expression level and variation, we firstly apply improved K-means algorithm and a gene expression variance model. We find that cell turn on less gene to express, then it up-regulate cancer-associated HK genes (a HK gene is specific in cancer group, while not in normal group) expression in cancer condition and make them express more variable among different cancer cells. Those genes that are regulated by cancer cell are enriched in cell cycle regulation related functions and constitute some cancer signatures. Cancer cells prefer to express AT-rich genes and avoid to express large genes, which maintain their limitless self-proliferation ability. These studies will help us understand which cell type-specific patterns of gene expression differ among different cell types, and particularly for cancer cell.(2) We applied RNA-seq to globally detect novel expression genes in12tissues of giant pandas, including liver, skeletal muscle, cerebrum cortex, pituitary, tongue, stomach, small intestine, colon, ovary, testis, black-haired and white-haired skins. Mapping result of transcriptomic datasets show that about50%coding inforomation is negelected in current genome annotation. We used a combined transcriptome reconstruction strategy that merged reference-based method by Cufflinks and de novo strategy by Trinity to create a more comprehensive annotation result from the transcriptomes of12tissuses. The transcripts that assembled by Trinity can evaluate the quality of giant panda genome assembly, and improve the completeness of coding gene region in genome. With expression and homology validation, we detected three groups of full-length novel genes:1)1,076homologous-based genes that aere identical to a known protein and a known cDNA sequence;2)10,494unknown genes that have only homologous expression EST, but no homology protein or cDNA sequence;3)12,575hypothetical genes that have certain ORF, but no homology evidence. Those novel genes can recover genome annotation, which is an effective complement of automate genome annotation system. A completed genome annotation would provide more reliable information for subsequent study of comparative genomics, resequencing, and transcriptomics. |