Since the sequences of species'genomes represent the first closed data set in biology, the gene structure annotation for genomes, which include the prediction of gene composing, gene structure and gene regulators in genome DNA sequences, becomes the core issue in bioinformatics. An automatic genome annotation system based on the bioinformatics analysis method becomes a rapid and effective way to annotate different features in genomes which include genes and gene stuctures. At the same time, the need for scalable ways and technologies of storing and managing genome-scale annotation data will enable users to access and retrieve data through the global web, besides the necessarily information security and data protection. Moreover, on account of the huge demanding for computing power, the annotation system based on a set of analysis softwares must be based on high performance computing environment. To solve the above problems, the following several aspects of the work have been engaged.The gene-building pipeline, which enables fast automated annotation of eukaryotic genomes, based on evidence derived from known protein, cDNA/mRNA, EST, and whole genome sequences, integrates variable analyses and algorithms which include protein alignments, EST gene build and ab initio predictions. Hence, an automatic gene annotation system has been set up.Studies on prediction of eukaryotic gene structure are conducted from extracting features of eukaryotic gene structures, EST data-mining, models and algorithms designing, software development, and so on. Consequently, a software for ab initio prediction of eukaryotic gene structure and a program for identify true EST alignments and exon regions of genes are developed.Genome database integrating genome sequences data and annotations have been established. Central with the annotated features in genomes, the database conceptual model can be created to effective store and manage the results of the genome annotation. Based on the characteristic of"once build, more times access", data access efficiency can be enhanced using rules of redundancy allowing, permission of relational tables and attributes varieties, entity division and so on. Through a series of measures including index building, data clustering by their coordinates, data pre-sorting, data dividing and storing the sequences in binary flat files, the database is optimized to support fast interactive performance with web tools that provide powerful visualizing and querying capabilities for mining the data. Futhermore, a code generator is developed to reduce costs of the genome database project.Genome browser, a web tool for visualized displaying and accessing of any requested portion of the genome annotation, is provided. Together with a series of aligned annotation'tracks', which are also adopted by the three famous international genome browsers, the genome annotation data are visually displayed. Improvement measures, which include organizing data by centering around the annotation features, integrating data into similar-level aggregations and SVG-based interactive browsing operations, are proposed. Roaming and zooming tracks by self-adaptive steps and scales have been adopted to enhance the way of navigating the sea of genomic data.Based on high performance computing environment, an automatic gene structure annotation system for eukaryotic genomes which conceives a set of annotation heuristic programs, genome database and a web site for genome display has been available. A computational framework, which is characteristics by a two-level job load system based on grid and cluster computing, to complete the large-scale computing tasks involve in high performance computing resources has also been presented. |