| Prediction of protein-coding genes, which is valuable for finding new genes, understanding the composition of genomes and identifying disease relevant genes, plays a very important role in various kinds of genome projects.The accurate identification of splice sites of eukaryotic genes is one of the challenging and essential problems of gene structure prediction. At present, the widely used splice site identification methods, such as the weight array model (WAM), are based on the features of the conservative signal sequences around splice sites. Besides such kind of information, in this paper, other features useful for identifying splice sites are exploited, including the relationship between the conservative signals and the C+G content of sequences around splice sites, the compositional features of the up and down stream sequences of splice sites and their dependence on the C+G content of sequences around splice sites. Further, different models are constructed to describe these features, and a logitlinear model is created to integrate them. Eventually, a new program SpliceKey for the prediction of splice sites is developed. Testing results demonstrate that the prediction accuracy of SpliceKey is not only significantly higher than that of WAM, but also better than that of DGSplice, a recently released splice site prediction program.A novel approach and the corresponding program DCGene to predict causative genes by mining functional information based on GO annotation is presented. When GO terms are used to evaluate the possibility of candidate genes to be causative genes, the features of GO terms-the DAG information-are effectively considered. This algorithm can effectively compute the relevant degree between genes and disease, which guarantees the accuracy of disease gene prediction. For assessment of the method, a leave-one-out test of 1057 disorders whose causative genes have been identified from OMIM database, using candidate genes from the corresponding located chromosome regions, containing 89 genes on average, and 12954 candidate genes from the human genome is preformed respectively. The prediction results demonstrate that the method can effectively predict the disease genes from candidate genes on located chromosome region and genome scale. Consequently, the prediction results can either be used to identify causative genes in chromosome region or to afford potential loci on genome-wide scale for linkage analysis of simple diseases and association study of complex diseases. |