| Copy Number Variation(CNV) is a DNA segment that is 1kb or largerand present at insertion, deletion and/or amplification in comparison witha reference genome. CNVs, which are widely spread in the genome andrich in repeat sequence, are highly related with some diseases bychanging gene expression directly or indirectly. The Existing studies onCNVs mainly focus on those containing genes, while most of them do not.Since the genome annotation is not perfect and potential new genes maybe missed, it is necessary to predict genes in CNV regions to deepen thestudy. The main prediction methods at present are alignment-basedmethod, ab initio method and hybrid method. However, none of the threemethods is suitable for CNV region gene prediction because they allignore the repeat sequence which is widespread in CNVs.In this thesis, we present a gene prediction method for CNV regions.The novelty of this research is that we keep repeat sequence in geneprediction process and choose CNV region instead of the genomesequence as the target region. The followings are the central work in thisthesis,(1) Constructed a Hidden Markov Model-based hybrid method prediction system. We took DNA sequence, conservation sequence and ESTsequence as the input of the system, used different statistical model forsignal and region to identify them respectivly.(2) Selected 230 reported CNV regions as a test data set to predict genes.The results showed that the sensitivity of the system was 18%.Additionally, some predicted genes contained one or more exons thanrelated Reference-gene. While the exons themselves or the nearby intronswere riched in repeat sequences. |