| With the advance of modern medicine and improvement of peoples'living quality, many serious infectious diseases and nutritional diseases have gradually been controlled, and so genetic diseases become more and more prominent. many diseases which harm the health and life-span such as hypertension, diabetes, cancer and mental diseases, have been found related to genetic factors. The study intended to find the relations between genetic variations and propensities for diseases has become hotspots recently. These studies have disclosed a lot of disease related SNP (Single Nucleotide Polymorphism), and most of them located in non-coding area.Among all non-coding SNPs, the regulatory SNP (rSNP) which locates in cis-acting elements (such as promoter, enhancer, silencer and insulator) and changes the expression of the gene attracted most attention. The experimental validation of regulatory SNP are costly and time-consuming, so it is necessary to apply theoretical prediction as complement and instruction of experiments. An great challenge on the analysis and prediction of rSNP is: Although most prediction and analysis methods aimed to determine the location of SNP, there is no effective structural attributes to differentiate rSNP from the others. The cause is mainly that we haven't understood the inner essence of the forming of rSNP.In this study, we focus on comprehensively analyzing the structural features at different levels, checking the function of structural attributes in prediction of rSNP, and understanding the essence of mechanism deeperly. The main works are listed below:Firstly, machine learning algorithms were applied to do integrated analysis of several structural attributes. A new dataset of the rSNPs and the control SNPs were collected by searching the literature and databases. Eleven structural attributes that have not been checked by machine learning before were collected and used to compare our datasets and attributes set with that of other similar studies. By ranking the attributes, among newly added attributes change of the DNA helical parameter Rise and hydroxyl radical cleavage pattern were found more important. By training and comparing, it was found that the na?ve Bayes classifier had the best performance, as being about 6% better than the support vector machine used in literature and being able to be improved a little by rationally adding new structural attributes. This work showed that the prediction of rSNP is a complicated issue and the performance of classifier can be improved by rationally adding new structural attributes.Two important methods: sites matrix and hydroxyl radical cleavage pattern were deeply analyzed below. Sites matrix needs crosswise extracting credible data from several related database, these data will be very useful for other methods as well. We extracted all human Transcription Factor Binding Sites (TFBSs) from TRANSFAC database which have the best comprehensive collection of TFBSs and figured out their coordinates in The Reference Sequence (RefSeq). One hundred and eighty three TFBSs were found having SNP, and 18 of them having rSNP. After the repeating sites removed, there were totally 13 rSNPs on 12 TFBSs. Among 183 SNPs on TFBS, there are only 32 SNPs lied in sites matrix, and only 5 of them being rSNP. So it is necessary to determine and collect more rSNP to accomplish reliable statistical analysis. Statistics of our data showed that the changes of frequency caused by SNP and rSNP distribute widely. It is contrary with the opinion that the location of rSNP should be highly conservative in sites matrix. Based on this result, we supposed that the there are more than one mode of binding between TF and DNA, and the specific base in rSNP has different function in each mode. But there are more thorough researches and analysis needed to support this supposal.Functionality of hydroxyl radical cleavage pattern in recognition of rSNP was then determined. The statistics showed that the changes of hydroxyl radical cleavage pattern caused by regulatory SNPs were statistically smaller than that caused by control SNPs on plus strand, but there were no statistically significant changes when minus strand analyzed. We supposed that the difference between plus and minus strand is an oversight of this method. After communicating with the author and applying double-strands algorithm as modification, the contradiction disappeared and the difference between rSNPs and control SNPs become more significant.The studies above analyzed the general structural attributes of rSNP applying probabilistic statistics with integrated and separate views respectively. To analyze the inner mechanism of rSNP, molecular dynamics simulation was applied on some individual instances at atomic level. Firstly, homologues of the transcription factors (TF) of which the TFBS has rSNP were searched from PDB database. Then 3 TFs: pituitary specific transcription factors POU1F1, vitamin D acceptor VDR and androgen receptor AR were selected for molecular dynamics simulation of their protein-DNA complexes. The simulation result of AR-DNA complex were first analyzed, and the results showed that the number of hydrogen bond and stable hydrogen bond of binding complex is far greater than that of non-binding complex, and the hydrogen bonds distribute widely on the whole DNA. Hydrophobic interaction analysis showed that the methyl group on mutated base T lead to great enhancement of nearby hydrophobic interaction in the binding complex. And the analysis of the relative motion of two recognition helices showed that two recognition helices were nearly parallel in the binding complex, and similar to free AR in the non-binding complex. So we supposed that the hydrophobic interaction is the key factor of the formation of that rSNP. This work showed that molecular dynamics simulation could be important reference for analysis of rSNP when reasonably applied.The phenomenon of planar relative motion of two recognition helices of the homodimeric transcriptional factors was inspected as well. This work provided guideline for analyzing the conformation's motion of similar type of transcriptional factors and understanding the binding dynamics of protein-DNA.The main works in this study include: the functionality of some unused structural attributes in recognition of rSNP was inspected by machine learning; the SNPs and rSNPs on TFBSs were comprehensively picked out, and a data platform for deeply analyzing function and mechanism of rSNP was constructed; an oversight of the hydroxyl radical cleavage pattern method was found and fixed, and the change of hydroxyl radical cleavage pattern was demonstrated related with the rSNP; by adopting the new results of protein-DNA recognition and analyzing the structural mechanism of rSNP, it was showed that molecular dynamics simulation could be important reference for analysis of rSNP; the phenomenon of planar relative motion of two recognition helices of the homodimeric transcriptional factors was found by molecular dynamics simulation, the theoretical background and range of application were then deeply analyzed.The new structural attributes and methods are still emerging continuously. By adding new effective attributes, the prediction performance of integrated theoretical model for rSNP recognition will gradually improve. Meanwhile, as more and more studies accomplished on the structural proteomics, the number of newly-verified protein-DNA complexes is increasing, and there will be more and more instances which can be used to analyze the inner mechanism of rSNP. When the instance studies accumulated enough, the mechanisms will be gradually figured out. |