| Transcriptional regulation is an important part of the process of gene expression regulation. The identification of transcription factor binding sites is one of the central topics for decoding transcriptional regulation, and an important step to understand the mechanism of transcriptional regulation. However, we are facing a severe test for identifying transcription factor binding affinities from a huge number of base pairs accurately. To solve this problem, the main work of this paper is as follows:1. We propose a model based on Logistic regression to predict transcription factor binding affinities, combining with high-throughput Ch IP-Seq technology. The model selects all 5-mer sub-sequences, which contains A, T, C, G, to compare with DNA sequences, and calculates the number of sub-sequences which appear in the DNA sequence, so as to construct an affinity matrix. The affinity matrix can accurately capture the transcription factor binding affinities. Then, we use this affinity matrix to construct the Logistic regression model and use the stability selection algorithm to optimize the model selection. Compared with the models based on PWM and PSSM, this model can improve the transcription factor binding affinities prediction accuracy.2. We improve the existing multiple linear regression model, to develop a new multiple linear regression model based on PBM. The model chooses all sub-sequences, the length of which is 3-8 bases, to compare with DNA probe sequences, and counts whether the sub-sequences appear in the DNA probe sequences, so as to construct a new affinity matrix. Then, we use the affinity matrix to establish the new multiple linear regression model. Meanwhile, we use SLEP to optimize the new model selection to improve the model prediction accuracy. Compared with the existing multiple linear regression model, the performance of the new multiple linear regression model based on PBM is very competitive for the transcription factor binding affinities prediction. |