Font Size: a A A

Distance-based Support Vector Machine To Predict DNA N6-methyladenine Modification Algorithm

Posted on:2022-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhangFull Text:PDF
GTID:2480306764470614Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
One of the first epigenetic regulatory mechanisms discovered in humans is DNA methylation.6-methyladenine(6mA)is the most representative DNA modification in prokaryotes,mainly involved in the restriction-modification system to protect individuals from the invasion of foreign DNA.However,it is not as valued as 5-methylcytosine.An important reason is that N6-m ethyl adenine modifications are thought to be widespread only in prokaryotes and unicellular eukaryotes,but are rarely found in multicellular eukaryotes.Recent studies have identified N6-methyladenine in eukaryotic,even mammalian and plant genomes,and found that N6-methyladenine plays an important role in growth,development and disease regulation.These studies open a new chapter in epigenetic modification in eukaryotes.As the initial and most critical step of this research,the identification of N6-methyladenine is particularly important,which has strong theoretical and practical significance.Thesis focuses on a class of DNA N6-methyladenine datasets,which are difficult to extract features by traditional machine learning methods.Therefore,thesis proposes a new classification and prediction method-distance-based support vector machine to predict DNA N6-methyladenine sites.The algorithm avoids the problem of feature extraction in traditional machine learning,and instead obtains the distance matrix between sequences.First,the mature multiple sequence alignment algorithm is used,and the central star alignment algorithm based on the suffix tree is employed to align the data set to obtain the similarity matrix.Second,the similarity matrix is skillfully taken logarithmic transformation to obtain the key distance matrix.Third,the distance matrix is converted into a kernel matrix that meets the training conditions through a Gaussian transformation.Finally,the kernel method is used in the support vector machine to classify and predict the data.At the same time,thesis makes an improvement on the proposed algorithm,and proposes a distance-based support vector regression machine to predict DNA N6methyladenine sites.The biggest difference between this algorithm and the support vector machine algorithm is that this algorithm argues the distance between the support vectors farthest from the classification hyperplane is the smallest,while the previous algorithm asks the distance between the support vectors closest to the classification hyperplane is the largest.The basic evaluation index:sensitivity,specificity and accuracy and the advanced evaluation index:Matthews correlation coefficient and F1 value are used to conduct a variety of comparative experiments,including 5-fold cross-validation and independent experiments.The experiments involves the comparison of learning methods,the comparison with other support vector machine algorithms and the comparison with the latest research,etc.,all experimental results show that the algorithm proposed has different degrees of advantages over the previous algorithms.
Keywords/Search Tags:DNA N6-methyladenine, Multiple Sequence Alignment, Distance Matrix, Support Vector Machine, Support Vector Regression Machine
PDF Full Text Request
Related items