Identification of protein complexes using machine learning (PyBrain and Scikit-Learn) based on DNA sequence data

Posted on:2015-11-02

Degree:M.S

Type:Thesis

University:Texas A&M University - Commerce

Candidate:Ruangchai, Wuthiwat

Full Text:PDF

GTID:2470390020951540

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

The National Center for Biotechnology Information (NCBI) provides various information that relate to science and health such as program downloads, databases, submissions, tools, and protocols. The database section is separated into many subsections such as Bioproject (formerly Genome Project), BioSample, Bookshelf, GenBank, and Nucleotide Database. This study especially focused on Nucleotide or DNA sequence database. With such large size data sets, researchers are able to use various theories and algorithms to extract and mine the knowledge. This study applied machine learning approaches for classification and identification of protein complexes by using DNA sequence inputs.;The construction of artificially intelligent systems that take in and analyze data in order to improve themselves and create better interaction with the data is called machine learning. The system effectively learns how to do their job better. Machine learning is the most useful tool for research and an exquisite sample of learning from examples (Love, 2014). This study used Python-Based Reinforcement Learning Artificial Intelligence and Neural Network Library (PyBrain) as a modular machine learning library and supervised learning algorithms as structure algorithms inside the machine.;In this study, the data were separated into two sets. The first set of the data was used to train the machines that had different algorithm structures. The second set of the data was used to test the accuracy of the machines. The machine models were built with specific parameters. They were trained and tested by the datasets. The results from the models were visualized by using line charts and clustered column charts.;From the result of six types of protein complex datasets, the machine that had the best accuracy and learning rate was the Resilient propagation machine model, with 99.98% accuracy and a fast learning rate, compared with others. The accuracy of Back propagation machine model was 97.76%. The accuracy of support vector machine models was 93.60%. The accuracy of stochastic gradient descent machine model was 98.38%.

Keywords/Search Tags:

Machine, DNA sequence, Data, Accuracy, Using, Protein

PDF Full Text Request

Related items

1	Predicting Protein-protein Interactions From Protein Sequence Based On Multiple Feature Extractions
2	Predicting Protein-Protein Interactions Based On Support Vector Machine And Complete Protein Sequence
3	Relationship Between Prediction Results Of Machine Learning-based Protein-protein Interaction And Sample Repeatability
4	The Machine Learning Model Of Protein Structural Prediction Based On Protein Sequence
5	Predicting Protein-protein Interactions Based On Machine Learning Algorithms Using Logistic Regression Model To Improve Accuracy Of Peptide Identification In Mass Spectrometry Analysis
6	Prediction Method Research Of Special Protein Recognition Based On Protein Sequence Information
7	Identification of interface residues involved in protein-protein and protein-DNA interactions from sequence using machine learning approaches
8	Prediction Research Of Protein Function Based On Sequence
9	Research Of Protein Subcellular Location Using Machine Learning Algorithms
10	Research On Prediction Of Protein Domains Based On Support Vector Machines