Font Size: a A A

Constructing Database Of Gene/Protein Interaction By Combining Dictionary And Condition Random Fields

Posted on:2016-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:T WeiFull Text:PDF
GTID:2180330461467367Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The relationship between gene/protein interactions can affect the expression quantity of intracellular RNA and protein. What’s more, it has an important effect on the activities of life to adapt to environmental change. Therefore, building a knowledge base of gene and protein interactions plays an important role in understanding of biological process. To build the knowledge base of gene and protein interactions, the first step is to identify gene/protein named entity.Currently, there are three major approaches in gene/protein named entity recognition, namely, that dictionary-based approach, rule-based approach and machine learning based approach. Dictionary-based identification method is practical and simple, but the size and quality of the dictionary exert great influence on recognition efficiency, and it’s difficult to create a complete dictionary. Rule-based identification method lacks adaptability, whose recognition efficiency depends on the rationality and completeness of the construction rules. Machine learning method is to use artificial annotated corpus for training in the corresponding machine learning algorithms, resulting in a corresponding algorithm model, and finally using the algorithm model to mark unknown corpus. And machine learning method is the most widely used method currently.This dissertation proposes an approach in gene/protein named entity recognition based on dictionary and Condition Random Fields. Firstly, we build up a gene/protein named entity dictionary based on biomedical databases such as UniProt, Gene Ontology, IntAct and HPRD. Secondly, through algorithm of the Condition Random Fields (CRFs) and the construction of rich gene/protein entity features, we initiatively present the feature of gene/protein dictionaries and work out a gene/protein recognition model. Thirdly, we conduct experiments on JNLPBA2004 database through the open source package CRF++. The results indicate that the approach combined dictionary and CRFs algorithm is of representatively high efficiency. Lastly, the CRFs algorithm model is applied to recognize the gene/protein names from biomedical abstracts, and a gene/protein interactive relationship database is built up by studying the links between them.
Keywords/Search Tags:gene/protein, named entity recognition, Condition Random Fields, dictionary, interactions
PDF Full Text Request
Related items