Constructing Database Of Gene/Protein Interaction By Combining Dictionary And Condition Random Fields

Posted on:2016-03-24

Degree:Master

Type:Thesis

Country:China

Candidate:T Wei

Full Text:PDF

GTID:2180330461467367

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The relationship between gene/protein interactions can affect the expression quantity of intracellular RNA and protein. Whatâ€™s more, it has an important effect on the activities of life to adapt to environmental change. Therefore, building a knowledge base of gene and protein interactions plays an important role in understanding of biological process. To build the knowledge base of gene and protein interactions, the first step is to identify gene/protein named entity.Currently, there are three major approaches in gene/protein named entity recognition, namely, that dictionary-based approach, rule-based approach and machine learning based approach. Dictionary-based identification method is practical and simple, but the size and quality of the dictionary exert great influence on recognition efficiency, and itâ€™s difficult to create a complete dictionary. Rule-based identification method lacks adaptability, whose recognition efficiency depends on the rationality and completeness of the construction rules. Machine learning method is to use artificial annotated corpus for training in the corresponding machine learning algorithms, resulting in a corresponding algorithm model, and finally using the algorithm model to mark unknown corpus. And machine learning method is the most widely used method currently.This dissertation proposes an approach in gene/protein named entity recognition based on dictionary and Condition Random Fields. Firstly, we build up a gene/protein named entity dictionary based on biomedical databases such as UniProt, Gene Ontology, IntAct and HPRD. Secondly, through algorithm of the Condition Random Fields (CRFs) and the construction of rich gene/protein entity features, we initiatively present the feature of gene/protein dictionaries and work out a gene/protein recognition model. Thirdly, we conduct experiments on JNLPBA2004 database through the open source package CRF++. The results indicate that the approach combined dictionary and CRFs algorithm is of representatively high efficiency. Lastly, the CRFs algorithm model is applied to recognize the gene/protein names from biomedical abstracts, and a gene/protein interactive relationship database is built up by studying the links between them.

Keywords/Search Tags:

gene/protein, named entity recognition, Condition Random Fields, dictionary, interactions

PDF Full Text Request

Related items

1	Research On Biomedical Named Enyity Recogniyion Method Based On Deep Learning
2	Research On Named Entity Recognition Methods For Clinical Medicine
3	Research And Implementation Of A Biomedical Named Entity Recognition Method Based On Deep Learning
4	Research On The Application Of Deep Learning Models In Geographic Named Entity Recognition
5	Research On The Identification And Standardization Of Medical Named Entities From Clinical Real-World Data
6	Research On Biomedical Named Entity Recognition Based On Deep Learning
7	Research On Biomedical Named Entity Recognition Method Based On Word Meaning Enhancemen
8	Research On Biomedical Named Entity Recognition Based On Weak Supervision
9	Research On Biomedical Named Entity Recognition Algorithm Based On Multi-Task Learning
10	A Phenotypic Named Entity Recognition Method Based On Distant Supervision