Font Size: a A A

Study On Methods Of Knowledge Discovery Based On Biomedical Literature

Posted on:2007-09-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Q ZhangFull Text:PDF
GTID:1104360242461415Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Literature mining is a method which can be applied into analyzing text data automatically. It employs a variety of research field, such as data mining, text mining and natural language processing. As a high efficient method applied to extract, integrate and discover knowledge from literatures, it can deal with lots of literatures fast and mine knowledge on special fields. With the introduction of related algorithms and refinement of corpora, it is improved that the performance and reliability of literature mining, and is widely applied in research.Researches on biomedicine have accumulated lots of literature data, which present huge amount of knowledge in different aspects. On the other hand, bioinformatic technologies are aimed at managing and analyzing mass data generated by biomedical experiments, and giving some predictions or directive conclusions. As a new branch of bioinformatics technology, literature mining technologies take biomedical literature data as raw subject of analysis, extract and integrate knowledge disseminated in text data, in time to realize the presentation and induction of the embedded knowledge. This dissertation takes PubMed data as source literature, extracts knowledge about protein, disease and chemical with several developed/integrated mining methods, and induces new knowledge with extracted facts. The main results of this study are summarized as follows:1) Entity names recognition from literature data, and mapping from them to the bio-molecular database. Entity recognition from literatures is the foundation of other mining work, and different methods should be applied to recognize different knowledge in different fields. Three normal kinds of entities, such as proteins (genes), diseases and chemicals, are widely presented in biomedical literature. The protein entities were recognized by using the statistical model, which is based on conditional random fields. The disease entities were recognized based on dictionary methods, which belongs to 21 types of diseases of 3rd level in MeSH database. Similarly the chemical entities were recognized from the literatures. The recognized protein entities were mapped to Entrez Gene database, which had been transformed into three sets of protein names dictionary according to different formatting processes. With this graded mapping strategy, the protein entities were classified into four sets of names: exact entity, reliable entity, likely entity and unknown entity.2) Six sets of entity relations based on entity relation rules. Three kinds of entities have six kinds of combination styles, thus six kinds of entities were discovered: protein-protein, disease-disease, chemical-chemical, protein-disease, protein-chemical and disease-chemical. These entity relation were described by way of entity co-occurrence frequency firstly。The sentences containing related entities were parsed by a POS-tagging tool, and 536 verbs which described the entity relation were extracted. The verbs could be classified into four types, including interaction-related, regulation-related, protein-modification-related, and others. The related verb list formed the entity relation rules library. The text data were scanned by the relation rules library, and then six kinds of entity relation data were extracted. The biomedical explains to these entity relation types were given, and some possible reasons of normal and abnormal relations were discussed too.3) Entity relation network construction based on entity relation data, and three kinds of sub graph extraction strategies to discover new knowledge. Six kinds of entity relation data were used to build up entity relation networks, which including six simple entity relation networks and two hybrid entity relation networks. The hybrid entity relation networks were made up with simple entity relation networks: molecular interaction network contained protein-protein, chemical-chemical and protein-chemical relation data, while the full relation network was made up with all six simple networks. The network topology character were analyzed, and three kinds of sub graph can be extracted: connected sub graph, Hub sub graph, and relation sub graph, which can be used to infer entities with undirected relations, activate entities and relation pathway of s set of related entities.4) Prototype system of the biomedical literature mining platform. This platform integrated the tools needed by literature mining. They were the third party tools or developed by the author, and were supplied with unified interface and data format. This platform was able to perform three kinds of knowledge discovery tasks: recognizing entities, mining entity relations and building up entity relation networks, it also supplied the file format compatible with the third party graph tools, which made it possible to visualize entity relation networks and their sub graphs.
Keywords/Search Tags:literature mining, knowledge discovery, entity, entity relation, entity relation network, prototype system
PDF Full Text Request
Related items