Font Size: a A A

The Application Of Text Mining In Human Gene Function And Molecular Network Research Based On Free Terms

Posted on:2016-04-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H WangFull Text:PDF
GTID:1314330482956726Subject:Oncology
Abstract/Summary:PDF Full Text Request
? BackgroundGiven a set of genes,for example from high-throughput experiments,it can be helpful to know which biological functions and molecular networks may be involved.Identifying human genes related to certain biomedical events,such as various diseases,biological or pathological processes and gene functions,is of considerable value to biomedical researchers,curators and annotators.In the age of "network medicine",collecting all known related-genes(i.e.,genes have been reported in the literature,or annotated by biocurators),and furthermore constructing a corresponding gene network,are significant for discovering novel genes involved in or elucidating the underlying molecular mechanism of a specific biomedical event.Some manually curated databases or tools provide a standard way to practice.Gene Ontology(GO)uses structured vocabularies for molecular function,biological process and cellular component,to annotate gene products.KEGG database depicts various pathways.HPRD,BioGRID and IntAct curate and archive Protein-Protein Interaction(PPI)from the scientific literature.Some annotation tools that integrate these manually curated databases,such as DAVID and EGAN provide convenient and practical application.Some curated databases release genes related with a pre-defined topics.Tumor suppressor gene database(TSGene)identifies hundreds of tumor suppressor genes,oncomiRDB database annotates the experimentally verified oncogenic and tumor-suppressive miRNAs,and the LncRNADisease database curates the experimentally supported IncRNA-disease association data from literature.The Comparative Toxicogenomics Database(CTD)provides manually curated chemical-gene,chemical-disease and gene-disease relationships from the published literature,and then integrates these data to generate inferred chemical-gene-disease networks.These manually curated databases could have practical application,however,the coverage of knowledge bases is still incomplete,because the amount of biomedical literatures has grown exponentially,structured vocabularies cannot keep up with new terms emerging and manual curation is a time-consuming effort.PubMed database comprises over 24 million citations for biomedical literature and is increasing at a-4%growth rate per year.Moreover,for expert biocurators,text-mining tools with appropriate functionality and easy-to-use interface are expected in their annotation workflowSome text mining tools can compensate for these deficiencies.Martini and CoPub 5.0 adopted a keyword-based approach to annotate gene function,however,the keywords were still limited in a pre-defined thesauri.iHOP and STRING generate gene networks based on genes co-occurrence in the literature.However,even though gene pairs co-occur in the same sentences,only 30%of pairs have an actual interaction.FACTA+,EBIMed and PolySearch discover hidden relationships between biomedical concepts from Medline abstracts,thereby they enable users to search for genes related to a search term.However,FACTA+ and EBIMed cannot search a phrase,on the contrary,PolySearch cannot search multiple single words.The Disease and Gene Annotation(DGA)database integrates GeneRIF,Disease Ontology and molecular interaction network to construct disease-gene,gene-gene and disease-disease relations.However,many disease names are described by abbreviations instead of long forms which leads to many disease-gene relations may be omitted in DGA.In this thesis,we planned to do three researches using text-miming methods:i)analysis of gene functions with free terms(i.e.,any terms in the literature)generated by literature mining or provided by the user,ii)accurate identification and integration of comprehensive molecular interactions from Medline abstracts,to construct molecular networks and sub-networks related to the free terms,and iii)exploring genes co-occur with any free terms in the literature to identify related genes effectively,and constructing corresponding gene network.Finally,we developed two web-based text-mining servers GenCLiP 2.0 and CooLGeN.? Material and methods:1.Literature mining function and network of human gene(1)Identifying gene-related abstractsTo create the thesaurus,we took the gene symbol and alias for each human Entrez Gene entry from NCBI Gene and HGNC.To edit the lexicon,we eliminated non-informative terms,highly ambiguous terms and common English words.We also automatically expanded the spelling variants based on some simple rules.Our gene recognition procedures used dictionary-based and rule-based approaches to identify gene names in abstracts.The rules are based on patterns manually deduced and tuned from the BioCreative ? Gene Normalization(GN)training set and our previous research.We used gene recognition procedures to identify gene related abstracts,and to define gene-PMID links.(2)Gene functional annotation and cluster analysis Terms(including a single word,GO terms and phrases followed with acronyms)that appeared frequently in certain genes' related abstracts were considered as keywords for these genes.We proposed a fuzzy cluster algorithm to group the keyword results.This method measures the relationships of all keyword-keyword pairs with kappa statistics and clusters highly-associated keywords.The user can also provide keywords to annotate the input genes,or remove keywords that are automatically generated by our server.A heat map that represents cluster analysis results of the selected keywords and the input genes can also be created.(3)Identifying molecular interactions For de novo extraction of molecular interaction,a rule-based approach that considered words surrounding gene names and interaction words,and distance between two genes or between interaction word and gene,etc.,was used to search sentences.The rules were compiled from five PPI corpora that contain PPI interaction annotation:AImed,BioInfer,HPRD50,IEPA and LLL.Gene pairs from four manually curated databases(HPRD,BioGRID,CORUM,and IntAct)that were co-mentioned in sentences were additionally considered as molecular interactions.To provide more contexts for identified gene pairs,we collected all sentences that contained these gene pairs.These sentences and corresponding abstracts were used as the context of these gene pairs.(4)Gene network construction The gene network was constructed with well-defined molecular interactions.A sub-network can subsequently be constructed based on free terms specified by a user.When the free terms appear in the sentence(or an abstract contained the sentence)of a gene pair,the connection was created.The node border color will highlighted for genes that are related to a search term.Moreover,it is available to construct a network of both up-and down-regulated genes based on a user-defined gene list,with highlights in different colors.Random simulation was performed to determine whether a gene network was specific for the input genes.2.Gene co-occur in literature and gene network(1)Running gene recognition procedure,the gene names mentioned in MEDLINE abstracts were recognized and assigned to a corresponding Entrez Gene ID(GID),and the GID-PMID mapping was completed.Furthermore,we split the abstracts that mentioned genes into sentences and assigned each of them a unique sentence ID(SID).Similarly,gene names appearing in these sentences were recognized and matched to the corresponding Entrez Gene ID,and then the GID-SID mapping was performed.Finally,we built indexes of words and phrases with corresponding GID,SID and PMID,to support the exploration of genes related to any search terms.(2)The extraction and complement of GeneRIFA GeneRIF statement includes a GID and a PMID.We extracted the statements for human genes and assigned them a unique sentence ID(RID),and then mapped the GID to RID.Abbreviations and long forms identified by the BioADI and Allie databases were utilized to complement the undefined abbreviations in addition to gene names.Gene names mentioned in these sentences were recognized based on the assigned GID and gene thesaurus.And then,we built indexes of words and phrases with corresponding GID and RID,to support the search of terms-related genes.(3)Interaction dataGene/Protein interaction data are composed of two types:the curated PPIs,which were integrated from HPRD,BioGRID,IntAct and CORUM databases;the text-mining molecular interactions,which were automatically detected by the rule-based approach.These interaction data were used to remind users the known interactors while exploring gene-gene associations,and to construct the specific gene network for selected genes,which may be associated with a gene or a certain topic.3.Web server developmentGenCLiP 2.0 and CooLGeN were constructed on a typical LAMP(Linux + Apache+ MySQL + PHP/Perl)platform,and designed to provide user-friendly access for database query.The average linkage hierarchical cluster of genes and keywords was achieved by utilizing a Perl module for Cluster 3.0,and the result is represented by a heat map using a PHP GD library.Interactive gene network will be generated using Flash-based Cytoscape Web and jQuery JavaScript libraries.4.Application and comparison for web server.We tested the keyword annotation function of GenCLiP 2.0 on a set of cell-cycle-regulated genes,and compared with Martini,FatiGO and CoPub.We used GenCLiP 2.0 to analyze abnormal expression genes for keloid as compared with hypertrophic scar,and compared the performance with other web-based tools,such as CoPub,STRING and DAVID.We used CooLGeN to discover genes that are likely interacted with EZH2,to explore genes and construct gene network that are related with Epithelia-mesenchymal transition,compared with iHOP,PolySearch,EBIMed,CoPub and FACTA+.? Results:1.Our gene recognition procedure achieved an F-measure of 82.8%(recall:83.8%,precision:81.8%)on BioCreative II(GN)test set,which compared favorably to other tested methods.Moreover,we evaluated our procedure on the test set of iHOP.The F-measure was 0.86,which was better than iHOP.From whole Medline abstracts,we identified 20,228 genes that occurred in about 3,780,000 abstracts and 14,820,000 sentences.2.We identified 16,703 keywords for 20,160 out of 20,228 genes,where 4,143 keywords were phrases with an acronym and 2,313 were GO terms.The de novo approach recognized 10,937 genes forming 83,037 pairs of molecular interactions,where 69,095 pairs were not collected by the four PPI databases.In our manually defined and other test sets,the precision of molecular interactions was nearly 90%.Details and comparison with other tools is available in Supplementary Data 1.After integrating the four databases,molecular interactions increased to 104,734 pairs,which appeared in about 2,750,000 sentences and 1,080,000 abstracts.3.GenCLiP 2.0 is a web-based tool(http://ci.smu.edu.cn/GenCLiP2.0/)that can analyze human genes through three functions:(?)generation of enriched and clustered keywords,which are generated based on occurrence frequencies of free terms in gene related literature or provided by a user,(?)construction of a gene-network using accurate molecular interactions and generation of sub-networks based on user-defined query terms,and(?)generation of enriched and clustered GO terms and pathways.4.CooLGeN is accessible at:http://ci.smu.edu.cn/Test/CooLGeN/.CooLGeN contains three main web interfaces:input page,result genes with literature view page and gene network view page.The input is divided into two categories:free text and gene symbol,supporting the discovery of terms-genes and gene-gene associations,respectively.For the free text search,users can input Boolean search terms that contain multiple single words or phrases.The co-occur field comprises Medline abstract/sentence and GeneRIF sentence.Users can select genes to construct gene network,as well as add one or more expected genes to construct a network together.5.In our analysis for 118 keloid abnormal genes in GenCLiP 2.0,enriched keywords were mostly related with cell growth,extracellular matrix,epithelial mesenchymal transition,cell migration,cell adhesion,mesenchymal stem cell and wound healing.'Collagen' was manually input as a search term,and found that 10 up-regulated genes were closely associated with collagen.These keywords are mostly concordant with well-known characteristics of keloid.Interestingly,keratnocyte and keratinocyte differentiation were also annotated as keywords.This reminded us that we should pay more attention to keratnocyte.Resulting gene networks showed that up-regulated MMP2 played an important role in the network.Interestingly,THBS2,CST3 and GLB1 as activators of MMP2,were up-regulated,while three inhibitors,IL1RN,S100A8 and S100A9 were down-regulated.Most of these genes had not been investigated in keloid.Consequently,we proposed that abnormal expression of these genes can cause up regulation of MMP2,and may impact keloid progress.Compared with other similar tools,GenCLiP 2.0 offers unique features.6.In the application,CooLGeN can discover genes associated with EZH2 effectively.We carefully reviewed the literatures and identified 51 interactors of EZH2 which were not existed in curated databases.CooLGeN supports the Boolean search terms,therefore,we got all genes co-occur with multiple terms that related to epithelial mesenchymal transition simultaneously.We identified 140 genes that were not in GO databases,and constructed their gene network which reflected the complexity of EMT process.Compared with the similar tools,CooLGeN can discover related gene more conveniently and effectively,and meet more biomedical researchers' demands.Moreover,CooLGeN is the first tool that support Boolean search to find related gene.? Conclusion1.We present a web-based text-mining server,GenCLiP 2.0,which can analyze human genes with enriched keywords and molecular interactions.Compared with other similar tools,GenCLiP 2.0 offers two unique features:i)analysis of gene functions with free terms(i.e.,any terms in the literature)generated by literature mining or provided by the user,and ii)accurate identification and integration of comprehensive molecular interactions from Medline abstracts,to construct molecular networks and sub-networks related to the free terms.Therefore,GenCLiP 2.0 has a unique advantage to elucidate the molecular mechanisms of disease,to build the molecular network of the disease,and to discover the diagnosis and treatment target,etc.However,the drawback is the high false positive rate of the annotation,not as reliable as manual annotation databases,such as GO and KEGG.2.CooLGeN is a novel text mining resource that can search genes co-occurred with any search terms or a certain gene in the literatures,and can construct a corresponding gene network.The functionality of CooLGeN provides an effective and efficient way for biomedical researchers to identify the genes of interest and the interactions between them.Meanwhile,CooLGeN can be useful to support biomedical databases curators to annotate gene associated information.
Keywords/Search Tags:Literature mining, Web server, Free term, Function annotation, Molecular network, Related-gene
PDF Full Text Request
Related items