Font Size: a A A

Functional Cluster Of Genes And Construction Of Molecular Networks Based On Free Terms

Posted on:2013-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:J H WangFull Text:PDF
GTID:2234330395461642Subject:Genetics
Abstract/Summary:PDF Full Text Request
■BackgroundHigh-throughput assays such as microarray, proteomics or RNA sequencing have been applied generally now. As a result, users always get a list of signature genes or their products. It is necessary to analyze their enrichment functions and pathways or networks they are involved in, and further the key nodes or pathways of the gene networks.With the rise of bioinformatics, literature mining has gradually become a routine auxiliary means on biomedical research, but also become one of the important means of large-scale access to raw data. It plays an important role for the boost of diagnosis, prevention and treatment. Literature mining plays a major role in many important biological research areas(for example, to obtain the protein-protein interaction, gene function annotation and biological pathways, etc.)For now, a lot of databases for gene function, pathways, or molecules interaction have been curated from literatures. For example, the Gene Ontology (GO) database for gene functions, the KEGG database for pathways, and the HPRD database for protein-protein interaction. However, only a small part of gene functions, pathways. or molecules interaction were contained in these databases due to limit of human effort and material, and the annotation type is standardized and can not be changed. GO database for example, a lot of gene function related terms such as embryonic stem cell, and a specific virus name have not been defined as annotation terms in the GO database.Here, we developed a web server GenCLiP2.0for literature mining gene function, pathways, and molecules interaction. The main characters of this tool are:1) it can analyze the gene functions with free terms generated by machine or provided by user;2) it identified and integrated the most comprehensive molecules interaction from the full Pubmed, to construct interaction network and any free term related sub-network.■Material and methods:1. Literature mining function and molecular network of human gene(1) Download database:download literature database from PubMed and database of gene names from Entrez gene and HUGO to local database, and update regularly.(2) Identifying gene related literature:Integrate the Entrez gene name and the HUGO gene name, establish the database of human genes; according to the human gene name (including abbreviations, alias, full name and product name, etc.), develop the rules of gene name recognization, identify literatures related to gene from PubMed, and build the database.We apply dictionary-based mixed with rule-based approach to identify literature related to genes. First, extract the gene name (including abbreviations, alias, full name and product name, etc.) of human genes from Entrez and gene profiles. Expand, delete and correct gene name to improve the recall rate of gene name recognition. Meanwhile, according to the gene full name and gene profiles, we develop a secondary search terms to improve the recognition accuracy of the gene name. Then, we summarize the complex identification rules of gene names in the training set of Biocreative Ⅱ GN to further improve the recall and precise of the gene name recognition.(3) Identify gene functional annotation:extract the non-public vocabulary words and phrases which high frequency occurrence in at least two gene’s related literature as the functional annotation (keyword), and build the database.We identify gene keywords by two ways. First, identify non-public vocabulary present at literature related at high frequency as candidate keywords of genes. Second, identify the high frequency phrases present at literature related to genes (including biological process and molecular function annotations of the GO database, and abbreviated phrases present at literatures) as the candidate keywords of genes. Then screen at least two genes shared by the candidate keywords as keywords of genes further.(4) Identifying molecular interactions:extracting molecular interactions (including protein-protein interactions and protein-gene interactions) from the gene-related literatures and building the database.We have been collecting regularity vocabulary describing molecular interactions widely used. We have summarized the usage of every regularity word through five sets of literature test of interactions with protein-protein, so as to formulate the rule of identification. Gene-related literatures are divided into sentences, and then according to the rules established, the gene/protein interactions in the sentences are identified.(5) Integrate the other molecular interaction databases.We collected four existing manual annotation molecular interactions databases (HPRD, BioGRID, CORUM, IntAct). We extracted molecule pairs which occurrence in the same sentence, and then integrated into the established the molecular interaction database.(6) Word related gene search:Based on words (combination) submitted by users, identifying the literatures that contains the sentence both words (combinations) and genes occurrence in same sentence from gene related literatures, to report the genes related to words (combinations) identified.(7) Gene function search and cluster:according to the single gene or gene list submitted by users, searching gene functional annotation database, identifying the functional annotation of individual gene, or identifying functional annotation enrichment of gene list and making fuzzy cluster. It allows users to add or delete functional annotation of genes manually in order to make clustering results more closely related with the purpose of users.(8) Construct molecular network:according to the single gene or gene list submitted by users, searching database, identifying the molecular networks of single gene involved in, or constructing the molecular networks of gene list. Furthermore, Based on the words (combination) submitted by users, search literatures and identify the gene pairs from the molecular networks, which the gene pairs and words (combination) occurrence in the sentences, to build molecular networks related to specific keywords.2. Mining function and pathway of human gene from the database.(1) Download GO database and databases about pathways (including the metabolic pathways) to the local.(2) Based on the single gene (or gene lists) submitted by users, search (or do enrichment analysis) GO annotation and regulatory pathways.(3) Fuzzy cluster and display the results of enrichment analysis of functions of the submitted gene list.3. The development of web server GenCLiP2.0.We used the LAMP (Linux+Apache+MySQL+PHP/Perl) on the server of high-performance computer cluster,what means is that the entire system works on the Linux platform, using Apache as Web server, using MySQL as the database system and developing using PHP/Perl scripting language combination with HTML and JavaScript. We try our best to design a stable and extended easily web server and a web interface to operate easily.4. Test and application of GenCLiP2.0web server.(1) We test the recall and precise of gene name recognization module by Biocreative Ⅱ GN test set and iHOP test set, and compare with similar software of participating in the Biocreative Ⅱ contest and iHOP.(2) We extracted200sentences (including442pairs of molecular interactions) randomly, validating manually to confirm the recognition accuracy rate of the molecular interaction.(3) We compared the molecular interactions identified by literature mining with4PPI database existed to determine the number of the molecular interactions discovered newly.(4) We entered cancer "stem cell" into word related gene module to search genes related to cancer stem cell and determine the accuracy rate by reading manually. (5) Based on the695genes of nasopharyngeal carcinoma offered by Sengupta expressing differently compared with normal nasopharyngeal tissues, of which,326genes express increased and369genes express decreased,we entered the two group genes respectively into GenCLiP2.0to do clustering analysis of gene function and construct molecular network.■Results:1. The gene name recognization module is achieved the recall of83.8%, the precise of81.8%, and the F value of0.828on the test set of Biocreative II GN. The result is better than the best results of competition. The F value on the test set of iHOP is0.86. The result is better than iHOP test results.2. In19.65million abstracts from PubMed until2010, there are18305human genes in3.14million abstracts shown5.94million times. The average number of each gene’s related literatures is326.3. A total of17497keywords were recognized. In18232human genes, the average keywords of each gene are24and the average number of related genes for each keyword is25. Calculate the frequency of keywords in each gene-related literature. These key words and word frequency ultimately can be used for the literature functional annotation and cluster analysis.4. We developed53rules about recognizing molecular interactions. The accuracy rate on the training set is almost90%. Module of molecular interactions finalized60,609gene pairs, which the intersection between and4popular PPI databases is less than a quarter. After Integrated these four PPI database, molecular interactions increased to79033pairs.5. We have finished building network platform. The website is: http://ci.smu.edu.cn. The main functional modules of GenCLiP2.0include word(s) related gene search module, gene information module, gene functional annotation cluster module, literature mining molecular networks module, GO and pathways analysis module and registration module.6. Using strings’cancer "stem cell’", we got333candidate genes of tumor stem cell by word(s) related gene search module. The genes are sorted by the number of literatures shown cancer "stem cell". About50%of the genes are correct by reading manually.7. The analysis results by GenCLiP2.0on695differentially expressed genes of nasopharyngeal carcinoma are consistent with the analysis results of GO annotation by Sengupta. However, GenCLiP2.0found that differentially expressed genes of nasopharyngeal carcinoma were closely associated with epithelial differentiation, EBV response, embryonic stem cells, mesenchymal stem cells. These relations are discovered to be shown in the form of free words, rather than the standard form of the GO annotation, so they can’t be found by GO annotation. Further, GenCLiP2.0constructs gene networks differentially expressed genes of nasopharyngeal carcinoma involved in and related to specific functions and identify the key nodes of networks.■Conclusion1. GenCLiP2.0extracts functional annotation of human genes and molecular interactions from the literature. The advantages:1) give full play to the advantage of the free word, overcome the restrictions of some standard database, such as GO annotations;2) allows users to play their professional knowledge to annotate gene function through add or delete a candidate annotation terms;3) achieve full coverage of PubMed, a more comprehensive range than artificial annotation database, such as GO and KEGG. Therefore, GenCLiP2.0has a unique advantage to elucidate the molecular mechanisms of disease, to build the molecular network of the disease, and to discover the diagnosis and treatment target, etc. However, the drawback is the high false positive rate of the annotation, not as reliable as manual annotation databases, such as GO and KEGG.2. GenCLiP2.0’s gene recognization module has a high recall and precision, as better as some advanced literature mining softwares in the world, so the gene-related literature is reliable. Molecular interactions recognition module has a high accuracy, up to89%, but the recall rate is very low, less than30%. The two modules are still having a room for improvement, continue to study or be extended to more species.3. GenCLiP2.0’s molecular interaction database provides over60,000gene pairs, and most have not yet been annotated by other databases. Through the integration of the existing four PPI databases, molecular interaction database contains more than70,000pairs, which is so far the most comprehensive annotation database base on abstract.4. GenCLiP2.0has a high analysis speed and short analysis cycle, has a user friendly interface and is easy to use.
Keywords/Search Tags:Literature mining, Database, Web server, Free term, Function annotation, Molecular network
PDF Full Text Request
Related items