Font Size: a A A

Bioinformatics Modelling On Texts And Its Reasearch And Application On Prostate Cancer

Posted on:2014-03-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:F ZhuFull Text:PDF
GTID:1264330431973661Subject:Systems Biology
Abstract/Summary:PDF Full Text Request
Many biomedical texts provide a wealth of resources for biomedical researchers.However, it is impossible for people to manually process this gigantic amount of texts.Meanwhile, text mining can help researchers to explore information of interest fromexisting texts. Through text mining, required biomedical texts can be retrieved fromliterature databases; text mining can extract important information and knowledge fromthese unstructured texts which contain numerous research results and experiment data; textmining can also help to generate hypothesis and carry on prediction which can be used forfurther research work.Cancer is one of the worst diseases that influence human health. The research oncancer prevention, diagnosis, and treatment is one of the hotspots of biomedical researchareas. As it is well known, biomedical research relies heavily on existing material. Thereare a lot of cancer-related literature and experimental data, while text mining has anadvantage of information processing. Therefore, many researchers have begun to combinecancer research with text mining to discover new knowledge and promote biomedicalresearch.In this dissertation, we review the sub-tasks of text mining, the general processes,commonly used data sets and tools, and show some current text mining applications incancer research. We also analyze and summarize text mining-based cancer systems biologyresearch routine process, and point out limitations of text mining, as well as challenges andsolutions.To get information by text mining from massive data, it is essential to find outbiological terms from the texts. Named entity recognition is aimed to identify predefinedtypes of entity names, such as genes and proteins. However, since many factors ofbiological texts such as term structure, grammar, morphology, semantics, and context arenot the same as general texts, many recognition systems failed to identify the terms frombiological texts.SVM (Support Vector Machine) does well in small-scale, non-linear,high-dimensional pattern recognition and other machine learning problems. CRF(Conditional Random Fields) is good at solving sequence tagging problems. However, bothof them have many limitations and drawbacks. As they are complementary to some extent,combining these two methods together can promote performance. In this dissertation, we propose a series of algorithms to detect biological terms fromtexts. The algorithm uses SVM to determine whether a term is a biological one and thenutilize CRF to decide the type of biological words. After merging the results returned bySVM and CRF, an algorithm will be responsible for the correction which uses maximalbi-direction probability to remove inconsistency and ensure the maximal length of theterm.The test results on GENIA datasets and JNLPBA04datasets show that our proposedmethod yields better results. The basic idea of the proposed method is taking full advantageof SVM to improve the effect of CRF. However, since the SVM and CRF are two differentmethods, simply combining them together may cause inconsistency. By amendmentalgorithms, the inconsistency problem can be resolved, thereby enhancing the recognitioneffect.With the proceeding of biological research, people have gradually realized thatcomplex biological functions and the phenomenon of life are the results of complexinteractions among a variety of biological basic units. Deeply studying bio-molecularinteraction network to understand life through a variety of bio-molecular interactions is anelement of modern biology.In the network environment, it is unsuitable to only consider the single interactionbetween the two biological entities. Hereby, in this dissertation, we propose acomprehensive impact concept to measure interactions in the network context.Comprehensive influence includes the direct interaction between the two nodes thatrepresent two entities and indirect interaction between them. The results show that thecomprehensive effect is more suitable for the network environment. We believe that thegreater the influence results in stronger force between two biological entities, as well ashigher probability of occurrence. As most biological networks are not random networks,we put forward a network entropy evaluation method which is based on comprehensiveinfluence to measure the irregularities of network flow distribution in order to analyze thestability of the network during evolution. As the final network after iterations istopologically different from a randomly built network, the network that has the lessnetworks entropy, which indicates more stable, will be better.In this dissertation, we, adopting reinforcement learning idea, put forward analgorithm for interaction network forming which takes advantage of actors-critic algorithmframework. With the algorithm, nodes are used to represent bio-molecules and edgesdenote interactions. During the evolutionary process, a node selects with which nodes in the network it tends to interact. Different decisions will result in different network entropyvalues. The average network entropy will be used to evaluate the current state. Keepselecting and carrying on iteration, until eventually forming an optimal network. Thenetwork is the result of the dynamic nature of learning behavior.Prostate cancer is a malignancy. Researchers have concerned it for a long time. In thisdissertation, we attain biological texts from PubMed and establish a prostate cancer proteininteraction networks by the proposed methods. The results show that our proposed methodis pretty good. Network topology analysis results also show that the network node degreedistribution is scale-free.
Keywords/Search Tags:bioinformatics, text mining, prostate cancer, reinforcement learning, protein interaction network
PDF Full Text Request
Related items