Font Size: a A A

Distributed Parallel Machine Learning Algorithms And The Application In Biomedical Field

Posted on:2019-10-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:J G ChenFull Text:PDF
GTID:1364330545472898Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technologies,such as the Internet,the Internet of Things,and sensor networks,large-scale datasets have been exploded in various application fields.In the era of big data,the issue of efficiently and accurately extracting valuable knowledge from these datasets has attracted increasing attention in academic and industrial fields.Efficient machine learning and data mining technologies are urgently needed for big data processing.At the same time,computing resources such as parallel computing and distributed computing,provide efficient computing power for machine learning technologies.In this dissertation,the distributed parallel machine learning algorithms are researched,including parallel classification,clustering,graph mining,and deep learning algorithms.In addition,the proposed algorithms are applied in the fields of Medical,Bioinformatics,and Biomedicine,providing a scientific basis for medical diagnosis and exploring the law of life and biological activity.The main jobs and innovation of this dissertation are as follows:(1)Research a distributed parallel classification algorithm and its application in the hospital queuing-recommendation.We propose a Parallel Random Forest(PRF)classification algorithm based on the Apache Spark cloud platform.The parallel solution of PRF is designed from the perspectives of data parallelism and task parallelism,respectively.In terms of data parallelism,methods of vertical data partitioning and data multiplexing are proposed to effectively reduce data communication costs among different machines.In terms of task parallelism,a two-layer parallel training method is proposed,where the training process of PRF is performed in parallel among different decision trees in the PRF model and different nodes in each tree,respectively.In addition,the proposed PRF algorithm is applied to the Hospital Queuing-Recommendation(HQR)system,where PRF is used to train the patients' treatment time-consuming model.Then,according to the trained model and the current queuing situation of each treatment project,the HQR system can provide an intelligent treatment route planning for each patient.(2)Research a parallel clustering algorithm and its application in the disease diagnosis and treatment recommendation.We propose an Adaptive Domain Density-peak Clustering(ADDC)algorithm.Firstly,aiming at the problem of sparse cluster loss on the datasets with varyingdensity distribution(VDD),we propose an adaptive domain density measurement method.Secondly,aiming at the problem of cluster fragmentation on the datasets with multiple domaindensity maximums(MDDM),we propose a cluster self-merging method.In addition,the proposed ADDC algorithm is applied to the disease diagnosis and treatment recommendation system.We can effectively identify the disease symptom clusters that have multiple symptoms and multiple etiologies,from the massive historical disease treatment datasets.Then,association rules between the disease symptom clusters and their corresponding treatments are analyzed.The system can automatically identify a patient's current disease symptoms depending on his inspection report and recommend the corresponding treatment plans.(3)Research a parallel deep learning algorithm in distributed computing environments and its application in the colon cancer cell nuclear detection and classification.Based on distributed computing,a Bi-layer Parallel Training architecture of Convolutional Neural Network(BPTCNN)is proposed to effectively improve the CNNs training performance.In the outer parallel training,strategies such as data parallelism,asynchronous weight updating,and dynamic data migration are proposed to address the problems of data communication,task synchronization,and workload balancing in distributed parallel computing.In the inner parallel training,the training process of each CNN sub-network is further accelerated on each machine.In addition,the proposed BPT-CNN algorithm is applied to the diagnosis of pathological images,and a deep learning-based colon cancer cell nuclear detection and classification algorithm is proposed.It can effectively detect and classify cancer cell nuclei in different forms from pathological slice images.(4)Research a parallel graph mining algorithm and its application in the Protein-Protein Interaction(PPI)network.Firstly,we integrate the original PPI network and the Gene Expression Datasets(GED)to construct a Weighted PPI(WPPI)network model,where we both consider the protein topology of PPI and its genetic relationships in specific biological processes.In addition,a Multi-source Learning-based Protein Community Detection(MLPCD)algorithm is proposed for the WPPI networks.Moreover,the detected protein communities are compared with known protein complexes and function modules.The Gene Ontology annotations are used to assess the functional enrichment of these communities.Experimental results show that the MLPCD algorithm is superior to related algorithms in terms of accuracy and performance.The work of this dissertation has rich theoretical value and great practical significance.Especially in the era of big data era,it makes full use of distributed computing and parallel computing resources to improve the performance of scalable parallel machine learning algorithms.Then,we explore the application of these algorithms to the field of Biomedicine,laying a solid foundation for the application of other practical fields.
Keywords/Search Tags:Distributed computing, Parallel computing, Machine learning, Big data, Biomedicine
PDF Full Text Request
Related items