| Unlike traditional intelligent analytics applications in small-scale data scenarios,intelligent analytics applications in big data scenarios are no longer about a single stand-alone AI algorithm,but about the integration of big data,big models,and big computing,which requires consideration of algorithm design,big data processing,and efficient distributed parallel computing.As a result,a new set of challenges and issues has emerged.Firstly,complex data mining and machine learning in real-world large-scale data scenarios have outstanding computational efficiency issues.Big data has shaken the traditional computational complexity theory and methods.In large-scale data scenarios,traditional polynomial complexity algorithms are difficult to complete the computation of big data problems.Therefore,it is necessary to design efficient distributed and parallel methods and algorithms for large-scale data scenarios to improve the computational efficiency.However,the design of efficient distributed data mining and machine learning methods and algorithms faces a series of complex fundamental theoretical approaches and key technical problems,which not only need to take into account the inherent computational complexity of stand-alone serial algorithms,but also need to consider and study the parallelizability of big data distributed algorithms,the system complexity of storage,I/O and network communication,and the deep performance optimization of distributed algorithms in the distributed parallel environment.Secondly,the existing big data intelligent analytics technologies and platforms also have outstanding easy-to-use problem.On one hand,the existing big data intelligent modeling methods have high technical thresholds,relying heavily on the experience of experts.Thus,it is necessary to research efficient modeling methods based on automatic machine learning to reduce the technical threshold and significantly improve the efficiency of AI model construction.Automatic machine learning also faces fundamental theoretical methods and key technical issues such as the effectiveness of search and modeling methods and the optimization of search computing efficiency.On the other hand,The development of intelligent analytic applications in big data scenarios is not only a problem of algorithm design,but also a problem of big data and big computing,which needs to solve the key technical problems of the cross-fusion of big data intelligent analytic modeling and distributed parallel computing systems,in order to build a unified big data intelligent analytic programming computing support platform that integrates algorithm design and big data programming computing capabilities.Based on the basic theoretical and methodological research on distributed data mining and machine learning,automatic machine learning,and big data programming and computational methods,this thesis combines the importance and technical challenges of the algorithms themselves and the background of practical application requirements in the industry,and firstly selects a series of data mining and machine learning algorithms that are commonly used,highly complex,computationally inefficient,and difficult to design distributed algorithms,and then carries out the research of efficient large-scale distributed and parallel data mining and machine learning methods and algorithms.Secondly,efficient automatic machine learning methods and algorithms for different task scenarios are proposed.Finally,based on the integration of distributed data mining and machine learning as well as automatic machine learning,the thesis studies the construction of an efficient and easy-to-use unified programming method and computing platform.The platform has been validated with practical applications.Specifically,the main research content and innovations are as follows.(1)Attribute reordering based large-scale distributed function dependency discovery algorithm.Function dependency is the basic and commonly-used data structure in data mining.However,function dependency discovery tasks have high computational complexity and memory complexity,leading to huge runtime and memory overheads in large-scale data scenarios.To address the problem,the thesis proposes a large-scale distributed function dependency discovery algorithm Smart FD based on attribute reordering.In the data preprocessing stage,the research designs an efficient attribute reordering method based on skewness and cardinality.In the function dependency discovery stage,a distributed sampling method based on a fast-sampling and early-aggregation mechanism,an index-based validation method,and an adaptive switching method between sampling and validation are proposed.The experimental results show that Smart FD can achieve one or two orders of magnitude performance improvement compared to existing algorithms,and has good data scalability and system scalability.(2)Massively parallel spectral clustering algorithm based on the distributed data-parallel model.The spectral clustering algorithm can achieve better clustering effect than traditional clustering algorithms.However,it suffers from complex calculation process,high computational complexity,and time-consuming computation,especially in large-scale data scenarios,where computational efficiency is a major problem.To address the problem,the thesis proposes a massively parallel spectral clustering algorithm SCo S based on the distributed data-parallel model.SCo S implements the parallelization of similarity matrix construction and sparsification,the parallelization of Laplacian matrix construction and normalization,the parallelization of normalized Laplacian matrix eigenvector computation as well as k-means clustering.The experimental results show that SCo S can achieve good data scalability and system scalability in large-scale data scenarios.(3)Distributed task-parallel deep forest training method and algorithm based on uniform split of sub-forests.In recent years,researchers have proposed a deep forest model that is comparable to deep neural networks.However,the existing deep forest training algorithms are stand-alone serial with low computational efficiency and poor scalability,making it difficult to meet the practical application requirements.To address the problem,the thesis proposes a distributed task-parallel deep forest training algorithm Forest Layer based on fine-grained sub-forest split,which can improve the computational concurrency and reduce the network communication overhead at the same time.Three system optimization methods,including lazy scan,pre-pooling,and partial transmission,are further proposed.The experimental results show that Forest Layer achieves 7 to 20 times speedup compared to the existing deep forest training algorithm,and has near linear scalability and good load balance.(4)Efficient automatic machine learning methods and algorithms for different task scenarios.Firstly,for the full-flow big data analysis scenario,the thesis proposes an automatic machine learning pipeline design algorithm Robo ML based on reinforcement learning and Bayesian optimization.The structure search and hyperparameter optimization are alternately optimized.Secondly,for the resource-constrained scenario,the thesis proposes an adaptive successive filtering-based Auto ML algorithm BOASF.BOASF models the Auto ML problem as a multi-armed bandit problem and accelerates the Auto ML search process by adaptive fast filtering as well as adaptive resource allocation.Finally,to address the problem of concept drift in the lifelong learning scenario,Auto LLE is proposed,which integrates global incremental and local ensemble models and adaptively adjusts the weight of each model based on the time window and error measure.Auto LLE can efficiently capture concept drift and improve the prediction performance of the machine model.(5)Efficient automatic deep learning methods and algorithms for big data.Firstly,to improve the computational performance of hyperparameter optimization of deep neural networks,an efficient hyperparameter optimization method Fast HO that combines progressive multi-fidelity optimization and successive halving optimization,is proposed.The efficiency of hyper-parameter optimization is improved by filtering out the poorly performing hyperparameter configurations as early as possible and gradually allocating more resources to the remaining hyperparameter configurations.Secondly,to improve the architecture search efficiency of deep neural networks,MGDARTS is proposed as a differentiable network architecture search algorithm that minimizes the discretization performance gap.By designing a weight function that is easier to saturate and making an integral constraint on the sum of weights of each edge in the super network,MGDARTS can minimize the performance loss after the discretization of super network.The experimental results show that the proposed algorithms are all superior to existing algorithms.(6)Efficient and easy-to-use unified big data intelligent analytics programming method and platform.To effectively support the development of intelligent big data analytics applications,based on the integration of the above-mentioned distributed data mining and machine learning and automatic machine learning,an efficient and easy-to-use cross-platform unified big data intelligent analysis programming model and platform that supports multiple computational models is designed and implemented.Firstly,a cross-platform unified big data intelligent analysis programming model covering table model,matrix model,tensor model,graph model,streaming data model and other computational models is proposed.On top of the unified programming model,a big data intelligent analysis visualization programming method based on computational flow diagram is designed.Secondly,a unified big data intelligent analysis platform integration framework and a cross-platform unified job scheduling method are studied.Finally,based on the unified big data intelligent analysis and visualization programming system platform,practical application verification is carried out.This work has published 7 first-author research papers(including 2 CCF A journal/conference paper,2 CCF B journal/conference papers,and 1 Chinese CCF A journal paper).In addition,the research work in Auto ML has won a total of 9 awards in the Auto ML competition held at the top international AI conferences such as Neur IPS,KDD,PAKDD,etc.In addition,it won the national gold medal in the Fifth China College Student "Internet+" Innovation and Entrepreneurship Competition sponsored by the Ministry of Education.The related technical achievements have been transferred to several large IT enterprises in China such as Huawei,360,etc.for ground application. |