Sampling Methods With The Best Discrepancy Sequence And Applications To Machine Learning

Posted on:2024-08-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Zhang

Full Text:PDF

GTID:1520307202454634

Subject:Data science

Abstract/Summary:

PDF Full Text Request

As machine learning models become increasingly larger in scale,many computational issues become more apparent.Larger machine learning models require more computational resources,storage resources,and longer training times,leading to increased carbon emissions.With the widespread use of embedded devices,model inferences are becoming more and more common.This means that a large number of embedded devices are also having an increasingly significant impact on the environment.Embedded devices are typically resource-constrained in terms of computing power,storage capacity,and power consumption.If machine learning models are too large on embedded devices or run for long periods of time,the high power consumption of embedded devices can cause overheating,increase energy demand,and lead to more carbon dioxide emissions and environmental issues such as global warming.Therefore,we must not only focus on the power consumption during the training phase of machine learning models but also consider the consumption during the inference phase and its impact on the environment.When developing machine learning models and products,we should strive to reduce computational requirements and adopt lightweight designs to minimize the negative impact on the environment while improving the sustainability and long-term value of the model operation.This thesis responds to the recent call-to-action in the artificial intelligence community to strengthen research on the environmental impact of machine learning.The research direction we have chosen is to change the composition of the training data for models.We propose novel sampling methods to select representative training subset data and to reduce the impact of sampling errors on model training,thereby improving the efficiency of model training.Many researchers advocate that machine learning model training should not blindly pursue big data,but instead focus on selecting high-quality and representative data.Researchers should carefully design data scaling,sampling,and selection strategies to improve the training efficiency of machine learning models and reduce the environmental footprint of the entire process.Relevant studies have shown that the quality of data sampling directly affects the training time of machine learning models,and thus directly impacts the carbon footprint of the models.This is because the current mainstream machine learning and deep learning models use simple random sampling methods during training.This method has significant sampling errors and masks the performance deficiencies of the models through"computational brute force",resulting in a waste of computational resources.Some researchers have proposed the construction of "core subsets" which involves building a relatively small subset of samples from large-scale datasets for model training.However,multiple experiments have shown that research on such methods is still immature.Therefore,we need to continue searching for new sampling method designs based on previous research,improve the quality of sampling,and enhance the performance and training efficiency of machine learning models.Professors Fang Kaitai and Academician Wang Yuan pioneered the technical route of applying number theory methods to statistical techniques.This thesis follows in their footsteps and proposes the Best Discrepancy Cluster Sampling(BDC)method,the Best Discrepancy Cluster Bootstrap(BDCB)method,and the Best Discrepancy Cluster Subsampled Double Bootstrap(BDC-SDB)on the theory of "low-discrepancy sequences" in analytic number theory.These methods are then applied in various aspects,including deep learning model training,random forests,deep forests,lightweight hybrid model design,and embedded device deployment.Firstly,the BDC constructs a sampling frame based on the uniform distribution property and rapid convergence of low-discrepancy sequences.The entire dataset is divided into multiple small clusters using the whole-cluster sampling method.The analysis of variance(ANOVA)formula theoretically proves that when the intra-cluster correlation coefficient of the sampling frame is negative,the accuracy of cluster sampling is superior to simple random sampling and stratified sampling.This means that the variance within each small cluster is larger than the variance between clusters,making each small cluster a good representation of the whole dataset.Experiments 2 and 3 demonstrate that,whenever the cluster sizes are equal or unequal,the BDC method has an average estimation standard error that is 61.46%and 62.33%lower than that of simple random sampling,respectively.Experiment 4 proves that with high-dimensional datasets,the BDC method outperforms stratified sampling with an estimation standard error that is 37.96%lower.Experiment 5 shows that the BDC method reduces sampling errors caused by simple random sampling,resulting in a 36.95%reduction in prediction error rate for machine learning models.In addition,this thesis proposes two new methods based on traditional resampling techniques:BDCB and BDC-SDB.Experiment 6 demonstrates that our proposed BDCB method has an average mean squared error(MSE)of estimated means that is 73.71%lower than that of traditional bootstrap methods.Experiment 7 shows that compared to traditional bootstrap,BDCB and Subsampled Double Bootstrap(SDB)methods,the BDC-SDB method provides smaller coefficient estimation errors for machine learning models and converges faster.The results from 30 publicly available datasets demonstrate that the advantages of these two new methods are more pronounced when the subsample size is small,providing significant benefits in terms of saving computation time and resources.Next,this thesis combines the proposed BDC method with deep learning model training techniques.Each batch is treated as a cluster,which means that the dataset can be divided into several batches using the BDC method.Since the BDC method ensures that the samples within each cluster(batch)are relatively similar,the training process of deep learning models becomes more efficient.The number of training iterations can be reduced,and overfitting issues can be prevented.The results of Experiment 8 demonstrate that on tabular datasets,the convergence speed of deep learning model training using the BDC method is improved by approximately 40%compared to traditional methods.Experiment 9 shows that on image datasets,the convergence speed of deep learning model training using the new method is improved by around 83%compared to traditional methods.Both experiments confirm that the deep learning model training method based on the BDC method allows each batch sample to be closer to the whole training set.Consequently,the model can learn useful information from the training data with fewer training iterations,accelerating the training process.What is more,this thesis combines the BDCB method with traditional random forest and deep forest to propose the Best Discrepancy Cluster Forest(BDCF)and the Best Discrepancy Cluster Deep Forest algorithms(BDC-DF),Experiments 10,12,and 13 used a total of 35 datasets.The results show that the accuracy of the BDCF algorithm is approximately 1%higher on average than that of the traditional random forest algorithm,with a smaller generalization error bound of 3%,while the ensemble size(i.e.number of trees)is only about 1/3 of the traditional random forest.Experiments 14,15,and 16 used a total of 34 datasets.The results show that the accuracy of the BDC-DF algorithm is approximately 1.85%higher on average than that of the traditional deep forest algorithm,with a smaller generalization error bound of 76%.Finally,this thesis adopts the approach of a hybrid lightweight model to design a hybrid lightweight classification model with a six-layer 2D CNN and the BDCF.This lightweight model is used for the classification and diagnosis of heart and lung sounds.Experiment 17 uses two datasets of lung and heart sounds.The results show that our 11-class hybrid model achieves an accuracy of 99.97%,an F1 score of 99.89%,a precision of 99.90%,a specificity of 99.99%,and a sensitivity of 99.88%.These metrics are significantly higher than those of previous models using the same datasets.In Experiment 18,the hybrid model is integrated into an embedded electronic stethoscope based on the Raspberry Pi Zero 2W singleboard computer,and it successfully runs 20 automated auscultations under low power and limited computing resources.In summary,a large number of experiments in this thesis demonstrate that the combination of number theory methods and statistical techniques is an effective Quasi-Monte Carlo method.The sampling and resampling methods proposed in this thesis significantly improve the efficiency of random forest,deep forest,and deep learning model training,reducing computational resource consumption in the training process.It can also be hybridised with deep learning models to achieve lightweight deployment on low-power embedded devices.This thesis has high practical value,providing a novel methodology for other researchers to develop green algorithms and contributing to China’s green and low-carbon development.

Keywords/Search Tags:

the Best Discrepancy Sequence, Sampling Methods, Cluster Sampling, Bootstrap Method, Deep Learning Model Training, Deep Forests, Hybrid Model

PDF Full Text Request

Related items

1	Analysis Of Bootstrap Methods And Its Applications In A Few Linear Model Classes
2	Research On Magnetic Resonance Spectroscopy Reconstruction Algorithms Based On Deep Learning
3	Automatic Sampling Method And Control System For Deep-water Hydrographical Sampling Studying
4	Research On Flower Deep Learning Identification Method For Outdoor Plants Knowledge Expansion
5	The Research On Technology Of The Deep-Sea Stratified Sampling
6	Geoscience Sampling Model And Sampling Program Design And Application
7	Analysis On Maintaining Stratification Retention Characteristics In The Entire Sampling Process For Deep Lunal Regolith
8	Research On Low-disturbing Sampling Theory And Truth-preserving Technique Of Deep-sea Surface Layer Sediment
9	Design And Implementation Of General Intelligent Control System For Deep Sea Detection And Sampling Equipment
10	Convenience Spatial Sampling Strategy For LAI Empirical Model Construction