Distributed Data Analysis And Applications Based On Optimal Transport Theory

Posted on:2024-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Yang

Full Text:PDF

GTID:2568307067493014

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Optimal transport theory can be traced back to 1781,when it originated from an engineering problem proposed by a famous French mathematician.Given two probability distributions and the ”local” cost of moving one unit of mass from the source to the target,the goal of optimal transport problem is to find the optimal mapping that can achieve distribution transportation with minimum ”global” cost.The important advantage of optimal transport theory lies in its ability to define a distance measure between probability distributions,which can be applied even when the supports of the two distributions do not overlap.In addition,the optimal transport distance can capture distribution geometric properties that are ignored by other distribution metrics,and further provide a geometric path from the source distribution to the target distribution.However,despite these advantages,the computational cost of optimal transport theory has become a bottleneck for further development in the field of machine learning.Especially when data is scattered across different server nodes,large-scale distributed distribution comparison tasks are difficult to solve using existing algorithms,and there are mainly the following challenges:(1)Each server node in the distributed network often has limited storage and computational capabilities,and it is not advocated to directly access the raw data of other server nodes due to data privacy and security concerns;(2)In addition to computational complexity,communication efficiency is also an important issue that cannot be ignored in distributed optimization.Especially when communication protocols and storage protocols do not match,it is difficult to balance computational efficiency and communication efficiency.In addition to studying the problem of distribution comparison in distributed scenarios,this paper further explores the application of distribution comparison based on optimal transport theory at the application level,i.e.in the context of the link prediction fairness problem.The goal of fairness in link prediction tasks is to partially or completely eliminate the effect of sensitive features of nodes on link prediction results,having achieved a trade-off between fairness and accuracy in the algorithm.While existing work can achieve the fairness requirement,it is often limited to plain graph data and some of the heuristic algorithms lack theoretical guarantees.To address these issues,this paper proposes relevant algorithms,which are presented as follows:1.Based on the dual form of the entropy regularized optimal transport problem,this paper presents a framework for a distributed distribution comparison algorithm.Within this framework,a kernel matrix approximation algorithm is first proposed.The algorithm enables the server nodes to obtain an approximate kernel matrix through one-shot communication in order to achieve a reduction in the commu-nication complexity of the algorithm while protecting data privacy.Considering the inconsistency between actual distributed network communication protocols and data storage protocols,this paper further proposes a mini-batch random block coordinate descent algorithm to compute and update the dual variables in a dis-tributed manner.At the same time,the paper specifically analyses the errors of the proposed algorithm at the theoretical level.2.For the fairness problem of link prediction,this paper first extends fairness metrics to the dyadic level of graph data and uses the correlation between fairness metrics to establish a theoretical link between the problem of aligning conditional distri-butions of data with different sensitive features and the dyadic fairness problem of the link prediction task.Based on this theoretical foundation,this paper proposes a link prediction fairness pre-processing algorithm based on optimal transport theory,and further extends the algorithm to the case of multi-valued sensitive features to make it applicable to more practical application problems.3.In simulation experiments and distributed domain adaptation tasks,this paper verifies the effectiveness of a distributed distribution comparison algorithm frame-work based on optimal transport theory.At the same time,this paper validates the link prediction fairness algorithm based on optimal transport theory on two graph datasets.Experimental results show that the algorithm proposed in this pa-per can achieve an optimal trade-off between fairness and accuracy on the link prediction task.

Keywords/Search Tags:

Optimal Transport, Distributed Distribution Comparison, Privacy-preserving, Dyadic Fairness, Link Prediction

PDF Full Text Request

Related items

1	Research On Privacy Preserving Methods For Data Mining
2	Research On Privacy-preserving Methods Against Link-prediction-based Attacks
3	Privacy Preserving Feature Selection In Distributed Environment
4	Research On The Technologies Of Information Mining And Privacy Preserving In Distributed Environment
5	Research On Key Techniques In Privacy-Preserving Machine Learning Systems
6	Research On Privacy Preserving Classification Algorithm For Horizontal Distribution Data
7	The Researching Of Link Prediction Algorithm Based On Privacy In Social Networks
8	Research On Privacy-preserving Technologies In Distributed Environmen And Their Applications
9	Privacy-preserving Research Of The Distributed Clustering Algorithm Based On Density
10	Research On Privacy-preserving Distributed Optimization Algorithms