Font Size: a A A

Study On Key Issues Of Stochastic Block Models In Complex Networks

Posted on:2020-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H ChangFull Text:PDF
GTID:1360330620953159Subject:Mathematical statistics
Abstract/Summary:PDF Full Text Request
Since the large scale,complex structure and difficult comprehend of complex networks(known as graphs in mathematics),approaches that can analyze the global statistical properties of networks have attracted attention,such as cluster analysis,which is often called community detection in network analysis.Although cluster analysis is an old topic,the graph clustering and clustering data points in metric spaces are quite different.The essential difference is that the former is related to connection patterns of edges.And the latter is mainly related to the "distance" in the metric space.The relational structures contained in networks lead to the dependence between edges.The complex dependence of networks does not follow the common assumptions in standard statistics,which brings new challenges to cluster analysis.The stochastic block models are a class of probabilistic generative model based on stochastic equivalence.Parameters that reflect different types of connection patterns of nodes can be flexibly introduced to the models.These parameters can generate a range of network structures.The stochastic block models bring convenience to the global statistical properties of networks such as cluster analysis.However,there are some key problems in stochastic block models that need to be solved,mainly in three aspects:(1)How to determine the number of communities,that is,the problem of model selection;(2)How to expand the model to use other information to improve the quality of community detection;(3)How to design efficient algorithms to adapt to the growing network scale.In this thesis,the above three issues will be researched based on the perspective of connectivity behavior of nodes and relying on stochastic block models as following:(1)Many stochastic block models have a problem with model selection.In fact,it is an open question how to determine exactly how many communities in the analysis of network structure,and there is no widely accepted model selection criterion.Most existing methods follow the model selection criteria in classical statistics to seek the best balance between model accuracy and model complexity.In contrast,an idea based on information entropy and mixing coefficient is proposed,which focuses on extracting the information contained in the mixing coefficients.We first note that the mixing coefficients(mixed prior distribution)contains information on community size.Extracting the information contained in the mixing coefficients by using entropy and removing the community with little information(i.e.,the community size is too small)can determine the number of communities.In Chapter 3,the above ideas are implemented by combining information entropy and NMM.NMM(Newman's mixture model)is a probabilistic generative model based on the idea of stochastic equivalence.It can explore a range of network structures including community structure,bipartite and core-periphery structures,etc.(called generalized community structure).However,NMM needs to know the number of communities in advance.An entropy regularized mixture model(called EMM)is proposed to infer the number of communities and identify network structure contained in a network,simultaneously.The coefficient of entropy term with self-adaptability is designed,so that the coefficient has both penalty function and balance function.In the model,the parameters are evaluated by expectation-maximization(EM)algorithm.By minimizing the coefficient of of entropy term,the small clusters contained little information can be discarded step by step.Therefore,EMM can automatically discover the number of communities when the algorithm converges.The empirical study on both synthetic networks and real networks has shown that the proposed model EMM is superior to the stateof-the-art methods.Although EMM is proposed in combination with entropy and NMM,the idea of extracting mixing coefficient information based on entropy is also applicable to other models with mixed coefficients.The method based on entropy and mixing coefficient extends the method of model selection.(2)An important extension of basic stochastic block models is that introducing attribute information to this model.However,the attribute information in networks is various.How to integrate these different types of information into the model to explore the structural regularities is a challenging problem.The attribute information may become interference information if inappropriate use.Many existing models uncover traditional community structure,i.e.,the division of the nodes of a network into densely connected groups with only sparse betweengroup connections.In Chapter 4,by sharing potential location of nodes between nodes and its attributes,a principled statistical model named PSB_PG that generates link topology and node attributes is proposed.This model for generating links is based on the stochastic blockmodels following a Poisson distribution.Therefore,it is capable of detecting a wide range of network structures including community structures,bipartite structures and other mixture structures.The model for generating node attributes assumes that node attributes are high dimensional and sparse and also follow a Poisson distribution.This makes the model be uniform and the model parameters can be directly estimated by expectation-maximization(EM)algorithm.Experimental results on artificial networks and real networks containing various structures have shown that the proposed model PSB_PG is not only competitive with the state-of-the-art models,but also provides good semantic interpretation for each community via the learned relationship between the community and its related attributes.(3)Two main problems exist in EM algorithms used in Chapters 3 and 4.One is that the time complexity of the parameter inference algorithm is high,especially when the models for generating topology information are based on the block assumption in stochastic blockmodels(SBM).The other is that the model parameters inferred by the EM algorithm depend on the observed network and are overfitting.To solve the above two problems,in Chapter 5,based on stochastic blockmodels,a joint probability generation model is proposed.A variational distribution family is introduced to approximate the posterior conditional distribution of community membership in the model.A non asymptotic approximation can be obtained through a variational Bayes EM algorithm.An algorithm VI-GCD which can quickly calculate the posterior distribution of the potential position of each node is designed.The new algorithm can discover the generalized communities in attributed networks and get the attributes related to each community.Meanwhile,the new algorithm reduces the complexity and avoids the problem of over-fitting of the EM algorithm.Therefore,the algorithm VI-GCD can be implemented on large-scale networks.
Keywords/Search Tags:complex network, network structure, NMM, SBM, EM algorithm, variational inference
PDF Full Text Request
Related items