| Community discovery is to divide the nodes in the network into multiple communities through a certain division method.In the early stage of research,researchers abstracted the real network as a homogeneous information network,that is,nodes and edges are the same type of network,and proposed many traditional homogeneous information network community discovery methods.However,most networks in the real world are heterogeneous information networks,that is,nodes and edges are various types of networks.The community discovery method of homogeneous information network is suitable for heterogeneous information network,but the accuracy of community division is low.The researchers found that most of the communities in the real world are overlapping communities,so overlapping community discovery in heterogeneous information networks can not only utilize rich semantic information,but also make community discovery results more realistic.Most of the traditional community discovery methods only divide the community according to the structure information of the node or only use the attribute information of the node.The researchers found that Graph Neural Networks(GNN)can combine the structural information of the network with the node attribute information,and learn at the same time.The traditional graph neural network performs better in finding problems in the homogeneous information network community,but cannot use the characteristics of different node types and edge types in heterogeneous information networks cannot make good use of the semantic information of heterogeneous networks.Based on the graph neural network and the characteristics of the heterogeneous information network,this thesis proposes an overlapping community discovery method based on the heterogeneous graph attention network to solve the overlapping community discovery problem of the heterogeneous information network.The contributions are as follows:1.In order to fully combine the structure information and attribute information of the heterogeneous information network,construct the heterogeneous network feature representation.The traditional graph neural network is only suitable for the homogeneous information network,which combines the structural information and attribute information of the homogeneous information network nodes to perform feature representation learning,and uses the generated low-dimensional feature space representation for downstream data analysis.In order to take advantage of the characteristics of different types of nodes in the heterogeneous information network,the structure information of the nodes in the heterogeneous information network is firstly represented by a node matrix according to the specified meta-path,and then the structural information of the different meta-paths and the attribute information are combined to construct the heterogeneous information network’s node feature representation.2.In order to fully mine heterogeneous network information,an improved heterogeneous graph attention network is used to extract node features.The heterogeneous graph attention network combines the graph neural network with the attention mechanism in the heterogeneous information network,obtains the weight information of the neighbor nodes based on the metapath through the node-level attention mechanism,and obtains the weight information of different meta-paths through the semantic-level attention mechanism,and fuse all weight information to obtain a new node feature vector,which fully exploits different semantic information in heterogeneous networks.This thesis improves the activation function of the semantic-level attention mechanism in the heterogeneous graph attention network,solves the problem of gradient disappearance,and learns the node feature vector together with the subsequently generated community membership matrix.3.For overlapping community discovery,the heterogeneous graph attention network is combined with the graph convolutional neural network and the loss is unified based on the B-P model.The node feature vector generated by the heterogeneous graph attention network is used to generate the community membership matrix through the graph convolutional neural network,and the negative log-likelihood function of the Bernoulli-Poisson model is used as the loss function to uniformly optimize the node feature vector and community overlap degree,so that the B-P model can be used for the discovery of overlapping communities in heterogeneous information networks,and the final community division result can be obtained through the threshold of community division.This thesis selects two real heterogeneous information network datasets DBLP and IMDB,and compares them with the traditional community discovery algorithm SLPA and other algorithms based on graph neural networks,graph convolutional neural networks,graph attention networks,heterogeneous graph attention networks,The NOCD algorithm conducts comparative experimental analysis,and uses the improved extended modularity EQ* value as a measure of the effect of finding overlapping communities in heterogeneous information networks.The experimental results show that the model proposed in this thesis has a certain degree of improvement compared with the traditional community discovery algorithm and the algorithm based on graph neural network.And by analyzing the meta-path weight information obtained by the final training,it can be seen that the meta-path weight information obtained based on the improved heterogeneous graph attention network conforms to the understanding of semantic information in the real world. |