| The 21 st century is an era of information.With the development of Internet technology,cyberspace has become another important activity space outside of the real space.People communicate with others,entertain and do financial activities through the Internet.These activities was recorded as data.How to use this huge data wealth is the hotspot of current research,thus making the development of data mining,machine learning and other disciplines,and finally bring the dawn of the era of artificial intelligence.Although the Internet has brought great convenience to our lives,it has also brought new challenges to public safety.The disclosure of personal privacy,Internet fraud,pyramid schemes and other new criminal activities are also intensifying.However,the anonymity of cyberspace and the encryption of data have caused great difficulties in the detection of cases.In order to solve the problem of difficult mapping between network virtual identity and real identity,the US Defense Advanced Research Projects Agency first proposed the concept of the network "gene" in 2010.China’s national key research and development plan also has a project for research in this area in 2017.The network "gene" borrows the concept of biological genes to try to solve the problem of connection between virtual space and real space.Network "gene" can be expressed as a multi-dimensional data structure that expresses the unique behavioral characteristics of network users when they use the network,thereby realizing the identification of the user’s trusted identity.Network "gene" can be divided into two parts: identity "gene" and behavior "gene" which is based on data generated by entities in real space and cyberspace.The identity "gene" is the physical information of the entity in the real space,including credentials such as ID cards,personal registration accounts such as mobile phone numbers and bank card numbers,and individual biological characteristics such as fingerprint and iris.Individual behavior data in cyberspace can be mainly divided into eight categories: mobile communication,instant messaging,e-mail,travel,online shopping,delivery,microblogging,remote operation.The data is processed and analyzed to extract data that can characterize the entity.These data is used as "gene" fragments of the entity,and then the independent behavior "gene" fragments are combined as the behavioral "gene" of the entities,in which different "gene" segments correspond to different behavioral characteristics.Finally,the network "gene" can uniquely identify the entity and reflect the essential characteristics of the entity.In general,the generation method of network "gene" is based on social behavioral psychology and taxonomy theory,following the order of semantic classification of data classification data to behavioral feature classification to behavior classification and final psychological classification,completing concept decomposition from top to bottom,and data aggregation from bottom to top.By conceptualizing and symbolizing the results of each level of processing,a unique,stable,appendable,and interpretable network "gene" code get mapped.The main research done in this paper is as follows:(1)The definition and basic characteristics of the network "gene" are studied,and the composition of the network "gene" is expounded.The basic flow of the construction of the network "gene" is given: from the extraction of the feature of the entity to the stipulation of the network "gene" fragment to the network "gene" fragment splicing combination and finally,the network "gene" is shaped.(2)In the numerous network behavior data,this paper chooses to analyze three behavioral data sets(human short message communication,travel and remote operations),and the effective binomial and triad sets are confirmed.At the same time,the time interval distribution,location distribution of the entity behavior,and the difference of the entity behavior are analyzed.Then the "gene" fragment structure of these three behaviors is given,and the unified structure of the behavioral "gene" was abstracted from these three structures.(3)According to the unified structure obtained,the similarity measurement method of network "gene" is studied.An algorithm for calculating the similarity of "gene" in network is given.Two cases of single segment and multiple fragments matching are analyzed,and the solution is given accordingly. |