Font Size: a A A

The Research Of Entity Matching Method And Application On Credit Reference System

Posted on:2011-10-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:1119360305455689Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The entities symbolize the economic activity ability individuals or organizations. The entities in the credit reference system means individuals, families, enterprises and enterprise groups, etc. Entities'matching is used to check that whether the entities described by different syntax are the same semantics. The main function of the credit reference system is to collect the credit data which scatters in the different departments of society, and then, classify and release the credit information by different credit entities. The goal of this system is to build a credit information management system, which covers every economical ability entity over the country.Entity matching is the technical basis of the credit reference system. There are lots of fuzzy matching operations of credit entities in this system. The reasons are listed as follows: firstly, the primary key of credit entities is distinct in different information sources; secondly, there are various problems in the credit data, such as input error, spelling error and format difference, etc. Entity matching in the credit reference system can be classified into three levels:field level match, record level match and complex structure entities match. Furthermore, the credit reference system must resolve many difficult technology problems such as the huge volumes of matching data, great difference among the different data sources and so on.This paper studies entities matching on the credit reference system. It proposes the solutions and algorithms for entities matching based on studying the appropriate matching function by the different data source characteristics. The main content of this paper are as follows.(1) The problem of the adaptive field matching is studied, and an adaptive string similarity calculation method is proposed based on associated token. According to associate with token operator sets, the proposed algorithm formally defines the similarity of homophone, refines data characteristics from word frequency and associated operator frequency of different data sources. The method compute data characteristics such as adapt frequency, association types, etc. of matching classification and similarity through support vector machine training. This method verified through experiments and comparative analysis is well adaptive for different data quality and associated types.(2) The efficiently matching problem of massive entities data is studied, and a joint grouping model is designed. The indexing and grouping characteristics are abstracted through grouping operators, and the disjunctive and disjunctive formal overall group-style concepts are introduced. These could be used in the same data source with many group operators join, group the matching operation entities records, reduce the times of records comparison during the entities matching computing process. Then, the best overall group-style operators according with different data source which could solving the effective of massive data entities matching problem are computed by using cover set solving, ensuring the accuracy of matching operation. It's proved by experiment; this method could improve the effective of matching operation.(3) The matching problem of multi-source unmarked field is studied. this paper proposed a semi-supervised entity matching method based on active learning and an unsupervised automatic matching based on iterative learning of SVM. The method based on active learning constructs multiple matching functions learners and builds up learning committee. Then in the following learning process, learning committee chooses, on his own, candidate training samples as training sample, regarding most gains of the study entity match function information. The method based on iterative learning of SVM maximizes classification distance between the support vectors and the planes. This can be divided into two steps, the first stage is to use recent neighbor method to select initial training sample automatically. The second stage is to use the characteristic of SVM the maximization classified interval, the iteration carries on the automatic training to SVM. This paper has analyzed the active learning entity match method through the experiment and the merit and the limiting condition of the iterative SVM automatic entity match method.(4)The matching problem of recorded cluster entity is studied. According to the special data structure of recorded cluster entity, the normative recorded cluster entity matching model is set up with weighting bipartite graph theory. The recorded cluster entity's upper and lower bound matching algorithm is designed. Through quickly deriving the threshold's upper and lower bound of matching entity, the entity's sub-record max weighting matching times are decreased. Through data experimental, it is confirmed that the proposed matching model and method can effective raise the precision and efficiency of the recorded cluster entity matching.(5)The matching problem of XML semi-structured entity is studied. Through computing the weights between different types attributes nodes with its father node in XML text, setting up the threshold of matching entity similarity and seeking XML transform rules and entity matching functions, the XML entities matching operation is carried out. The experimental results prove that this method has good matching efficiency.Based on the credit reference system constructed by People's Bank of China, this paper sums up technical bottleneck of entities matching of credit reference system. The concrete research issues are proposed after analyzing weak points of present methods. Meanwhile, the algorithms and solutions that this paper provides have mostly applied to credit reference system of enterprises and individuals, to resolve the problems of entities matching of multi-source data, mass data and complicated structure. At present, credit reference system of enterprises has collected and matched 15 types credit information of 882 agencies, including financial credit, clearance account, social security and environmental illegality etc. Credit reference system of individuals, has collected and matched for 11 types credit information of 702 agencies, including financial credit, housing provident fund, endowment insurance, and telecommunications arrearage etc. In a word, the credit reference system for universal entities credit information collecting has been realized.
Keywords/Search Tags:Credit reference, Credit reference System, Entities Matching, Machine Learning, Support Vector Machine
PDF Full Text Request
Related items