Font Size: a A A

Research On Cluster-based Person Name Disambiguation

Posted on:2012-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:F PeiFull Text:PDF
GTID:2218330368492540Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Named entity disambiguation is one of the most important problems in natural language processing .In named entity category, person name has strong ambiguity so that person name disambiguation is the most difficult category. Person name disambiguation is mainly applied to search engine, social network and population knowledge database and so on. Because person name ambiguity problem remains huge challenge, the international academic field has organized three English person name disambiguation evaluations (Web People Search, WePS) and one Chinese person name disambiguation evaluation(CIPS-Sighan 2010 bakeoff-3).This thesis adopt Hierarchical Agglomerative Clustering (HAC)developing an English person name disambiguation system(EPND) ,and adopt two-stage Affinity Propagation(AP) clustering algorithm developing a Chinese person name disambiguation(CPND) system.In this paper, we firstly introduce WePS and CIPS-Sighan 2010 bakeoff-3, including evaluation corpus(development data and test data),evaluation metrics, baseline system, participants , popular person name disambiguation techniques and so on.We firstly extract various kinds of features , and select effect features by means of detail experiments in EPND system. On the basis of the feature extraction and selection, we improve clustering effect by fusion of many features. In this thesis, we adopt comparative maturely hierarchical clustering algorithm(HAC) to implement EPND system,?but choice of linkage is group-average link not single link .The experiment results shows group-average is better than single link.We adopt an iteratively greedy algorithm called jumping-distance tree to extract N-Grams of the context person appears in CPND system, whose similarity measure is the smoothing TF*IDF weighting. Results show it can deal with Chinese word segmentation problem so that realize the recognition of discarded documents. And we adopt two-stage AP clustering algorithm, the first stage of which ensure high precision and the second stage of which improve the recall. Results show it is useful, and the result of diagnostic test shows the segmentation effect of Chinese words also has important influence for disambiguation.
Keywords/Search Tags:Person Name Disambiguation, Evaluation Metrics, Feature Extraction, Clustering Algorithm, Named Entity, N-Gram
PDF Full Text Request
Related items