Font Size: a A A

Research On Data Cleaning And Fusion Techniques Of Multi-source Heterogeneous POI Data

Posted on:2020-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:C F XianFull Text:PDF
GTID:2370330602459019Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
With the vigorous development of LBS(Location Based Service)and the widespread application of electronic maps,POI(Point Of Interest)data as the underlying data of electronic maps has become a part of our daily lives.Many researchers are dedicated to mining people's travel trajectories,identifying urban functional areas,and urban hotspots through researching POI data information to improve the quality of our life services or provide decisions for managers.However,the premise of data mining is have rich and high-quality data,otherwise the situation of "large amount of data and small amount of information" will affect the mining results.Since there are many data quality problems of POI data from different websites that will affect the mining results,how to improve the quality of POI data through data cleaning and fusion technology has been a topic of concern to researchers.Through the text classification prediction to clean up the ambiguity of the POI category,the traditional FastText algorithm is not efficient in processing Chinese short text classification.Aiming at the multi-source heterogeneous POI classification problem,the traditional distance-based classification algorithm has a large time complexity,and only the categories are considered for the calculation of the similarity of non-spatial attributes.With the opening of shared data on the Internet and the development of crawler technology,it provides great convenience for us to obtain POI data.In this context,this paper studies the cleaning and fusion technology of multi-source heterogeneous POI data.Firstly,this paper studies the collection methods of commonly used POI data,and collects the POI data of the real estate website and Weibo sign-in website through the crawler technology based on Scrapy framework.The POI data of the bottom layer of Kunming map is extracted by ArcGIS software as experimental data.Characteristics of source heterogeneous POI data;Next,the data quality problems such as inconsistency,duplicate records,missing values and inconsistent data of the multi-source heterogeneous POI data are analyzed,and the corresponding data cleaning algorithms are not targeted for the POI category.The improved cleaning method of TI-FastText(TF-IDF,FastText)classification model is proposed and the effectiveness of the algorithm is verified by comparison experiments.Finally,the commonly used POI fusion algorithm is analyzed and some classical POI fusion algorithms are brieflyintroduced.Based on the research results,an improved TLCB(Two-level Clustering-based)algorithm based on two-layer clustering POI fusion is proposed.The algorithm combines two-layer clustering of spatial attributes and non-spatial attributes,and validates the text by comparing experiments of different coincidence degrees.The proposed TLCB algorithm has a good performance in dealing with POI fusion.
Keywords/Search Tags:POI, Data cleaning, Data fusion, Clustering algorithm, Classification algorithm
PDF Full Text Request
Related items