Font Size: a A A

Taxi Data Quality Analysis And Processing Based On Hadoop

Posted on:2016-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:H Q PangFull Text:PDF
GTID:2322330476455310Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Shenzhen has established a common information platform for intelligent transportation through the Intelligent Transportation System(ITS).Every day the information platform collects massive amounts of traffic data which contains abundant traffic information.Traffic data of high quality guarantees ITS to make correct decisions. However,during the actual traffic data collection process, equipment failure, the external environment interference, human errors and other variety of factors lead to inevitable loss or redundancy of the raw data acquired and other quality problems.In this paper, the cloud computing platform based on Hadoop is used to analyze the quality of massive taxi data in Shenzhen and process them.The main work in this paper includes the following aspects:(1)This paper researches the results domestic and foreign scholars obtained in the aspects of data quality assessment and data cleaning and analyzes its shortages and then introduces the main research content of this paper.(2)This paper designs an evaluation system combination of historical data based on Analytic Hierarchy Process(AHP) in Decision Science. This paper uses AHP to calculate the evaluation index weights and obtains the quality score of data with the expectations of the historical data as a benchmark, thus quantifying the quality of the data and intuitively reflecting the quality of the data.(3)This paper designs the evaluation scheme on the GPS data and the operating data of the taxi in Shenzhen based on data feature.Firstly, this paper finds out the main factors affecting the quality of the data so as to determine the respective evaluation index, then for the existence of redundant, incomplete and erroneous data in the data set, the corresponding evaluation criteria algorithm is put forward to determine if the data meets the requirements.(4)According to the evaluation results of data quality of taxi in Shenzhen, this paper focuses on the duplicate data cleansing technology and proposes deduplication algorithm based on Map Reduce to delete duplicate data. Then the taxi data cleaning program based on Hadoop platform is proposed for the GPS data and operating data, The data cleaning program mainly aims at quality problems like the incomplete data, redundancy and errors. It migrates traditional cleaning technology to cloud platform, effectively raising the calculation efficiency and data quality.(5)The GPS data after cleansing is applied in the taxi stops research, and the stops detection algorithm is put forward based on DBSCAN to find the taxi stops from non-passenger trajectory data. The detection algorithm is divided into three steps: to acquire candidate point, to filter candidate point and to cluster candidate points of stops. The acquisition of candidate point is based on the candidate detection algorithm, and then the detection algorithm utilizes the time and space properties to filter the candidate point, and finally analyzes the advantages and disadvantages of various clustering algorithms and chooses the DBSCAN clustering algorithm to cluster stops.With establishing the data quality evaluation system and evaluating the taxi GPS data and operating data quality, this paper finally obtains data quality score of two data sets and intuitively reflects the data quality, thus providing the basis for subsequent cleaning tasks.This paper researches the corresponding data cleaning scheme according to the data quality evaluation results, and finds out which can effectively improve the data quality and support for ITS to make the correct decision.This paper also studies taxi stops based on the data after cleansing, which helps city managers better understand the situation of the taxi driver and is also instructive for the driver to find passengers.
Keywords/Search Tags:Hadoop, Data Quality, Data Cleansing, Stops
PDF Full Text Request
Related items