| Acquisition and processing of traffic data is the key technology of intelligent transportation system. Radio and computer technology makes the RFID (Radio Frequency Identification) detection widely used in road traffic data collection. However, because of test equipment failure, abnormal communication and environmental factors, RFID traffic data have problems such as redundancy, loss, error and imprecise and these problems will affect the performance if directly used. So it is very necessary for data cleaning. The main contents are as follows.Firstly, based on the study of data cleaning, RFID technology, working principle and data attribute and structure are analyzed. These provide a great help for analyzing the data problems and their causes. It takes Nanjing as an example, complete the data cleaning. In the experimental area there are 43 base stations and more than 83 million traffic data.Secondly, error data cleaning is introduced. Error data cleaning is mainly about the license plate number. According to the rule of license plate number there are 4 types of error. Detecting the error data using clustering method and counting the number of erroneous data, calculating the error rate, then analyzing the change law of time and space.Then, redundant data cleaning is introduced. Redundant data can be divided into duplicate data and similar data. Redundant data cleaning is mainly about license plate number and passing time. Count the number of vehicles which have the same plate but different passing time. The passing time is from 1 second to 300 second. Then calculating the redundant rate and making a line chart. The point which tends to gentle is the redundant point. Data within the point will be deleted, the others will be retained.Thirdly, missing data cleaning is introduced. Missing data is mainly about passing time. It can be divided into 8 types of loss data according to length of time, such as monthly loss, daily loss and so on. Then calculate loss rate of a month, a day, an hour,30 minutes,15 minutes,10 minutes,5 minutes and 1 minute respectively. And the loss rate of a base station can be determined according to the line chart.In addition, a new method is proposed based on the analysis thought, analysis method, and analysis process above.In the end, future vision is proposed and briefly analyzed based on the study made in this thesis. |