计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (1): 113-121.DOI: 10.3778/j.issn.1002-8331.2012-0432

• 大数据与云计算 • 上一篇    下一篇

基于时空相关性的公交大数据清洗

谢智颖,何原荣,李清泉   

  1. 1.厦门理工学院 计算机与信息工程学院,福建 厦门 361024
    2.深圳大学 空间信息智能感知与服务深圳市重点实验室,广东 深圳 518060
  • 出版日期:2022-01-01 发布日期:2022-01-06

Big Data Cleaning Method for Bus Based on Spatiotemporal Correlation

XIE Zhiying, HE Yuanrong, LI Qingquan   

  1. 1.School of Computing and Information Engineering, Xiamen University of Technology, Xiamen, Fujian 361024, China
    2.Shenzhen Key Laboratory of Spatial Smart Sensing and Services, Shenzhen University, Shenzhen, Guangdong 518060, China
  • Online:2022-01-01 Published:2022-01-06

摘要: 随着大数据与AI技术的发展,由数据驱动的预测模型层出不穷,数据清洗在提升这些模型预测中起着重要的作用。从公交车运行数据的时空相关性入手,分析了公交大数据存在的四类异常,接着在对时间相关性、空间邻近性、时空依赖性等公交大数据特性的分析基础上,提出了整合缓冲区、四分位数、时间依赖网络等时空处理方法的冗余清洗、范围清洗、异常清洗、补全清洗四种清洗方法,然后对公交进出站、轨迹数据集用这几种清洗方法进行了清洗。在不同清洗数据集下,通过LSTM公交到达时间预测精度的比较分析,证明了数据清洗对预测精度的提升是显著的。

关键词: 数据清洗, 时空相关性, 数据质量, 公交大数据

Abstract: With the development of big data and AI technology, data driven prediction models emerge in endlessly. Data cleaning plays an important role in improving the prediction of these models. Firstly, the four types of anomalies existing in big data for bus is analyzed. Secondly, the four cleaning methods based on the analysis of the characteristics of bus big data such as temporal correlation, spatial proximity and temporal dependence have been put forward. The four cleaning methods include the redundancy cleaning, range cleaning, exception cleaning, complete cleaning. Thirdly, the bus in and out station and trajectory data sets are cleaned with these cleaning methods. Finally, through the comparative analysis of the prediction accuracy of bus arrival time by LSTM under different cleaning data sets, it is proved that the data cleaning can significantly improve the prediction accuracy.

Key words: data cleaning, spatiotemporal correlation, data quality, big data for bus