Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (2): 213-220.DOI: 10.3778/j.issn.1002-8331.1709-0378

Previous Articles     Next Articles

Prediction of PM2.5 Concentration Level Based on Random Forest and Meteorological Parameters

REN Cairong1, XIE Gang1,2   

  1. 1.College of Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
    2.School of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China
  • Online:2019-01-15 Published:2019-01-15

基于随机森林和气象参数的PM2.5浓度等级预测

任才溶1,谢  刚1,2   

  1. 1.太原理工大学 信息工程学院,太原 030024
    2.太原科技大学 电子信息工程学院,太原 030024

Abstract: Not only does air pollution, especially PM2.5, do harm to people’s physical and mental health, but it also restricts the economic development of cities. In order to forecast the concentration level of PM2.5 in a convenient and accurate way, a prediction model of concentration level of PM2.5 based on random forest is proposed, the feature factors adopt the meteorological data of Taiyuan city from 2013 to 2016, the rule of time sequence of PM2.5 concentration change of the prediction site, and its temporal and spatial correlation with the surrounding sites. Firstly, the K-Means algorithm is applied to cluster the raw meteorological data in order to reduce the correlation between different classifiers. Secondly, the undersampling method is used to balance the dataset so as to reduce the impact of class imbalance on the performance of classifiers. Finally, a predictive model is constructed by using random forest with good generalization ability. By the verification of the real data, the method boasts good recall, precision and F-score in the prediction of the concentration level of PM2.5.

Key words: PM2.5, random forest, meteorological factors, undersampling, prediction

摘要: 空气污染不仅危害人类的身心健康,而且还会制约城市的经济发展,其中PM2.5带来的影响尤为突出。为了方便准确地预测出空气中的PM2.5浓度等级,提出了一种基于随机森林的PM2.5浓度等级预测方法,特征因子采用太原市2013年—2017年的气象数据、预测站点的PM2.5浓度变化的时间规律以及与周围站点的时空关联性。该方法首先利用K-Means算法对原始气象数据聚类,降低不同分类器之间的相关性,然后利用欠采样方法对数据进行平衡采样,减少类不平衡对分类器性能的影响,最后利用泛化能力好的随机森林构建预测模型。经过真实数据验证,该方法对PM2.5浓度等级预测具有较好的精确度、召回率与[F]值。

关键词: PM2.5, 随机森林, 气象因子, 欠采样, 预测