Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (5): 79-87.DOI: 10.3778/j.issn.1002-8331.2002-0163

Previous Articles     Next Articles

Prediction Model of Execution Time for Batch Application in Spark

LI Shuo, LIANG Yi   

  1. Faculty of Information, Beijing University of Technology, Beijing 100124, China
  • Online:2021-03-01 Published:2021-03-02

面向Spark的批处理应用执行时间预测模型

李硕,梁毅   

  1. 北京工业大学 信息学部,北京 100124

Abstract:

The prediction of execution time for batch application in Spark is the key technology to guide the resource allocation and application balance of Spark. However, the existing work adopts an unified prediction model for application with different behavior characteristics and considers limited factors in the model learning, which reduces the accuracy of prediction. In order to solve the above problems, an execution time prediction model for Spark batch application is proposed, which considers the diversity of batch application’s behavior characteristics. The model first classifies the execution time of Spark batch application based on strong-correlated metrics, and then uses PCA and GBDT algorithms to predict the execution time for each application category. Finally, when the ad-hoc application arrives, it is mapped into a specific application category and its execution time is predicted with the corresponding prediction model. The experimental results show that, compared with the unified prediction model, the proposed method can reduce the mean square root error and the mean absolute percentage error of the prediction results by 32.1% and 33.9% on average.

Key words: Spark, batch application, classification, prediction

摘要:

Spark批处理应用执行时间预测是指导Spark系统资源分配、应用均衡的关键技术。然而,既有研究对于具有不同运行特征的应用采用统一的预测模型,且预测模型考虑因素较少,降低了预测的准确度。针对上述问题,提出了一种考虑了应用特征差异的Spark批处理应用执行时间预测模型,该模型基于强相关指标对Spark批处理应用执行时间进行分类,对于每一类应用,采用PCA和GBDT算法进行应用执行时间预测。当即席应用到达后,通过判断其所属应用类别并采用相应的预测模型进行执行时间预测。实验结果表明,与采用统一预测模型相比,提出的方法可使得预测结果的均方根误差和平均绝对百分误差平均降低32.1%和33.9%。

关键词: Spark, 批处理应用, 分类, 预测