计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (23): 68-73.DOI: 10.3778/j.issn.1002-8331.1912-0293

• 大数据与云计算 • 上一篇    下一篇

Spark迭代密集型应用的优化方法研究

魏占辰,刘晓宇,黄秋兰,孙功星   

  1. 1.中国科学院 高能物理研究所,北京 100049
    2.中国科学院大学,北京 100049
  • 出版日期:2020-12-01 发布日期:2020-11-30

Research on Optimization for Iteration-Intensive Applications on Spark

WEI Zhanchen, LIU Xiaoyu, HUANG Qiulan, SUN Gongxing   

  1. 1.Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
  • Online:2020-12-01 Published:2020-11-30

摘要:

Spark是一个非常流行且广泛适用的大数据处理框架,具有良好的易用性和可扩展性。但在实际应用中,仍然存在一些问题需要解决。例如在部分迭代计算场景中,得到的加速效果并不理想,究其原因在于使用Spark等分布式系统后引入的额外损耗较大。为准确分析并降低这些损耗,提出了Spark效率分析公式,以分布式计算代价衡量额外损耗,以有效计算比衡量执行效率。在此基础上,还针对Spark迭代密集型应用设计并实现了一种优化策略。测试结果表明,有效计算比和程序执行性能得到了大幅提升,其中有效计算比提升了约0.373,程序执行时间缩短了约68.2%。

关键词: Spark, 迭代密集型应用优化, 分布式计算代价, 有效计算比

Abstract:

Spark is a very popular and widely applicable big data processing framework with good easy-using and scalability. However, there are still some problems that need to be solved in practical applications. For example, in some iteration-intensive computing scenarios, the acceleration effect is not ideal. The reason is that the application efficiency is influenced by large additional loss introduced when using Spark. In order to accurately analyze and reduce these losses, this paper proposes a Spark efficiency formula. Additional losses are measured with the distributed calculation cost and application efficiency is measured with effective calculation ratio. This paper also proposes an optimization strategy for iteration-intensive applications on Spark according to the formula. Test results show that the effective calculation ratio has been greatly improved by about 0.373 and the execution time has been reduced by about 68.2%.

Key words: Spark, optimization for iteration-intensive application, distributed calculation cost, effective calculation ratio