计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (4): 72-76.DOI: 10.3778/j.issn.1002-8331.1701-0238

• 大数据与云计算 • 上一篇    下一篇

一种Spark集群下的shuffle优化机制

熊安萍1,2,夏玉冲1,杨方方1   

  1. 1.重庆邮电大学 计算机科学与技术学院,重庆 400065
    2.重庆市移动互联网数据应用工程技术研究中心,重庆 400065
  • 出版日期:2018-02-15 发布日期:2018-03-07

Shuffle optimization for Spark cluster

XIONG Anping1,2, XIA Yuchong1, YANG Fangfang1   

  1. 1.School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2.Chongqing Engineering Research Center of Mobile Internet Data Application, Chongqing 400065, China
  • Online:2018-02-15 Published:2018-03-07

摘要: Spark是基于内存的分布式数据处理框架,其shuffle过程中大量数据需要通过网络传输,已成为Spark最主要的瓶颈之一。针对shuffle过程中存在的数据分布不均造成不同节点网络I/O负载不均的问题,设计了基于task本地性等级的重启策略,进一步提出了均衡的调度策略来平衡各节点的网络I/O负载。最后通过实验验证了优化机制能够减少计算任务的执行时间,提升整个shuffle过程的执行效率。

关键词: Spark集群, shuffle过程, 数据传输, 本地性, 调度策略

Abstract: Spark is a distributed processing framework based on memory. The large amounts of data generated by the shuffle process deeply affect the network transmission, which has become one of the main bottlenecks of the Spark performance. In order to solve the problem of unbalanced data distribution resulting in the I/O load imbalance in different nodes, a restart policy based on task local level is designed. Finally, the optimization mechanism is verified by experiments, which can reduce the execution time of task and improve the efficiency of shuffle process.

Key words: Spark cluster, shuffle process, data transfer, locality, schedule strategy