计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (7): 67-73.DOI: 10.3778/j.issn.1002-8331.1909-0124

• 大数据与云计算 • 上一篇    下一篇

Spark环境下基于子图的异步迭代更新方法

李超,董新华,陈建峡   

  1. 湖北工业大学 计算机学院,武汉 430068
  • 出版日期:2020-04-01 发布日期:2020-03-28

Asynchronous Iterative Updates Method Based on Subgraph in Spark

LI Chao, DONG Xinhua, CHEN Jianxia   

  1. School of Computer Science, Hubei University of Technology, Wuhan 430068, China
  • Online:2020-04-01 Published:2020-03-28

摘要:

全局同步计算模型简单易用,但是路障同步导致收敛速度变慢。以顶点为中心的异步迭代虽然提高了收敛速度,但在计算节点之间需要频繁发送信息。在Spark环境下提出一种基于子图的异步迭代更新方法。在子图之间建立异步消息通信连接后,子图能以异步方式发送数据块;通过多线程同步避免数据读写冲突,保证异步更新时顶点状态的一致性。在大规模样本数据集上分别从收敛结果、收敛速度和通信代价验证方法有效性。实验结果表明,与全局同步迭代相比,该方法有效提高了计算收敛速度。与顶点为中心的异步更新方式相比,该方法在收敛时间上略有增长,但是显著降低了通信开销。

关键词: 子图, 异步更新, Spark环境, 图数据, 图切分

Abstract:

Bulk synchronous parallel model is easy to program. However, long waiting time is required for each vertex to step into next round in the BSP models, and frequent messages-passing are incurred by vertex-centric asynchronous methods. To accelerate the execution of iterative graph computations, this paper proposes an asynchronous iterative method in Spark, and exploits two means to guarantee the validity. Firstly, by leveraging remote procedure call to establish connections, data blocks can be transmitted asynchronously among vertex partitions and edge partitions. Secondly, to guarantee data consistency during updating, synchronization is adopted to make threads exclusive access to vertex state. Experiments on large scale graph data are conducted on a local cluster. Comparing with the BSP and vertex-centric model, the proposed method not only accelerates iteration, but also improves communication efficiency.

Key words: subgraph, asynchronous update, Spark, graph data, graph partition