计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (22): 67-73.DOI: 10.3778/j.issn.1002-8331.1708-0293

• 大数据与云计算 • 上一篇    下一篇

Spark平台下的凸包问题研究

李格非1,马蔚吟2,李  力3   

  1. 1.上海交通大学 计算机科学与工程系,上海 200240
    2.南京医科大学 基础医学院,南京 211166
    3.上海交通大学 软件学院,上海 200240
  • 出版日期:2018-11-15 发布日期:2018-11-13

Convex hull problem based on Spark platform

LI Gefei1, MA Weiyin2, LI Li3   

  1. 1.Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
    2.School of Basic Medical Sciences, Nanjing Medical University, Nanjing 211166, China
    3.School of Software, Shanghai Jiao Tong University, Shanghai 200240, China
  • Online:2018-11-15 Published:2018-11-13

摘要: 随着移动互联网时代的到来,越来越多的含地理位置信息的空间数据需要处理,如何在海量的空间数据中进行常见的几何查询成为一个挑战,凸包问题因其在模式识别、图像处理、统计学、地理信息系统、博弈论、图论等领域中被广泛应用成为近些年研究的一个热点。凸包问题的研究始于单机版的算法,进而过渡到Hadoop等基于硬盘的分布式系统,但是受限于单节点的计算存储能力的瓶颈以及Hadoop平台基于硬盘的特性,其计算性能尚不能达到人们的在线实时计算的需求。研究基于内存的分布式计算框架Spark下的凸包问题,给出基于Spark平台的凸包查询整体框架,框架从查询接口、语法解析和物理执行等多方面结合SparkSQL引擎。随后,给出基于Andrew单调链算法的单机算法CHStand,分析单机算法并行度上的问题后,提出基于Spark的CHSpark算法,进一步优化算法并提出一种Spark平台下的优化算法CHGeom。通过实验对比说明三种算法的相对性能提升,实验发现Spark平台下的解决方案相对传统的单机平台下的解决方案有着较大的性能提升,所提算法具有良好的拓展性和广泛的实际应用价值。

关键词: Spark平台, 分布式计算, 空间几何查询, 凸包运算

Abstract: With the arrival of mobile Internet era, more and more spatial data need to be stored and processed. How to conduct spatial geometrical queries on massive spatial data becomes a challenge in recent years. The convex hull problem becomes a focus for research on its wide usage in visual pattern matching, image processing, statistics, geographic information system, game theory and graph theory fields. The solution for convex hull problem starts with algorithms running on single computer, and then comes to some disk-based distributed systems, such as Hadoop. But bottlenecked by the storage and computing ability for single computer, the frameworks can not satisfy the demand for data scientists. This paper introduces an in-memory cluster computing framework Apache Spark to solve convex hull problem, the framework optimizes SparkSQL for geometrical operation from query interface, syntax parsing and physical execution. Firstly, a standalone algorithm CHStand on single computer is brought up to solve this problem, which is a na?ve solution on Spark named CHSpark. Then an optimized CHGeom algorithm is come up to solve the convex hull problem. Additional experiments show that CHGeom runs much faster than algorithms on single computer. Some ideas mentioned in this paper can be easily extended to other spatial geometrical operation.

Key words: Spark platform, distributed computing, spatial geometrical query, convex hull operation