计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (12): 76-84.DOI: 10.3778/j.issn.1002-8331.1605-0320

• 大数据与云计算 • 上一篇    下一篇

基于子模式的关系数据到图数据ETL方法研究

丁强龙,王  津,张学杰   

  1. 云南大学 信息学院,昆明 650091
  • 出版日期:2017-06-15 发布日期:2017-07-04

Research on ETL method of transforming relational data to graph data based on sub-schema

DING Qianglong, WANG Jin, ZHANG Xuejie   

  1. School of Information Science and Engineering, Yunnan University, Kunming 650091, China
  • Online:2017-06-15 Published:2017-07-04

摘要: 图数据库在解决多层关系查询、社区发现等问题时性能优于关系数据库。然而目前大量的数据以关系数据的形式存储,如何高效完整地进行关系数据到图数据的ETL,即抽取、转换、加载,是图数据库应用领域研究的重要问题。国内外对该问题有了一些研究,但存在转换后的图数据质量不高、转换效率低、转换结果不利于分布式存储等问题。因此,提出基于子模式的关系数据到图数据ETL方法,改进原有ETL方法的流程和算法。该方法将关系数据库模式拆分为若干个子模式,并行进行ETL。不仅提高了ETL的效率,转换结果能满足图数据的分布式存储要求,也可以作为Spark GraphX计算框架的基础数据。最后,使用Java EE和Neo4j开发了原型系统,并进行了实验验证。结果表明,改进后的ETL方法获得了较已有方法更好的转化性能。

关键词: 图数据库, 分布式存储, ETL(数据提取、转换和加载), 子模式

Abstract: For addressing problems such as multi-layer relational query and community detection, graph database outperforms relational database. However, most data of existing applications have stored in the form of relationship. Therefore, how to extract-transform-load (ETL) relational data to graph data efficiently and absolutely is still an important problem of deploying graph database applications. Existing researches suffer from three major limitations:(1) The quality of converted graph data are poor; (2) the efficiency of transforming is low; (3) the transformed results are not suitable for distributed storage. To overcome these limitations, a sub-schema-based ETL method for transforming relational data to graph data is proposed in this paper. By splitting schema of relational database to several sub-schemas, this method improves the algorithm and procedure of previous ETLs and provides an efficient way for parallel ETL. The transformed results can satisfy the requirements of distributed storage, and conduct to be the basis data for Spark GraphX computing framework. Finally, Java EE and Neo4j are applied to implement the prototype system for experimental verification. The comparative results show that the improved ETL method yields better performance than previous methods.

Key words: graph database, distributed storage, extract-transform-load(ETL), sub-schema