计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (2): 88-97.DOI: 10.3778/j.issn.1002-8331.1605-0117

• 大数据与云计算 • 上一篇    下一篇

大规模云同步归集数据系统的异步并行优化

杨海涛1,张传斌2,阮镇江1,徐  飞1   

  1. 1.广东省建设信息中心,广州 510055
    2.中山大学 数据科学与计算机学院,广州 510006
  • 出版日期:2017-01-15 发布日期:2017-05-11

Asynchronous parallel optimization of large-scale cloud sync-collection data systems

YANG Haitao1, ZHANG Chuanbin2, RUAN Zhenjiang1, XU Fei1   

  1. 1.Guangdong Construction Information Center, Guangzhou 510055, China
    2.School of Data and Computer Science, Sun Yat-Sen University, Guangzhou 510006, China
  • Online:2017-01-15 Published:2017-05-11

摘要: 国民经济非垂直管理行业或领域建立大数据中心,需要配备能大规模云同步归集行业数据的软件系统,“行业数据云通用的同步枢纽与大数据联合体平台”(GSMS)就是为此而研制的。GSMS主要用于通过互联网大规模同步采集各地异构自治系统(或设备)的业务或事实数据并加以开发应用。在实际应用中,当众多GSMS客户线程各自并发地向GSMS数据中心同步数据时,所产生的大规模数据同步会话将汇聚在GSMS服务端,从而形成处理瓶颈。此外,同步会话全程串行的锁步机制也会制约大规模数据同步归集的性能。为此,提出并实现了一种异步并行化改进GSMS系统方案:将服务端高时耗计算环节从数据同步串行锁步过程中分离出来,为其引入基于多道消息队列中间件的异步并行处理机制,并提供相应的松弛同步事务保障措施。实践表明,正确地实现这种异步并行处理能有效提升服务端处理速度并满足同步系统的可靠性和一致性要求。

关键词: 异步并行处理, 海量数据归集, 大规模云同步, 数据同步枢纽

Abstract: To establish data centers for non-vertically administrated industries or spheres, a software system capable of sync-collecting industrial data in large-scale is needed to be equipped. To this need, the so-called Generic Sync-pivot Mega-data Syndicate platform for industries data clouds, in short GSMS is developed. GSMS is mainly used to sync-collect massive business or factual data from heterogeneous autonomous systems or equipments, and thereon develop and apply those data. In practical applications, when each of numerous GSMS client threads concurrently sync data into a GSMS data center respectively, they make a convergence of large-scale sync sessions at the GSMS server end, thereby it forms a process bottleneck. Besides, the lock-step mechanism which makes the entire course of sync session serial will depress performance of large-scale data sync-collection. Thus, a scheme of asynchronous paralleling improvement upon GSMS is proposed and developed:separating highly time-consuming computing step from data sync lock-step course, and introducing therein an asynchronous paralleling mechanism based on multiple message queues middleware, as well as providing corresponding measures for committing loose transaction of data sync. Practices demonstrate that the right implementation of such asynchronous paralleling can efficiently upgrade processing speed at the server end while still satisfying reliability and consistency requirements of sync system.

Key words: asynchronous paralleling, massive data collection, large-scale cloud sync, data sync pivot