Computer Engineering and Applications ›› 2016, Vol. 52 ›› Issue (2): 7-13.

Previous Articles     Next Articles

Crowdsourced query optimization for selection query with multiple predicates

FENG Jianhong, HU Huiqi, WENG Xueping, FENG Jianhua   

  1. Department of Computer Science & Technology, Tsinghua University, Beijing 100084, China
  • Online:2016-01-15 Published:2016-01-28

众包环境下多谓词查询优化

冯剑红,胡卉芪,翁学平,冯建华   

  1. 清华大学 计算机科学与技术系,北京 100084

Abstract: Crowdsourced query optimization has attracted significant attention from the database community in recent years. In this paper, it considers the crowdsourced selection query with multiple predicates and leverage human power to find all objects that satisfy every query predicate. A straightforward method enumerates every object and checks whether it satisfies each predicate. The cost of this method is [|R|?n,] where [|R|] is the number of objects and n is the number of predicates. Obviously this method is rather expensive, especially for large datasets or many predicates. It finds that different predicates have different selectivities and if it first verifies a highly selective predicate, it can avoid checking other predicates for objects that do not satisfy the predicate and thus significantly reduce the cost. An important problem is to determine a good predicate order. However it is rather hard to obtain an optimal order. To address this problem, it proposes a sampling-based framework to find a high-quality order. In order to control the cost of order generation, it devises a random-sampling-based selection method by randomly selecting the predicate order. Since the random-based selection randomly selects predicate permutations, which may lead to large cost, it proposes a ?ltering-based algorithm to further reduce the cost. It evaluates the method using real-world datasets on real crowdsourcing platforms. Experimental results indicate that the methods obtain a high-quality predicate order while significantly reducing the monetary cost.

null

摘要: 近年来,众包查询优化得到了数据库领域的广泛关注。主要研究了众包多谓词选择查询问题——借助于人力找到满足多谓词查询条件的对象。一种简单的方法是枚举数据集中的对象,对于每个对象判断是否满足每条谓词。它产生的代价是[|R|?n],其中[|R|]为数据集中对象的数量,[n]为谓词的数量。很显然,当处理大数据集或者查询包含较多谓词的时候,简单方法的代价是非常昂贵的。由于不同的谓词具有不同的选择性,如果首先验证高选择性的谓词,那么在验证剩余谓词的时候,就可以避免验证不满足高选择性谓词的对象。因此,采用一个好的谓词顺序实现众包选择查询可以显著减少人工代价。然而,实际中很难获得最佳的谓词序列。针对该问题,提出了一种基于采样的框架来获得高质量的查询序列。为了控制查询序列生成的成本,设计了基于随机序列的最优选择方法,该方法通过随机选择序列获得最终的谓词顺序。由于基于随机序列的选择方法可能产生较大的代价,为了减少开销,提出了一种基于过滤的序列选择方法。通过在众包平台上使用真实数据集评测了提出的方法,实验结果表明,该方法能够显著减少查询序列生成的代价,同时获得高质量的查询序列。

关键词: 众包选择查询, 采样, 选择性