众包环境下多谓词查询优化

摘要/Abstract

摘要： 近年来，众包查询优化得到了数据库领域的广泛关注。主要研究了众包多谓词选择查询问题——借助于人力找到满足多谓词查询条件的对象。一种简单的方法是枚举数据集中的对象，对于每个对象判断是否满足每条谓词。它产生的代价是[|R|?n]，其中[|R|]为数据集中对象的数量，[n]为谓词的数量。很显然，当处理大数据集或者查询包含较多谓词的时候，简单方法的代价是非常昂贵的。由于不同的谓词具有不同的选择性，如果首先验证高选择性的谓词，那么在验证剩余谓词的时候，就可以避免验证不满足高选择性谓词的对象。因此，采用一个好的谓词顺序实现众包选择查询可以显著减少人工代价。然而，实际中很难获得最佳的谓词序列。针对该问题，提出了一种基于采样的框架来获得高质量的查询序列。为了控制查询序列生成的成本，设计了基于随机序列的最优选择方法，该方法通过随机选择序列获得最终的谓词顺序。由于基于随机序列的选择方法可能产生较大的代价，为了减少开销，提出了一种基于过滤的序列选择方法。通过在众包平台上使用真实数据集评测了提出的方法，实验结果表明，该方法能够显著减少查询序列生成的代价，同时获得高质量的查询序列。

null

关键词: 众包选择查询, 采样, 选择性

Abstract: Crowdsourced query optimization has attracted significant attention from the database community in recent years. In this paper, it considers the crowdsourced selection query with multiple predicates and leverage human power to find all objects that satisfy every query predicate. A straightforward method enumerates every object and checks whether it satisfies each predicate. The cost of this method is [|R|?n,] where [|R|] is the number of objects and n is the number of predicates. Obviously this method is rather expensive, especially for large datasets or many predicates. It finds that different predicates have different selectivities and if it first verifies a highly selective predicate, it can avoid checking other predicates for objects that do not satisfy the predicate and thus significantly reduce the cost. An important problem is to determine a good predicate order. However it is rather hard to obtain an optimal order. To address this problem, it proposes a sampling-based framework to find a high-quality order. In order to control the cost of order generation, it devises a random-sampling-based selection method by randomly selecting the predicate order. Since the random-based selection randomly selects predicate permutations, which may lead to large cost, it proposes a ?ltering-based algorithm to further reduce the cost. It evaluates the method using real-world datasets on real crowdsourcing platforms. Experimental results indicate that the methods obtain a high-quality predicate order while significantly reducing the monetary cost.

冯剑红，胡卉芪，翁学平，冯建华. 众包环境下多谓词查询优化[J]. 计算机工程与应用, 2016, 52(2): 7-13.

FENG Jianhong, HU Huiqi, WENG Xueping, FENG Jianhua. Crowdsourced query optimization for selection query with multiple predicates[J]. Computer Engineering and Applications, 2016, 52(2): 7-13.

[1]	陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
[2]	王乐，韩萌，李小娟，张妮，程浩东. 不平衡数据集分类方法综述[J]. 计算机工程与应用, 2021, 57(22): 42-52.
[3]	涂睿，王文格，卢成阳. 移动机器人实时采样路径重规划[J]. 计算机工程与应用, 2021, 57(20): 157-163.
[4]	孟东霞，李玉鑑. 利用自然最近邻的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(2): 91-96.
[5]	赵曼宇，叶军. 一类时滞异质网络的拟同步控制[J]. 计算机工程与应用, 2021, 57(12): 86-92.
[6]	刘云，钱美伊，李辉，王传旭. 特征融合与训练加速的高效目标跟踪[J]. 计算机工程与应用, 2021, 57(10): 101-109.
[7]	温廷新，孔祥博. 不平衡样本下的金融市场极端风险预警研究[J]. 计算机工程与应用, 2020, 56(8): 256-260.
[8]	顾兆军，吴优，赵春迪，周景贤. 流量的集成学习与重采样均衡分类方法[J]. 计算机工程与应用, 2020, 56(6): 86-91.
[9]	王家润，孙禹楠，尹辉，杨志龙. 军事目标关系的拟合地势及双重保凸可视建模[J]. 计算机工程与应用, 2020, 56(5): 270-278.
[10]	徐玲玲，迟冬祥. 面向不平衡数据集的机器学习分类策略[J]. 计算机工程与应用, 2020, 56(24): 12-27.
[11]	张忠林，冯宜邦，赵中恺. 一种基于SVM的非均衡数据集过采样方法[J]. 计算机工程与应用, 2020, 56(23): 220-228.
[12]	陈虹，赵建智，肖成龙，陈建虎，肖越. 改进ADASYN-SDA的入侵检测模型研究[J]. 计算机工程与应用, 2020, 56(2): 97-105.
[13]	龙建全，梁艳阳. 多路口环境下RRT的最优路径规划[J]. 计算机工程与应用, 2020, 56(19): 273-278.
[14]	刘剑锋，李瑞华，刘垚圻，苏泳涛，胡金龙. 无线通信中的信噪比估计算法研究[J]. 计算机工程与应用, 2020, 56(18): 82-89.
[15]	王亮，冶继民. 整合DBSCAN和改进SMOTE的过采样算法[J]. 计算机工程与应用, 2020, 56(18): 111-118.

众包环境下多谓词查询优化

Crowdsourced query optimization for selection query with multiple predicates

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics