Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (4): 139-146.DOI: 10.3778/j.issn.1002-8331.2108-0343

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Multi-Representation Model for the First-Stage Semantic Retrieval

CAI Yinqiong, FAN Yixing, GUO Jiafeng, ZHANG Ruqing   

  1. 1.Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    2.University of Chinese Academy of Sciences, Beijing 100190, China
  • Online:2023-02-15 Published:2023-02-15

基于多表达的第一阶段语义检索模型

蔡银琼,范意兴,郭嘉丰,张儒清   

  1. 1.中国科学院 计算技术研究所 网络数据科学与技术重点实验室,北京 100190
    2.中国科学院大学,北京 100190

Abstract: Modern information retrieval systems generally use the retrieval-and-ranking multi-stage architecture. Recently, retrieval models based on dense representations have been gradually applied in the first stage of document retrieval tasks, showing better performance than the traditional sparse vector space models. Considering the efficiency requirement of the first-stage retrieval, most of these models employ the bi-encoder architecture. It encodes the query and the document independently, and obtains dense representation vectors for them respectively. Then, the score of query-document pair is calculated by using a simple similarity function based on the obtained query and document representation. However, it is query agnostic when encoding the document, and documents usually contain more topic information than queries. Thus, this simple single-representation model may cause serious document information loss. To solve this problem, this work designs a new dense retrieval method, called multi-representation dense retrieval(MDR), to encode the document into multiple dense representation vectors. At the same time, it introduces the coverage mechanism to ensure the difference between multiple vectors, so as to cover the information of different topics in the document. Experimental results of passage ranking and document ranking tasks on MS MARCO dataset prove the effectiveness of the proposed model.

Key words: semantic retrieval, bi-encoder model, information retrieval

摘要: 当前,信息检索系统通常采用“检索+重排序”的多级流水线架构。基于稠密表示的检索模型已经被逐渐应用到第一阶段检索中,并展现出了相比传统的稀疏向量空间模型更好的性能。考虑到第一阶段检索所需的高效性,大多数情况下这些模型的基本架构都采用双编码器(bi-encoder)结构。对查询和文档进行独立的编码,分别得到一个稠密表示向量,然后基于获得的查询和文档表示使用简单的相似度函数计算查询-文档对的得分。然而,在编码文档的过程中查询是不可知的,而且文档相比查询而言通常包含更多的主题信息,因此这种简单的单表示模型可能会造成严重的文档信息丢失。为了解决这个问题,设计了一种新的语义检索方法MDR(multi-representation dense retrieval),将文档编码成多个稠密向量表示。同时,该方法引入覆盖率(coverage)机制来保证多个向量之间的差异性,从而能够覆盖文档中不同主题的信息。为了评估模型性能,在MS MARCO数据集上进行了段落排序和文档排序任务,实验结果证明了MDR方法的有效性。

关键词: 语义检索, 双编码器模型, 信息检索