计算机工程与应用 ›› 2026, Vol. 62 ›› Issue (5): 1-17.DOI: 10.3778/j.issn.1002-8331.2503-0004

• 热点与综述 • 上一篇    下一篇

大模型时代自动问答系统及评价体系综述 

崔龙飞1,王宗水2,3+,鲍盈旭4,赵红5   

  1. 1.北京信息科技大学 计算机学院,北京 100192
    2.北京信息科技大学 商学院,北京 100192
    3.中国科学院 数学与系统科学研究院,北京 100190
    4.岭南大学 社会科学学院,香港 999077
    5.中国科学院大学 经济与管理学院,北京 100190
    + 通信作者 E-mail:wangzongshui8@163.com
  • 收稿日期:2025-03-03 修回日期:2025-09-18 在线发布日期:2026-03-01 出版日期:2026-03-01
  • 基金资助:
    国家自然科学基金(71972175);新疆维吾尔自治区重点研发任务专项(2024B03026);北京市教委优秀青年人才培育计划(BPHR202203237);中国博士后基金面上项目(2025M770691);高等教育研究专题(2023GXJK570)。

Survey on Question Answering Systems and Evaluation in the Era of Large Models

CUI Longfei1, WANG Zongshui2,3+, BAO Yingxu4, ZHAO Hong5   

  1. 1.School of Computing, Beijing Information Science & Technology University, Beijing 100192, China
    2.Business School, Beijing Information Science & Technology University, Beijing 100192, China
    3.Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
    4.Faculty of Social Sciences, Lingnan University, Hong Kong 999077, China
    5.School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
    + Corresponding author E-mail:wangzongshui8@163.com
  • Received:2025-03-03 Revised:2025-09-18 Online:2026-03-01 Published:2026-03-01

摘要: 大模型时代,自动问答系统呈现出诸多新的特征。通过文献阅读和梳理,对自动问答系统特征和评测体系进行总结与归纳,从问答模型推理训练的训练数据、预训练框架、模型后处理、模型高效微调等阶段,对比大模型发展初期“追求数据和参数规模”的训练方法和如今“注重数据和模型效率”之间的差异,系统分析基于大模型的自动问答系统新的特征。总结当前各种类型的自动问答大模型评测体系,并详细梳理自动化评价体系HELM(holistic evaluation of language model)在自动问答任务上的数据集、评价指标和量化计算方法。未来基于大模型的自动问答系统研究将会围绕多模态融合、高安全性、高可解释性、低资源消耗,以及结合大模型和自动化的综合评价体系这几个方面进一步拓展与深化。

关键词: 大模型(LMs), 自动问答(QA)系统, 系统特征, HELM评价体系

Abstract: In the era of large models (LMs), question answering (QA) systems exhibit new characteristics. This paper reviews QA system features and evaluation methods. From the stages of training data for question-answering model inference training, pre-training frameworks, model post-processing, and efficient model fine-tuning, it contrasts the early “pursuit of data and parameter scale” training methods with the current “emphasis on data and model efficiency, ” and systematically analyzes the new characteristics of large-model-based automatic question-answering systems. The paper summarizes existing QA evaluation methods, reviewing HELM (holistic evaluation of language model) regarding datasets, metrics, and quantitative assessment for QA tasks. Future research on LM-based QA systems will further expand and deepen in several asp-ects, including multimodal fusion, high security, high interpretability, low resource consumption, and a comprehensive evaluation system combining large models and automation.

Key words: large models(LMs), question answering(QA) systems, system features, HELM (holistic evaluation of language model) evaluation framework