Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (20): 194-205.DOI: 10.3778/j.issn.1002-8331.2501-0217

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

QAAttack: Text Feature Analysis-Based Fuzz Testing Approach for Question-Answering Systems

FU Haikuo, QIAO Yuanxin, CHEN Jingjing, CUI Zhanqi, WANG Zhiwei   

  1. 1.School of Computer Science, Beijing Information Science and Technology University, Beijing 100080, China
    2.Data and Technical Support Center, Cyberspace Administration of China, Beijing 100048, China
  • Online:2025-10-15 Published:2025-10-15

QAAttack:基于文本特征分析的问答系统模糊测试方法

符海阔,乔塬心,陈菁菁,崔展齐,王志伟   

  1. 1.北京信息科技大学 计算机学院,北京 100080 
    2.国家互联网信息办公室 数据与技术保障中心,北京 100048

Abstract: Traditional testing methods for question-answering (QA) systems rely on pre-annotated datasets, which are costly to construct and often unavailable in real-world scenarios. To address this issue, researchers have proposed generating test cases through metamorphic relations to evaluate system robustness. However, existing approaches lack effective guidance and selection strategies during test case generation, resulting in a substantial number of redundant or invalid test cases, which may compromise the effectiveness of the testing outcomes. To address these limitations, this paper presents QAA-ttack, a fuzz testing method for QA systems based on textual feature analysis. QAAttack employs diverse metamorphic relations for seed mutation, calculates the TF-IDF scores of words in question sentences to determine the priority order of mutations, and introduces a seed selection strategy guided by answer semantic similarity. This strategy prioritizes seeds most likely to trigger system errors in each iteration. Experimental results show that, compared with existing methods, QAAttack significantly improves the detection of erroneous behaviors in QA systems, identifying 14 236 more erroneous behaviors than QAAskeR. Moreover, it achieves a 37.08% higher success rate in triggering errors compared to QATest.

Key words: question-answering systems, fuzz testing, metamorphic testing, natural language processing

摘要: 传统问答系统测试方法依赖预先标注的数据集,但人工标注成本较高,且某些测试场景中通常缺少标注数据。为应对这一问题,研究人员提出通过构建蜕变关系生成测试数据以评估系统鲁棒性。然而,现有方法在测试用例生成过程中缺乏测试指导和选择策略,导致生成大量冗余或无效的测试数据,可能影响测试结果有效性。针对上述问题,提出基于文本特征分析的问答系统模糊测试方法QAAttack,采用多种蜕变关系进行种子变异,计算问题句中单词的TF-IDF分数以确定优先变异顺序,并引入基于答案语义相似度的种子选择策略,在每轮迭代中优先选择最可能触发系统错误的种子。实验结果表明,与现有方法相比,QAAttack显著提升了检测问答系统错误行为的有效性,其检测的错误行为数量较QAAskeR多出14 236个;此外,其触发错误行为的成功率相较于QATest提升了37.08%。

关键词: 问答系统, 模糊测试, 蜕变测试, 自然语言处理