QAAttack： Text Feature Analysis-Based Fuzz Testing Approach for Question-Answering Systems

doi:10.3778/j.issn.1002-8331.2501-0217

Abstract

Abstract: Traditional testing methods for question-answering (QA) systems rely on pre-annotated datasets, which are costly to construct and often unavailable in real-world scenarios. To address this issue, researchers have proposed generating test cases through metamorphic relations to evaluate system robustness. However, existing approaches lack effective guidance and selection strategies during test case generation, resulting in a substantial number of redundant or invalid test cases, which may compromise the effectiveness of the testing outcomes. To address these limitations, this paper presents QAA-ttack, a fuzz testing method for QA systems based on textual feature analysis. QAAttack employs diverse metamorphic relations for seed mutation, calculates the TF-IDF scores of words in question sentences to determine the priority order of mutations, and introduces a seed selection strategy guided by answer semantic similarity. This strategy prioritizes seeds most likely to trigger system errors in each iteration. Experimental results show that, compared with existing methods, QAAttack significantly improves the detection of erroneous behaviors in QA systems, identifying 14 236 more erroneous behaviors than QAAskeR. Moreover, it achieves a 37.08% higher success rate in triggering errors compared to QATest.

Key words: question-answering systems, fuzz testing, metamorphic testing, natural language processing

摘要： 传统问答系统测试方法依赖预先标注的数据集，但人工标注成本较高，且某些测试场景中通常缺少标注数据。为应对这一问题，研究人员提出通过构建蜕变关系生成测试数据以评估系统鲁棒性。然而，现有方法在测试用例生成过程中缺乏测试指导和选择策略，导致生成大量冗余或无效的测试数据，可能影响测试结果有效性。针对上述问题，提出基于文本特征分析的问答系统模糊测试方法QAAttack，采用多种蜕变关系进行种子变异，计算问题句中单词的TF-IDF分数以确定优先变异顺序，并引入基于答案语义相似度的种子选择策略，在每轮迭代中优先选择最可能触发系统错误的种子。实验结果表明，与现有方法相比，QAAttack显著提升了检测问答系统错误行为的有效性，其检测的错误行为数量较QAAskeR多出14 236个；此外，其触发错误行为的成功率相较于QATest提升了37.08%。

关键词: 问答系统, 模糊测试, 蜕变测试, 自然语言处理

FU Haikuo, QIAO Yuanxin, CHEN Jingjing, CUI Zhanqi, WANG Zhiwei. QAAttack： Text Feature Analysis-Based Fuzz Testing Approach for Question-Answering Systems[J]. Computer Engineering and Applications, 2025, 61(20): 194-205.

符海阔, 乔塬心, 陈菁菁, 崔展齐, 王志伟. QAAttack：基于文本特征分析的问答系统模糊测试方法[J]. 计算机工程与应用, 2025, 61(20): 194-205.

References

[1] KEPUSKA V, BOHOUTA G. Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home)[C]//Proceedings of the IEEE 8th Annual Computing and Communication Workshop and Conference. Piscataway: IEEE, 2018: 99-103.
[2] LOPEZ A. Statistical machine translation[J]. ACM Computing Surveys, 2008, 40(3): 1-49.
[3] SUN Z Y, ZHANG J M, HARMAN M, et al. Automatic testing and improvement of machine translation[C]//Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. New York: ACM, 2020: 974-985.
[4] DERIU J, RODRIGO A, OTEGI A, et al. Survey on evaluation methods for dialogue systems[J]. Artificial Intelligence Review, 2021, 54(1): 755-810.
[5] SCHRAMOWSKI P, TURAN C, ANDERSEN N, et al. Language models have a moral dimension[J]. arXiv:2103.11790, 2021.
[6] HSU Y C, SHEN Y L, JIN H X, et al. Generalized ODIN: det-ecting out-of-distribution image without learning from out-of-distribution data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10948-10957.
[7] GU Y, KASE S E, VANNI M, et al. Beyond I.I.D. three levels of generalization for question answering on knowledge bases[C]//Proceedings of the Web Conference 2021. New York: ACM, 2021: 3477-3488.
[8] KEYSERS D, SCH?RLI N, SCALES N, et al. Measuring compositional generalization: a comprehensive method on realistic data[J]. arXiv:1912.09713, 2019.
[9] MILLER J, KRAUTH K, RECHT B, et al. The effect of natural distribution shift on question answering models[C]//Proceedings of the 37th International Conference on Machine Lea-rning, 2020: 6905-6916.
[10] CHEN S Q, JIN S, XIE X Y. Testing your question answe-ring software via asking recursively[C]//Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2021: 104-116.
[11] LIU Z X, FENG Y, YIN Y N, et al. QATest: a uniform fuz-zing framework for question answering systems[C]//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: 1-12.
[12] CHEN S Q, JIN S, XIE X Y. Validation on machine reading comprehension software without annotated labels: a property-based method[C]//Proceedings of the 29th ACM Joint Mee-ting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2021: 590-602.
[13] WU H C, LUK R W P, WONG K F, et al. Interpreting TF-IDF term weights as making relevance decisions[J]. ACM Transactions on Information Systems, 2008, 26(3): 1-37.
[14] QIN P G, YU J R, GAO Y, et al. Unified QA-aware knowledge graph generation based on multi-modal modeling[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 7185-7189.
[15] RAJPURKAR P, JIA R, LIANG P. Know what you don’t know: unanswerable questions for SQuAD[J]. arXiv:1806. 03822, 2018.
[16] CLARK C, LEE K, CHANG M W, et al. BoolQ: exploring the surprising difficulty of natural yes/no questions[J]. arXiv: 1905.10044, 2019.
[17] KWIATKOWSKI T, PALOMAKI J, REDFIELD O, et al. Natural questions: a benchmark for question answering research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 453-466.
[18] 江洋洋, 金伯, 张宝昌. 深度学习在自然语言处理领域的研究进展[J]. 计算机工程与应用, 2021, 57(22): 1-14.
JIANG Y Y, JIN B, ZHANG B C. Research progress of natural language processing based on deep learning[J]. Computer Engineering and Applications, 2021, 57(22): 1-14.
[19] 吴欢欢, 谢瑞麟, 乔塬心, 等. 基于可解释性分析的深度神经网络优化方法[J]. 计算机研究与发展, 2024, 61(1): 209-220.
WU H H, XIE R L, QIAO Y X, et al. Optimizing deep neural network based on interpretability analysis[J]. Journal of Computer Research and Development, 2024, 61(1): 209-220.
[20] RAJPURKAR P. Squad: 100,000+ questions for machine comprehension of text[J]. arXiv:1606.05250, 2016.
[21] CHOI E, HE H, IYYER M, et al. QuAC: question answe-ring in context[J]. arXiv:1808.07036, 2018.
[22] YU F, SEFF A, ZHANG Y, et al. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop[J]. arXiv:1506.03365, 2015.
[23] BARR E T, HARMAN M, MCMINN P, et al. The oracle problem in software testing: a survey[J]. IEEE Transactions on Software Engineering, 2015, 41(5): 507-525.
[24] 王赞, 闫明, 刘爽, 等. 深度神经网络测试研究综述[J]. 软件学报, 2020, 31(5): 1255-1275.
WANG Z, YAN M, LIU S, et al. Survey on testing of deep neural networks[J]. Journal of Software, 2020, 31(5): 1255-1275.
[25] 谢瑞麟, 崔展齐, 陈翔, 等. IADT: 基于解释分析的深度神经网络差分测试[J]. 软件学报, 2024, 35(12): 5452-5469.
XIE R L, CUI Z Q, CHEN X, et al. IADT: interpretability-analysis-based differential testing for deep neural network[J]. Journal of Software, 2024, 35(12): 5452-5469.
[26] MCKEEMAN W M. Differential testing for software[J]. Digital Technical Journal, 1998, 10(1): 100-107.
[27] SEGURA S, FRASER G, SANCHEZ A B, et al. A survey on metamorphic testing[J]. IEEE Transactions on Software Engineering, 2016, 42(9): 805-824.
[28] CHEN T Y, KUO F C, LIU H, et al. Metamorphic testing[J]. ACM Computing Surveys, 2019, 51(1): 1-27.
[29] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out, 2004.
[30] CHEN Y, JIANG Y, MA F, et al. EnFuzz: ensemble fuzzing with seed synchronization among diverse fuzzers[J]. arXiv: 1807.00182, 2018.
[31] CAVNAR W B, TRENKLE J M. N-gram-based text categorization[C]//Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
[32] LIN C Y, HOVY E. Automatic evaluation of summaries using N-gram co-occurrence statistics[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown: ACL, 2003: 71-78.
[33] JURAFSKY D S , MARTIN J H. Speech and language processing[M]New York: Prentice Hall, 2010.
[34] HERRERA A, GUNADI H, MAGRATH S, et al. Seed sele-ction for successful fuzzing[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2021: 230-243.
[35] SHEN Q C, CHEN J J, ZHANG J M, et al. Natural test generation for precise testing of question answering software[C]//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: 1-12.
[36] XIE X Y, JIN S, CHEN S Q. QAAskeR+: a novel testing method for question answering software via asking recursive questions[J]. Automated Software Engineering, 2023, 30(1): 14.
[37] WANG J, LI Y H, CHEN Z F, et al. Knowledge graph driven inference testing for question answering software[C]//Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. New York: ACM, 2024: 1-13.