[1] KEPUSKA V, BOHOUTA G. Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home)[C]//Proceedings of the IEEE 8th Annual Computing and Communication Workshop and Conference. Piscataway: IEEE, 2018: 99-103.
[2] LOPEZ A. Statistical machine translation[J]. ACM Computing Surveys, 2008, 40(3): 1-49.
[3] SUN Z Y, ZHANG J M, HARMAN M, et al. Automatic testing and improvement of machine translation[C]//Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. New York: ACM, 2020: 974-985.
[4] DERIU J, RODRIGO A, OTEGI A, et al. Survey on evaluation methods for dialogue systems[J]. Artificial Intelligence Review, 2021, 54(1): 755-810.
[5] SCHRAMOWSKI P, TURAN C, ANDERSEN N, et al. Language models have a moral dimension[J]. arXiv:2103.11790, 2021.
[6] HSU Y C, SHEN Y L, JIN H X, et al. Generalized ODIN: det-ecting out-of-distribution image without learning from out-of-distribution data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10948-10957.
[7] GU Y, KASE S E, VANNI M, et al. Beyond I.I.D. three levels of generalization for question answering on knowledge bases[C]//Proceedings of the Web Conference 2021. New York: ACM, 2021: 3477-3488.
[8] KEYSERS D, SCH?RLI N, SCALES N, et al. Measuring compositional generalization: a comprehensive method on realistic data[J]. arXiv:1912.09713, 2019.
[9] MILLER J, KRAUTH K, RECHT B, et al. The effect of natural distribution shift on question answering models[C]//Proceedings of the 37th International Conference on Machine Lea-rning, 2020: 6905-6916.
[10] CHEN S Q, JIN S, XIE X Y. Testing your question answe-ring software via asking recursively[C]//Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2021: 104-116.
[11] LIU Z X, FENG Y, YIN Y N, et al. QATest: a uniform fuz-zing framework for question answering systems[C]//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: 1-12.
[12] CHEN S Q, JIN S, XIE X Y. Validation on machine reading comprehension software without annotated labels: a property-based method[C]//Proceedings of the 29th ACM Joint Mee-ting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2021: 590-602.
[13] WU H C, LUK R W P, WONG K F, et al. Interpreting TF-IDF term weights as making relevance decisions[J]. ACM Transactions on Information Systems, 2008, 26(3): 1-37.
[14] QIN P G, YU J R, GAO Y, et al. Unified QA-aware knowledge graph generation based on multi-modal modeling[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 7185-7189.
[15] RAJPURKAR P, JIA R, LIANG P. Know what you don’t know: unanswerable questions for SQuAD[J]. arXiv:1806. 03822, 2018.
[16] CLARK C, LEE K, CHANG M W, et al. BoolQ: exploring the surprising difficulty of natural yes/no questions[J]. arXiv: 1905.10044, 2019.
[17] KWIATKOWSKI T, PALOMAKI J, REDFIELD O, et al. Natural questions: a benchmark for question answering research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 453-466.
[18] 江洋洋, 金伯, 张宝昌. 深度学习在自然语言处理领域的研究进展[J]. 计算机工程与应用, 2021, 57(22): 1-14.
JIANG Y Y, JIN B, ZHANG B C. Research progress of natural language processing based on deep learning[J]. Computer Engineering and Applications, 2021, 57(22): 1-14.
[19] 吴欢欢, 谢瑞麟, 乔塬心, 等. 基于可解释性分析的深度神经网络优化方法[J]. 计算机研究与发展, 2024, 61(1): 209-220.
WU H H, XIE R L, QIAO Y X, et al. Optimizing deep neural network based on interpretability analysis[J]. Journal of Computer Research and Development, 2024, 61(1): 209-220.
[20] RAJPURKAR P. Squad: 100,000+ questions for machine comprehension of text[J]. arXiv:1606.05250, 2016.
[21] CHOI E, HE H, IYYER M, et al. QuAC: question answe-ring in context[J]. arXiv:1808.07036, 2018.
[22] YU F, SEFF A, ZHANG Y, et al. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop[J]. arXiv:1506.03365, 2015.
[23] BARR E T, HARMAN M, MCMINN P, et al. The oracle problem in software testing: a survey[J]. IEEE Transactions on Software Engineering, 2015, 41(5): 507-525.
[24] 王赞, 闫明, 刘爽, 等. 深度神经网络测试研究综述[J]. 软件学报, 2020, 31(5): 1255-1275.
WANG Z, YAN M, LIU S, et al. Survey on testing of deep neural networks[J]. Journal of Software, 2020, 31(5): 1255-1275.
[25] 谢瑞麟, 崔展齐, 陈翔, 等. IADT: 基于解释分析的深度神经网络差分测试[J]. 软件学报, 2024, 35(12): 5452-5469.
XIE R L, CUI Z Q, CHEN X, et al. IADT: interpretability-analysis-based differential testing for deep neural network[J]. Journal of Software, 2024, 35(12): 5452-5469.
[26] MCKEEMAN W M. Differential testing for software[J]. Digital Technical Journal, 1998, 10(1): 100-107.
[27] SEGURA S, FRASER G, SANCHEZ A B, et al. A survey on metamorphic testing[J]. IEEE Transactions on Software Engineering, 2016, 42(9): 805-824.
[28] CHEN T Y, KUO F C, LIU H, et al. Metamorphic testing[J]. ACM Computing Surveys, 2019, 51(1): 1-27.
[29] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out, 2004.
[30] CHEN Y, JIANG Y, MA F, et al. EnFuzz: ensemble fuzzing with seed synchronization among diverse fuzzers[J]. arXiv: 1807.00182, 2018.
[31] CAVNAR W B, TRENKLE J M. N-gram-based text categorization[C]//Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
[32] LIN C Y, HOVY E. Automatic evaluation of summaries using N-gram co-occurrence statistics[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown: ACL, 2003: 71-78.
[33] JURAFSKY D S , MARTIN J H. Speech and language processing[M]New York: Prentice Hall, 2010.
[34] HERRERA A, GUNADI H, MAGRATH S, et al. Seed sele-ction for successful fuzzing[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2021: 230-243.
[35] SHEN Q C, CHEN J J, ZHANG J M, et al. Natural test generation for precise testing of question answering software[C]//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: 1-12.
[36] XIE X Y, JIN S, CHEN S Q. QAAskeR+: a novel testing method for question answering software via asking recursive questions[J]. Automated Software Engineering, 2023, 30(1): 14.
[37] WANG J, LI Y H, CHEN Z F, et al. Knowledge graph driven inference testing for question answering software[C]//Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. New York: ACM, 2024: 1-13. |