
计算机工程与应用 ›› 2026, Vol. 62 ›› Issue (5): 1-17.DOI: 10.3778/j.issn.1002-8331.2503-0004
崔龙飞1,王宗水2,3+,鲍盈旭4,赵红5
收稿日期:2025-03-03
修回日期:2025-09-18
在线发布日期:2026-03-01
出版日期:2026-03-01
基金资助:CUI Longfei1, WANG Zongshui2,3+, BAO Yingxu4, ZHAO Hong5
Received:2025-03-03
Revised:2025-09-18
Online:2026-03-01
Published:2026-03-01
摘要: 大模型时代,自动问答系统呈现出诸多新的特征。通过文献阅读和梳理,对自动问答系统特征和评测体系进行总结与归纳,从问答模型推理训练的训练数据、预训练框架、模型后处理、模型高效微调等阶段,对比大模型发展初期“追求数据和参数规模”的训练方法和如今“注重数据和模型效率”之间的差异,系统分析基于大模型的自动问答系统新的特征。总结当前各种类型的自动问答大模型评测体系,并详细梳理自动化评价体系HELM(holistic evaluation of language model)在自动问答任务上的数据集、评价指标和量化计算方法。未来基于大模型的自动问答系统研究将会围绕多模态融合、高安全性、高可解释性、低资源消耗,以及结合大模型和自动化的综合评价体系这几个方面进一步拓展与深化。
崔龙飞, 王宗水, 鲍盈旭, 赵红. 大模型时代自动问答系统及评价体系综述 [J]. 计算机工程与应用, 2026, 62(5): 1-17.
CUI Longfei, WANG Zongshui, BAO Yingxu, ZHAO Hong. Survey on Question Answering Systems and Evaluation in the Era of Large Models[J]. Computer Engineering and Applications, 2026, 62(5): 1-17.
| [1] WEIZENBAUM J.ELIZA:a computer program for the study of natural language communication between man and machine[J].Communications of the ACM,1966,9(1):36-45. [2] VOORHEES E M,TICE D M.Overview of the TREC-9 question answering track[C]//Proceedings of the Ninth Text REt-rieval Conference (TREC 2000),2000:71-80. [3] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems,2017:6000-6010. [4] ISHWARI K S D,ANEEZE A K R R,SUDHEESAN S,et al.Advances in natural language question answering:a review[EB/OL].[2025-02-01].https://arxiv.org/abs/1904.05276. [5] CACIULARU A,DAGAN I,GOLDBERGER J,et al.Long context question answering via supervised contrastive learning[EB/OL].[2025-02-01].https://arxiv.org/abs/2112.08777. [6] ABBASIANTAEB Z,MOMTAZI S.Text-based question ans-wering from information retrieval and deep neural network perspectives:a survey[J].WIREs Data Mining and Knowledge Discovery,2021,11(6):e1412. [7] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics,2019:4171-4186. [8] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training [EB/OL].OpenAI Blog[2025-02-01].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_ paper.pdf. [9] 陈俊臻,王淑营,罗浩然.融合大模型微调与图神经网络的知识图谱问答[J].计算机工程与应用,2024,60(24):166-176. CHEN J Z,WANG S Y,LUO H R.Combining large model fine-tuning and graph neural networks for knowledge graph question answering[J].Computer Engineering and Applications,2024,60(24):166-176. [10] HUANG J,CHANG K C.Towards reasoning in large language models:a survey[EB/OL].[2025-02-01].https://arxiv.org/abs/2212.10403. [11] ALAYRAC J B,DONAHUE J,LUC P,et al.Flamingo:a visual language model for few-shot learning[EB/OL].[2025-02-01].https://arxiv.org/abs/2204.14198. [12] SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outrageously large neural networks:the sparsely-gated mixture-of-experts layer[EB/OL].[2025-02-01].https://arxiv.org/abs/1701.06538. [13] WEI J,WANG X Z,SCHUURMANS D,et al.Chain-of-thought prompting elicits reasoning in large language models[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems.New York:ACM,2022:24824-24837. [14] OpenAI.Hello GPT-4o[EB/OL].OpenAI Blog[2025-02-01].https://openai.com/index/hello-gpt-4o/. [15] TEAM G,GEORGIEV P,LEI V I,et al.Gemini 1.5:unlocking multimodal understanding across millions of tokens of context[EB/OL].[2025-02-01].https://arxiv.org/abs/2403.05530. [16] YANG A,LI A F,YANG B S,et al.Qwen3 technical report[EB/OL].[2025-02-01].https://arxiv.org/abs/2505.09388. [17] BI X,CHEN D,CHEN G,et al.DeepSeek LLM:scaling open-source language models with longtermism[EB/OL].[2025-02-01].https://arxiv.org/abs/2401.02954. [18] HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[EB/OL].[2025-02-01].https://arxiv.org/abs/1503.02531. [19] HU E J,SHEN Y,WALLIS P,et al.LoRA:low-rank adaptation of large language models[C]//Proceedings of the International Conference on Learning Representations(ICLR),2022. [20] Anthropic.The claude 3 model family:Opus,SONNET,Haiku[R/OL].Anthropic Technical Report[2025-02-01].https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618-857627/Model_Card_Claude_3.pdf. [21] LEE S,JANG Y,PARK C,et al.PEEP-talk:a situational dialogue-based chatbot for English education[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2023:190-207. [22] LEE J,YOON W,KIM S,et al.BioBERT:a pre-trained biomedical language representation model for biomedical text mining[J].Bioinformatics,2020,36(4):1234-1240. [23] 张华丽,康晓东,李小军,等.语义及句法特征多注意力交互的医疗自动问答[J].计算机工程与应用,2022,58(18):233-240. ZHANG H L,KANG X D,LI X J,et al.Semantic and syntactic features with multi-attentive interaction for medical question answering[J].Computer Engineering and Applications,2022,58(18):233-240. [24] 张君冬,刘江峰,王震宇,等.用户响应式场景下大模型驱动的AI问答研究:以医疗分诊为例[J].情报理论与实践,2025,48(2):188-197. ZHANG J D,LIU J F,WANG Z Y,et al.Research on AI question answering driven by large language models in user responsive scenarios:a case study of medical triage[J].Information Studies (Theory & Application),2025,48(2):188-197. [25] CHALKIDIS I,FERGADIOTIS M,MALAKASIOTIS P,et al.LEGAL-BERT:the muppets straight out of law school[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Stroudsburg:ACL,2020:2898-2904. [26] HONG W X,LI J,LI S Y.Financial FAQ question-answering system based on question semantic similarity[C]//Proceedings of the International Conference on Knowledge Science,Engi-neering and Management,2024:152-163. [27] 王文晟,谭宁,黄凯,等.基于大模型的具身智能系统综述[J].自动化学报,2025,51(1):1-19. WANG W S,TAN N,HUANG K,et al.Embodied intelligence systems based on large models:a survey[J].Acta Aut-omatica Sinica,2025,51(1):1-19. [28] BALAYN A,LOFI C,HOUBEN G J.Managing bias and unf-airness in data for decision support:a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems[J].The VLDB Journal,2021,30(5):739-768. [29] CHANG Y P,WANG X,WANG J D,et al.A survey on evaluation of large language models[J].ACM Transactions on Intelligent Systems and Technology,2024,15(3):1-45. [30] 车万翔,窦志成,冯岩松,等.大模型时代的自然语言处理:挑战、机遇与发展[J].中国科学:信息科学,2023,53(9):1645-1687. CHE W X,DOU Z C,FENG Y S,et al.Towards a comprehensive understanding of the impact of large language models on natural language processing:challenges,opportunities and future directions[J].Scientia Sinica (Informationis),2023,53(9):1645-1687. [31] 李智,王震,杨赋庚,等.基于表格的自动问答研究与展望[J].计算机工程与应用,2021,57(13):67-76. LI Z,WANG Z,YANG F G,et al.Research and prospect of automatic question answer based on table[J].Computer Eng-ineering and Applications,2021,57(13):67-76. [32] 赵芸,刘德喜,万常选,等.检索式自动问答研究综述[J].计算机学报,2021,44(6):1214-1232. ZHAO Y,LIU D X,WAN C X,et al.Retrieval-based automatic question answer:a literature survey[J].Chinese Journal of Computers,2021,44(6):1214-1232. [33] FERRUCCI D,BROWN E,CHU-CARROLL J,et al.Building Watson:an overview of the DeepQA project[J].AI Magazine,2010,31(3):59-79. [34] CHEN D Q,FISCH A,WESTON J,et al.Reading wikipedia to answer open-domain questions[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2017:1870-1879. [35] GUPTA P,GUPTA V.A survey of text question answering techniques[J].International Journal of Computer Applications,2012,53(4):1-8. [36] VOORHEES E M.The TREC-8 question answering track report[C]//Proceedings of the Text Retrieval Conference,1999:77-82. [37] NOY N,GAO Y Q,JAIN A,et al.Industry-scale Knowledge Graphs:lessons and Challenges:five diverse technology companies show how it’s done[J].Queue,2019,17(2):48-75. [38] LAN Y S,HE G L,JIANG J H,et al.Complex knowledge base question answering:a survey[J].IEEE Transactions on Knowledge and Data Engineering,2023,35(11):11196-11215. [39] TALMOR A,BERANT J.The web as a knowledge-base for answering complex questions[EB/OL].[2025-02-01].https://arxiv.org/abs/1803.06643. [40] OVADIA S.Quora.com:another place for users to ask questions[J].Behavioral & Social Sciences Librarian,2011,30(3):176-180. [41] SRBA I,BIELIKOVA M.A comprehensive survey and classification of approaches for community question answering[J].ACM Transactions on the Web,2016,10(3):1-63. [42] BARADARAN R,GHIASI R,AMIRKHANI H.A survey on machine reading comprehension systems[J].Natural Language Engineering,2022,28(6):683-732. [43] MAVI V,JANGRA A,JATOWT A.Multi-hop question ans-wering[J].Foundations and Trends? in Information Retrieval,2024,17(5):457-586. [44] RADFORD A,KIM J W,XU T,et al.Robust speech recognition via large-scale weak supervision[C]//Proceedings of the 40th International Conference on Machine Learning.New York:ACM,2023:28492-28518. [45] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16×16 words:transformers for image recognition at scale[EB/OL].[2025-02-01].https://arxiv.org/abs/2010.11929. [46] RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[EB/OL].[2025-02-01].https://arxiv.org/abs/2103.00020. [47] LIU H,LI C,WU Q,et al.Visual instruction tuning[C]//Adv-ances in Neural Information Processing Systems,2023:34892-34916. [48] SAWCZYN A,BINKOWSKI J,JANIAK D,et al.FactSelfCheck:fact-level black-box hallucination detection for LLMs[EB/OL].[2025-02-01].https://arxiv.org/abs/2503.17229. [49] LI C Y,WANG J Y,PAN X D,et al.ReasoningShield:safety detection over reasoning traces of large reasoning models[EB/OL].[2025-02-01].https://arxiv.org/abs/2505.17244. [50] CHRISTIANO P,LEIKE J,BROWN T B,et al.Deep reinforcement learning from human preferences[EB/OL].[2025-02-01].https://arxiv.org/abs/1706.03741. [51] RAFAILOV R,SHARMA A,MITCHELL E,et al.Direct preference optimization:your language model is secretly a reward model[EB/OL].[2025-02-01].https://arxiv.org/abs/2305.18290. [52] RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].Journal of Machine Learning Research,2020,21:1-67. [53] LEE K,CHANG M-W,TOUTANOVA K.Latent retrieval for weakly supervised open domain question answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2019:6086-6096. [54] YANG W,XIE Y Q,LIN A,et al.End-to-end open-domain question answering with[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.Stroudsburg:ACL,2019:72-77. [55] BROWN T B,MANN B,RYDER N,et al.Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.New York:ACM,2020:1877-1901. [56] TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:open and efficient foundation language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2302.13971. [57] ZHU W H,LIU H Y,DONG Q X,et al.Multilingual mac-hine translation with large language models:empirical res-ults and analysis[C]//Findings of the Ass-ociation for Computational Linguistics:NAACL 2024.Stroudsburg:ACL,2024:2765-2781. [58] KWIATKOWSKI T,PALOMAKI J,REDFIELD O,et al.Natural questions:a benchmark for question answering res-earch[J].Transactions of the Association for Computational Linguistics,2019,7:453-466. [59] LI J L,WANG J Y,ZHANG Z S,et al.Self-prompting large language models for zero-shot open-domain QA[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:ACL,2024:296-310. [60] ZHANG T Y,LADHAK F,DURMUS E,et al.Benchmarking large language models for news summarization[J].Transactions of the Association for Computational Linguistics,2024,12:39-57. [61] GUNASEKAR S,ZHANG Y,ANEJA J,et al.Textbooks are all you need[EB/OL].[2025-02-01].https://arxiv.org/abs/2306. 11644. [62] DU N,HUANG Y P,DAI A M,et al.GLaM:efficient scaling of language models with mixture-of-experts[EB/OL].[2025-02-01].https://arxiv.org/abs/2112.06905. [63] LIU E S,ZHU J Y,LIN Z N,et al.Efficient expert pruning for sparse mixture-of-experts language models:enhancing performance and reducing inference costs[EB/OL].[2025-02-01].https://arxiv.org/abs/2407.00945. [64] OUYANG L,WU J,XU J,et al.Training language models to follow instructions with human feedback[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems.New York:ACM,2022:27730-27744. [65] WEN J X,KE P,SUN H,et al.Unveiling the implicit toxicity in large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2311.17391. [66] LEWIS P,PEREZ E,PIKTUS A,et al.Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.New York:ACM,2020:9459-9474. [67] DETTMERS T,PAGNONI A,HOLTZMAN A,et al.QLoRA:efficient finetuning of quantized LLMs[EB/OL].[2025-02-01].https://arxiv.org/abs/2305.14314. [68] ZELLERS R,HOLTZMAN A,BISK Y,et al.HellaSwag:can a machine really finish your sentence?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2019:4791-4800. [69] LIN S,HILTON J,EVANS O.TruthfulQA:measuring how models mimic human falsehoods[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2022:3214-3252. [70] WANG X Z,WEI J,SCHUURMANS D,et al.Self-consistency improves chain of thought reasoning in language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2203.11171. [71] FEI H,LI B B,LIU Q,et al.Reasoning implicit sentiment with chain-of-thought prompting[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2023:1171-1182. [72] YAO S Y,YU D,ZHAO J,et al.Tree of thoughts:deliberate problem solving with large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2305.10601. [73] 陶江垚,奚雪峰,盛胜利,等.结构化思维提示增强大语言模型推理能力综述[J].计算机工程与应用,2025,61(6):64-83. TAO J Y,XI X F,SHENG S L,et al.Review on enhancing reasoning abilities of large language model through structured thinking prompts[J].Computer Engineering and Applications,2025,61(6):64-83. [74] ALAYRAC J B,DONAHUE J,LUC P,et al.Flamingo:a vis-ual language model for few-shot learning[EB/OL].[2025-02-01].https://arxiv.org/abs/2204.14198. [75] LIU W J,ZHOU P,ZHAO Z,et al.K-BERT:enabling language representation with knowledge graph[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:2901-2908. [76] GUAN X Y,LIU Y J,LIN H Y,et al.Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting[C]//Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Art-ificial Intelligence.New York:ACM,2024:18126-18134. [77] SANH V,DEBUT L,CHAUMOND J,et al.DistilBERT,a distilled version of BERT:smaller,faster,cheaper and lighter[EB/OL].[2025-02-01].https://arxiv.org/abs/1910.01108. [78] BELTAGY I,PETERS M E,COHAN A.Longformer:the long-document transformer[EB/OL].[2025-02-01].https://arxiv.org/abs/2004.05150. [79] ZAHEER M,GURUGANESH G,DUBEY A,et al.BigBird:transformers for longer sequences[EB/OL].[2025-02-01].https://arxiv.org/abs/2007.14062. [80] KITAEV N,KAISER ?,LEVSKAYA A.Reformer:the efficient transformer[EB/OL].[2025-02-01].https://arxiv.org/abs/2001.04451. [81] LI J Q,WANG M M,ZHENG Z L,et al.LooGLE:can long-context language models understand long contexts?[EB/OL].[2025-02-01].https://arxiv.org/abs/2311.04939. [82] ASAI A,WU Z Q,WANG Y Z,et al.Self-RAG:learning to retrieve,generate,and critique through self-reflection[EB/OL].[2025-02-01].https://arxiv.org/abs/2310.11511. [83] JIN Q,DHINGRA B,LIU Z P,et al.PubMedQA:a dataset for biomedical research question answering[EB/OL].[2025-02-01].https://arxiv.org/abs/1909.06146. [84] FAREA A,YANG Z,DUONG K,et al.Evaluation of Question Answering Systems:complexity of judging a natural language[EB/OL].[2025-02-01].https://arxiv.org/abs/2209. 12617. [85] LIN C Y.ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2004:74-81. [86] CHANG Y P,WANG X,WANG J D,et al.A survey on evaluation of large language models[J].ACM Transactions on Int-elligent Systems and Technology,2024,15(3):1-45. [87] SUN Z Q,YU H K,SONG X D,et al.MobileBERT:a compact task-agnostic BERT for resource-limited devices[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2020:2158-2170. [88] CHIANG W L,ZHENG L M,SHENG Y,et al.Chatbot Arena:an open platform for evaluating LLMs by human preference[EB/OL].[2025-02-01].https://arxiv.org/abs/2403.04132. [89] BOMMASANI R,LIANG P,LEE T.Holistic evaluation of language models[J].Annals of the New York Academy of Sciences,2023,1525(1):140-146. [90] 曾帅,王帅,袁勇,等.面向知识自动化的自动问答研究进展[J].自动化学报,2017,43(9):1491-1508. ZENG S,WANG S,YUAN Y,et al.Towards knowledge aut-omation:a survey on question answering systems[J].Acta Automatica Sinica,2017,43(9):1491-1508. [91] VAN SCHAIK T A,PUGH B.A field guide to automatic eva-luation of LLM-generated summaries[C]//Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,2024:2832-2836. [92] SILVA V A,BITTENCOURT I I,MALDONADO J C.Automatic question classifiers:a systematic review[J].IEEE Transactions on Learning Technologies,2019,12(4):485-502. [93] KO?ISKY T,SCHWARZ J,BLUNSOM P,et al.The NarrativeQA reading comprehension challenge[J].Transactions of the Association for Computational Linguistics,2018,6:317-328. [94] CHOI E,HE H,IYYER M,et al.QuAC:question answering in context[C]//Proceedings of the 2018 Conference on Emp-irical Methods in Natural Language Processing.Stroudsburg:ACL,2018:2174-2184. [95] CLARK C,LEE K,CHANG M W,et al.BoolQ:exploring the surprising difficulty of natural yes/No questions[EB/OL].[2025-02-01].https://arxiv.org/abs/1905.10044. [96] MIHAYLOV T,CLARK P,KHOT T,et al.Can a suit of armor conduct electricity? a new dataset for open book question answering[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2018:2381-2391. [97] HENDRYCKS D,BURNS C,BASART S,et al.Measuring massive multitask language understanding[EB/OL].[2025-02-01].https://arxiv.org/abs/2009.03300. [98] GUU K,LEE K,TUNG Z,et al.Retrieval augmented language model pre-training[C]//Proceedings of the International Conference on Machine Learning(ICML),2020:3929-3938. [99] ECHTERHOFF J M,LIU Y,ALESSA A,et al.Cognitive bias in decision-making with LLMs[C]//Findings of the Association for Computational Linguistics:EMNLP 2024.Stroudsburg:ACL,2024:12640-12653. [100] 徐磊,胡亚豪,潘志松.针对大语言模型的偏见性研究综述[J].计算机应用研究,2024,41(10):2881-2892. XU L,HU Y H,PAN Z S.Review of biased research on large language model[J].Application Research of Computers,2024,41(10):2881-2892. [101] WEN Y C,BI K P,CHEN W,et al.Evaluating implicit bias in large language models by attacking from a psychometric perspective[EB/OL].[2025-02-01].https://arxiv.org/abs/2406.14023. [102] WANG K H,YANG J X,WU H J.A survey of toxic comment classification methods[EB/OL].[2025-02-01].https://arxiv.org/abs/2112.06412. [103] LEES A,TRAN V Q,TAY Y,et al.A new generation of perspective API:efficient multilingual character-level transformers[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2022:3197-3207. [104] ZHAO W X,ZHOU K,LI J Y,et al.A survey of large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2303.18223. [105] SHOEYBI M,PATWARY M,PURI R,et al.Megatron-LM:training multi-billion parameter language models using model parallelism[EB/OL].[2025-02-01].https://arxiv.org/abs/1909. 08053. [106] WANG B X,CHEN W X,PEI H Z,et al.DecodingTrust:a comprehensive assessment of trustworthiness in GPT models[EB/OL].[2025-02-01].https://arxiv.org/abs/2306.11698. [107] LEE T,YASUNAGA M,MENG C L,et al.Holistic evaluation of text-to-image models[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems.New York:ACM,2023:69981-70011. [108] SHENG Y,ZHENG L M,YUAN B H,et al.FlexGen:high-throughput generative inference of large language models with a single GPU[EB/OL].[2025-02-01].https://arxiv.org/abs/2303.06865. [109] ZHANG Z Y,SHENG Y,ZHOU T Y,et al.H2O:heavy-hitter oracle for efficient generative inference of large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2306.14048. [110] ZHUO T Y,HUANG Y J,CHEN C Y,et al.Red teaming ChatGPT via jailbreaking:bias,robustness,reliability and toxicity[EB/OL].[2025-02-01].https://arxiv.org/abs/2301.12867. [111] DESHPANDE A,MURAHARI V,RAJPUROHIT T,et al.Toxicity in ChatGPT:analyzing persona-assigned language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2304.05335. [112] GUHA N,NYARKO J,HO D E,et al.Legalbench:a collaboratively built benchmark for measuring legal reasoning in large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2308.11462. [113] LI H N,ZHANG Y X,KOTO F,et al.CMMLU:measuring massive multitask language understanding in Chinese[C]//Findings of the Association for Computational Linguistics.Stroudsburg:ACL,2024:11260-11285. [114] YU J F,WANG X Z,TU S Q,et al.KoLA:carefully benchmarking world knowledge of large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2306.09296. [115] MASSAROLI S,POLI M,FU D Y,et al.Laughing hyena distillery:extracting compact recurrences from convolutions[EB/OL].[2025-02-01].https://arxiv.org/abs/2310.18780. [116] KUANG W R,QIAN B C,LI Z T,et al.FederatedScope-LLM:a comprehensive package for fine-tuning large language models in federated learning[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2024:5260-5271. [117] LIN Y T,CHEN Y N.LLM-eval:unified multi-dimensional automatic evaluation for open-domain conversations with large language models[C]//Proceedings of the 5th Workshop on NLP for Conversational AI.Stroudsburg:ACL,2023:47-58. [118] ZHENG L M,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-bench and chatbot arena[EB/OL].[2025-02-01].https://arxiv.org/abs/2306.05685. [119] WALLACE E,FENG S,KANDPAL N,et al.Universal adversarial triggers for attacking and analyzing NLP[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Stroudsburg:ACL,2019:2153-2162. [120] CLARK K,KHANDELWAL U,LEVY O,et al.What does BERT look at? an analysis of BERT’s attention[C]//Proceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks for NLP.Stroudsburg:ACL,2019:276-286. [121] BAI G J,CHAI Z,LING C,et al.Beyond efficiency:a systematic survey of resource-efficient large language models[EB/OL].[2025-02-01].https://arxiv.org/abs/2401.00625. |
| No related articles found! |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||