Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (20): 75-104.DOI: 10.3778/j.issn.1002-8331.2410-0452

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey of Feedback-Based Content and Behavior Alignment Methods for Large Language Model

ZHANG Yuying, YUN Jing, LIU Xueying, SHI Xiaoguo   

  1. 1.College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010080, China
    2.Inner Mongolia Autonomous Region Engineering & Technology Research Center of Big Data Based Software Service, Hohhot 010080, China
    3.Inner Mongolia Beijiang Cyberspace Security Key Laboratory, Hohhot 010080, China
  • Online:2025-10-15 Published:2025-10-15

基于反馈的大语言模型内容与行为对齐方法综述

张钰莹,云静,刘雪颖,史晓国   

  1. 1.内蒙古工业大学 数据科学与应用学院,呼和浩特 010080 
    2.内蒙古自治区大数据软件服务工程技术研究中心,呼和浩特 010080
    3.内蒙古北疆网络空间安全重点实验室,呼和浩特 010080

Abstract: In recent years, large language models have demonstrated exceptional capabilities in natural language understanding, generation, and reasoning across a range of tasks. However, ensuring that their outputs align with human-defined standards has become a critical solution. This paper presents a systematic review of feedback-based alignment methods, focusing on the dual objectives of “content alignment” and “behavior alignment”. The review spans conceptual frameworks, technical implementations, and evaluation methodologies. Firstly, it clarifies the sources, formats, and intended purposes of feedback, establishing a conceptual framework for feedback-based alignment. Secondly, it summarizes existing feedback alignment methods in the order of model training, inference, and generation. Following this, it reviews the fundamental technical metrics for evaluating large models, along with relevant datasets and benchmarks. Finally, this paper highlights the potential of feedback-based alignment methods to improve the performance of large language models, as well as the significant challenges and key issues currently faced.

Key words: large language models(LLMs), AI alignment, content security, evaluate benchmarks

摘要: 近年来,大语言模型在一系列任务中展现了卓越的自然语言理解、生成与推理能力。然而,为了确保其输出符合人类预设标准,对齐成为关键的解决方式。针对“内容对齐”和“行为对齐”两大核心目标,从概念框架、技术实现到评估方法进行了系统综述。明确了获取反馈的来源、格式及其使用目的,建立了基于反馈对齐的概念框架。按照大模型训练、推理和生成的顺序总结了现有的基于反馈对齐的方法。之后回顾了评估大模型的基本技术指标,以及相关的数据集与基准。总结了基于反馈的对齐方法在提升大语言模型性能方面的潜力,以及当前面临的重大挑战和关键问题。

关键词: 大语言模型(LLMs), AI对齐, 内容安全, 评估基准