Computer Engineering and Applications ›› 204, Vol. 60 ›› Issue (17): 129-138.DOI: 10.3778/j.issn.1002-8331.2306-0190

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Fine-Tuning via Mask Language Model Enhanced Representations Based Contrastive Learning and Application

ZHANG Dechi, WAN Weibing   

  1. School of Electrical and Electronic Engineering, Shanghai University of Engineering Science, Shanghai 200000, China
  • Online:2024-09-01 Published:2024-08-30

掩码语言增强表示的对比学习微调和应用

张德驰,万卫兵   

  1. 上海工程技术大学,电子电气工程学院,上海 200000

Abstract: Autoattention networks play an important role in a language model based on Transformer, where a fully connected structure can capture non-continuous dependencies in a sequence in parallel. However, the fully connected self-attention network is easy to overfit to false association information, such as false association between words and words, and between words and the predicted target. This overfitting problem limits the ability of language models to generalize data outside the domain or the distribution. To improve the robustness and generalization ability of the Transformer language model against false associations, fine-tuning framework via mask language model enhanced representations based contrastive learning is proposed in this paper. Specifically, the text sequence and the sequence after its random mask are sent into a twin network, and then the model parameters are learned by combining the contrast learning objective and the downstream task objective. Each twin network consists of a pre-trained language model and a task classifier. Therefore, the fine-tuning framework is more consistent with the mask language model pre-training learning mode and can maintain the generalization ability of pre-training knowledge in downstream tasks. The MNLI, FEVER, and QQP datasets and their challenge datasets are compared with the latest baseline models, including large language models ChatGPT, GPT4, and LLaMA. Experimental results show that the proposed model can guarantee the performance in distribution and improve the performance out of distribution. The experimental results on ATIS and Snips data sets prove that the model is also effective in common natural language processing tasks.

Key words: Transformer, masked language model, contrast learning, fine-tuning, spurious association, generalization ability

摘要: 在基于Transformer的语言模型中自注意力网络扮演了重要的角色,其中的全连接结构能够以并行方式捕捉序列中非连续的依赖关系。但是,全连接的自注意力网络很容易过拟合到虚假关联信息上,比如词与词、词与预测目标之间的虚假关联。这种过拟合问题限制了语言模型对领域外或分布外数据的泛化能力。为了提高Transformer语言模型对虚假关联的鲁棒性以及泛化能力,提出掩码语言增强表示的对比学习微调框架(fine-tuning framework via mask language model enhanced representations based contrastive learning,MCL-FT)。具体而言,文本序列和其随机掩码后的序列送入到一个孪生网络,结合对比学习目标和下游任务目标对模型进行参数学习。其中,每一个孪生网络由预训练语言模型和任务分类器组成。所以,该微调框架更加符合掩码语言模型预训练学习方式,能够在下游任务中保持预训练知识的泛化能力。在MNLI、FEVER和QQP数据集以及它们的挑战数据集上与最新的基线模型进行了对比,包括大语言模型ChatGPT、GPT4、LLaMA,实验结果验证了提出模型在保证分布内性能的同时有效提高了分布外的性能。在ATIS和Snips数据集上的实验结果证明,该模型在常见自然语言处理任务中也有显著的效果。

关键词: Transformer, 掩码语言模型, 对比学习, 微调, 虚假关联, 泛化能力