Fine-Tuning via Mask Language Model Enhanced Representations Based Contrastive Learning and Application

doi:10.3778/j.issn.1002-8331.2306-0190

Abstract

Abstract: Autoattention networks play an important role in a language model based on Transformer, where a fully connected structure can capture non-continuous dependencies in a sequence in parallel. However, the fully connected self-attention network is easy to overfit to false association information, such as false association between words and words, and between words and the predicted target. This overfitting problem limits the ability of language models to generalize data outside the domain or the distribution. To improve the robustness and generalization ability of the Transformer language model against false associations, fine-tuning framework via mask language model enhanced representations based contrastive learning is proposed in this paper. Specifically, the text sequence and the sequence after its random mask are sent into a twin network, and then the model parameters are learned by combining the contrast learning objective and the downstream task objective. Each twin network consists of a pre-trained language model and a task classifier. Therefore, the fine-tuning framework is more consistent with the mask language model pre-training learning mode and can maintain the generalization ability of pre-training knowledge in downstream tasks. The MNLI, FEVER, and QQP datasets and their challenge datasets are compared with the latest baseline models, including large language models ChatGPT, GPT4, and LLaMA. Experimental results show that the proposed model can guarantee the performance in distribution and improve the performance out of distribution. The experimental results on ATIS and Snips data sets prove that the model is also effective in common natural language processing tasks.

Key words: Transformer, masked language model, contrast learning, fine-tuning, spurious association, generalization ability

摘要： 在基于Transformer的语言模型中自注意力网络扮演了重要的角色，其中的全连接结构能够以并行方式捕捉序列中非连续的依赖关系。但是，全连接的自注意力网络很容易过拟合到虚假关联信息上，比如词与词、词与预测目标之间的虚假关联。这种过拟合问题限制了语言模型对领域外或分布外数据的泛化能力。为了提高Transformer语言模型对虚假关联的鲁棒性以及泛化能力，提出掩码语言增强表示的对比学习微调框架（fine-tuning framework via mask language model enhanced representations based contrastive learning，MCL-FT）。具体而言，文本序列和其随机掩码后的序列送入到一个孪生网络，结合对比学习目标和下游任务目标对模型进行参数学习。其中，每一个孪生网络由预训练语言模型和任务分类器组成。所以，该微调框架更加符合掩码语言模型预训练学习方式，能够在下游任务中保持预训练知识的泛化能力。在MNLI、FEVER和QQP数据集以及它们的挑战数据集上与最新的基线模型进行了对比，包括大语言模型ChatGPT、GPT4、LLaMA，实验结果验证了提出模型在保证分布内性能的同时有效提高了分布外的性能。在ATIS和Snips数据集上的实验结果证明，该模型在常见自然语言处理任务中也有显著的效果。

关键词: Transformer, 掩码语言模型, 对比学习, 微调, 虚假关联, 泛化能力

ZHANG Dechi, WAN Weibing. Fine-Tuning via Mask Language Model Enhanced Representations Based Contrastive Learning and Application[J]. Computer Engineering and Applications, 2024, 60(17): 129-138.

张德驰, 万卫兵. 掩码语言增强表示的对比学习微调和应用[J]. 计算机工程与应用, 2024, 60(17): 129-138.

References

[1] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[2] 廖胜兰, 吉建民, 俞畅, 等. 基于BERT模型与知识蒸馏的意图分类方法[J]. 计算机工程, 2021, 47(5): 73-79.
LIAO S L, JI J M, YU C, et al. Intention classification method based on BERT model and knowledge distillation[J]. Computer Engineering, 2021, 47(5): 73-79.
[3] PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[4] 方炯焜, 陈平华, 廖文雄. 结合GloVe和GRU的文本分类模型[J]. 计算机工程与应用, 2020, 56(20): 98-103.
FANG J K, CHEN P H, LIAO W X. Text classification model based on GloVe and GRU[J]. Computer Engineering and Applications, 2020, 56(20): 98-103.
[5] 周燕. 基于GloVe模型和注意力机制Bi-LSTM的文本分类方法[J]. 电子测量技术, 2022, 45(7): 42-47.
ZHOU Y. Text classification method based on GloVe model and attention Mechanism Bi-LSTM[J]. Electronic Measurement Technology, 2022, 45(7): 42-47.
[6] SHAHBAZ M, SURESH L, REXFORD J, et al. Elmo: source routed multicast for public clouds[J]. IEEE/ACM Transactions on Networking, 2020, 28(6): 2587-2600.
[7] 王雨嫣, 廖柏林, 彭晨, 等. 递归神经网络研究综述[J]. 吉首大学学报 (自然科学版), 2021, 42(1): 41-48.
WANG Y Y, LIAO B L, PENG C, et al. Research review of recurrent neural networks[J]. Journal of Jishou University (Natural Sciences Edition), 2021, 42(1): 41-48.
[8] KIM Y. Convolutional neural networks for sentence classification[J]. arXiv:1408.5882, 2014.
[9] KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences[J]. arXiv:1404.2188, 2014.
[10] 马建红, 刘亚培, 刘言东, 等. CGGA: 一种CNN与并行门控机制混合的文本分类模型[J]. 小型微型计算机系统, 2021, 42(3): 516-521.
MA J H, LIU Y P, LIU Y D, et al. CGGA: text classification model based on CNN and parallel gating mechanism[J]. Journal of Chinese Computer Systems, 2021, 42(3): 516-521.
[11] 滕金保, 孔韦韦, 田乔鑫, 等. 基于LSTM-Attention与CNN混合模型的文本分类方法[J]. 计算机工程与应用, 2021, 57(14): 126-133.
TENG J B, KONG W W, TIAN Q X, et al. Text classification method based on LSTM-attention and CNN hybrid model[J]. Computer Engineering and Applications, 2021, 57(14): 126-133.
[12] 宋中山, 牛悦, 郑禄, 等. 多尺度CNN卷积与全局关系的中文文本分类模型[J]. 计算机工程与应用, 2023, 59(20): 103-110.
SONG Z S, NIU Y, ZHENG L, et al. Multiscale double-layer convolution and global feature text classification model[J]. Computer Engineering and Applications, 2023, 59(20): 103-110.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[14] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018-12-01)[2023-07-19]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_
paper.pdf.
[15] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[16] HAO Y, DONG L, WEI F, et al. Investigating learning dynamics of BERT fine-tuning[C]//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020: 87-92.
[17] WILLIAMS A, NANGIA N, BOWMAN S R. A broad-coverage challenge corpus for sentence understanding through inference[J]. arXiv:1704.05426, 2017.
[18] MCCOY R T, PAVLICK E, LINZEN T. Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference[J]. arXiv:1902.01007, 2019.
[19] THORNE J, VLACHOS A, CHRISTODOULOPOULOS C, et al. FEVER: a large-scale dataset for fact extraction and verification[J]. arXiv:1803.05355, 2018.
[20] SCHUSTER T, SHAH D J, YEO Y J S, et al. Towards debiasing fact verification models[J]. arXiv:1908.05267, 2019.
[21] SHANKAR I, NIKHIL D, KORNEL C. First quora dataset release: question pairs[EB/OL]. (2017-12-01)[2023-04-27]. https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs.
[22] ZHANG Y, BALDRIDGE J, HE L. PAWS: paraphrase adversaries from word scrambling[J]. arXiv:1904.01130, 2019.
[23] TUR G, HAKKANI-TUR D, HECK L. What is left to be understood in ATIS?[C]//Proceedings of the 2010 IEEE Spoken Language Technology Workshop, 2010: 19-24.
[24] COUCKE A, SAADE A, BALL A, et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces[J]. arXiv:1805.10190, 2018.
[25] UTAMA P A, MOOSAVI N S, GUREVYCH I. Towards debiasing NLU models from unknown biases[J]. arXiv:2009.12303, 2020.
[26] CHEN J, SHEN D, CHEN W, et al. HiddenCut: simple data augmentation for natural language understanding with better generalizability[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 4380-4390.
[27] MEISSNER J M, SUGAWARA S, AIZAWA A. Debiasing masks: a new framework for shortcut mitigation in NLU[J]. arXiv:2210.16079, 2022.
[28] WU T, GUI T. Less is better: recovering intended-feature subspace to robustify NLU models[J]. arXiv:2209.07879, 2022.
[29] DOU S, ZHENG R, WU T, et al. Decorrelate irrelevant, purify relevant: overcome textual spurious correlations from a feature perspective[J]. arXiv:2202.08048, 2022.
[30] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[J]. arXiv:2302.13971, 2023.
[31] LIU B, LANE I. Attention-based recurrent neural network models for joint intent detection and slot filling[J]. arXiv:1609.01454, 2016.
[32] CHEN Q, ZHUO Z, WANG W. BERT for joint intent classification and slot filling[J]. arXiv:1902.10909, 2019.
[33] ZHANG L, MA D, ZHANG X, et al. Graph LSTM with context-gated mechanism for spoken language understanding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 9539-9546.
[34] WANG J, WEI K, RADFAR M, et al. Encoding syntactic knowledge in transformer encoder for intent detection and slot filling[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 13943-13951.
[35] HOU C, LI J, YU H, et al. Prior knowledge modeling for joint intent detection and slot filling[C]//Proceedings of the 15th International FLINS Conference, 2023: 3-10.