计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (24): 121-130.DOI: 10.3778/j.issn.1002-8331.2208-0109

• 模式识别与人工智能 • 上一篇    下一篇

融合剪枝和多语微调的黏着语命名实体识别

罗凯昂,哈里旦木·阿布都克里木,刘畅,阿布都克力木·阿布力孜,郭文强   

  1. 新疆财经大学 信息管理学院,乌鲁木齐 830012
  • 出版日期:2023-12-15 发布日期:2023-12-15

Agglutinative Languages Named Entity Recognition Based on Pruner and Multilingual Fine-Tuning

LUO Kai’ang, Abudukelimu Halidanmu, LIU Chang, Abudukelimu Abulizi, GUO Wenqiang   

  1. School of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, China
  • Online:2023-12-15 Published:2023-12-15

摘要: 以维吾尔语为代表的少数民族语言具有黏着性和资源匮乏等特点,为其命名实体识别任务带来了巨大挑战。与此同时,多语言模型存在参数规模和词表大、推理速度慢等问题。为此,通过对CINO进行重新剪枝,提出针对低资源黏着语命名实体识别的CINO新版本:CINO-Agglu。为了探讨最佳微调策略,缓解低资源问题,对维吾尔语、哈萨克语、柯尔克孜语、乌兹别克语、塔塔尔语等五种黏着语分别进行单语言和多语言微调。实验结果表明,CINO-Agglu相较于剪枝前,模型大小、参数量、词表大小、推理时间分别减少45%、44%、92%、38%,并且在五种语言上的平均F1值为85.9%,超过了所有基线模型。加入适当规模的同语族数据有利于提升微调效果。

关键词: 黏着语, 低资源语言, 命名实体识别, 多语言迁移, 模型剪枝

Abstract: Minority languages, represented by Uyghur, are characterized by agglutination and lack resources, which pose great challenges for their named entity recognition tasks. Meanwhile, the multilingual model suffers from problems such as large parameter scale, large word vocabularies, and slow inference speed. In order to explore the best fine-tuning strategy to alleviate the low-resource problem, monolingual and multilingual fine-tuning are performed for five agglutinative languages, namely Uyghur, Kazakh, Kirghiz, Uzbek, and Tatar, respectively. The experimental results show that CINO-Agglu reduces the model size, number of parameters, word list size, and inference time by 45%, 44%, 92%, and 38%, respectively, compared with the period before pruning, and the average F1 score on the five languages is 85.9%, which exceeds all baseline models. The inclusion of appropriately sized data from the same language branch is beneficial to enhance the fine-tuning effect.

Key words: agglutinative language, low-resource language, named entity recognition, cross-lingual transfer, model pruner