计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (21): 127-133.DOI: 10.3778/j.issn.1002-8331.2307-0200

• 模式识别与人工智能 • 上一篇    下一篇

基于藏文字符感知的文本预训练模型方法研究

洛桑嘎登,尼玛扎西   

  1. 1.西藏大学 信息科学技术学院,拉萨 850000
    2.西藏大学 西藏自治区藏文信息技术人工智能重点实验室,拉萨 850000
  • 出版日期:2024-11-01 发布日期:2024-10-25

Research on Pre-Training Models for Tibetan Text with Character Awareness

Gadeng Luosang, Nyima Tashi   

  1. 1.Institute of Information Science and Technology, Tibet University, Lhasa 850000, China
    2.Key Laboratory of Tibetan Information Technology and Artificial Intelligence, Tibet University, Lhasa 850000, China
  • Online:2024-11-01 Published:2024-10-25

摘要: 目前藏文预训练模型主要使用音节作为藏文单词表示。采用音节嵌入构建藏文单词表示,会存在藏文单词表示不完整且鲁棒性不高的问题。为了应对这一挑战,提出了一个名为藏文字符感知的预训练模型,该模型融合藏文字符、字丁和音节三个维度的特征,从藏文更细粒度的信息表征藏文单词特征。利用原始数据集和对抗性拼写错误测试集,评估了所提出的方法在藏文自动分词和命名实体识别任务上的性能。实验结果表明,该方法可以同时提高藏文预训练语言模型的性能和鲁棒性。

关键词: 藏文, 预训练模型, 字符感知

Abstract: Tibetan pre-training models have predominantly employed syllables to represent Tibetan words. However, relying solely on syllable embeddings can lead to incomplete and less robust representations. To overcome this challenge, a novel pre-training model called “Tibetan character-aware” is introduced. This model incorporates features from Tibetan characters, radical components, and syllables to capture Tibetan word characteristics at a more detailed level. This paper evaluates the effectiveness of the approach on Tibetan automatic segmentation and named entity recognition tasks using both the original dataset and an adversarial spelling error test set. The experimental results demonstrate a significant improvement in the performance and robustness of Tibetan pre-training language models achieved through the proposed method.

Key words: Tibetan, pre-training model, character awareness