Research on Pre-Training Models for Tibetan Text with Character Awareness

doi:10.3778/j.issn.1002-8331.2307-0200

Abstract

Abstract: Tibetan pre-training models have predominantly employed syllables to represent Tibetan words. However, relying solely on syllable embeddings can lead to incomplete and less robust representations. To overcome this challenge, a novel pre-training model called “Tibetan character-aware” is introduced. This model incorporates features from Tibetan characters, radical components, and syllables to capture Tibetan word characteristics at a more detailed level. This paper evaluates the effectiveness of the approach on Tibetan automatic segmentation and named entity recognition tasks using both the original dataset and an adversarial spelling error test set. The experimental results demonstrate a significant improvement in the performance and robustness of Tibetan pre-training language models achieved through the proposed method.

Key words: Tibetan, pre-training model, character awareness

摘要： 目前藏文预训练模型主要使用音节作为藏文单词表示。采用音节嵌入构建藏文单词表示，会存在藏文单词表示不完整且鲁棒性不高的问题。为了应对这一挑战，提出了一个名为藏文字符感知的预训练模型，该模型融合藏文字符、字丁和音节三个维度的特征，从藏文更细粒度的信息表征藏文单词特征。利用原始数据集和对抗性拼写错误测试集，评估了所提出的方法在藏文自动分词和命名实体识别任务上的性能。实验结果表明，该方法可以同时提高藏文预训练语言模型的性能和鲁棒性。

关键词: 藏文, 预训练模型, 字符感知

Gadeng Luosang, Nyima Tashi. Research on Pre-Training Models for Tibetan Text with Character Awareness[J]. Computer Engineering and Applications, 2024, 60(21): 127-133.

洛桑嘎登, 尼玛扎西. 基于藏文字符感知的文本预训练模型方法研究[J]. 计算机工程与应用, 2024, 60(21): 127-133.

References

[1] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[2] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[3] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[4] SUN Z, LI X, SUN X, et al. Chinese pretraining enhanced by glyph and pinyin information[J]. arXiv:2106.16038，2021.
[5] 蔡坤钊, 曾碧卿, 陈鹏飞. GAT: 用于自然语言理解的基于全局的对抗训练[J]. 中文信息学报, 2023, 37(3): 27-35.
CAI K Z, ZENG B Q, CHEN P F. GAT: global-based adversarial training for natural language understanding[J]. Journal of Chinese Information Processing, 2023, 37(3): 27-35.
[6] RAZUMOVSKAIA E, VULI? I, KORHONEN A. Data augmentation and learned layer aggregation for improved multilingual language understanding in dialogue[C]//Proceedings of the Findings of the Association for Computational Linguistics (ACL 2022), 2022: 2017-2033.
[7] 马式琨, 滕冲, 李霏, 等. 基于领域特征提纯的多领域文本分类[J]. 中文信息学报, 2022, 36(8): 92-100.
MA S K, TENG C, LI F, et al. Multi-domain text classification based on domain feature refinement[J]. Journal of Chinese Information Processing, 2022, 36(8): 92-100.
[8] LUO Q, LIU L, LIN Y, et al. Don’t miss the labels: label-semantic augmented meta-learner for few-shot text classification[C]//Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021), 2021: 2773-2782.
[9] 孙斌, 常开志, 李树涛. 面向医疗咨询的复杂问句意图智能理解[J]. 中文信息学报, 2023, 37(1): 112-120.
SUN B, CHANG K Z, LI S T. Complex question intention understanding for medical consultation[J]. Journal of Chinese Information Processing, 2023, 37(1): 112-120.
[10] ZHAO Y, HUANG J, HU W, et al. Implicit relation linking for question answering over knowledge graph[C]//Proceedings of the Findings of the Association for Computational Linguistics (ACL 2022), 2022: 3956-3968.
[11] SUN Y, WANG S, LI Y, et al. Ernie: enhanced representation through knowledge integration[J]. arXiv:1904.09223, 2019.
[12] SUN Y, WANG S, LI Y, et al. Ernie 2.0: a continual pre-training framework for language understanding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8968-8975.
[13] CUI Y, CHE W, LIU T, et al. Pre-training with whole word masking for Chinese bert[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
[14] MA W, CUI Y, SI C, et al. CharBERT: character-aware pre-trained language model[J]. arXiv:2011.01513, 2020.
[15] 张朋捷, 王磊, 马博, 等. 基于预训练语言模型的维吾尔语事件抽取[J]. 计算机工程与设计, 2023, 44(5): 1487-1494.
ZHANG P J, WANG L, MA B, et al. Uyghur event extraction based on pre-trained language model[J]. Computer Engineering and Design, 2023, 44(5): 1487-1494.
[16] 罗凯昂, 哈里旦木·阿布都克里木, 刘畅, 等. 融合剪枝和多语微调的黏着语命名实体识别[J]. 计算机工程与应用, 2023, 59(24): 121-130.
LUO K A, ABUDUKELIMU H, LIU C, et al. Agglutinative languages named entity recognition based on pruner and multilingual fine-tuning[J]. Computer Engineering and Applications, 2023, 59(24): 121-130.
[17] 吴都. 基于深度神经网络的蒙古文命名实体识别研究[D]. 北京: 北京交通大学, 2021.
WU D. Research on mongolian named entity recognition based on deep neural network[D]. Beijing: Beijing Jiaotong University, 2021.
[18] 王炜华. 蒙古文命名实体识别研究[D]. 呼和浩特: 内蒙古大学, 2019.
WANG W H. Mongolian named entity recognition[D]. Hohhot: Inner Mongolia University, 2019.
[19] 胥桂仙, 刘兰寅, 张廷, 等. 基于预训练模型和图神经网络的藏文文本分类研究[J]. 东北师大学报 (自然科学版), 2023, 55(1): 52-64.
XU G X, LIU L Y, ZHANG T, et al. Tibetan text classification based on pre-training model and graph neural network[J]. Journal of Northeast Normal University (Natural Science Edition), 2023, 55(1): 52-64.
[20] 于韬, 尼玛次仁, 拥措, 等. 基于藏文Albert预训练语言模型的图采样与聚合实体关系抽取[J]. 中文信息学报, 2022, 36(10): 63-72.
YU T, NIMA C R, YONG C, et al. Graph sampling and aggregated entity relation extraction based on Tibetan Albert pre-trained language model[J]. Journal of Chinese Information Processing, 2022, 36(10): 63-72.
[21] 头旦才让, 仁青东主, 尼玛扎西. 基于CRF的藏文地名识别技术研究[J]. 计算机工程与应用, 2019, 55(18): 111-115.
THUPTEN T, RINCHEN D, NYIMA T. Research on Tibetan location name recognition technology under CRF[J]. Computer Engineering and Applications, 2019, 55(18): 111-115.
[22] 华却才让, 姜文斌, 赵海兴, 等. 基于感知机模型藏文命名实体识别[J]. 计算机工程与应用, 2014, 50(15): 172-176.
HUA Q C R, JIANG W B, ZHAO H X, et al. Tibetan name entity recognition with perceptron model[J]. Computer Engineering and Applications, 2014, 50(15): 172-176.
[23] 洛桑嘎登, 杨媛媛, 赵小兵. 基于知识融合的CRFs藏文分词系统[J]. 中文信息学报, 2015, 29(6): 213-219.
LUOBSANG K, YANG Y Y, ZHAO X B. Tibetan automatic word segmentation based on conditional random fields and knowledge fusion[J]. Journal of Chinese Information Processing, 2015, 29(6): 213-219.
[24] 洛桑嘎登, 群诺, 索南尖措, 等. 融合音节部件特征的藏文命名实体识别方法[J]. 厦门大学学报 (自然科学版), 2022, 61(4): 624-629.
LUOSANG G, QUN N, SUONAN J, et al. Fusion of syllable component features for Tibetan named entity recognition[J]. Journal of Xiamen University (Natural Science), 2022, 61(4): 624-629.
[25] ZHANG Y, WANG J, YU L C, et al. MA-BERT: learning representation by incorporating multi-attribute knowledge in transformers[C]//Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021), 2021: 2338-2343.
[26] LIU Z, LI F, LI G, et al. EBERT: efficient BERT inference with dynamic structured pruning[C]//Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021), 2021: 4814-4823.
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[28] LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized bert pretraining approach[J]. arXiv:1907.11692, 2019.
[29] GONG H Y, GUPTA K, JAIN A, et al. IlliniMet: illinois system for metaphor detection with contextual and linguistic information[C]//Proceedings of the Second Workshop on Figurative Language Processing, 2020: 146.
[30] ZOU H, YANG J, WU X. Unsupervised energy-based adversarial domain adaptation for cross-domain text classification[C]//Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021), 2021: 1208-1218.
[31] YANG Z, DAI Z, YANG Y, et al. XLNet: generalized autoregressive pretraining for language understanding[C]//Advances in Neural Information Processing Systems, 2019.
[32] HAMBORG F, DONNAY K, MERLO P. NewsMTSC: a dataset for (multi-) target-dependent sentiment classification in political news articles[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021.
[33] ZANZOTTO F M, SANTILLI A, RANALDI L, et al. KERMIT: complementing transformer architectures with encoders of explicit syntactic interpretations[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020: 256-267.
[34] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901.
[35] ETHAYARAJH K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings[J]. arXiv:1909.00512, 2019.
[36] FLORIDI L, CHIRIATTI M. GPT-3: its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30: 681-694.
[37] SHEANG K C, SAGGION H. Controllable sentence simplification with a unified text-to-text transfer transformer[C]//Proceedings of the 14th International Conference on Natural Language Generation (INLG), Sep 20-24 2021, Aberdeen, Scotland, UK. Aberdeen: Association for Computational Linguistics, 2021.
[38] 朱宇雷, 德吉卡卓, 群诺, 等. 基于图神经网络结合预训练模型的藏文短文本情感分析研究[J]. 中文信息学报, 2023, 37(2): 71-79.
ZHU Y L, DEJI K, QNU N, et al. Research on sentiment analysis of Tibetan short texts based on graph neural network and pre-training model[J]. Journal of Chinese Information Processing, 2023, 37(2): 71-79.
[39] 安波, 龙从军. 基于预训练语言模型的藏文文本分类[J]. 中文信息学报, 2022, 36(12): 85-93.
AN B, LONG C J. Tibetan text classification based on pre-trained language model[J]. Journal of Chinese Information Processing, 2022, 36(12): 85-93.
[40] 李亮. 基于ALBERT的藏文预训练模型及其应用[D]. 兰州: 兰州大学, 2020.
LI L. Tibetan pre-training model and its application based on ALBERT[D]. Lanzhou: Lanzhou University, 2020.
[41] 国家技术监督局. 中华人民共和国国家标准信息技术信息交换用藏文编码字符集基本集[S]. 北京: 中国标准出版社, 1997.
State Bureau of Technical Supervision. National standards of the People’s Republic of China. Information technology. Basic set of Tibetan coded characters for information interchange[S]. Beijing: China Standard Press, 1997.
[42] 桑塔, 达哇彭措. 信息处理用藏文字丁统计[J]. 科技信息, 2010(29): 430.
SANG T, DAWA P. Statistics of Tibetan characters used in information processing[J]. Science and Technology Information, 2010(29): 430.