Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (1): 141-144.

Previous Articles     Next Articles

Research on Tibetan document encoding recognition

CHUN Yan, QU Zhen   

  1. Department of Computer Science and Technology, Tibet University, Lhasa 850012, China
  • Online:2013-01-01 Published:2013-01-16

藏文文本编码识别方法研究

春  燕,曲  珍   

  1. 西藏大学 计算机科学与技术系,拉萨 850012

Abstract: This paper discusses critical problems with Tibetan encoding identification and conversion. According to Tibetan character structural and its statistical characteristics, it introduces various possible recognition rules, and the results are analyzed and compared. Used characteristics of distance regulation and high frequency between Tibetan syllables to determination encoding identification of FOUNDER Windows, FOUNDER Dos, Tonguer, HURGURNG Windows, HURGURNG Dos, Pandita, the Tibetan encoding based on ASCII, ISO/IEC10646 basic set and Tibetan coded character sets-Extension A, can correctly distinguish Tibetan text with other languages. The rate of recognition reaches 100% using these algorithms on the test documents.

Key words: Tibetan encoding, Tibetan encoding identification, syllable dot

摘要: 讨论了藏文编码识别与转换中的关键问题,介绍了藏文结构特点和统计学特征以及各种可能的识别准则,并进行分析比较。确定了使用以藏文的音节点间距规律和高频音节为特征的识别方法对方正Windows、方正Dos、同元、华光Windows、华光Dos、班智达、ASCII的藏文编码方案、ISO/IEC10646基本集、国家标准扩充集A的藏文编码识别,能够正确地将藏文文本与其他语言进行区分。在对目标样本的测试中,该算法的识别率可达100%。

关键词: 藏文编码, 藏文编码识别, 音节点