计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (17): 143-147.

• 数据库、信号与信息处理 • 上一篇    下一篇

COX:高压缩率的中文XML文档压缩技术

赵友桥1,张山山1,路松峰1,吴志杰2   

  1. 1.华中科技大学 计算机科学与技术学院,武汉 430074
    2.中国工程物理研究院 计算机应用研究所,四川 绵阳 621900
  • 出版日期:2012-06-11 发布日期:2012-06-20

COX:Chinese-oriented XML compressor with high compression ratio

ZHAO Youqiao1, ZHANG Shanshan1, LU Songfeng1, WU Zhijie2   

  1. 1.School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
    2.Institute of Computer Application Technology, China Academy of Engineering Physics, Mianyang, Sichuan 621900, China
  • Online:2012-06-11 Published:2012-06-20

摘要: 针对当前常用的XML压缩算法没有考虑中文特点的情况,结合中文与XML的特点,提出一种高压缩率的适合中文XML文档的压缩算法COX。利用中文分词技术对XML文档进行分词处理,通过统计词频后获得排序的词典,利用Huffman编码思想对高频及长词汇进行压缩编码;解析XML文档后,把文档元素进行分类,同一类型的元素放入同一容器之中;算法还特别针对数字类型的数据进行了特殊处理。实验结果显示,相对于通用的压缩软件,COX具有更好的压缩效果,但压缩和解压缩时间要慢一些。

关键词: 中文XML文档, 数据压缩, 中文分词, 词典

Abstract: To overcome the shortcoming of the current XML compression algorithms which do not distinguish between Chinese characters and English words, it presents a Chinese-oriented XML compressor with high compression ratio, called COX. The input documents are preprocessed by using the technology of Chinese word segmentation, the sorted dictionary is obtained by counting the word frequency, and then the high-frequency and long-size words are coded by using the Huffman coding method. The items in the XML documents are classified by analyzing the documents, the items with the same class tag are sent to the same container. Moreover, the numerical data are processed especially in COX. The experimental results show that, compared to the general compression algorithms, COX achieves higher compression ratio if the XML documents contain more Chinese words, while needing more compression and decompression time as return.

Key words: Chinese XML document, data compression, Chinese word segmentation, dictionary