Heuristic Chinese sentence compression algorithm based on hot word

HAN Jing, ZHANG Dongzhan   

  1. School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
韩  静,张东站   

  1. 厦门大学 信息科学与技术学院,福建 厦门 361005

Abstract: Since the parallel sentence/compression corpora which most of the traditional methods based on are not easy to obtain, a linguistically-motivated heuristics Chinese sentence compression algorithm is proposed after studying traditional methods. By analyzing the human-produced compression and linguistic knowledge, two sets of rules are proposed, one is in word layer and the other is in clause layer. Two sets of rules based on the parse tree and the words dependence are used to compress sentence, and enhance the algorithm by hot word in order to keep the algorithm flexibility and accuracy. In the last step the compression result is cleaned and repaired. Human-produced compression, rule-only algorithm and hot word enhanced algorithm are compared then the results are evaluated in compression rate, grammaticality, informativeness and heat. The experimental results show that heuristic Chinese sentence compression algorithm based on hot word can improve the heat of compression results without much loss in compression rate, grammaticality and informativeness.

Key words: Chinese sentence compression, hot word, linguistic, parse tree

摘要: 传统的句子压缩方法多基于难以获得的“原句-压缩句”对齐语料库,因此提出了不依赖于对齐语料库的中文句子压缩算法。通过研究人工压缩结果并结合语言学知识,提出了词语层面和分句层面的两组压缩规则。算法在原句句法分析树和词语间依赖关系的基础上,使用两组规则进行压缩,同时为了保证压缩算法具有更强的适应性和准确性,引入词语的热度加强了压缩算法,最后通过句子整理和语法修复得到最终的压缩句。对比了人工压缩、只使用规则压缩和引入词语热度压缩三种压缩方法。实验结果表明,基于热度的启发式中文句子压缩算法可以在压缩比、语法性、信息量都损失较少的情况下,提高压缩句的热度。

关键词: 中文句子压缩, 热词, 语言学, 句法分析树