Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (21): 25-29.DOI: 10.3778/j.issn.1002-8331.2008.21.007

• 博士论坛 • Previous Articles     Next Articles

Research on Chinese text segmentation based on quantified conceptual relations extracted from Chinese dictionary

ZHONG Mao-sheng1,2,HU Yi1,LIU Lei1   

  1. 1.Department of Computer Science and Engineering,Shanghai Jiaotong University,Shanghai 200240,China
    2.School of Information Engineering,East China Jiaotong University,Nanchang 330013,China
  • Received:2008-04-30 Revised:2008-06-02 Online:2008-07-21 Published:2008-07-21
  • Contact: ZHONG Mao-sheng

基于词典词语量化关系的中文文本分割方法

钟茂生1,2,胡 熠1,刘 磊1   

  1. 1.上海交通大学 计算机科学与工程系,上海 200240
    2.华东交通大学 信息工程学院,南昌 330013
  • 通讯作者: 钟茂生

Abstract: With the quick expanding of the Internet information resource,the task of processing a mass of non-structured texts is faced with a huge challenge.Text segmentation based on the topic is a very important preprocessing step of text processing,and the performance of text segmentation technique has an immediate influence on the result of these tasks,such as Information Retrieval,Text Summarization and Q-A system.However,there exists two key problems in the text segmentation task,namely,how to measure the relevance of between topics and how to make a strategy for identifying the segment boundary based on the relevance of the context.In order to solve the above problems,this paper presents a new approach to measure the relevance of between sentences based on the Quantified Conceptual Relations(QCR) extracted from Modern Chinese Standard Dictionary(MCSD),and built a model to calculate the Segmentation Value of the gap point of between sentences for the task of text segmentation oriented sentence-level(no paragraph-level).The experiment results show that this approach has achieved a lower average error rate pk than that of state-of-the-art methods in the task of Chinese Text Segmentation.

Key words: text segmentation, quantified conceptual relations, inter-sentence relevance measure, gap point, segmentation value

摘要: 随着Internet网络资源的快速膨胀,海量的非结构化文本处理任务成为巨大的挑战。文本分割作为文本处理的一个重要的预处理步骤,其性能的优劣直接影响信息检索、文本摘要和问答系统等其他任务处理的效果。针对文本分割中需要解决的主题相关性度量和边界划分策略两个根本问题,提出了一种基于词典词语量化关系的句子间相关性度量方法,并建立了一个计算句子之间的间隔点分隔值的数学模型,以实现基于句子层次的中文文本分割。通过三组选自国家汉语语料库的测试语料的实验表明,该方法识别分割边界的平均错误概率pk和最低值均好于现有的其他中文文本分割方法。

关键词: 文本分割, 词语量化关系, 句子相关性度量, 间隔点, 分隔值