计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (12): 19-23.

• 博士论坛 • 上一篇    下一篇

话语标记的计量与自动过滤提取

阚明刚   

  1. 中国传媒大学 文学院,国家语言资源监测与研究中心有声媒体语言分中心,北京 100024
  • 出版日期:2012-04-21 发布日期:2012-04-20

Statistics and auto-retrieving of discourse markers

KAN Minggang   

  1. School of Arts, China Broadcast Media Language Monitor and Research Branch, Communication University of China, Beijing 100024, China
  • Online:2012-04-21 Published:2012-04-20

摘要: 语篇中的话语标记在自然语言处理中逐渐得到重视。基于大规模语料库对话语标记进行自顶向下的梳理是该研究的目标。研究中构建了两个500万字次的语体语料库,利用UltraEdit等软件对话语标记进行提取和统计,对使用情况作了详细分析,发现话语标记并非只用于口语之中,每种语体都有自己的使用特色。在获得的话语标记的基础上,给出了在大规模语料库中提取算法并编程实现,减少了人工操作,提高了识别效率。

关键词: 机助, 话语标记, 计量, 过滤

Abstract: Discourse Markers(DMs) are paid more attention in the field of natural language processing recently. The target of this research is to comb DMs top-down based on large-scale corpus. Two genre corpuses are built, each with a scale of 5 million characters. Several pieces of software, such like UltraEdit, are applied to retrieving and calculating. After the use situations are analyzed in detail, it is found that DMs are not used only in oral discourse and each genre has its own use traits. An algorithm is given and realized through C#, and a test shows it is effective.

Key words: computer-assisted, Discourse Markers(DMs), calculation, filtration