计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (23): 142-149.DOI: 10.3778/j.issn.1002-8331.1809-0028

• 模式识别与人工智能 • 上一篇    下一篇

基于词嵌入的书面语篇多层次差异探究

张学敬,吕学强,周强   

  1. 1.北京信息科技大学 网络文化与数字传播北京市重点实验室,北京 100101
    2.北京信息科学与技术国家研究中心,北京 100084
    3.清华大学 信息技术研究院 语音和语言技术中心,北京 100084
  • 出版日期:2019-12-01 发布日期:2019-12-11

Multi-Level Difference Analysis of Written Discourse Based on Word Embedding

ZHANG Xuejing, LV Xueqiang, ZHOU Qiang   

  1. 1.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
    2.Beijing National Research Center for Information Science and Technology, Beijing 100084, China
    3.Center for Speech and Language Technology, Research Institute of Information Technology, Tsinghua University, Beijing 100084, China
  • Online:2019-12-01 Published:2019-12-11

摘要: 书面语篇包含有独白语篇和对话语篇两种类型,而独白语篇和对话语篇具有不同的描述功能和用词特点,这对基于这些语篇的不同分析任务计算建模提出了新的挑战。基于现有两种语篇标注库,采用统计分析方法,对两类语篇的不同层次功能结构差异性进行了定量分析。基于三种不同类型语料文本中自动训练得到的不同词嵌入向量,以字向量的角度初步分析了两类语篇在用词方面的不同分布特点。在此基础上针对两类语篇的4个典型分析任务,研究了不同词嵌入对深度学习模型分析性能的影响效果。实验结果表明,不同的词嵌入在不同语篇分析任务的表现能力存在明显差异,从而验证了独白语篇和对话语篇的多层次差异。

关键词: 独白语篇, 对话语篇, 词嵌入, 多层次差异分析

Abstract: Written discourse includes monologue text and dialogue text, and?monologue and dialogue texts have different description function and vocabulary features. This poses new challenges to the modeling of different parsing tasks based on these texts. This paper uses statistical analysis method to perform a quantitative analysis in multi-level differences of the two types of texts. And this paper analyzs the different distribution features of the two types of texts based on the different word embedding obtained from automatic training of three different types of text data. Then four typical discourse analysis tasks are adopted to analyze the influence of different word embedding in deep learning models. The experimental results show that there are obvious differences in the performance of different word embedding in different discourse analysis tasks, and it verifies the multi-level differences between monologue text and discourse text.

Key words: monologue text, dialogue text, word embedding, multi-level difference analysis