Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (2): 120-126.DOI: 10.3778/j.issn.1002-8331.1809-0177

Previous Articles     Next Articles

Convolutional Neural Networks Without Pooling Layer for Chinese Word Segmentation

TU Wenbo, YUAN Zhenming, YU Kai   

  1. 1.College of Information Engineering, Hangzhou Normal University, Hangzhou 311121, China
    2.Engineering Research Center of Mobile Health Management System, Ministry of Education, Hangzhou 311121, China
  • Online:2020-01-15 Published:2020-01-14

无池化层卷积神经网络的中文分词方法

涂文博,袁贞明,俞凯   

  1. 1.杭州师范大学 信息工程学院,杭州 311121
    2.移动健康管理系统教育部工程研究中心,杭州 311121

Abstract: In Chinese information processing, word segmentation is a very common and critical task. Usually, the first step of the Chinese Natural Language Processing(NLP) tasks is word segmentation. Over the years, the method of Chinese word segmentation has evolved from machine learning to deep learning. However, most of the models have various deficiencies such as the models being too complex, relying heavily on hand-crafted features, and having poor performance on Out of Vocabulary(OOV) words. This paper proposes a PCNN (Pure CNN) Chinese word segmentation model based on Convolutional Neural Networks(CNN). This model uses the word vector context window to label the words. It has a simple structure and does not rely on the hand-crafted features, good stability, high accuracy and other advantages. Considering the characteristics of the distributed word vector itself, there is no need for pooling in the PCNN model. The features data extracted from the convolution layer are preserved, and the training speed of the model is greatly improved. The experimental results on public datasets show that the accuracy of the model is reached other neural network models. At the same time, it is also verified in the comparison experiment that the network model without pooling layer is superior to the network model with pooling layer.

Key words: Natural Language Processing(NLP), Chinese word segmentation, Convolutional Neural Networks(CNN), word vector

摘要: 在中文信息处理中,分词是一个十分常见且关键的任务。很多中文自然语言处理的任务都需要先进行分词,再根据分割后的单词完成后续任务。近来,越来越多的中文分词采用机器学习和深度学习方法。然而,大多数模型都不同程度的有模型过于复杂、过于依赖人工处理特征、对未登录词表现欠佳等缺陷。提出一种基于卷积神经网络(Convolutional Neural Networks,CNN)的中文分词模型——PCNN(Pure CNN)模型,该模型使用基于字向量上下文窗口的方式对字进行标签分类,具有结构简单、不依赖人工处理、稳定性好、准确率高等优点。考虑到分布式字向量本身的特性,在PCNN模型中不需要卷积的池化(Pooling)操作,卷积层提取的数据特征得到保留,模型训练速度获得较大提升。实验结果表明,在公开的数据集上,模型的准确率达到当前主流神经网络模型的表现水准,同时在对比实验中也验证了无池化层(Pooling Layer)的网络模型要优于有池化层的网络模型。

关键词: 自然语言处理, 中文分词, 卷积神经网络, 字向量