语义、句法网络作为语体分类知识源的对比研究

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (2): 10-14.

语义、句法网络作为语体分类知识源的对比研究

陈芯莹1，刘海涛2

1.西安交通大学外国语学院，西安 710049
2.浙江大学语言行为模式中心，杭州 310058

出版日期:2014-01-15 发布日期:2014-01-26

Comparison study of using semantic and syntactic network characteristics to do text clustering

CHEN Xinying1, LIU Haitao2

1.School of Foreign Studies, Xi’an Jiaotong University, Xi’an 710049, China
2.Center of Language-Behavior Patterns, Zhejiang University, Hangzhou 310058, China

Online:2014-01-15 Published:2014-01-26

摘要/Abstract

摘要： 基于6种语体的句法和语义树库分别构建了依存句法和语义网络，对这些网络的边数、节点数、节点平均度、聚类系数、平均最短路径长度、网络中心势、直径、节点度幂律分布的幂指数、度分布与幂律拟合的决定系数等整体特征进行了对比分析。以这些整体特征为变量，采用不同的聚类方法，对这6种语体的句法和语义网络进行了聚类分析。研究结果显示，同样是基于语言学原则构建起来的网络结构，依存句法网络和依存语义网络之间有明显差异。其参数的含义不尽相同，依据其各项参数所做的聚类实验的结果也不相同。采用语义网络的一些主要参数组合，可以获得相对合理的聚类结果，但不能很好地区分书面语体和口语体；通过句法网络的一些主要参数组合，可以很好地区分不同语体的文本，获得较为合理的文本聚类结果。

关键词: 语体, 文本分类, 网络特征

Abstract: The study builds six dependence syntactic networks and semantic networks based on syntactic and semantic treebanks of different genres and does a comparative analysis of overall features of the networks, including the number of edges, the number of the nodes, the average degree, the clustering coefficient, the average path length, the centralization, the diameter, the index of power-law, and the coefficient of determination. The article tries multi-methods, with features as variables, to do clustering analysis of these networks. The results show that, although the syntactic and semantic networks all follow the linguistic principles, there are obvious differences between syntax and semantic networks. The meanings of the network parameters vary and the clustering results according to the parameters are different. Using the combinations of main semantic network parameters can obtain relatively reasonable clustering results, but it cannot distinguish well written style from colloquialism while using the combinations of main syntactic network parameters can well distinguish different styles of texts and obtain reasonable text clustering results.

Key words: genre, text clustering, network features

陈芯莹1，刘海涛2. 语义、句法网络作为语体分类知识源的对比研究[J]. 计算机工程与应用, 2014, 50(2): 10-14.

CHEN Xinying1, LIU Haitao2. Comparison study of using semantic and syntactic network characteristics to do text clustering[J]. Computer Engineering and Applications, 2014, 50(2): 10-14.

[1]	黄金杰，蔺江全，何勇军，何瑾洁，王雅君. 局部语义与上下文关系的中文短文本分类算法[J]. 计算机工程与应用, 2021, 57(6): 94-100.
[2]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[3]	郑诚，董春阳，黄夏炎. 基于BTM图卷积网络的短文本分类方法[J]. 计算机工程与应用, 2021, 57(4): 155-160.
[4]	贺文亮，朱敏玲. 胶囊神经网络研究现状与未来的浅析[J]. 计算机工程与应用, 2021, 57(3): 33-43.
[5]	滕金保，孔韦韦，田乔鑫，王照乾，李龙. 基于CNN和LSTM的多通道注意力机制文本分类模型[J]. 计算机工程与应用, 2021, 57(23): 154-162.
[6]	武书钊，李功权，卜明伟. 基于知识图谱的自杀倾向检测问答系统构建[J]. 计算机工程与应用, 2021, 57(22): 304-312.
[7]	李铁飞，生龙，吴迪. BERT-TECNN模型的文本分类方法研究[J]. 计算机工程与应用, 2021, 57(18): 186-193.
[8]	丁勇，程家桥，蒋翠清，王钊. 基于主题和关键词特征的比较文本分类方法[J]. 计算机工程与应用, 2021, 57(17): 196-202.
[9]	滕金保，孔韦韦，田乔鑫，王照乾. 基于LSTM-Attention与CNN混合模型的文本分类方法[J]. 计算机工程与应用, 2021, 57(14): 126-133.
[10]	翟一鸣，王斌君，周枝凝，仝鑫. 面向文本分类的多头注意力池化RCNN模型[J]. 计算机工程与应用, 2021, 57(12): 155-160.
[11]	姚佳奇，徐正国，燕继坤，王科人. GCN-PU:基于图卷积网络的PU文本分类算法[J]. 计算机工程与应用, 2021, 57(11): 162-167.
[12]	申艳光，贾耀清. 基于词共现与图卷积的文本分类方法[J]. 计算机工程与应用, 2021, 57(11): 173-178.
[13]	郝超，裘杭萍，孙毅，张超然. 多标签文本分类研究进展[J]. 计算机工程与应用, 2021, 57(10): 48-56.
[14]	张曼，夏战国，刘兵，周勇. 全卷积神经网络的字符级文本分类方法[J]. 计算机工程与应用, 2020, 56(5): 166-172.
[15]	唐庄，王志舒，周爱，冯美姗，屈雯，鲁明羽. 面向文本分类的transformer-capsule集成模型[J]. 计算机工程与应用, 2020, 56(24): 151-156.