Discourse-Level Topic Segmentation Model with Multi-Level Information Enhanced Heterogeneous Graphs Network

doi:10.3778/j.issn.1002-8331.2212-0363

Abstract

Abstract: Topic segmentation is a basic task in the field of natural language processing, which divides the text into several semantically related text blocks according to the principle of semantic correlation. Nevertheless, the existing topic segmentation models are insufficient to extract the deep semantic information of sentences and further ignore the hierarchical information and contextual interaction in the discourse. To solve the above problems, this paper proposes a discourse-level topic segmentation model MHG-TS that enhances heterogeneous graphs through the multi-level information. MHG-TS constructs the network of heterogeneous graphs from the sentences and keywords in the discourse, adopts the pre-trained language model BERT to capture the deep semantic features of the nodes in the graph. At the level of first-order neighborhood, the model uses the graph attention mechanism to assign more weight to the semantic association nodes, which enhances the information interaction of semantic association nodes in the first-order neighborhood. At the level of keyword nodes, it adopts the information of keywords to enforce the representation of semantic features. At the level of high-order neighborhood, it adopts the keyword nodes as intermediaries to build the cross-sentence information interaction in the high-order neighborhood and to enrich the non-sequential relationship between sentence nodes, thus the sentence representations containing global semantic information is realized finally by integrating with multi-level information. Compared with the state-of-the-art model, the average values of MHG-TS’s performance of three evaluation indexes on many datasets increase by 3.08%, 2.56% and 5.92% respectively and the best experimental effects are obtained.

Key words: graph attention mechanism, pre-trained language model, topic segmentation, sentence encoding

摘要： 话题分割是自然语言处理领域的基础任务之一，按照话题相关性原则将文本分割为多个话题相关的文本块。针对现有话题分割模型提取句子深层语义信息方面明显不足，并且忽略了篇章中的层次信息和上下文交互等问题，提出了一种多层级信息增强异构图的篇章级话题分割模型MHG-TS。该方法利用篇章中的句子和关键词构建异构图网络，引入BERT预训练语言模型捕获图中节点的深层语义特征，在句子节点一阶邻域层级，利用图注意力机制为语义关联的节点分配更大的边权重，增强了一阶邻域中语义关联节点的信息交互；在关键词节点层级，引入关键词信息加强句子语义特征表示；在句子高阶邻域层级，利用关键词节点作为中介，构建了句子节点高阶邻域中的跨句信息交互，丰富了句子节点之间的非序列关系，最终通过融合多层级信息实现包含全局语义信息的句子表示。相较于当下流行的模型，在多个数据集上，三个评价指标性能平均值分别提高了3.08%、2.56%、5.92%，取得了最佳的实验结果。

关键词: 图注意力机制, 预训练语言模型, 话题分割, 句子表示

ZHANG Yangning, ZHU Jing, DONG Rui, YOU Zeshun, WANG Zhen. Discourse-Level Topic Segmentation Model with Multi-Level Information Enhanced Heterogeneous Graphs Network[J]. Computer Engineering and Applications, 2024, 60(9): 203-211.

张洋宁, 朱静, 董瑞, 尤泽顺, 王震. 多层级信息增强异构图的篇章级话题分割模型[J]. 计算机工程与应用, 2024, 60(9): 203-211.

References

[1] SHTEKH G, KAZAKOVA P, NIKITINSKY N, et al. Exploring influence of topic segmentation on information retrieval quality[C]//Proceedings of the 5th International Conference on Internet Science (INSCI 2018), St Petersburg, Russia, October 24-26, 2018. [S.l.]: Springer International Publishing, 2018: 131-140.
[2] 秦兵, 刘挺, 李生. 多文档自动文摘综述[J]. 中文信息学报, 2005, 19(6): 15-22.
QIN B, LIU T, LI S. Survey of multi-document summarization[J]. Journal of Chinese Information Processing, 2005, 19(6): 15-22.
[3] 张仰森, 段宇翔, 黄改娟, 等. 社交媒体话题检测与追踪技术研究综述[J]. 中文信息学报, 2019, 33(7): 1-10.
ZHANG Y S, DUAN Y X, HUANG G J, et al. A survey on topic detection and tracking methods in social media[J]. Journal of Chinese Information Processing, 2019, 33(7): 1-10.
[4] GLAVA G , NANNI F , PONZETTO S P . Unsupervised text segmentation using semantic relatedness graphs[C]//Joint Conference on Lexical & Computational Semantics, 2016.
[5] KOSHOREK O, COHEN A, MOR N, et al. Text segmentation as a supervised learning task[J]. arXiv:1803.09337, 2018.
[6] HEARST M. TextTiling: segmenting text into multi-paragraph subtopic passages[J]. Computational Linguistics, 1997, 23(1): 33-64.
[7] DENNIS S, LANDAUER T, KINTSCH W, et al. Introduction to latent semantic analysis[C]//25th Annual Meeting of the Cognitive Science Society, Boston, Mass, 2003: 25.
[8] RIEDL M, BIEMANN C. TopicTiling: a text segmentation algorithm based on LDA[C]//Proceedings of ACL 2012 Student Research Workshop, 2012: 37-42.
[9] WANG L, LI S, Lü Y, et al. Learning to rank semantic coherence for topic segmentation[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017: 1340-1344.
[10] LI J, SUN A, JOTY S R. SegBot: a generic neural text segmentation model with pointer network[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018: 4166-4172.
[11] BARROW J, JAIN R, MORARIU V, et al. A joint model for document segmentation and segment labeling[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 313-322.
[12] SHI H, ZHOU H, CHEN J, et al. On tree-based neural sentence modeling[J]. arXiv:1808.09644, 2018.
[13] SOMASUNDARAN S. Two-level transformer and auxiliary coherence modeling for improved text segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 7797-7804.
[14] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[15] LUKASIK M, DADACHEV B, SIMOES G, et al. Text segmentation by cross segment attention[J]. arXiv:2004.14535, 2020.
[16] YAO L, MAO C, LUO Y. Graph convolutional networks for text classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 7370-7377.
[17] LIN Y, MENG Y, SUN X, et al. Bertgcn: transductive text classification by combining GCN and BERT[J]. arXiv:2105.
05727, 2021.
[18] 徐邵洋, 蒋峰, 李培峰. 基于篇章结构图网络的话题分割[J]. 中文信息学报, 2021, 35(12): 17-27.
XU S Y, JIANG F, LI P F. Topic segmentation via discourse structure graph network[J]. Journal of Chinese Information Processing, 2021, 35(12): 17-27.
[19] AIZAWA A. An information-theoretic perspective of tf-idf measures[J]. Information Processing & Management, 2003, 39(1): 45-65.
[20] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[21] LENG Z, TAN M, LIU C, et al. PolyLoss: a polynomial expansion perspective of classification loss functions[J]. arXiv:2204.12511, 2022.
[22] ARNOLD S, SCHNEIDER R, CUDRé-MAUROUX P, et al. SECTOR: a neural model for coherent topic segmentation and classification[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 169-184.
[23] ZHANG L, ZHOU Q. Topic segmentation for dialogue stream[C]//2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019.
[24] CHOI F Y Y. Advances in domain independent linear text segmentation[C]//Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference (NAACL 2000), 2000.
[25] CHEN H, BRANAVAN S R K, BARZILAY R, et al. Global models of document structure using latent permutations[C]//Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009.
[26] BEEFERMAN D, BERGER A, LAFFERTY J. Statistical models for text segmentation[J]. Machine Learning, 1999, 34(1): 177-210.
[27] PEVZNER L, HEARST M A. A critique and improvement of an evaluation metric for text segmentation[J]. Computational Linguistics, 2002, 28(1): 19-36.
[28] FOURNIER C. Evaluating text segmentation using boundary edit distance[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013: 1702-1712.
[29] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.