基于关键词的维吾尔单文档自动文摘技术研究

计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (16): 130-135.

• 数据库、数据挖掘、机器学习 • 上一篇下一篇

基于关键词的维吾尔单文档自动文摘技术研究

买哈铺热提·外力1，赵梦原2，艾斯卡尔·艾木都拉1

1.新疆大学信息科学与工程学院，乌鲁木齐 830046
2.清华大学语音和语言技术研究中心，北京 100086

出版日期:2015-08-15 发布日期:2015-08-14

Keyword based Uyghur single document summarization

Mahpirat Wali1, ZHAO Mengyuan2, Askar Hamdulla1

1.Institute of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2.Research Center of Speech and Language Technology, Tsinghua University, Beijing 100086, China

Online:2015-08-15 Published:2015-08-14

摘要/Abstract

摘要： 以互联网为代表的信息技术的发展使人们索取信息变得前所未有的便捷，同时也对如何有效利用信息提出了挑战。自动文摘技术通过自动选择文档中的代表句子，可以极大提高信息使用的效率。近年来，基于英文和中文的自动文摘技术获得广泛关注并取得长足进展，而对少数民族语言的自动文摘研究还不够充分，例如维吾尔语。构造了一个面向维吾尔语的自动文摘系统。首先利用维吾尔语的语言学知识对文档进行预处理，之后对文档进行了关键词提取，利用这些关键词进行了抽取式自动文摘。比较了基于TF-IDF和基于TextRank的两种关键词提取算法，证明TextRank方法提取出的关键词更适合自动文摘应用。通过研究证明了在充分考虑到维吾尔语语言信息的前提下，基于关键词的自动文摘方法可以取得让人满意的效果。

关键词: 维吾尔文, 自动文摘, TF-IDF算法, Textrank, ROUGE

Abstract: As represented by the Internet, development of information technology has enabled people to obtain information easier than ever before, but it also presents challenges to the effective use of information. Automatic summarization techniques greatly improve efficiency in the use of information by automatically selecting representatives of the sentences in the document. In recent years, automatic summarization techniques based on English and Chinese received wide attention and achieved significant progress while the automatic summarization of minority languages is not sufficient, such as Uyghur language. This paper constructs a Uyghur-oriented automatic summarization system. Uyghur linguistic knowledge is used to handle the document, and then keywords which are extracted from the document is used for automatic text summarization. Two different TF-IDF-based and TextRank-based extraction algorithms are compared; it proves TextRank method is more suitable for automatic text summarization. It is demonstrated that on the premise of full account of Uygur language information, automatic text summarization based on keywords can achieve satisfactory results.

Key words: Uyghur, automatic summarization, TF-IDF algorithm, TextRank, ROUGE

买哈铺热提·外力1，赵梦原2，艾斯卡尔·艾木都拉1. 基于关键词的维吾尔单文档自动文摘技术研究[J]. 计算机工程与应用, 2015, 51(16): 130-135.

Mahpirat Wali1, ZHAO Mengyuan2, Askar Hamdulla1. Keyword based Uyghur single document summarization[J]. Computer Engineering and Applications, 2015, 51(16): 130-135.

[1]	艾合麦提江·麦提托合提，艾斯卡尔·艾木都拉，阿布都萨拉木·达吾提. 应用通道增强MSER与CNN的维吾尔文本区域定位[J]. 计算机工程与应用, 2020, 56(16): 132-138.
[2]	徐学斌，吾尔尼沙·买买提，阿力木江·艾沙，朱亚俐，库尔班·吾布力. 聚类+连体段判别的维吾尔文档图像单词切分[J]. 计算机工程与应用, 2020, 56(14): 148-155.
[3]	张祖平，沈晓阳. 基于深度学习的用户行为推荐方法研究[J]. 计算机工程与应用, 2019, 55(4): 142-147.
[4]	郑诚，钱改林，章金平. Title加TextRank抽取关键句的情感分类研究[J]. 计算机工程与应用, 2019, 55(20): 95-100.
[5]	阿依萨代提·阿卜力孜，加合买提·司马义，卡米力·木依丁，艾斯卡尔·艾木都拉. 脱机手写维吾尔文本图像单词切分[J]. 计算机工程与应用, 2018, 54(9): 133-138.
[6]	薛朋强，鲜英，努尔布力，吾守尔·斯拉木. 面向维吾尔文的敏感信息过滤方法研究[J]. 计算机工程与应用, 2018, 54(5): 236-241.
[7]	徐奕枫1，刘利军1，黄青松1，2，傅铁威1. 智能导医系统中TF-IDF权重改进算法研究[J]. 计算机工程与应用, 2017, 53(4): 238-243.
[8]	易晓芳，卡米力·木依丁，艾斯卡尔·艾木都拉. 基于连通域特征的维吾尔手写文本行分割[J]. 计算机工程与应用, 2014, 50(18): 142-146.
[9]	袁廷磊，吾守尔·斯拉木，邓俊，赵志成. 维吾尔文智能输入法词库结构的研究与应用[J]. 计算机工程与应用, 2014, 50(16): 131-134.
[10]	张建周，哈力木拉提·买买提，陈晓娇. 改进的K-means算法在维文连体段聚类中的应用[J]. 计算机工程与应用, 2014, 50(14): 135-138.
[11]	杨燚，祖丽菲亚·卡哈尔，艾斯卡尔·艾木都拉. 基于改进SRG法的叠加维吾尔文字提取算法[J]. 计算机工程与应用, 2014, 50(12): 220-225.
[12]	许辉1，热依曼·吐尔逊1，2，吾守尔·斯拉木2. 基于HMM和GMM的维吾尔语联机手写体识别研究[J]. 计算机工程与应用, 2014, 50(11): 202-205.
[13]	李凯，艾斯卡尔·艾木都拉. 基于边缘和基线的维吾尔文图像文字定位算法[J]. 计算机工程与应用, 2014, 50(10): 203-207.
[14]	古孜丽塔吉·乃拜，库尔班·吾布力，卡米力·木依丁，艾斯卡尔·艾木都拉. 基于多方向特征融合的维吾尔文笔迹鉴别技术[J]. 计算机工程与应用, 2013, 49(3): 139-142.
[15]	陈晓娇，哈力木拉提·买买提. 一种基于HMM的维吾尔文联机手写识别的方法[J]. 计算机工程与应用, 2013, 49(24): 175-178.

基于关键词的维吾尔单文档自动文摘技术研究

Keyword based Uyghur single document summarization

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics