裁判文书关键词提取的改进方法研究

doi:10.3778/j.issn.1002-8331.2004-0097

计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (23): 153-160.DOI: 10.3778/j.issn.1002-8331.2004-0097

裁判文书关键词提取的改进方法研究

白凤波，常林，王世凡，李彬，王颖洁，周红，刘耀

1.中国政法大学证据科学研究院，北京 100088
2.浙江迪安鉴定科学研究院，杭州 310000
3.中国科学技术大学软件学院，江苏苏州 215000
4.大连大学信息工程学院，辽宁大连 116622
5.公安部物证鉴定中心，北京 100038

出版日期:2020-12-01 发布日期:2020-11-30

Improved Method Study on Extracting Keywords in Chinese Judgment Documents

BAI Fengbo, CHANG Lin, WANG Shifan, LI Bin, WANG Yingjie, ZHOU Hong, LIU Yao

1.Institute of Evidence Law and Forensic Science, China University of Political Science and Law, Beijing 100088, China
2.Di’an Institute of Forensic Sciences in Zhejiang, Hangzhou 310000, China
3.School of Software Engineering, University of Science and Technology of China, Suzhou, Jiangsu 215000, China
4.College of Information Engineering, Dalian University, Dalian, Liaoning 116622, China
5.Institute of Forensic Sciences, Ministry of Public Security, Beijing 100038, China

Online:2020-12-01 Published:2020-11-30

摘要/Abstract

摘要：

在国家加强依法治国的方针指引下，自然语言处理（NLP）和信息检索（IR）等领域与法治社会的深入结合是必然趋势。为司法工作者提供正确、全面的智能化辅助以提高工作效率，对裁判文书的关键词提取方法进行了研究。针对传统关键词提取方法的劣势，结合词语的词性、长度、词跨度、位置以及文档所属类别等多重因素，并基于图模型的TextRank算法，提出了一种改进的TF-IDF算法（IAKEF），引入信息熵、离散度、融合特征的概念，主要解决了传统算法对于词语在语义上的忽略和类间、类内信息分布上的问题，使其能够更有效地从文本中选择特征。通过对比实验，对改进算法的效果进行分析和评价，实验结果表明改进的算法与传统的算法相比在准确率、召回率及F1-Measure上均有显著的提高。

关键词: 改进TF-IDF, 关键词抽取, 信息熵, 离散度, 特征融合

Abstract:

Under the national policy the guidance to rule the country by law, it is an inevitable trend to combine the field of artificial intelligence, such as NLP（Natural Language Processing） and IR（Information Retrieve）, with the need to rule of law. In this paper, through the research of keyword extraction method for judicial documents, the purpose is to provide accurate and comprehensive intelligent assistance for judicial service workers to improve work efficiency. This paper proposes an improved TF-IDF algorithm, named Improved Algorithm for Keyword Extraction in Forensics（IAKEF）, targeting to the disadvantages of traditional keyword extraction methods, according to the multiple factors such as part of speech, length, word span, position and document category, based on the TextRank algorithm of graph model, introducing the concepts of information entropy, dispersion degree and fusion features. The algorithm mainly solves the problems of traditional algorithms for semantic neglect of words and distribution of information among classes or a class inner, so that the features from text can be selected more effectively. With the experiments and the comparison of algorithms, the improvement effect is analyzed and verified, the experimental results show that the improved algorithm has a significant improvement in accuracy, recalling-rate and F1-Measure compared with the traditional algorithm.

Key words: improved TF-IDF, keyword extraction, information entropy, dispersion, feature fusion

白凤波，常林，王世凡，李彬，王颖洁，周红，刘耀. 裁判文书关键词提取的改进方法研究[J]. 计算机工程与应用, 2020, 56(23): 153-160.

BAI Fengbo, CHANG Lin, WANG Shifan, LI Bin, WANG Yingjie, ZHOU Hong, LIU Yao. Improved Method Study on Extracting Keywords in Chinese Judgment Documents[J]. Computer Engineering and Applications, 2020, 56(23): 153-160.

[1]	陆莉霞，邹俊忠，郭玉成，张见，王蓓. 多模态融合的膝关节损伤预测[J]. 计算机工程与应用, 2021, 57(9): 225-232.
[2]	王玲，王家沛，王鹏，孙爽滋. 融合注意力机制的孪生网络目标跟踪算法研究[J]. 计算机工程与应用, 2021, 57(8): 169-174.
[3]	李明山，韩清鹏，张天宇，王道累. 改进SSD的安全帽检测方法[J]. 计算机工程与应用, 2021, 57(8): 192-197.
[4]	郭晓静，隋昊达. 改进YOLOv3在机场跑道异物目标检测中的应用[J]. 计算机工程与应用, 2021, 57(8): 249-255.
[5]	王鹏，叶学义，王涛，钱丁炜. 双偏差双空间局部方向模式的人脸识别[J]. 计算机工程与应用, 2021, 57(4): 91-99.
[6]	韩文静，罗晓曙，杨日星. 一种复合型手势识别方法研究[J]. 计算机工程与应用, 2021, 57(4): 108-113.
[7]	赵辉，李志伟，方禄发. 特征信息增强的单发多框检测器算法[J]. 计算机工程与应用, 2021, 57(4): 148-154.
[8]	王殿伟，赵梦影，刘颖，宋海军，谢永军. 改进的R-SSD全景视频图像车辆检测算法[J]. 计算机工程与应用, 2021, 57(3): 189-195.
[9]	肖瑞雪，冯英伟，屈建萍. 结合高效特征融合的可变尺寸图像隐写分析[J]. 计算机工程与应用, 2021, 57(24): 126-134.
[10]	卢苇，刘丹，邵敏，吴扬东. 改进Mask R-CNN网络在医学图像识别与分割中的应用[J]. 计算机工程与应用, 2021, 57(24): 234-241.
[11]	滕金保，孔韦韦，田乔鑫，王照乾，李龙. 基于CNN和LSTM的多通道注意力机制文本分类模型[J]. 计算机工程与应用, 2021, 57(23): 154-162.
[12]	王传昱，李为相，陈震环. 基于语音和视频图像的多模态情感识别研究[J]. 计算机工程与应用, 2021, 57(23): 163-170.
[13]	畅雅雯，赵冬青，单彦虎. 多特征融合和自适应聚合的立体匹配算法研究[J]. 计算机工程与应用, 2021, 57(23): 219-225.
[14]	江魁，丘远东，郑浩城. 基于信息熵与LSTM的ICMPv6 DDoS攻击检测方法[J]. 计算机工程与应用, 2021, 57(21): 148-154.
[15]	左健豪，姜文刚. 自适应融合特征的人群计数网络[J]. 计算机工程与应用, 2021, 57(21): 203-208.

裁判文书关键词提取的改进方法研究

Improved Method Study on Extracting Keywords in Chinese Judgment Documents

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics