计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (23): 153-160.DOI: 10.3778/j.issn.1002-8331.2004-0097

• 模式识别与人工智能 • 上一篇    下一篇

裁判文书关键词提取的改进方法研究

白凤波,常林,王世凡,李彬,王颖洁,周红,刘耀   

  1. 1.中国政法大学 证据科学研究院,北京 100088
    2.浙江迪安鉴定科学研究院,杭州 310000
    3.中国科学技术大学 软件学院,江苏 苏州 215000
    4.大连大学 信息工程学院,辽宁 大连 116622
    5.公安部物证鉴定中心,北京 100038
  • 出版日期:2020-12-01 发布日期:2020-11-30

Improved Method Study on Extracting Keywords in Chinese Judgment Documents

BAI Fengbo, CHANG Lin, WANG Shifan, LI Bin, WANG Yingjie, ZHOU Hong, LIU Yao   

  1. 1.Institute of Evidence Law and Forensic Science, China University of Political Science and Law, Beijing 100088, China
    2.Di’an Institute of Forensic Sciences in Zhejiang, Hangzhou 310000, China
    3.School of Software Engineering, University of Science and Technology of China, Suzhou, Jiangsu 215000, China
    4.College of Information Engineering, Dalian University, Dalian, Liaoning 116622, China
    5.Institute of Forensic Sciences, Ministry of Public Security, Beijing 100038, China
  • Online:2020-12-01 Published:2020-11-30

摘要:

在国家加强依法治国的方针指引下,自然语言处理(NLP)和信息检索(IR)等领域与法治社会的深入结合是必然趋势。为司法工作者提供正确、全面的智能化辅助以提高工作效率,对裁判文书的关键词提取方法进行了研究。针对传统关键词提取方法的劣势,结合词语的词性、长度、词跨度、位置以及文档所属类别等多重因素,并基于图模型的TextRank算法,提出了一种改进的TF-IDF算法(IAKEF),引入信息熵、离散度、融合特征的概念,主要解决了传统算法对于词语在语义上的忽略和类间、类内信息分布上的问题,使其能够更有效地从文本中选择特征。通过对比实验,对改进算法的效果进行分析和评价,实验结果表明改进的算法与传统的算法相比在准确率、召回率及F1-Measure上均有显著的提高。

关键词: 改进TF-IDF, 关键词抽取, 信息熵, 离散度, 特征融合

Abstract:

Under the national policy the guidance to rule the country by law, it is an inevitable trend to combine the field of artificial intelligence, such as NLP(Natural Language Processing) and IR(Information Retrieve), with the need to rule of law. In this paper, through the research of keyword extraction method for judicial documents, the purpose is to provide accurate and comprehensive intelligent assistance for judicial service workers to improve work efficiency. This paper proposes an improved TF-IDF algorithm, named Improved Algorithm for Keyword Extraction in Forensics(IAKEF), targeting to the disadvantages of traditional keyword extraction methods, according to the multiple factors such as part of speech, length, word span, position and document category, based on the TextRank algorithm of graph model, introducing the concepts of information entropy, dispersion degree and fusion features. The algorithm mainly solves the problems of traditional algorithms for semantic neglect of words and distribution of information among classes or a class inner, so that the features from text can be selected more effectively. With the experiments and the comparison of algorithms, the improvement effect is analyzed and verified, the experimental results show that the improved algorithm has a significant improvement in accuracy, recalling-rate and F1-Measure compared with the traditional algorithm.

Key words: improved TF-IDF, keyword extraction, information entropy, dispersion, feature fusion