计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (6): 247-253.DOI: 10.3778/j.issn.1002-8331.1912-0372

• 工程与应用 • 上一篇    下一篇

基于语义的档案数据智能分类方法研究

霍光煜,张勇,孙艳丰,尹宝才   

  1. 1.北京工业大学 信息学部 多媒体与智能软件技术北京市重点实验室,北京 100124
    2.北京市交通信息中心,北京 100055
  • 出版日期:2021-03-15 发布日期:2021-03-12

Research on Archive Data Intelligent Classification Based on Semantic

HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai   

  1. 1.Multimedia and Intelligent Software Technology Laboratory, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
    2.Beijing Transportation Research Center, Beijing 100055, China
  • Online:2021-03-15 Published:2021-03-12

摘要:

随着信息技术的高速发展,各种数字档案数据量出现了爆炸式的增长。如何合理地挖掘分析档案数据,提升对新收录档案智能管理的效果已成为一个亟需解决的问题。现有的档案数据分类方法是面向管理需求的人工分类,这种人工分类的方式效率低下,忽略了档案固有的内容信息。此外,对于档案信息发现和利用来说,需进一步挖掘分析档案数据内容之间的关联性。面向档案智能管理的需求,从档案数据的文本内容角度出发,对人工分类的档案进行进一步分析。采用LDA模型提取文档的主题特征向量,进而用[K]-means算法对档案的主题特征进行聚类,得到档案间的关联。针对新收录档案数据的分类问题,采用现有档案数据,有监督的训练FastText深度学习模型,用训练完成的模型对新收录的档案数据进行全自动分类。在数据集上测试的结果表明,所提聚类方法在文档数据集的准确率相较于传统的基于TF-IDF特征的聚类算法提升6%,基于FastText的档案分类方法准确率超过96%,达到了代替手工分类的级别,验证了该方法的有效性和实用性。

关键词: LDA特征表示, 文本聚类, FastText文本分类, 档案管理

Abstract:

With the rapid development of information technology, various digital archive data volumes have exploded. How to through reasonable mining analysis improve the intelligent management of new archives has become an urgent problem. Existed archival data classification methods are formulated for management requirements. This manual classification method is inefficient and ignores the inherent content information of archives. In addition, it is necessary to explore the correlation between the contents of archive data by archive information. Faced with the needs of archives intelligent management, this paper further analyzes the manually classified archives from the perspective of the semantic information. This method extracts the document-topic feature vector based on LDA, uses [K]-means algorithm to cluster the topic features of the document and obtain the association between archives. To classify the newly included archives, the existed archive data is used to supervised training FastText deep learning model, and the newly collected archives are automatically classified using the trained model. The results on the archives dataset show that the accuracy of the proposed clustering method is 6% higher than that of the traditional TF-IDF-based clustering algorithm. The accuracy of this classification method is higher than the traditional classification method, and the accuracy rate is more than 96%, which reaches the level of replacing manual classification, and verifies the validity and practicability of this method.

Key words: LDA topic feature representation, text clustering, FastText text classification, archives management