Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (6): 247-253.DOI: 10.3778/j.issn.1002-8331.1912-0372

Previous Articles     Next Articles

Research on Archive Data Intelligent Classification Based on Semantic

HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai   

  1. 1.Multimedia and Intelligent Software Technology Laboratory, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
    2.Beijing Transportation Research Center, Beijing 100055, China
  • Online:2021-03-15 Published:2021-03-12



  1. 1.北京工业大学 信息学部 多媒体与智能软件技术北京市重点实验室,北京 100124
    2.北京市交通信息中心,北京 100055


With the rapid development of information technology, various digital archive data volumes have exploded. How to through reasonable mining analysis improve the intelligent management of new archives has become an urgent problem. Existed archival data classification methods are formulated for management requirements. This manual classification method is inefficient and ignores the inherent content information of archives. In addition, it is necessary to explore the correlation between the contents of archive data by archive information. Faced with the needs of archives intelligent management, this paper further analyzes the manually classified archives from the perspective of the semantic information. This method extracts the document-topic feature vector based on LDA, uses [K]-means algorithm to cluster the topic features of the document and obtain the association between archives. To classify the newly included archives, the existed archive data is used to supervised training FastText deep learning model, and the newly collected archives are automatically classified using the trained model. The results on the archives dataset show that the accuracy of the proposed clustering method is 6% higher than that of the traditional TF-IDF-based clustering algorithm. The accuracy of this classification method is higher than the traditional classification method, and the accuracy rate is more than 96%, which reaches the level of replacing manual classification, and verifies the validity and practicability of this method.

Key words: LDA topic feature representation, text clustering, FastText text classification, archives management



关键词: LDA特征表示, 文本聚类, FastText文本分类, 档案管理