计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (24): 277-288.DOI: 10.3778/j.issn.1002-8331.2209-0402

• 大数据与云计算 • 上一篇    下一篇

以CodeBERT为基础的代码分类研究

成思强,刘建勋,彭珍连,曹奔   

  1. 1.湖南科技大学 计算机科学与工程学院,湖南 湘潭 411201
    2.湖南科技大学 服务计算与软件服务新技术湖南重点实验室,湖南 湘潭 411201
  • 出版日期:2023-12-15 发布日期:2023-12-15

CodeBERT Based Code Classification Method

CHENG Siqiang, LIU Jianxun, PENG Zhenlian, CAO Ben   

  1. 1.School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, Hunan 411201, China
    2.Key Laboratory For Services Computing and Novel Software Technology, Hunan University of Science and Technology, Xiangtan, Hunan 411201, China
  • Online:2023-12-15 Published:2023-12-15

摘要: 随着代码大数据的不断发展,代码库中的源代码数量逐渐增长。如何快速有效地对代码库中的代码进行分类管理,对软件工程的发展具有十分重要的意义。第一次将预训练模型引入代码分类研究,并提出了一种优化的代码分类方法CBBCC。CBBCC采用wordpiece对源代码进行数据预处理。采用CodeBERT预训练模型对源代码进行特征表征。在预训练模型的基础上进行分类任务的微调。为了验证所提模型的有效性,在POJ104数据集上进行实验分析。实验结果表明,相对于7种基准模型,CBBCC模型各项分类指标都在98%以上。其中准确率上比目前最优模型提高了1.1个百分点,达到了POJ104代码分类数据集上分类任务的SOTA值。CBBCC能有效地对代码进行标注,提高对开源社区源代码的管理,促进软件工程领域的发展。

关键词: 代码分类, 代码表征, CodeBERT, 迁移训练, 代码片段

Abstract: With the continuous development of code big data, the amount of source code in the code base is gradually growing, which makes software code management more complex. How to quickly and effectively classify and manage the code in the code base is of great importance to the development of software engineering. The article introduces pre-trained models to code classification research for the first time and proposes an optimized code classification method, CBBCC, which firstly uses wordpiece to pre-process the source code. Secondly, a CodeBERT pre-training model is used to characterise the source code. Finally, the classification task is fine-tuned on the basis of the pre-trained model. To verify the effectiveness of the proposed model, experimental analysis is conducted on the POJ104 dataset. The experimental results show that the CBBCC model achieves more than 98% in all classification metrics compared to the seven benchmark models. The accuracy is improved by 1.1 percentage points over the current optimal model, reaching the SOTA value for the classification task on the POJ104 code classification dataset. CBBCC can effectively annotate code, improve the management of open source community source code and promote the development of the software engineering field.

Key words: code classification, code representation, CodeBERT, migration training, code fragmentation