以CodeBERT为基础的代码分类研究

doi:10.3778/j.issn.1002-8331.2209-0402

摘要/Abstract

摘要： 随着代码大数据的不断发展，代码库中的源代码数量逐渐增长。如何快速有效地对代码库中的代码进行分类管理，对软件工程的发展具有十分重要的意义。第一次将预训练模型引入代码分类研究，并提出了一种优化的代码分类方法CBBCC。CBBCC采用wordpiece对源代码进行数据预处理。采用CodeBERT预训练模型对源代码进行特征表征。在预训练模型的基础上进行分类任务的微调。为了验证所提模型的有效性，在POJ104数据集上进行实验分析。实验结果表明，相对于7种基准模型，CBBCC模型各项分类指标都在98%以上。其中准确率上比目前最优模型提高了1.1个百分点，达到了POJ104代码分类数据集上分类任务的SOTA值。CBBCC能有效地对代码进行标注，提高对开源社区源代码的管理，促进软件工程领域的发展。

关键词: 代码分类, 代码表征, CodeBERT, 迁移训练, 代码片段

Abstract: With the continuous development of code big data, the amount of source code in the code base is gradually growing, which makes software code management more complex. How to quickly and effectively classify and manage the code in the code base is of great importance to the development of software engineering. The article introduces pre-trained models to code classification research for the first time and proposes an optimized code classification method, CBBCC, which firstly uses wordpiece to pre-process the source code. Secondly, a CodeBERT pre-training model is used to characterise the source code. Finally, the classification task is fine-tuned on the basis of the pre-trained model. To verify the effectiveness of the proposed model, experimental analysis is conducted on the POJ104 dataset. The experimental results show that the CBBCC model achieves more than 98% in all classification metrics compared to the seven benchmark models. The accuracy is improved by 1.1 percentage points over the current optimal model, reaching the SOTA value for the classification task on the POJ104 code classification dataset. CBBCC can effectively annotate code, improve the management of open source community source code and promote the development of the software engineering field.

Key words: code classification, code representation, CodeBERT, migration training, code fragmentation

成思强, 刘建勋, 彭珍连, 曹奔. 以CodeBERT为基础的代码分类研究[J]. 计算机工程与应用, 2023, 59(24): 277-288.

CHENG Siqiang, LIU Jianxun, PENG Zhenlian, CAO Ben. CodeBERT Based Code Classification Method[J]. Computer Engineering and Applications, 2023, 59(24): 277-288.

参考文献

[1] UGUREL S，KROVETZ R，GILES C L.What’s the code?[C]//The Eighth ACM SIGKDD International Conference，2002.
[2] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformer for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2019：4171-4186.
[3] LIU Y H，OTT M，GOYAL N，et al.RoBERTa：a robustly optimized BERT pretraining approach[J].arXiv：1907.11692，2019.
[4] RADFORD A，NARASIMHAN K，SALIMANS T，et al.Improving language understanding by generative pre-training[EB/OL].[2022-04-13].https：//www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.
[5] ALBERTI C，LEE K，COLLINS M.A BERT baseline for the natural questions[J].arXiv：1901.08634，2019.
[6] NOGUEIRA R，CHO K.Passage re-ranking with BERT[J].arXiv：1901.04085，2019.
[7] ADHIKARI A，RAM A，TANG R，et al.DocBERT：BERT for document classification[J].arXiv：1904.08398，2019.
[8] WU X，LV S，ZANG L，et al.Conditional BERT contextual augmentation[J].arXiv：1812.06705，2018.
[9] HUANG W，CHENG X，CHEN K，et al.Toward fast and accurate neural Chinese word segmentation with multi-criteria learning[J].arXiv：1903.04190，2019.
[10] FENG Z，GUO D，TANG D，et al.CodeBERT：a pre-trained model for programming and natural languages[J].arXiv：2002.08155，2020.
[11] HUSAIN H，WU H H，GAZIT T，et al.CodeSearchNet challenge：evaluating the state of semantic code search[J].arXiv：1909.09436，2019.
[12] HINDLE A，BARR E T，SU Z，et al.On the naturalness of software[C]//2012 34th International Conference on Software Engineering（ICSE），2012：837-847.
[13] MOU L L，LI G，ZHANG L，et al.Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence，Phoenix，Feb 12-17，2016.Menlo Park：AAAI，2016：1287-1293.
[14] 谢文凯，彭鑫，赵文耘.软件开发问答网站代码片段自动分类方法研究[J].计算机应用与软件，2021，38（8）：1-6.
XIE W K，PENG X，ZHAO W Y.Automatic classification research for code snippets in software development Q&A website[J].Computer Applications and Software，2021，38（8）：1-6.
[15] GU X D，ZHANG H Y，KIM S.Deep code search[C]//IEEE/ACM 40th International Conference on Software Engineering（ICSE），2018：933-944.
[16] KAMIYA T，KUSUMOTO S，INOUE K.CCFinder：a multi linguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering，2002，28（7）：654-670.
[17] SAJNANI H，SAINI V，SVAJLENKO J，et al.SourcererCC：scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering，Austin，May 14-22，2016.New York：ACM，2016：1157-1168.
[18] ALLAMANIS M，BARR E T，DEVANBU P，et al.A survey of machine learning for big code and naturalness[J].ACM Computing Surveys，2018，51（4）：1-37.
[19] KAUR A，NAYYAR R.A comparative study of static code analysis tools for vulnerability detection in C/C++ and JAVA source code[J].Procedia Computer Science，2020，171：2023-2029.
[20] CHUNG J，GULCEHRE C，CHO K H，et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv：1412.3555，2014.
[21] DENG J，DONG W，SOCHER R，et al.ImageNet：a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition，2009.
[22] HE K，GIRSHICK R，DOLLAR P.Rethinking ImageNet pre-training[C]//International Conference on Computer Vision，2019.
[23] PETERS M E，NEUMANN M，IYYER M，et al.Deep contextualized word representations[J].arXiv：1802.05365，2018.
[24] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781v3，2013.
[25] SALZA P，SCHWIZER C，GU J，et al.On the effectiveness of transfer learning for code search[J].arXiv：2108.05890，2021.
[26] ZHANG J，WANG X，ZHANG H，et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering（ICSE），2019.
[27] 史志成，周宇.代码特征自动提取方法[J].计算机科学与探索，2021，15（3）：456-467.
SHI Z C，ZHOU Y.Method of code features automated extraction[J].Journal of Frontiers of Computer Science and Technology，2021，15（3）：456-467.
[28] HUA W，LIU G.Transformer-based networks over tree structures for code classification[J].Applied Intelligence，2022，52（8）：8895-8909.
[29] 张祥平，刘建勋.基于深度学习的代码表征及其应用综述[J].计算机科学与探索，2022，16（9）：2011-2029.
ZHANG X P，LIU J X.Overview of deep learning-based code representation and its applications[J].Journal of Frontiers of Computer Science and Technology，2022，16（9）：2011-2029.
[30] 卢喜东，段哲民，钱叶魁，等.一种基于深度森林的恶意代码分类方法[J].软件学报，2020，31（5）：1454-1464.
LU X D，DUAN Z M，QIAN Y K，et al.Malicious code classification method based on deep forest[J].Journal of Software，2020，31（5）：1454-1464.
[31] 王晓萌，管志斌，辛伟，等.基于深度卷积神经网络的源代码缺陷检测方法[J].清华大学学报（自然科学版），2021，61（11）：1267-1272.
WANG X M，GUAN Z B，XIN W，et al.Source code defect detection using deep convolutional neural networks[J].Journal of Tsinghua University（Science and Technology），2021，61（11）：1267-1272.
[32] 王润正，高见，仝鑫，等.融合注意力机制的恶意代码家族分类研究[J].计算机科学与探索，2021，15（5）：881-892.
WANG R Z，GAO M，TONG X，et al.Research on malicious code family classification combining attention mechanism[J].Journal of Frontiers of Computer Science and Technology，2021，15（5）：881-892.
[33] YING A T T，ROBILLARD M P.Code fragment summarization[C]//Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering，2013.
[34] LU M M，TAN D W，XIONG N X，et al.Program classification using gated graph attention neural network for online programming service[J].arXiv：1903.03804，2019.
[35] PHAN A V，CHAU P N，NGUYEN M L，et al.Automatically classifying source code using tree-based approaches[J].Data & Knowledge Engineering，2017，114：12-25.
[36] ALVARES M，MARWALA T，NETO F.Application of computational intelligence for source code classification[C]//2014 IEEE Congress on Evolutionary Computation（CEC），2014.
[37] ALRESHEDY K，DHARMARETNAM D，GERMAN D M，et al.SCC：automatic classification of code snippets[J].arXiv：1809.07945v1，2018.
[38] JOULIN A，GRAVE E，BOJANOWSKI P，et al.Bag of tricks for efficient text classification[J].arXiv：1607.01759，2016.