计算机科学 ›› 2020, Vol. 47 ›› Issue (9): 10-16.doi: 10.11896/jsjkx.200400041

• 计算机软件* 上一篇    下一篇

基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测

欧阳鹏1, 陆璐1,2, 张凡龙3, 邱少健4   

  1. 1 华南理工大学计算机科学与工程学院 广州510641
    2 华南理工大学梅州技术研究院 广东 梅州514021
    3 广东工业大学计算机学院 广州510006
    4 华南农业大学数学与信息学院 广州510642
  • 收稿日期:2020-04-09 发布日期:2020-09-10
  • 通讯作者: 陆璐(lul@scut.edu.cn)
  • 作者简介:939956752@qq.com
  • 基金资助:
    国家自然科学基金(61370103);广州产学研基金(201902020004);梅州产学研项目(2019A0101019)

Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology

OUYANG Peng1, LU Lu1,2, ZHANG Fan-long3, QIU Shao-jian4   

  1. 1 School of Computer Science and Engineering,South China University of Technology,Guangzhou 510641,China
    2 Technology Research Institute,South China University of Technology,Meizhou,Guangdong 514021,China
    3 School of Computers,Guangdong University of Technology,Guangzhou 510006,China
    4 School of Mathematics and Informatics,South China Agricultural University,Guangzhou 510642,China
  • Received:2020-04-09 Published:2020-09-10
  • About author:OUYANG Peng,born in 1996,postgraduate.His main research interests include software reliability maintenance and transfer learning.
    LU Lu,born in 1971,Ph.D,professor,is a member of China Computer Federation.His main research interests include software engineering,software testing and software architecture design.
  • Supported by:
    National Natural Science Foundation of China (61370103),Industry-University-Research Foundation of Guangzhou (201902020004) and Industry-University-Research Project of Meizhou (2019A0101019).

摘要: 近年来,随着软件需求的不断增加,开发人员通过复用已有的代码向项目中引入了大量的克隆代码。随着软件版本的迭代和更新,克隆代码会发生变化,而克隆代码变化会导致额外的维护代价,并逐渐成为软件维护的负担。研究人员尝试利用机器学习方法开展克隆代码一致性维护需求预测研究,通过预测克隆代码的变化是否会导致额外的维护代价,来帮助软件质量保障团队更有效地分配维护资源,从而提高工作效率并降低运维成本。然而,在软件开发的初期阶段,软件项目往往没有经过充分的演化,缺少历史数据用于构建有效的预测模型,因此跨项目克隆代码一致性维护需求预测方法被提出。文中以减少跨项目数据分布差异为切入点,提出了基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测方法CPCCP+,旨在将测试集与数据集映射到核空间中,通过迁移主成分分析方法减小跨项目数据的分布差异,并对数据集的类不平衡问题进行处理,从而提高跨项目预测模型的性能。在实验数据集方面,选取了7个开源数据集,合计形成42组跨项目克隆代码一致性维护需求预测任务。将提出的方法与使用基分类器的方法进行比较,评估指标包含Precision,Recall和F-Measure。实验结果表明,CPCCP+能更有效地进行跨项目克隆代码一致性维护需求的预测。

关键词: 过采样技术, 克隆代码, 跨项目预测, 迁移学习, 一致性变化

Abstract: In recent years,as software requirements increase,developers have introduced a large amount of clone code into the project by reusing existing code.As the software version is updated,the clone code changes and it may become a burden on software maintenance.Researchers have attempted to use the machine learning to conduct research on the prediction of clone code consistency,and help the software quality assurance team to allocate maintenance resources more effectively by predicting whether changes to cloned code will cause additional maintenance costs,thereby improving work efficiency and reducing maintenance costs.However,in the early stage of software development,software projects are often not fully evolved,and historical data is lacking for constructing an effective predictive model.Therefore,cross-project clone code consistency prediction methods are proposed.In this paper,we propose a cross-project clone code consistency prediction method via transfer learning and oversampling technology (CPCCP+).This method aims to match test set and training set into kernel space,reduce the distribution discrepancy of cross-project data by transfer component analysis,and alleviate the class imbalance issue to improve the performance of cross-project prediction model.In terms of experimental datasets,this paper selects seven open source datasets,which can form 42 combinations of cross-project clone code consistency prediction tasks totally.In terms of model performance comparison,the CPCCP+ proposed in this paper is compared with the method only using base classifier.The evaluation metrics include precision,recall and F-measure.The experimental results show that CPCCP+ can more effectively perform cross-project clone code consistency prediction.

Key words: Code clone, Consistent change, Cross-project prediction, Oversampling technology, Transfer learning

中图分类号: 

  • TP311
[1] SAJNANI H,SAINI V,SVAJLENKO J,et al.SourcererCC:Scaling code clone detection to big-code[C]//2016 IEEE/ACM 38th International Conference on Software Engineering.2016:1157-1168.
[2] KRINKE J.A study of consistent and inconsistent changes tocode clones[C]//14th working Conference on Reverse Engineering.2007:170-178.
[3] BETTENBURG N,SHANG W,IBRAHIM W M,et al.An empirical study on inconsistent changes to code clones at the release level[J].Science of Computer Programming,2012,77(6):760-776.
[4] WAGNER S,ABDULKHALEQ A,KAYA K,et al.On the rela-tionship of inconsistent software clones and faults:an empirical study[C]//2016 IEEE 23rd International Conference on Software Analysis,Evolution,and Reengineering.2016:79-89.
[5] JUERGENS E,DEISSENBOECK F,HUMMEL B,et al.Docode clones matter?[C] //Proceedings of the 31st InternationalConference on Software Engineering.2009:485-495.
[6] WHITE M,TUFANO M,VENDOME C,et al.Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering.2016:87-98.
[7] ZHANG F,KHOO S C,SU X.Predicting consistent clonechange[C]//2016 IEEE 27th International Symposium on Software Reliability Engineering.2016:353-364.
[8] ZHANG F,KHOO S C,SU X.Machine-Learning Aided Analysis of Clone Evolution[J].Chinese Journal of Electronics,2017,26(6):1132-1138.
[9] KIM M,SAZAWAL V,NOTKIN D,et al.An empirical study of code clone genealogies [J].Acmfigsoft Software Engineering Notes,2005,30(5):187-196.
[10] ZHANG F.Research on analysis and consistency maintenance of code clone based on software evolution[D].Harbin:Harbin Institute of Technology,2017.
[11] KAMEIY,MONDEN A,MATSUMOTO S,et al.The effects of over and under sampling on fault- prone module detection[C]//Proceedings of the First International Symposium on Empirical Software Engineering and Measurement.IEEE,2007:196-204.
[12] SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J.Improving software-quality predictions with data sampling and boosting[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2009,39(6):1283-1294.
[13] PAN S J,YANG Q.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359.
[14] BORGWARDT KM,GRETTON A,RASCG M J,et al.In-tegrating structured biological data by kernel maximum mean discrepancy[J].Bioinformatics,2006,22(14):e49-e57.
[15] ZHANG F,KHOO S,SU X.Predicting change consistency in a clone group[J].Journal of Systems and Software,2017,134:105-119.
[16] PAN S J,TSANG I W,KWOK J T,et al.Domain adaptation via transfer component analysis[J].IEEE Transactions on Neural Networks,2011,22(2):199-210.
[17] ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//Proceedings of IEEE International Conference on Program Comprehension.2008:172-181.
[18] HALSTEAD M H.Elements of software science[M].NewYork:Elsevier,1977.
[19] SHE R,ZHANG L.Method for Identifying and Recommending Reconstructed Clones Based on Software Evolution History[J].Computer Science,2019,46(8):224-232.
[20] KHOSHGOFTAAR T M,SEIFFERT C,VAN HULSE J,et al.Learning with limited minority class data[C]//Proceedings of the International Conference on Machine Learning and Applications.IEEE,2007:348-353.
[21] SU X,ZHANG F.A Survey for Management-Oriented CodeClone Research[J].Chinese Journal of Computers,2018,41(3):628-651.
[22] HHAN J,PEI J,KAMBER M.Data mining:concepts and techniques[M].NewYork:Elsevier,2011.
[1] 方义秋, 张震坤, 葛君伟.
基于自注意力机制和迁移学习的跨领域推荐算法
Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning
计算机科学, 2022, 49(8): 70-77. https://doi.org/10.11896/jsjkx.210600011
[2] 王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰.
基于多源迁移学习的大坝裂缝检测
Dam Crack Detection Based on Multi-source Transfer Learning
计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124
[3] 彭云聪, 秦小林, 张力戈, 顾勇翔.
面向图像分类的小样本学习算法综述
Survey on Few-shot Learning Algorithms for Image Classification
计算机科学, 2022, 49(5): 1-9. https://doi.org/10.11896/jsjkx.210500128
[4] 谭珍琼, 姜文君, 任演纳, 张吉, 任德盛, 李晓鸿.
基于二分图的个性化学习任务分配
Personalized Learning Task Assignment Based on Bipartite Graph
计算机科学, 2022, 49(4): 269-281. https://doi.org/10.11896/jsjkx.210500125
[5] 左杰格, 柳晓鸣, 蔡兵.
基于图像分块与特征融合的户外图像天气识别
Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion
计算机科学, 2022, 49(3): 197-203. https://doi.org/10.11896/jsjkx.201200263
[6] 张舒萌, 余增, 李天瑞.
跨领域文本的可迁移情绪分析方法
Transferable Emotion Analysis Method for Cross-domain Text
计算机科学, 2022, 49(3): 218-224. https://doi.org/10.11896/jsjkx.210400034
[7] 李星燃, 张立言, 姚树婧.
结合特征融合和注意力机制的微表情识别方法
Micro-expression Recognition Method Combining Feature Fusion and Attention Mechanism
计算机科学, 2022, 49(2): 4-11. https://doi.org/10.11896/jsjkx.210900028
[8] 侯宏旭, 孙硕, 乌尼尔.
蒙汉神经机器翻译研究综述
Survey of Mongolian-Chinese Neural Machine Translation
计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[9] 熊朝阳, 王婷.
基于卷积神经网络的建筑构件图像识别
Image Recognition for Building Components Based on Convolutional Neural Network
计算机科学, 2021, 48(6A): 51-56. https://doi.org/10.11896/jsjkx.200500122
[10] 吴兰, 王涵, 李斌全.
基于自监督任务最优选择的无监督域自适应方法
Unsupervised Domain Adaptive Method Based on Optimal Selection of Self-supervised Tasks
计算机科学, 2021, 48(6A): 357-363. https://doi.org/10.11896/jsjkx.201000030
[11] 李达, 雷迎科, 张海川.
基于LTE网络的室外指纹定位
Outdoor Fingerprint Positioning Based on LTE Networks
计算机科学, 2021, 48(6A): 404-409. https://doi.org/10.11896/jsjkx.200700170
[12] 刘昱彤, 李鹏, 孙云云, 胡素君.
基于深度动态联合自适应网络的图像识别方法
Image Recognition with Deep Dynamic Joint Adaptation Networks
计算机科学, 2021, 48(6): 131-137. https://doi.org/10.11896/jsjkx.210100008
[13] 张久杰, 陈超, 聂宏轩, 夏玉芹, 张丽萍, 马占飞.
基于类粒度的克隆代码群稳定性实证研究
Empirical Study on Stability of Clone Code Sets Based on Class Granularity
计算机科学, 2021, 48(5): 75-85. https://doi.org/10.11896/jsjkx.200900062
[14] 刘林芽, 吴送英, 左志远, 曹子文.
基于YOLOv3算法的山区铁路边坡落石检测方法研究
Research on Rockfall Detection Method of Mountain Railway Slope Based on YOLOv3 Algorithm
计算机科学, 2021, 48(11A): 290-294. https://doi.org/10.11896/jsjkx.201200113
[15] 周彦, 陈少昌, 吴可, 宁明强, 陈宏昆, 张鹏.
SCTD1.0:声呐常见目标检测数据集
SCTD 1.0:Sonar Common Target Detection Dataset
计算机科学, 2021, 48(11A): 334-339. https://doi.org/10.11896/jsjkx.210100138
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!