计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 79-86.doi: 10.11896/jsjkx.231200100

• 计算机软件 • 上一篇    下一篇

一种基于集成学习的开源许可证检测与兼容性判断的方法

白江浩, 朴勇   

  1. 大连理工大学软件学院 辽宁 大连 116620
  • 收稿日期:2023-12-15 修回日期:2024-05-06 出版日期:2024-12-15 发布日期:2024-12-10
  • 通讯作者: 朴勇(piaoy@dlut.edu.cn)
  • 作者简介:(ssdutbjh@163.com)

Ensemble Learning Based Open Source License Detection and Compatibility Assessment

BAI Jianghao, PIAO Yong   

  1. School of Software Engineering, Dalian University of Technology, Dalian, Liaoning 116620, China
  • Received:2023-12-15 Revised:2024-05-06 Online:2024-12-15 Published:2024-12-10
  • About author:BAI Jianghao,born in 1999,postgra-duate.His main research interests include natural language processing and so on.
    PIAO Yong,born in 1975,Ph.D,asso-ciate professor,is a member of CCF(No.E3677M).His main research interests include data mining and intelligent computing.

摘要: 软件供应链的安全性和可靠性对软件质量和演化有重要影响,而软件组件的许可证分析正是软件供应链中不可或缺的一环。开源许可证约束着开源软件的使用条件,以保护知识产权并维持开源软件的长远发展。为了避免法律风险与财产损失,识别开源软件许可证并判断开源许可证之间的兼容性至关重要。文中提出了基于集成学习的开源许可证的检测方法与依据兼容性的许可证推荐方法。具体来讲,提出了以基于大语言模型的集成学习为主,以规则匹配为辅的方法来进行开源许可证检测,并依据需求与有向图算法来完成许可证的兼容性判断与推荐。实验表明,相比于传统方法,该方法在更少的维护成本与高扩展性的优势下具有更好的检测效果,也能够有效地检测出兼容性并推荐结果。

关键词: 大语言模型, 集成学习, 开源许可证, 句向量相似度, 兼容性判断

Abstract: The quality and evolution of software are profoundly influenced by the security and reliability of the software supply chain.An essential element of this chain is the analysis of licenses associated with different software components.Open source licenses play a vital role in defining conditions for using open source software,safeguarding intellectual property,and ensuring the sustained development of open source projects.To mitigate legal risks and protect against property losses,it is imperative to accurately identify open source software licenses and assess their compatibility.In this paper,we propose an innovative method for detecting open source licenses using ensemble learning,complemented by a recommendation system based on compatibility.Our main approach leverages ensemble learning techniques,particularly emphasizing the use of large language models.To bolster the accuracy of open source license detection,this methodology is augmented with rule matching.Subsequently,compatibility assessments and license recommendations are derived using directed graph algorithms.Experimental results validate the effectiveness of our method,showcasing not only reduced maintenance costs and heightened scalability but also superior detection performance in comparison to traditional methods.The proposed approach excels in identifying compatibility issues and provides dependable recommendations,thereby contributing to a more secure and reliable software supply chain.

Key words: Large language model, Ensemble learning, Open source license, Sentence vector similarity, Compatibility assessment

中图分类号: 

  • TP391
[1]XU S,GAO Y,FAN L,et al.Lidetector:License incompatibility detection for open source software[J].ACM Transactions on Software Engineering and Methodology,2023,32(1):1-28.
[2]ZHAO L.An Analysis of Open Source Components in Mixed Source Software Projects[J].Computer Science, 2020, 47(S2):541-543,583.
[3]TU L Y.An Analysis of Legal Attribute of Open Source Software License[J].Legal System and Society,2021(17):189-190.
[4]LIU B B.Research on the Legal Issues of Open Source License Agreement[D].Lanzhou:Lanzhou University, 2020:1-45.
[5]KAPITSAKI G M, CHARALAMBOUS G.Find your OpenSource License Now![C]//Asia-Pacific Software Engineering Conference(APSEC).IEEE, 2016:1-8.
[6]WOLTER T,BARCOMB A,RIEHLE D,et al.Open source license inconsistencies on github[J].ACM Transactions on Software Engineering and Methodology,2023,32(5):1-23.
[7]ALMEIDA D A,MURPHY G C,WILSON G,et al.Investigating whether and how software developers understand open source software licensing[J].Empirical Software Engineering,2019,24:211-239.
[8]PASHCHENKO I,VU D L,MASSACCI F.A qualitative study of dependency management and its security implications[C]//Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security.2020:1513-1531.
[9]OSI approved licenses[OL].https://opensource.org/licenses/
[10]GERMAN D M,MANABE Y,INOUE K.A sentence-matching method for automatic license identification of source code files[C]//Proceedings of the 25th IEEE/ACM International Confe-rence on Automated Software Engineering.2010:437-446.
[11]JAEGER M C, FENDT O, GOBEILLE R, et al.The FOSSology Project:10 Years Of License Scanning[J].International Free and Open Source Software Law Review, 2018, 9(1):9-18.
[12]KAPITSAKI G M, TSELIKAS N D, FOUKARAKIS I E.An insight into license tools for open source software systems[M].Elsevier Science Inc.,2015:72-87.
[13]HARUTYUNYAN N,BAUER A,RIEHLE D.Industry re-quirements for FLOSS governance tools to facilitate the use of open source software in commercial products[J].The Journal of Systems and Software,2019, 158(Dec.):110390.1-110390.1-12.
[14]KAPITSAKI G M,CHARALAMBOUS G.Modeling and recommending open source licenses with findOSSLicense[J].IEEE Transactions on Software Engineering,2019,47(5):919-935.
[15]LIU X,HUANG L G,GE J,et al.Predicting licenses forchanged source code[C]//2019 34th IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2019:686-697.
[16]BALLHAUSEN M.Free and open source software licenses explained[J].Computer,2019,52(6):82-86.
[17]KAPITSAKI G,KRAMER F,TSELIKAS N D.Automating the license compatibility process in open source software with SPDX[J].Journal of Systems & Software,2016,131(Sep.):386-401.
[18]WANG Z Q,WU S,XIAO G Q,et al.How to Properly Choose Open Source Software Licenses for Open Source Software[J].Journal of Software,2021,32(5):1227-1229.
[19]LI B,ZHOU H,HE J,et al.On the sentence embeddings from pre-trained language models[J].arXiv:2011.05864,2020.
[20]SU J,CAO J,LIU W,et al.Whitening sentence representations for better semantics and faster retrieval[J].arXiv:2103.15316,2021.
[21]YAN Y,LI R,WANG S,et al.Consert:A contrastive framework for self-supervised sentence representation transfer[J].arXiv:2105.11741,2021.
[22]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[23]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced samplebuilding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021.
[24]REIMERS N,GUREVYCH I.Sentence-bert:Sentence embeddings using siamese bert-networks[J].arXiv:1908.10084,2019.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!