基于日志信息的不可重复构建原因分类

doi:10.11896/jsjkx.220300227

摘要/Abstract

摘要： 可重复构建指在预定义的构建环境下重新创建二进制工件的能力。由于可重复构建具有保证软件构建环境安全和提高软件构建和分发效率的作用,许多开源软件存储库(如Debian)开展了软件可重复构建实践。然而,由于缺乏足够的判断信息和源文件的复杂多样,确定软件不可重复构建的原因仍是一项费时费力的工作。为此,研究了基于机器学习的软件不可重复构建原因的分类检测。研究了4种典型的不可重复构建原因,即时间戳、文件顺序、随机性和语言环境。利用word2vec产生的词向量对文本日志进行表示,然后配合logistic回归模型,对差异日志和构建日志合并的文本语料进行学习和训练,从而实现对不可重复构建原因的自动分类。对算法进行了实现,并在671个不可重复构建的Debian软件包上进行实验,实验结果表明,该方法达到了80.75%的宏平均精度和86.07%的宏平均召回率,优于其他常用的机器学习算法。此外,还分析了差异日志和构建日志的相关性和重要性,实验结果表明两者对不可重复构建原因的分类都非常重要,缺一不可。该方法为不可重复构建原因自动分类提供了可靠的研究依据。

关键词: 可重复构建, 原因分类, 差异日志, 构建日志, 机器学习

Abstract: Reproducible build is the ability to recreate binary artifacts in a predefined build environment.Due to the role of reproducible build in ensuring the security of software construction environment and improving the efficiency of software construction and distribution,many open source software repositories(such as Debian) have carried out software reproducible build practice.However,due to the lack of sufficient judgment information and the complexity and diversity of source files,it is still a time-consuming and laborious challenge to determine why software can not be built reproducibly.In order to overcome this challenge,this paper studies the classification and detection of software unreproducible build causes based on machine learning.This paper stu-dies four typical reasons for unreproducible build,namely timestamp,fileordering,randomness and locale.This method uses the word vector generated by word2vec to represent the text log,and then cooperates with the logistic regression model to learn and train the text corpus combined with the difference log and the build log,so as to realize the automatic classification of the causes of unreproducible build.In this paper,the algorithm is implemented and tested on 671 unreproducible build Debian software packa-ges.Experimental results show that our method achieves a macro average precision of 80.75% and a macro average recall of 86.07%,which are better than other commonly used machine learning algorithms.In addition,we also analyze the relevance and importance of difference log and build log.Result indicates that both of them are significant for the classification of unreproducible build causes.This method provides a reliable research basis for automatic classification of unreproducible build causes.

Key words: Reproducible build, Cause classification, Difference log, Build log, Machine learning

中图分类号:

TP311

马钊, 刘东, 任志磊, 江贺. 基于日志信息的不可重复构建原因分类[J]. 计算机科学, 2022, 49(12): 109-117. https://doi.org/10.11896/jsjkx.220300227

MA Zhao, LIU Dong, REN Zhi-lei, JIANG He. Classification of Unreproducible Build Causes Based on Log Information[J]. Computer Science, 2022, 49(12): 109-117. https://doi.org/10.11896/jsjkx.220300227

参考文献

[1]LAMB C,ZACCHIROLI S.Reproducible Builds:Increasing the Integrity of Software Supply Chains[J].IEEE Software,2021,39(2):62-70.
[2]GUI X,LIU J,CHI M,et al.Analysis of malware applicationbased on massive network traffic[J].China Communications,2016,13(8):209-221.
[3]The editorial department.Journal International chapter of theinventory of network security events in 2019 [J].Confidential Science and Technology,2019(12):2.
[4]OHM M,PLATE H,SYKOSCH A,et al.Backstabber’s knife collection:A review of open source software supply chain attacks[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2020:23-43.
[5]DI RUSCIO D,PELLICCIONE P.Simulating upgrades of complex systems:The case of Free and Open Source Software[J].Information and Software Technology,2014,56(4):438-462.
[6]MACKINNON J G.The Linux operating system:Debian GNU/Linux[J].Journal of Applied Econometrics,1999,14(4):443-452.
[7]COURTÈS L,WURMUS R.Reproducible and user-controlledsoftware environments in HPC with Guix[C]//European Conference on Parallel Processing.Cham:Springer,2015:579-591.
[8]MASTE E.Reproducible Builds in FreeBSD[EB/OL].https://people.freebsd.org/~emaste/2017-03-12-AsiaBSDCon-Reproducible-Builds-FreeBSD.pdf.
[9]VANGOOR B K R,AGARWAL P,MATHEW M,et al.Performance and Resource Utilization of FUSE User-Space File Systems[J].ACM Transactions on Storage,2019,15(2):1-49.
[10]HOLGER L,ADRIAN B,ALEXANDER C,et al.Overview of reproducible builds for packages in unstable for amd64[EB/OL].(2014-10-01) [2021-11-25].https://tests.reproducible-builds.org/debian/unstable/index_suite_amd64_stats.html.
[11]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[12]HOLGER L,ADRIAN B,ALEXANDER C,et al.Reproducible Builds Experimental Toolchain[EB/OL].https://wiki.debian.org/ReproducibleBuilds/ExperimentalToolchain.
[13]HOLGER L,ADRIAN B,ALEXANDER C,et al.Variations introduced when testing Debian packages[EB/OL].(2014-10-01) [2021-11-25].https://tests.reproduciblebuilds.org/debian/index_variations.html.
[14]GLUKHOVA M.Tools for ensuring reproducible builds foropen-source software[J].Lappeenranta University of Techno-logy,2017,3(1):11-12.
[15]CHRIS L,HOLGER L,MATTIA R,et al.SOURCE DATEEPOCH specification[EB/OL].(2017-11-27)[2021-11-28].https://reproduciblebuilds.org/specs/source-date-epoch/.
[16]HOLGER L,ADRIAN B,ALEXANDER C,et al.Strip Nondeterminism:a Perl library for stripping non-deterministic information[EB/OL].(2014-10-01) [2021-11-28].https://packages.debian.org/sid/strip-nondeterminism.
[17]BAR-YOSEF N,WOOL A.Remote algorithmic complexity attacks against randomized hash tables[C]//International Confe-rence on E-Business and Telecommunications.Berlin:Springer,2007:162-174.
[18]HOLGER L,ADRIAN B,ALEXANDER C,et al.Achieve deterministic builds:Randomness[EB/OL].(2017-05-18)[2021-11-25].https://reproducible-builds.org/docs/randomness/.
[19]HOLGER L,ADRIAN B,ALEXANDER C,et al.Achieve deterministic builds:Locales [EB/OL].(2017-06-24)[2021-11-25].https://reproducible-builds.org/docs/locales/.
[20]HAVRLANT L,KREINOVICH V.A simple probabilistic explanation of term frequency-inverse document frequency(tf-idf) heuristic(and variations motivated by this explanation)[J].International Journal of General Systems,2017,46(1):27-36.
[21]ZHENG L,IDRISSI K,GARCIA C,et al.Logistic similaritymetric learning for face verification[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2015:1951-1955.
[22]MANNING C,RAGHAVAN P,SCHÜTZE H.Introduction to information retrieval[J].Natural Language Engineering,2010,16(1):100-103.
[23]JOHNS B T,MEWHORT D J K,JONES M N.The role of negative information in distributional semantic learning[J].Cognitive Science,2019,43(5):e12730.
[24]REN Z,JIANG H,XUAN J,et al.Automated localization for unreproducible builds[C]//Proceedings of the 40th Inter-national Conference on Software Engineering.2018:71-81.
[25]HOLGER L,ADRIAN B,ALEXANDER C,et al.Reproducible builds bugs filed [EB/OL].(2014-10-01)[2021-11-25].https://tests.reproducible.builds.org/debian/index_bugs.html.
[26]ZHI D X,XU X J,ZHANG H B.Comparison of Z-test and t-test [J].Statistics and decision making,2014(20):4.
[27]COURTÈS L,WURMUS R.Reproducible and user-controlledsoftware environments in HPC with Guix[C]//European Conference on Parallel Processing.Cham:Springer,2015:579-591.
[28]HOLGER L,ADRIAN B,ALEXANDER C,et al.Reproducible builds:week 54 in Stretch cycle [EB/OL].(2016-5-10) [2021-11-25].https://reproducible-builds.org/blog/posts/54/.
[29]REN Z,LIU C,XIAO X,et al.Root cause localization for unreproducible builds via causality analysis over system call tracing[C]//2019 34th IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2019:527-538.
[30]WHEELER D A.Countering trusting trust through diversedouble-compiling[C]//21st Annual Computer Security Applications Conference(ACSAC’05).IEEE,2005:13-48.
[31]NAVARRO LEIJA O S,SHIPTOSKI K,SCOTT R G,et al.Reproducible containers[C]//Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.2020:167-182.
[32]OHM M,SYKOSCH A,MEIER M.Towards detection of software supply chain attacks by forensic artifacts[C]//Proceedings of the 15th International Conference on Availability,Reliability and Security.2020:1-6.
[33]XIONG J,SHI Y,CHEN B,et al.Towards Build Verifiabilityfor Java-based Systems[J].arXiv:2202.05906,2022.
[34]HE H,CAO J,DU L,et al.ConstBin:A Tool for Automatic Fixing of Unreproducible Builds[C]//2020 IEEE International Symposium on Software Reliability Engineering Workshops(ISSREW).IEEE,2020:97-102.

相关文章 15

[1]	冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2]	宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3]	何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[4]	李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5]	张光华, 高天娇, 陈振国, 于乃文. 基于N-Gram静态分析技术的恶意软件分类研究 Study on Malware Classification Based on N-Gram Static Analysis Technology 计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[6]	陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[7]	李亚茹, 张宇来, 王佳晨. 面向超参数估计的贝叶斯优化方法综述 Survey on Bayesian Optimization Methods for Hyper-parameter Tuning 计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[8]	赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[9]	肖治鸿, 韩晔彤, 邹永攀. 基于多源数据和逻辑推理的行为识别技术研究 Study on Activity Recognition Based on Multi-source Data and Logical Reasoning 计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[10]	姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮. 一种基于异质模型融合的 Android 终端恶意软件检测方法 Android Malware Detection Method Based on Heterogeneous Model Fusion 计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[11]	王飞, 黄涛, 杨晔. 基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究 Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion 计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[12]	许杰, 祝玉坤, 邢春晓. 机器学习在金融资产定价中的应用研究综述 Application of Machine Learning in Financial Asset Pricing:A Review 计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[13]	么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明. 大数据驱动的社会经济地位分析研究综述 Big Data-driven Based Socioeconomic Status Analysis:A Survey 计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[14]	李野, 陈松灿. 基于物理信息的神经网络:最新进展与展望 Physics-informed Neural Networks:Recent Advances and Prospects 计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
[15]	张潆藜, 马佳利, 刘子昂, 刘新, 周睿. 以太坊Solidity智能合约漏洞检测方法综述 Overview of Vulnerability Detection Methods for Ethereum Solidity Smart Contracts 计算机科学, 2022, 49(3): 52-61. https://doi.org/10.11896/jsjkx.210700004

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed