Computer Science ›› 2022, Vol. 49 ›› Issue (12): 109-117.doi: 10.11896/jsjkx.220300227

• Computer Software • Previous Articles     Next Articles

Classification of Unreproducible Build Causes Based on Log Information

MA Zhao, LIU Dong, REN Zhi-lei, JIANG He   

  1. School of Software Engineering,Dalian University of Technology,Dalian,Liaoning 116620,China
  • Received:2022-03-23 Revised:2022-05-31 Published:2022-12-14
  • About author:MA Zhao,born in 1996,postgraduate.His main research interests include reproducible build and so on.JIANG He,born in 1980,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include system software and software engineering.

Abstract: Reproducible build is the ability to recreate binary artifacts in a predefined build environment.Due to the role of reproducible build in ensuring the security of software construction environment and improving the efficiency of software construction and distribution,many open source software repositories(such as Debian) have carried out software reproducible build practice.However,due to the lack of sufficient judgment information and the complexity and diversity of source files,it is still a time-consuming and laborious challenge to determine why software can not be built reproducibly.In order to overcome this challenge,this paper studies the classification and detection of software unreproducible build causes based on machine learning.This paper stu-dies four typical reasons for unreproducible build,namely timestamp,fileordering,randomness and locale.This method uses the word vector generated by word2vec to represent the text log,and then cooperates with the logistic regression model to learn and train the text corpus combined with the difference log and the build log,so as to realize the automatic classification of the causes of unreproducible build.In this paper,the algorithm is implemented and tested on 671 unreproducible build Debian software packa-ges.Experimental results show that our method achieves a macro average precision of 80.75% and a macro average recall of 86.07%,which are better than other commonly used machine learning algorithms.In addition,we also analyze the relevance and importance of difference log and build log.Result indicates that both of them are significant for the classification of unreproducible build causes.This method provides a reliable research basis for automatic classification of unreproducible build causes.

Key words: Reproducible build, Cause classification, Difference log, Build log, Machine learning

CLC Number: 

  • TP311
[1]LAMB C,ZACCHIROLI S.Reproducible Builds:Increasing the Integrity of Software Supply Chains[J].IEEE Software,2021,39(2):62-70.
[2]GUI X,LIU J,CHI M,et al.Analysis of malware applicationbased on massive network traffic[J].China Communications,2016,13(8):209-221.
[3]The editorial department.Journal International chapter of theinventory of network security events in 2019 [J].Confidential Science and Technology,2019(12):2.
[4]OHM M,PLATE H,SYKOSCH A,et al.Backstabber’s knife collection:A review of open source software supply chain attacks[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2020:23-43.
[5]DI RUSCIO D,PELLICCIONE P.Simulating upgrades of complex systems:The case of Free and Open Source Software[J].Information and Software Technology,2014,56(4):438-462.
[6]MACKINNON J G.The Linux operating system:Debian GNU/Linux[J].Journal of Applied Econometrics,1999,14(4):443-452.
[7]COURTÈS L,WURMUS R.Reproducible and user-controlledsoftware environments in HPC with Guix[C]//European Conference on Parallel Processing.Cham:Springer,2015:579-591.
[8]MASTE E.Reproducible Builds in FreeBSD[EB/OL].https://people.freebsd.org/~emaste/2017-03-12-AsiaBSDCon-Reproducible-Builds-FreeBSD.pdf.
[9]VANGOOR B K R,AGARWAL P,MATHEW M,et al.Performance and Resource Utilization of FUSE User-Space File Systems[J].ACM Transactions on Storage,2019,15(2):1-49.
[10]HOLGER L,ADRIAN B,ALEXANDER C,et al.Overview of reproducible builds for packages in unstable for amd64[EB/OL].(2014-10-01) [2021-11-25].https://tests.reproducible-builds.org/debian/unstable/index_suite_amd64_stats.html.
[11]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[12]HOLGER L,ADRIAN B,ALEXANDER C,et al.Reproducible Builds Experimental Toolchain[EB/OL].https://wiki.debian.org/ReproducibleBuilds/ExperimentalToolchain.
[13]HOLGER L,ADRIAN B,ALEXANDER C,et al.Variations introduced when testing Debian packages[EB/OL].(2014-10-01) [2021-11-25].https://tests.reproduciblebuilds.org/debian/index_variations.html.
[14]GLUKHOVA M.Tools for ensuring reproducible builds foropen-source software[J].Lappeenranta University of Techno-logy,2017,3(1):11-12.
[15]CHRIS L,HOLGER L,MATTIA R,et al.SOURCE DATEEPOCH specification[EB/OL].(2017-11-27)[2021-11-28].https://reproduciblebuilds.org/specs/source-date-epoch/.
[16]HOLGER L,ADRIAN B,ALEXANDER C,et al.Strip Nondeterminism:a Perl library for stripping non-deterministic information[EB/OL].(2014-10-01) [2021-11-28].https://packages.debian.org/sid/strip-nondeterminism.
[17]BAR-YOSEF N,WOOL A.Remote algorithmic complexity attacks against randomized hash tables[C]//International Confe-rence on E-Business and Telecommunications.Berlin:Springer,2007:162-174.
[18]HOLGER L,ADRIAN B,ALEXANDER C,et al.Achieve deterministic builds:Randomness[EB/OL].(2017-05-18)[2021-11-25].https://reproducible-builds.org/docs/randomness/.
[19]HOLGER L,ADRIAN B,ALEXANDER C,et al.Achieve deterministic builds:Locales [EB/OL].(2017-06-24)[2021-11-25].https://reproducible-builds.org/docs/locales/.
[20]HAVRLANT L,KREINOVICH V.A simple probabilistic explanation of term frequency-inverse document frequency(tf-idf) heuristic(and variations motivated by this explanation)[J].International Journal of General Systems,2017,46(1):27-36.
[21]ZHENG L,IDRISSI K,GARCIA C,et al.Logistic similaritymetric learning for face verification[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2015:1951-1955.
[22]MANNING C,RAGHAVAN P,SCHÜTZE H.Introduction to information retrieval[J].Natural Language Engineering,2010,16(1):100-103.
[23]JOHNS B T,MEWHORT D J K,JONES M N.The role of negative information in distributional semantic learning[J].Cognitive Science,2019,43(5):e12730.
[24]REN Z,JIANG H,XUAN J,et al.Automated localization for unreproducible builds[C]//Proceedings of the 40th Inter-national Conference on Software Engineering.2018:71-81.
[25]HOLGER L,ADRIAN B,ALEXANDER C,et al.Reproducible builds bugs filed [EB/OL].(2014-10-01)[2021-11-25].https://tests.reproducible.builds.org/debian/index_bugs.html.
[26]ZHI D X,XU X J,ZHANG H B.Comparison of Z-test and t-test [J].Statistics and decision making,2014(20):4.
[27]COURTÈS L,WURMUS R.Reproducible and user-controlledsoftware environments in HPC with Guix[C]//European Conference on Parallel Processing.Cham:Springer,2015:579-591.
[28]HOLGER L,ADRIAN B,ALEXANDER C,et al.Reproducible builds:week 54 in Stretch cycle [EB/OL].(2016-5-10) [2021-11-25].https://reproducible-builds.org/blog/posts/54/.
[29]REN Z,LIU C,XIAO X,et al.Root cause localization for unreproducible builds via causality analysis over system call tracing[C]//2019 34th IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2019:527-538.
[30]WHEELER D A.Countering trusting trust through diversedouble-compiling[C]//21st Annual Computer Security Applications Conference(ACSAC’05).IEEE,2005:13-48.
[31]NAVARRO LEIJA O S,SHIPTOSKI K,SCOTT R G,et al.Reproducible containers[C]//Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.2020:167-182.
[32]OHM M,SYKOSCH A,MEIER M.Towards detection of software supply chain attacks by forensic artifacts[C]//Proceedings of the 15th International Conference on Availability,Reliability and Security.2020:1-6.
[33]XIONG J,SHI Y,CHEN B,et al.Towards Build Verifiabilityfor Java-based Systems[J].arXiv:2202.05906,2022.
[34]HE H,CAO J,DU L,et al.ConstBin:A Tool for Automatic Fixing of Unreproducible Builds[C]//2020 IEEE International Symposium on Software Reliability Engineering Workshops(ISSREW).IEEE,2020:97-102.
[1] NING Han-yang, MA Miao, YANG Bo, LIU Shi-chang. Research Progress and Analysis on Intelligent Cryptology [J]. Computer Science, 2022, 49(9): 288-296.
[2] LENG Dian-dian, DU Peng, CHEN Jian-ting, XIANG Yang. Automated Container Terminal Oriented Travel Time Estimation of AGV [J]. Computer Science, 2022, 49(9): 208-214.
[3] ZHANG Guang-hua, GAO Tian-jiao, CHEN Zhen-guo, YU Nai-wen. Study on Malware Classification Based on N-Gram Static Analysis Technology [J]. Computer Science, 2022, 49(8): 336-343.
[4] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[5] LI Yao, LI Tao, LI Qi-fan, LIANG Jia-rui, Ibegbu Nnamdi JULIAN, CHEN Jun-jie, GUO Hao. Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network [J]. Computer Science, 2022, 49(8): 257-266.
[6] CHEN Ming-xin, ZHANG Jun-bo, LI Tian-rui. Survey on Attacks and Defenses in Federated Learning [J]. Computer Science, 2022, 49(7): 310-323.
[7] XIAO Zhi-hong, HAN Ye-tong, ZOU Yong-pan. Study on Activity Recognition Based on Multi-source Data and Logical Reasoning [J]. Computer Science, 2022, 49(6A): 397-406.
[8] YAO Ye, ZHU Yi-an, QIAN Liang, JIA Yao, ZHANG Li-xiang, LIU Rui-liang. Android Malware Detection Method Based on Heterogeneous Model Fusion [J]. Computer Science, 2022, 49(6A): 508-515.
[9] LI Ya-ru, ZHANG Yu-lai, WANG Jia-chen. Survey on Bayesian Optimization Methods for Hyper-parameter Tuning [J]. Computer Science, 2022, 49(6A): 86-92.
[10] ZHAO Lu, YUAN Li-ming, HAO Kun. Review of Multi-instance Learning Algorithms [J]. Computer Science, 2022, 49(6A): 93-99.
[11] WANG Fei, HUANG Tao, YANG Ye. Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion [J]. Computer Science, 2022, 49(6A): 784-789.
[12] XU Jie, ZHU Yu-kun, XING Chun-xiao. Application of Machine Learning in Financial Asset Pricing:A Review [J]. Computer Science, 2022, 49(6): 276-286.
[13] YAO Xiao-ming, DING Shi-chang, ZHAO Tao, HUANG Hong, LUO Jar-der, FU Xiao-ming. Big Data-driven Based Socioeconomic Status Analysis:A Survey [J]. Computer Science, 2022, 49(4): 80-87.
[14] LI Ye, CHEN Song-can. Physics-informed Neural Networks:Recent Advances and Prospects [J]. Computer Science, 2022, 49(4): 254-262.
[15] ZHANG Ying-li, MA Jia-li, LIU Zi-ang, LIU Xin, ZHOU Rui. Overview of Vulnerability Detection Methods for Ethereum Solidity Smart Contracts [J]. Computer Science, 2022, 49(3): 52-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!