Computer Science ›› 2026, Vol. 53 ›› Issue (6): 396-407.doi: 10.11896/jsjkx.250700114

• Computer Software • Previous Articles     Next Articles

How to Filter Collaborative Development Projects from Open-source Communities--Exploratory Study on GitHub

DONG Guojun1, CHENG Can2,3,4, FANG Qing5, YOU Lan1, WANG Wei6, PENG Qingxi7   

  1. 1 School of Computer Science,Hubei University,Wuhan 430062,China
    2.School of Artificial Intelligence,Hubei University,Wuhan 430062,China
    3 Key Laboratory of Intelligent Sensing System and Security,Ministry of Education(Hubei University),Wuhan 430062,China
    4 State Key Laboratory for Novel Software Technology at Nanjing University,Nanjing 210023,China
    5 Normal College,Jingchu University of Technology,Jingmen,Hubei 448000,China
    6 School of Data Science and Engineering,East China Normal University,Shanghai 200062,China
    7 School of Information Engineering,Wuhan College,Wuhan 430212,China
  • Received:2025-07-21 Revised:2025-10-15 Online:2026-06-15 Published:2026-06-09
  • About author:DONG Guojun,born in 2000,postgra-duate,is a student member of CCF(No.A11808G).His main research interests include open-source software ecosystem and large language models.
    YOU Lan,born in 1978,Ph.D,professor,is a senior member of CCF(No.H8967M).Her main research interests include spatio-temporal big data,natural language processing,and social computing.
  • Supported by:
    Hubei Provincial Natural Science Foundation Youth Program(2023AFB374) ,Open Subjects of State Key Laboratory for Novel Software Technology at Nanjing University(KFKT2025B48),Hubei Provincial Department of Education Philosophy and Social Sciences Research Project:Youth Program(24Q166),Technology Innovation Special Program of Hubei Province(2025BEB028,2024BAB034) and Hubei Provincial University Excellent Young and Middle-aged Scientific and Technological Innovation Team(T2022055).

Abstract: Collaborative development projects(CDPs) and engineering development projects(EDPs) represent two typical project models within the open-source software(OSS) ecosystem,reflecting a project's development status and engineering maturity.For developers,determining whether a project is a CDP is often challenging due to the lack of clear boundaries defining its collaborative nature.Conversely,identifying an EDP requires access to sufficient valuable development information.For researchers,neglecting CDP and EDP samples during selection contaminates the sample pool with numerous projects not development-oriented or unintentionally collaborative,thereby diminishing the validity of research findings.Current research lacks automated screening methods for these two project types.Addressing this gap,this study constructs a standardized dataset to validate the performance differences of 50 combinations of methods and features,along with 24 combinations of machine learning algorithms and feature sets,in project screening.This provides researchers with an efficient model for targeting relevant projects.The findings reveal:1)For scenarios demanding high Precision,baseline methods excel,achieving Precision scores of 0.900 and 0.880 when screening CDPs and EDPs,respectively;2)For scenarios prioritizing high F1-Score,machine learning methods perform best,yielding F1-Scores of 0.879 and 0.821 for CDP and EDP screening,respectively;3)Large language model methods achieve Precision scores of 0.691 and 0.569 for CDP and EDP screening,respectively;4)Integrating machine learning methods with existing screening approaches improve Precision by 4.56% to 42.8% for CDP screening and by 5.9% to 237.5% for EDP screening.

Key words: Data mining, Open-sources software, GitHub

CLC Number: 

  • TP311
[1]COSENTINO V,IZQUIERDO J L C,CABOT J.A systematic mapping study of software development with GitHub [J].IEEE Access,2017,5:7173-7192.
[2]QI Q,CAO J,LIU Y C.The evolution of software ecosystem in GitHub [J].Journal of Computer Research and Development,2020,57(3):513-524.
[3]WANG Y,REN Y X,GAO T,et al.Survey on governance technology of open-source software library ecosystem:twenty years of progress [J].Journal of Software,2024,35(2):629-674.
[4]WATTANAKRIENGKRAI S,CHINTHANET B,HATA H,et al.GitHub repositories with links to academic papers:Public access,traceability,and evolution [J].Journal of Systems and Software,2022,183:111117.
[5]BRONESKE D,KITTAN S,KRÜGER J.Sharing Software-Evolution Datasets:Practices,Challenges,and Recommendations [C]//Proceedings of the ACM on Software Engineering.2024:2051-2074.
[6]KIKAS R,DUMAS M,PFAHL D.Using dynamic and contextual features to predict issue lifetime in github projects[C]//Proceedings of the 13th International Conference on Mining Software Repositorie.2016.
[7]MUNAIAH N,KROH S,CABREY C,et al.Curating github for engineered software projects [J].Empirical Software Enginee-ring,2017,22:3219-3253.
[8]KALLIAMVAKOU E,GOUSIOS G,BLINCOE K,et al.An in-depth study of the promises and perils of mining GitHub [J].Empirical Software Engineering,2016,21:2035-2071.
[9]DONG R Z,LI B X,WANG L L,et al.Review of research on software ecosystems [J].Chinese Journal of Computers,2020,43(2):250-271.
[10]DAN C,XING W,PENG H,et al.Towards understanding exis-ting developers' collaborative behavior in oss communities [J].Computer Science,2016,43(S1):476-479,501.
[11]ZHANG Y,WANG Z,LI Z X,et al.Empirical study on application and maintenance of oss community profile documentation [J].Computer Science,2023,50(S1):826-833.
[12]YIN G,WANG T,LIU B X,et al.Survey of software data mi-ning for open source ecosystem [J].Journal of Software,2018,29(8):2258-2271.
[13]GOUSIOS G,SPINELLIS D.GHTorrent:GitHub's data from a firehose[C]//Proceedings of the 2012 9th IEEE Working Conference on Mining Software Repositories(MSR).IEEE,2012.
[14]MOMBACH T,VALENTE M T.GitHub REST API vs GH-Torrent vs GitHub Archive:A comparative study [R].2018.
[15]KOTTI Z,KRAVVARITIS K,DRITSA K,et al.Standing onshoulders or feet? An extended study on the usage of the MSR data papers [J].Empirical Software Engineering,2020,25(5):3288-3322.
[16]JØRGENSEN M.The influence of selection bias on effort overruns in software development projects [J].Information and Software Technology,2013,55(9):1640-1650.
[17]JARCZYK O,JAROSZEWICZ S,WIERZBICKI A,et al.Surgical teams on GitHub:Modeling performance of GitHub project development processes [J].Information and Software Technology,2018,100:32-46.
[18]BAO L,XIA X,LO D,et al.A large scale study of long-timecontributor prediction for github projects [J].IEEE Transactions on Software Engineering,2019,47(6):1277-1298.
[19]ELAZHARY O,STOREY M A,ERNST N,et al.Do as I do,not as I say:Do contribution guidelines match the github contribution process?[C]//Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2019.
[20]BERTONCELLO M V,PINTO G,WIESE I S,et al.Pull requests or commits? which method should we use to study contributors' behavior?[C]//Proceedings of the 2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020.
[21]KALLIS R,DI SORBO A,CANFORA G,et al.Predicting issue types on GitHub [J].Science of Computer Programming,2021,205:102598.
[22]XIA X Y,ZHAO S Y,HAN F Y,et al.Data mining and information service for open collaboration digital ecosystem [J].Computer Science,2024,51(10):187-195.
[23]PRANA G A A,TREUDE C,THUNG F,et al.Categorizing the content of github readme files [J].Empirical Software Enginee-ring,2019,24:1296-1327.
[24]PUHLFÜRß T,MONTGOMERY L,MAALEJ W.An exploratory study of documentation strategies for product features in popular GitHub projects[C]//Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2022.
[25]DEY T,LOUNGANI J,IVERS J.Smarter Project Selection for Software Engineering Research[C]//Proceedings of the 20th International Conference on Predictive Models and Data Analy-tics in Software Engineering.2024.
[26]LIU B C,ZHANG L,LIU Z W,et al.Cross-project issue recommendation method for open-source software defects [J].Journal of Software,2024,35(5):2340-2358.
[27]SAFIULLINA G,GUMEROV A,DLAMINI G,et al.Prelimi-nary Study:Exploring GitHub Repository Metrics[C]//Proceedings of the Future of Information and Communication Conference.Springer,2024.
[28]PARMAR A,KATARIYA R,PATEL V.A review on random forest:An ensemble classifier[C]//Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things(ICICI).Springer,2018.
[29]WANG J,WANG H,NIE F,et al.Ratio sum versus sum ratio for linear discriminant analysis [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(12):10171-10185.
[30]MIENYE I D,SUN Y.A survey of ensemble learning:Con-cepts,algorithms,applications,and prospects [J].IEEE Access,2022,10:99129-99149.
[31]HANCOCK J T,KHOSHGOFTAAR T M.CatBoost for big data:an interdisciplinary review [J].Journal of Big Data,2020,7(1):94.
[32]ZHAO G,DA COSTA D A,ZOU Y.Improving the pull requests review process using learning-to-rank algorithms [J].Empirical Software Engineering,2019,24:2140-2170.
[1] NI Yongting, QIAN Jin, YAN Shaowei, WU Yueyang. Fuzzy Three-way Clustering Based on Mean Shift [J]. Computer Science, 2026, 53(6): 332-338.
[2] XI Penghui, WU Xiazhen, JIANG Wencong, FANG Liangda, HE Chaobo, GUAN Quanlong. Review of Personalized Educational Resource Recommendations [J]. Computer Science, 2026, 53(2): 1-15.
[3] ZHAI Jie, CHEN Lexuan, PANG Zhiyu. Survey on Graph Neural Network-based Methods for Academic Performance Prediction [J]. Computer Science, 2026, 53(2): 16-30.
[4] HU Xin, DUAN Jiangli, HUANG Denan. Concept Cognition for Knowledge Graphs by Mining Double Granularity Concept Characteristics [J]. Computer Science, 2025, 52(6A): 240800047-6.
[5] GU Huijie, FANG Wenchong, ZHOU Zhifeng, ZHU Wen, MA Guang, LI Yingchen. CSO-LSTM Based Power Prediction Method for New Energy Generation [J]. Computer Science, 2025, 52(6A): 240600053-11.
[6] XIONG Keqin, RUAN Sijie, YANG Qianyu, XU Changwei , YUAN Hanning. Mobility Data-driven Location Type Inference Based on Crowd Voting [J]. Computer Science, 2025, 52(3): 169-179.
[7] WANG Tianyi, LIN Youfang, GONG Letian, CHEN Wei, GUO Shengnan, WAN Huaiyu. Check-in Trajectory and User Linking Based on Natural Language Augmentation [J]. Computer Science, 2025, 52(2): 99-106.
[8] LIU Yuting, GU Jingjing, ZHOU Qiang. Urban Flow Prediction Method Based on Structural Causal Model [J]. Computer Science, 2025, 52(10): 70-78.
[9] KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33.
[10] DONG Wanqing, ZHAO Zirong, LIAO Huimin, XIAO Hui, ZHANG Xiaoliang. Research and Implementation of Urban Traffic Accident Risk Prediction in Dynamic Road Network [J]. Computer Science, 2024, 51(6A): 230500118-10.
[11] JIANG Yanjie, DONG Chunhao, LIU Hui. Nonsense Variable Names Detection Method Based on Lexical Features and Data Mining [J]. Computer Science, 2024, 51(6): 23-33.
[12] XING Cunyuan, ZHANG Jie, JIN Ying. Discipline Competition Evaluation Model Based on Multi-attribute Comprehensive Evaluation [J]. Computer Science, 2024, 51(5): 21-26.
[13] BAO Kainan, ZHANG Junbo, SONG Li, LI Tianrui. ST-WaveMLP:Spatio-Temporal Global-aware Network for Traffic Flow Prediction [J]. Computer Science, 2024, 51(5): 27-34.
[14] CHEN Xinyang, CHEN Hanze, ZHOU Jiasheng, HUANG Jiaqing, YU Jiashuo, ZHU Longlong, ZHANG Dong. IntervalSketch:Approximate Statistical Method for Interval Items in Data Stream [J]. Computer Science, 2024, 51(4): 4-10.
[15] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!