Computer Science ›› 2024, Vol. 51 ›› Issue (10): 187-195.doi: 10.11896/jsjkx.230900071

• Computer Software • Previous Articles     Next Articles

Data Mining and Information Service for Open Collaboration Digital Ecosystem

XIA Xiaoya1, ZHAO Shengyu2, HAN Fanyu1, BI Fenglin1, WANG Wei1, ZHOU Xuan1, ZHOU Aoying1   

  1. 1 School of Data Science and Engineering,East China Normal University,Shanghai,200062,China
    2 School of Electronic and Information Engineering,Tongji University,Shanghai,200070,China
  • Received:2023-09-13 Revised:2024-03-12 Online:2024-10-15 Published:2024-10-11
  • About author:XIA Xiaoya,born in 1997,Ph.D candidate.Her main research interests include mining software repositories and open source software ecosystem go-vernance.
    WANG Wei,born in 1979,Ph.D,professor.His main research interests include open source measurements and computational education.
  • Supported by:
    National Natural Science Foundation of China(62137001) and Digital Transformation Innovation Research Project of Shanghai Municipal Education Commission(40400-22201).

Abstract: Large-scale development and proliferation of open source software has constructed an ecosystem for open source deve-lopment and collaboration.Within this system,individuals and organizations collaboratively develop high-quality software that is accessible to all.Social collaboration platforms,represented by GitHub,have further facilitated large-scale,distributed,and fine-grained code collaboration and technical socialization.Countless developers submit code,review code,report bugs,or propose new feature requests on these platforms every day.This results in a vast amount of behavioral data from the fully open collaborative development process,which holds immense value.This paper designs and implements a one-stop data mining system for the open source collaboration digital ecosystem,named OpenDigger.Its goal is to build data infrastructure in the open source field and promote the continuous development of the open source ecosystem.OpenDigger system consists primarily of data collection module,storage module,tag data module,and information service module.It is built upon an OLAP columnar database and a graph database.The system continuously collects data from multiple sources within the open-source ecosystem and provides various types of open-source information services to different user groups through a unified interface.Additionally,OpenDigger mines key information from the open-source digital ecosystem through the perspective of collaborative relationship networks.Compared to traditional statistical indicators,the collaborative network perspective better illustrates the association characteristics between open-source projects and developers.

Key words: Open source ecosystem, Open collaboration, Data mining, Information system, Graph analysis

CLC Number: 

  • TP391
[1]ZHOU M H,ZHANG Y X,TAN X.Software Digital Sociology[J].Chinese Science:Information Science,2019(11):1399-1411.
[2]WALKER G H,STANTON N A,SALMON P M,et al.A review of sociotechnical systems theory:a classic concept for new command and control paradigms[J].Theoretical Issues in Ergonomics Science,2008,9(6):479-499.
[3]ROPOHL G.Philosophy of socio-technical systems[J].Societyfor Philosophy and Technology Quarterly Electronic Journal,1999,4(3):186-194.
[4]CHUNG F R K,LU L.Complex graphs and networks[M].American Mathematical Soc.,2006.
[5]MA Y,BOGART C,AMREEN S,et al.World of code:an infra-structure for mining the universe of open source VCS data[C]//2019 IEEE/ACM 16th International Conference on Mining Software Repositories(MSR).IEEE,2019:143-154.
[6]DROST-FROMM I,TOMPKINS R.Open Source CommunityGovernance the Apache Way[J].Computer,2021,54(4):70-75.
[7]YUAN L,WANG H M,YIN G,et al.Mining and analyzing behavioral characteristic of developers in open source software[J].Journal of Computers,2010,33(10):1909-1918.
[8]LI C Y,HONG M.Analysis on Behavior Characteristics of De-velopers in Github[J].Computer Science,2019,46(2):152-158.
[9]CONSTANTINO K,SOUZA M,ZHOU S,et al.Perceptions of open-source software developers on collaborations:An interview and survey study[J].Journal of Software:Evolution and Process,2023,35(5):e2393.
[10]MARLOW J,DABBISH L,HERBSLEB J.Impression formation in online peer production:activity traces and personal profiles in github[C]//Proceedings of the 2013 Conference on Computer Supported Cooperative Work.2013:117-128.
[11]TSAY J,DABBISH L,HERBSLEB J.Influence of social andtechnical factors for evaluating contribution in GitHub[C]//Proceedings of the 36th International Conference on Software Engineering.2014:356-366.
[12]MCDONALD N,GOGGINS S.Performance and participation in open source software on github[M]//CHI'13 Extended Abstracts on Human Factors in Computing Systems.2013:139-144.
[13]DAI L C,DAI X,CUI Y,et al.Anomaly data mining algorithm in social network based on deep integrated learning[J].Journal of Jilin University(Engineering and Technology Edition),2022,52(11):2712-2717.
[14]LIU P,ZHANG P C,WANG N X.Structure and Evolution of Developer Collaboration Network in Cloud Foundry OSS Community[J].Complex Systems and Complexity Science,2020,16(4):31-43.
[15]YIN G,WANG T,LIU B X,et al.Survey of Software Data Mi-ning for Open Source Ecosystem[J].Journal of Software,2018,29(8):2258-2271.
[16]SAMOLADAS I,GOUSIOS G,SPINELLIS D,et al.The SQO-OSS quality model:measurement based open source software evaluation[C]//Open Source Development,Communities and Quality:IFIP 20 th World Computer Congress,Working Group 2.3 on Open Source Software.2008:237-248.
[17]BAUER V,HEINEMANN L,HUMMEL B,et al.A framework for incremental quality analysis of large software systems[C]//2012 28th IEEE International Conference on Software Maintenance(ICSM).IEEE,2012:537-546.
[18]ZOU Y,LIU C,JIN Y,et al.Assessing software quality through web comment search and analysis[C]//13th International Conference on Software Reuse.Springer,2013:208-223.
[19]ALLAMANIS M,SUTTON C.Why,when,and what:analyzing stack overflow questions by topic,type,and code[C]//2013 10th Working Conference on Mining Software Repositories(MSR).IEEE,2013:53-56.
[20]HENβ S,MONPERRUS M,MEZINI M.Semi-automatically ex-tracting FAQs to improve accessibility of software development knowledge[C]//2012 34th International Conference on Software Engineering(ICSE).IEEE,2012:793-803.
[21]WONG E,YANG J,TAN L.Autocomment:Mining questionand answer sites for automatic comment generation[C]//2013 28th IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2013:562-567.
[22]DAGENAIS B,ROBILLARD M P.Recovering traceability links between an API and its learning resources[C]//2012 34th International Conference on Software Engineering(ICSE).IEEE,2012:47-57.
[23]BACCHELLI A,PONZANELLI L,LANZA M.Harnessingstack overflow for the ide[C]//2012 Third International Workshop on Recommendation Systems for Software Engineering(RSSE).IEEE,2012:26-30.
[24]CHAUDHURI S,DAYAL U.An overview of data warehousing and OLAP technology[J].ACM Sigmod record,1997,26(1):65-74.
[25]ANGLES R.A comparison of current graph database models[C]//2012 IEEE 28th International Conference on Data Engineering Workshops.IEEE,2012:171-177.
[26]MILLER J J.Graph database applications and concepts withNeo4j[C]//Proceedings of theSouthern Association for Information Systems Conference.2013,2324(36):141-147.
[27]ASRATIAN A S,DENLEY T M J,HÄGGKVIST R.Bipartitegraphs and their applications[M].Cambridge University Press,1998.
[28]XIA X,WENG Z,WANG W,et al.Exploring activity and contributors on GitHub:Who,what,when,and where[C]//2022 29th Asia-Pacific Software Engineering Conference(APSEC).IEEE,2022:11-20.
[29]XING W,GHORBANI A.Weighted pagerank algorithm[C]//Proceedings of Second Annual Conference on Communication Networks and Services Research.IEEE,2004:305-314.
[30]DABBISH L,STUART C,TSAY J,et al.Social coding inGitHub:transparency and collaboration in an open software repository[C]//Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work.2012:1277-1286.
[1] KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33.
[2] DONG Wanqing, ZHAO Zirong, LIAO Huimin, XIAO Hui, ZHANG Xiaoliang. Research and Implementation of Urban Traffic Accident Risk Prediction in Dynamic Road Network [J]. Computer Science, 2024, 51(6A): 230500118-10.
[3] JIANG Yanjie, DONG Chunhao, LIU Hui. Nonsense Variable Names Detection Method Based on Lexical Features and Data Mining [J]. Computer Science, 2024, 51(6): 23-33.
[4] XING Cunyuan, ZHANG Jie, JIN Ying. Discipline Competition Evaluation Model Based on Multi-attribute Comprehensive Evaluation [J]. Computer Science, 2024, 51(5): 21-26.
[5] BAO Kainan, ZHANG Junbo, SONG Li, LI Tianrui. ST-WaveMLP:Spatio-Temporal Global-aware Network for Traffic Flow Prediction [J]. Computer Science, 2024, 51(5): 27-34.
[6] CHEN Xinyang, CHEN Hanze, ZHOU Jiasheng, HUANG Jiaqing, YU Jiashuo, ZHU Longlong, ZHANG Dong. IntervalSketch:Approximate Statistical Method for Interval Items in Data Stream [J]. Computer Science, 2024, 51(4): 4-10.
[7] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[8] SHEN Zhehui, WANG Kailai, KONG Xiangjie. Exploring Station Spatio-Temporal Mobility Pattern:A Short and Long-term Traffic Prediction Framework [J]. Computer Science, 2023, 50(7): 98-106.
[9] ZHANG Jian, ZHANG Ye. College Students Employment Dynamic Prediction of Multi-feature Fusion Based on GRU-LSTM [J]. Computer Science, 2023, 50(6A): 220500056-6.
[10] YANG Ye, WU Weizhi, ZHANG Jiaru. Optimal Scale Selection and Rule Acquisition in Inconsistent Generalized Decision Multi-scale Ordered Information Systems [J]. Computer Science, 2023, 50(6): 131-141.
[11] ZHAO Xuejian, ZHAO Ke. Bio-inspired Frequent Itemset Mining Strategy Based on Genetic Algorithm [J]. Computer Science, 2023, 50(11A): 220700200-8.
[12] LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[13] FANG Lian-hua, LIN Yu-mei, WU Wei-zhi. Optimal Scale Selection in Random Multi-scale Ordered Decision Systems [J]. Computer Science, 2022, 49(6): 172-179.
[14] YAO Xiao-ming, DING Shi-chang, ZHAO Tao, HUANG Hong, LUO Jar-der, FU Xiao-ming. Big Data-driven Based Socioeconomic Status Analysis:A Survey [J]. Computer Science, 2022, 49(4): 80-87.
[15] XUE Zhan-ao, HOU Hao-dong, SUN Bing-xin, YAO Shou-qian. Label-based Approach for Dynamic Updating Approximations in Incomplete Fuzzy Probabilistic Rough Sets over Two Universes [J]. Computer Science, 2022, 49(3): 255-262.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!