计算机科学 ›› 2024, Vol. 51 ›› Issue (10): 187-195.doi: 10.11896/jsjkx.230900071

• 计算机软件 • 上一篇    下一篇

面向开源协作数字生态的信息服务与数据挖掘

夏小雅1, 赵生宇2, 韩凡宇1, 毕枫林1, 王伟1, 周烜1, 周傲英1   

  1. 1 华东师范大学数据科学与工程学院 上海 200062
    2 同济大学电子与信息工程学院 上海 200070
  • 收稿日期:2023-09-13 修回日期:2024-03-12 出版日期:2024-10-15 发布日期:2024-10-11
  • 通讯作者: 王伟(wwang@dase.ecnu.edu.cn)
  • 作者简介:(xiaoya@stu.ecnu.edu.cn)
  • 基金资助:
    国家自然科学基金(62137001);上海市教委数字化转型创新研究项目(40400-22201)

Data Mining and Information Service for Open Collaboration Digital Ecosystem

XIA Xiaoya1, ZHAO Shengyu2, HAN Fanyu1, BI Fenglin1, WANG Wei1, ZHOU Xuan1, ZHOU Aoying1   

  1. 1 School of Data Science and Engineering,East China Normal University,Shanghai,200062,China
    2 School of Electronic and Information Engineering,Tongji University,Shanghai,200070,China
  • Received:2023-09-13 Revised:2024-03-12 Online:2024-10-15 Published:2024-10-11
  • About author:XIA Xiaoya,born in 1997,Ph.D candidate.Her main research interests include mining software repositories and open source software ecosystem go-vernance.
    WANG Wei,born in 1979,Ph.D,professor.His main research interests include open source measurements and computational education.
  • Supported by:
    National Natural Science Foundation of China(62137001) and Digital Transformation Innovation Research Project of Shanghai Municipal Education Commission(40400-22201).

摘要: 开源软件在大规模发展与普及的同时也构筑了一个开源开发与协同的生态系统,在这个系统中,个人与组织协同开发所有人都可以使用的高质量软件。以GitHub为代表的社会化协作平台进一步促进了大规模、分布式、细粒度的代码协作与技术社交,无数开发者每天在其上提交代码、评审代码、报告bug,或提出新的功能请求,如何利用这些海量的协作行为数据挖掘有价值的信息是当前的研究难点。因此,设计并实现了一个面向开源协作数字生态的一站式数据挖掘系统OpenDigger,目标是构建开源领域的数据基础设施,促进开源生态的持续发展。OpenDigger系统主要由数据采集服务、数据存储模块、标签数据模块和信息服务模块构成,它基于OLAP列式数据库和图数据库,持续采集多源开源生态数据,并通过统一的接口为不同用户群体提供各类开源信息服务。OpenDigger从协作关系网络视角挖掘开源数字生态中的关键信息,相比传统统计指标,协作网络视角更好地展现了开源项目与开发者的关联特性,用户可以使用在线分析环境或CLI工具对开源生态数据进行建模与分析。OpenDigger服务于蚂蚁金服、阿里巴巴、木兰开源社区等多家企业与社区,为OSPO(Open Source Program Office,开源办公室)从业者和开源项目运营负责人提供开源数字洞察能力。

关键词: 开源生态, 开源协作, 数据挖掘, 信息系统, 图分析

Abstract: Large-scale development and proliferation of open source software has constructed an ecosystem for open source deve-lopment and collaboration.Within this system,individuals and organizations collaboratively develop high-quality software that is accessible to all.Social collaboration platforms,represented by GitHub,have further facilitated large-scale,distributed,and fine-grained code collaboration and technical socialization.Countless developers submit code,review code,report bugs,or propose new feature requests on these platforms every day.This results in a vast amount of behavioral data from the fully open collaborative development process,which holds immense value.This paper designs and implements a one-stop data mining system for the open source collaboration digital ecosystem,named OpenDigger.Its goal is to build data infrastructure in the open source field and promote the continuous development of the open source ecosystem.OpenDigger system consists primarily of data collection module,storage module,tag data module,and information service module.It is built upon an OLAP columnar database and a graph database.The system continuously collects data from multiple sources within the open-source ecosystem and provides various types of open-source information services to different user groups through a unified interface.Additionally,OpenDigger mines key information from the open-source digital ecosystem through the perspective of collaborative relationship networks.Compared to traditional statistical indicators,the collaborative network perspective better illustrates the association characteristics between open-source projects and developers.

Key words: Open source ecosystem, Open collaboration, Data mining, Information system, Graph analysis

中图分类号: 

  • TP391
[1]ZHOU M H,ZHANG Y X,TAN X.Software Digital Sociology[J].Chinese Science:Information Science,2019(11):1399-1411.
[2]WALKER G H,STANTON N A,SALMON P M,et al.A review of sociotechnical systems theory:a classic concept for new command and control paradigms[J].Theoretical Issues in Ergonomics Science,2008,9(6):479-499.
[3]ROPOHL G.Philosophy of socio-technical systems[J].Societyfor Philosophy and Technology Quarterly Electronic Journal,1999,4(3):186-194.
[4]CHUNG F R K,LU L.Complex graphs and networks[M].American Mathematical Soc.,2006.
[5]MA Y,BOGART C,AMREEN S,et al.World of code:an infra-structure for mining the universe of open source VCS data[C]//2019 IEEE/ACM 16th International Conference on Mining Software Repositories(MSR).IEEE,2019:143-154.
[6]DROST-FROMM I,TOMPKINS R.Open Source CommunityGovernance the Apache Way[J].Computer,2021,54(4):70-75.
[7]YUAN L,WANG H M,YIN G,et al.Mining and analyzing behavioral characteristic of developers in open source software[J].Journal of Computers,2010,33(10):1909-1918.
[8]LI C Y,HONG M.Analysis on Behavior Characteristics of De-velopers in Github[J].Computer Science,2019,46(2):152-158.
[9]CONSTANTINO K,SOUZA M,ZHOU S,et al.Perceptions of open-source software developers on collaborations:An interview and survey study[J].Journal of Software:Evolution and Process,2023,35(5):e2393.
[10]MARLOW J,DABBISH L,HERBSLEB J.Impression formation in online peer production:activity traces and personal profiles in github[C]//Proceedings of the 2013 Conference on Computer Supported Cooperative Work.2013:117-128.
[11]TSAY J,DABBISH L,HERBSLEB J.Influence of social andtechnical factors for evaluating contribution in GitHub[C]//Proceedings of the 36th International Conference on Software Engineering.2014:356-366.
[12]MCDONALD N,GOGGINS S.Performance and participation in open source software on github[M]//CHI'13 Extended Abstracts on Human Factors in Computing Systems.2013:139-144.
[13]DAI L C,DAI X,CUI Y,et al.Anomaly data mining algorithm in social network based on deep integrated learning[J].Journal of Jilin University(Engineering and Technology Edition),2022,52(11):2712-2717.
[14]LIU P,ZHANG P C,WANG N X.Structure and Evolution of Developer Collaboration Network in Cloud Foundry OSS Community[J].Complex Systems and Complexity Science,2020,16(4):31-43.
[15]YIN G,WANG T,LIU B X,et al.Survey of Software Data Mi-ning for Open Source Ecosystem[J].Journal of Software,2018,29(8):2258-2271.
[16]SAMOLADAS I,GOUSIOS G,SPINELLIS D,et al.The SQO-OSS quality model:measurement based open source software evaluation[C]//Open Source Development,Communities and Quality:IFIP 20 th World Computer Congress,Working Group 2.3 on Open Source Software.2008:237-248.
[17]BAUER V,HEINEMANN L,HUMMEL B,et al.A framework for incremental quality analysis of large software systems[C]//2012 28th IEEE International Conference on Software Maintenance(ICSM).IEEE,2012:537-546.
[18]ZOU Y,LIU C,JIN Y,et al.Assessing software quality through web comment search and analysis[C]//13th International Conference on Software Reuse.Springer,2013:208-223.
[19]ALLAMANIS M,SUTTON C.Why,when,and what:analyzing stack overflow questions by topic,type,and code[C]//2013 10th Working Conference on Mining Software Repositories(MSR).IEEE,2013:53-56.
[20]HENβ S,MONPERRUS M,MEZINI M.Semi-automatically ex-tracting FAQs to improve accessibility of software development knowledge[C]//2012 34th International Conference on Software Engineering(ICSE).IEEE,2012:793-803.
[21]WONG E,YANG J,TAN L.Autocomment:Mining questionand answer sites for automatic comment generation[C]//2013 28th IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2013:562-567.
[22]DAGENAIS B,ROBILLARD M P.Recovering traceability links between an API and its learning resources[C]//2012 34th International Conference on Software Engineering(ICSE).IEEE,2012:47-57.
[23]BACCHELLI A,PONZANELLI L,LANZA M.Harnessingstack overflow for the ide[C]//2012 Third International Workshop on Recommendation Systems for Software Engineering(RSSE).IEEE,2012:26-30.
[24]CHAUDHURI S,DAYAL U.An overview of data warehousing and OLAP technology[J].ACM Sigmod record,1997,26(1):65-74.
[25]ANGLES R.A comparison of current graph database models[C]//2012 IEEE 28th International Conference on Data Engineering Workshops.IEEE,2012:171-177.
[26]MILLER J J.Graph database applications and concepts withNeo4j[C]//Proceedings of theSouthern Association for Information Systems Conference.2013,2324(36):141-147.
[27]ASRATIAN A S,DENLEY T M J,HÄGGKVIST R.Bipartitegraphs and their applications[M].Cambridge University Press,1998.
[28]XIA X,WENG Z,WANG W,et al.Exploring activity and contributors on GitHub:Who,what,when,and where[C]//2022 29th Asia-Pacific Software Engineering Conference(APSEC).IEEE,2022:11-20.
[29]XING W,GHORBANI A.Weighted pagerank algorithm[C]//Proceedings of Second Annual Conference on Communication Networks and Services Research.IEEE,2004:305-314.
[30]DABBISH L,STUART C,TSAY J,et al.Social coding inGitHub:transparency and collaboration in an open software repository[C]//Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work.2012:1277-1286.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!