计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 25-33.doi: 10.11896/jsjkx.191000087

所属专题: 智能软件工程

• 智能软件工程 • 上一篇    下一篇

基于特征提取的开源社区Fork摘要自动生成方法

张超1,毛新军1,2,卢遥1   

  1. (国防科技大学计算机学院 长沙 410000)1;
    (复杂系统软件工程重点实验室 长沙410000)2
  • 收稿日期:2019-10-15 出版日期:2020-03-15 发布日期:2020-03-30
  • 通讯作者: 毛新军(xjmao@nudt.edu.cn)
  • 基金资助:
    国家重点研发计划项目(2018YFB1004202);NSFC(61532004)

Approach of Automatic Fork Summary Generation in Open Source Community Based on Feature Extraction

ZHANG Chao1,MAO Xin-jun1,2,LU Yao1   

  1. (College of Computer Science and Technology, National University of Defense Technology, Changsha 410000, China)1;
    (Key Laboratory of Complex System Software Engineering, Changsha 410000, China)2
  • Received:2019-10-15 Online:2020-03-15 Published:2020-03-30
  • About author:ZHANG Chao,born in 1991,postgradua-te,is member of China Computer Fe-deration.His main research interests include software engineering and open source community. MAO Xin-jun,born in 1970,Ph.D,professor,is member of China Computer Federation.His main research interests include software engineering and open source community.
  • Supported by:
    This work was supported by the National Key R&D Program of China (2018YFB1004202) and Research on Mechanism and Method of Massive Online Collaborative Learning (61532004).

摘要: 当前,基于P/R的分布式协同开发已经成为开源社区中的主导软件开发方式。开发者通过Fork复制软件项目的版本库,创建自身分支,并在新建分支中进行独立开发。由于P/R协同开发模型具有开放性、透明性和并行化等特征,开发人员在Fork项目时难以掌握项目的Fork概况,不知道其他开发人员是否已通过Fork开展相同或类似的开发工作,从而容易产生重复性的贡献和冗余性开发。针对这个问题,提出一种Fork摘要的自动生成方法以帮助项目管理者加强项目管控,避免冗余贡献,增强合作交流。该方法首先爬取开源社区中具有Feature和Bug标签信息的Issue数据,采用随机森林方法训练一个分类器模型,以对Fork特征进行分类;随后收集Fork分支的软件开发活动数据,采用TextRank算法生成Fork详细信息以解释Fork的主要目的;最后设计了一组组合规则及相应的算法来整合Fork的类别、特征和其他信息,以形成完整的Fork摘要。为了检验所提方法在指导分布式协同开发方面的有效性,在Github上进行了30组人工测试和60组实际案例测试。结果表明,所提方法生成的Fork摘要的准确率达到67.2%,实验中76%的项目管理者认为Fork摘要有助于更好地管理项目,加强沟通与合作。

关键词: Fork摘要, 分布式开发, 开源软件, 开源社区

Abstract: At present,distributed collaborative development based on P/R has become the dominant software development me-thod in open source community.Because of the openness,transparency and parallelism of the software development in P/R mo-del,it is difficult for developers to obtain the complete Fork profile of the whole project,and know whether other developers have accomplished the same or similar development tasks,which are prone to duplicate contributions and redundant development.To solve this problem,this paper proposed an automatic generation method of Fork summary to help project managers strengthen project management,avoid redundant contributions,and enhance cooperation and communication among developers.The proposed method firstly crawls Issue data with feature and Bug label information in open source community,and trains a classifier model with random forest method to classify Fork features.Then,it collects the data of Fork branch’s software development activities and uses TextRank algorithm to generate detailed Fork information to explain the main purpose of Fork activity.Finally,a set of combination rules and corresponding algorithm are designed to integrate Fork’s categories,features and other information to form a complete Fork summary.In order to validate the effectiveness of the proposed method,30 groups of manual tests and 60 groups of actual live study were conducted on Github.The results show that the accuracy of Fork summary generated by this method is 67.2%.In the experiment,76% of project managers believe that Fork summary can help to better manage projects,and strengthen communication and cooperation.

Key words: Distributed cooperative development, Fork summary, Open source community, Opens source

中图分类号: 

  • TP311
[1]JIANG J,LO D,HE J,et al.Why and how developers fork what from whom in GitHub[J].Empirical Software Engineering,2016,22(1):1-32.
[2]BITZER J,SCHRODER P.The Impact of Entry and Competition by Open Source Software on Innovation Activity[J].Industrial Organization,2005.
[3]REN L,ZHOU S,KASTNER C,et al.Identifying Redundancies in Fork-based Development[C]∥2019 IEEE 26th International Conference on Software Analysis,Evolution and Reengineering (SANER).IEEE,2019.
[4]YU Y,LI Z,YIN G,et al.A dataset of duplicate pull-requests in github∥Mining Software Repositories.2018:22-25.
[5]GOUSIOS G,PINZGER M,VAN DEURSEN A,et al.An exploratory study of the pull-based software development model[C]∥International Conference on Software Engineering.2014:345-355.
[6]REN L,ZHOU S,KÄSTNER C.Forks insight:providing an overview of GitHub forks[C]∥Proceedings of the 40th International Conference on Software Engineering:Companion Proceeedings.ACM,2018.
[7]NYMAN L,MIKKONEN T.To Fork or Not to Fork:Fork Motivations in SourceForge Projects[C]∥Open Source Systems:Grounding Research - 7th IFIP WG 2.13 International Confe-rence(OSS 2011).DBLP,2011.
[8]ZHOU S,STANCIULESCU S,LEBENICH O,et al.Identifying features in forks[C]∥International Conference on Software Engineering.2018:105-116.
[9]YIN G,WANG T,LIU B X,et al.Survey of software data mi- ning for open source ecosystem[J].Journal of Software,2018,29(8):2258-2271.
[10]SADOWSKI C,AFTANDILIAN E,EAGLE A,et al.Lessons from building static analysis tools at Google[J].Communications of the ACM,2018,61(4):58-66.
[11]SALTON G,BUCKLEY C.Term-weighting approaches in automatic text retrieval[J].Information Processing and Management,1988,24(5):323-328.
[12]JAMES G,WITTEN D,HASTIE T,et al.An Introduction to Statistical Learning[M].Springer New York,2013.
[13]GOUSIOS G,ZAIDMAN A,STOREY M,et al.Work practices and challenges in pull-based development:the contributor’s perspective[C]∥International Conference on Software Enginee-ring.2015:285-296.
[14]VASILESCU B,BLINCOE K,XUAN Q,et al.The sky is not the limit:multitasking across GitHub projects[C]∥Internatio-nal Conference on Software Engineering.2016:994-1005.
[15]ROBLES,GREGORIO,GONZÁLEZBARAHONA J.A Com- prehensive Study of Software Forks:Dates,Reasons and Outcomes[C]∥Open Source System.2012:1-4.
[16]LI L S,REN Z L,LI X C,et al.How are Issue Units Linked? Empirical Study on the PSECLinking Behavior in GitHub[C]∥Asia-Pacific Software Engineering Conference(APSEC).2018.
[17]DABBISH L,STUART C,TSAY J,et al.Social coding in GitHub:transparency and collaboration in an open software repository[C]∥Conference on Computer Supported Cooperative Work.2012:1277-1286.
[18]DABBISH L,STUART C,TSAY J,et al.Leveraging Transpa- rency[J].IEEE Software,2013,30(1):37-43.
[19]Gail Cecile Murphy.Lightweight structural summarization as an aid to software evolution[OL]. https://core.ac.uk/display/20786603.
[20]ZHU J,ZHOU M,MOCKUS A.Effectiveness of code contribution:from patch-based to pull-request-based tools[C]∥Acm Sigsoft International Symposium on Foundations of Software Engineering.ACM,2016.
[21]POSHYVANYK D,MARCUS A.Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code[C]∥International Conference on Program Comprehension.IEEE Computer Society,2007:37-48.
[22]STOREY M A D,CHENG L T,BULL R I,et al.Shared waypoints and social tagging to support collaboration in software development[C]∥Proceedings of the 2006 ACM Conference on Computer Supported Cooperative Work(CSCW 2006).Banff,Alberta,Canada,ACM,2006.
[23]KUHN A,DUCASSE S,GIRBA T,et al.Semantic clustering:Identifying topics in source code[J].Information & Software Technology,2007,49(3):230-243.
[24]STANCIULESCU S,SCHULZE S,WASOWSKI A.Forked and integrated variants in an open-source firmware project[C]∥2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).IEEE,2015.
[25]KHATAVKAR V,KULKARNI P.Comparison of Support Vector Machines With and Without Latent Semantic Analysis for Document Classification ∥Data Management,Analytics and Innovation.2018:263--274.
[26]LANDAUER T K.Latent Semantic Analysis[M]∥Encyclopedia of Cognitive Science.Berlin:Springer,2006.
[27]BERGER T,NAIR D,RUBLACK R,et al.Three Cases of Feature-Based Variability Modeling in Industry[C]∥Model Driven Engineering Languages and Systems.2014:302-319.
[1] 范家宽, 王皓月, 赵生宇, 周添一, 王伟.
数据驱动的开源贡献度量化评估与持续优化方法
Data-driven Methods for Quantitative Assessment and Enhancement of Open Source Contributions
计算机科学, 2021, 48(5): 45-50. https://doi.org/10.11896/jsjkx.201000107
[2] 何鹏, 喻绿君.
面向群体协作开发的开源软件峭壁分析
Analysis of Open Source Software Cliff Walls for Group Collaborative Development
计算机科学, 2020, 47(6): 51-58. https://doi.org/10.11896/jsjkx.190300140
[3] 卢冬冬, 吴洁, 刘鹏, 盛永祥.
开源软件关键开发者类型及协作网络鲁棒性分析
Analysis of Key Developer Type and Robustness of Collaboration Network in Open Source Software
计算机科学, 2020, 47(12): 100-105. https://doi.org/10.11896/jsjkx.200300147
[4] 王扩, 王忠杰.
众包协作流程的恢复方法
Crowdsourcing Collaboration Process Recovery Method
计算机科学, 2020, 47(10): 19-25. https://doi.org/10.11896/jsjkx.191200164
[5] 陈丹,王星,何鹏,曾诚.
开源社区中已有开发者的合作行为分析
Towards Understanding Existing Developers’ Collaborative Behavior in OSS Communities
计算机科学, 2016, 43(Z6): 476-479. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.112
[6] 匡 立,易云飞,李元香.
基于弱连接理论的GitHub网络的分形特征分析
Analysis of Fractal Property on GitHub Network Based on Weak Ties Theory
计算机科学, 2015, 42(7): 146-149. https://doi.org/10.11896/j.issn.1002-137X.2015.07.032
[7] 李其锋,李 兵.
开源软件开发者的演化研究
Evolution of Contributors in Open Source Software Development
计算机科学, 2015, 42(12): 43-46.
[8] 张浩斌.
基于开放式云平台的开源在线评测系统设计与实现
Design and Implementation of the Open Cloud Platform Based Open Source Online Judge System
计算机科学, 2012, 39(Z11): 339-343.
[9] 林利 石文昌.
构建云计算平台的开源软件综述
Survey of Open Source Software for Building Cloud Computing Platforms
计算机科学, 2012, 39(11): 1-7.
[10] 张锡哲,罗实,印莹,张斌.
面向软件执行网络的行为拓扑分析研究
Analysis on Dynamic Behavior for Open-source Software Execution Network
计算机科学, 2011, 38(Z10): 242-248.
[11] .
数据挖掘技术在软件工程中的应用综述

计算机科学, 2009, 36(5): 1-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!