Computer Science ›› 2020, Vol. 47 ›› Issue (3): 25-33.doi: 10.11896/jsjkx.191000087

• Intelligent Software Engineering • Previous Articles     Next Articles

Approach of Automatic Fork Summary Generation in Open Source Community Based on Feature Extraction

ZHANG Chao1,MAO Xin-jun1,2,LU Yao1   

  1. (College of Computer Science and Technology, National University of Defense Technology, Changsha 410000, China)1;
    (Key Laboratory of Complex System Software Engineering, Changsha 410000, China)2
  • Received:2019-10-15 Online:2020-03-15 Published:2020-03-30
  • About author:ZHANG Chao,born in 1991,postgradua-te,is member of China Computer Fe-deration.His main research interests include software engineering and open source community. MAO Xin-jun,born in 1970,Ph.D,professor,is member of China Computer Federation.His main research interests include software engineering and open source community.
  • Supported by:
    This work was supported by the National Key R&D Program of China (2018YFB1004202) and Research on Mechanism and Method of Massive Online Collaborative Learning (61532004).

Abstract: At present,distributed collaborative development based on P/R has become the dominant software development me-thod in open source community.Because of the openness,transparency and parallelism of the software development in P/R mo-del,it is difficult for developers to obtain the complete Fork profile of the whole project,and know whether other developers have accomplished the same or similar development tasks,which are prone to duplicate contributions and redundant development.To solve this problem,this paper proposed an automatic generation method of Fork summary to help project managers strengthen project management,avoid redundant contributions,and enhance cooperation and communication among developers.The proposed method firstly crawls Issue data with feature and Bug label information in open source community,and trains a classifier model with random forest method to classify Fork features.Then,it collects the data of Fork branch’s software development activities and uses TextRank algorithm to generate detailed Fork information to explain the main purpose of Fork activity.Finally,a set of combination rules and corresponding algorithm are designed to integrate Fork’s categories,features and other information to form a complete Fork summary.In order to validate the effectiveness of the proposed method,30 groups of manual tests and 60 groups of actual live study were conducted on Github.The results show that the accuracy of Fork summary generated by this method is 67.2%.In the experiment,76% of project managers believe that Fork summary can help to better manage projects,and strengthen communication and cooperation.

Key words: Opens source, Open source community, Fork summary, Distributed cooperative development

CLC Number: 

  • TP311
[1]JIANG J,LO D,HE J,et al.Why and how developers fork what from whom in GitHub[J].Empirical Software Engineering,2016,22(1):1-32.
[2]BITZER J,SCHRODER P.The Impact of Entry and Competition by Open Source Software on Innovation Activity[J].Industrial Organization,2005.
[3]REN L,ZHOU S,KASTNER C,et al.Identifying Redundancies in Fork-based Development[C]∥2019 IEEE 26th International Conference on Software Analysis,Evolution and Reengineering (SANER).IEEE,2019.
[4]YU Y,LI Z,YIN G,et al.A dataset of duplicate pull-requests in github∥Mining Software Repositories.2018:22-25.
[5]GOUSIOS G,PINZGER M,VAN DEURSEN A,et al.An exploratory study of the pull-based software development model[C]∥International Conference on Software Engineering.2014:345-355.
[6]REN L,ZHOU S,KÄSTNER C.Forks insight:providing an overview of GitHub forks[C]∥Proceedings of the 40th International Conference on Software Engineering:Companion Proceeedings.ACM,2018.
[7]NYMAN L,MIKKONEN T.To Fork or Not to Fork:Fork Motivations in SourceForge Projects[C]∥Open Source Systems:Grounding Research - 7th IFIP WG 2.13 International Confe-rence(OSS 2011).DBLP,2011.
[8]ZHOU S,STANCIULESCU S,LEBENICH O,et al.Identifying features in forks[C]∥International Conference on Software Engineering.2018:105-116.
[9]YIN G,WANG T,LIU B X,et al.Survey of software data mi- ning for open source ecosystem[J].Journal of Software,2018,29(8):2258-2271.
[10]SADOWSKI C,AFTANDILIAN E,EAGLE A,et al.Lessons from building static analysis tools at Google[J].Communications of the ACM,2018,61(4):58-66.
[11]SALTON G,BUCKLEY C.Term-weighting approaches in automatic text retrieval[J].Information Processing and Management,1988,24(5):323-328.
[12]JAMES G,WITTEN D,HASTIE T,et al.An Introduction to Statistical Learning[M].Springer New York,2013.
[13]GOUSIOS G,ZAIDMAN A,STOREY M,et al.Work practices and challenges in pull-based development:the contributor’s perspective[C]∥International Conference on Software Enginee-ring.2015:285-296.
[14]VASILESCU B,BLINCOE K,XUAN Q,et al.The sky is not the limit:multitasking across GitHub projects[C]∥Internatio-nal Conference on Software Engineering.2016:994-1005.
[15]ROBLES,GREGORIO,GONZÁLEZBARAHONA J.A Com- prehensive Study of Software Forks:Dates,Reasons and Outcomes[C]∥Open Source System.2012:1-4.
[16]LI L S,REN Z L,LI X C,et al.How are Issue Units Linked? Empirical Study on the PSECLinking Behavior in GitHub[C]∥Asia-Pacific Software Engineering Conference(APSEC).2018.
[17]DABBISH L,STUART C,TSAY J,et al.Social coding in GitHub:transparency and collaboration in an open software repository[C]∥Conference on Computer Supported Cooperative Work.2012:1277-1286.
[18]DABBISH L,STUART C,TSAY J,et al.Leveraging Transpa- rency[J].IEEE Software,2013,30(1):37-43.
[19]Gail Cecile Murphy.Lightweight structural summarization as an aid to software evolution[OL].
[20]ZHU J,ZHOU M,MOCKUS A.Effectiveness of code contribution:from patch-based to pull-request-based tools[C]∥Acm Sigsoft International Symposium on Foundations of Software Engineering.ACM,2016.
[21]POSHYVANYK D,MARCUS A.Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code[C]∥International Conference on Program Comprehension.IEEE Computer Society,2007:37-48.
[22]STOREY M A D,CHENG L T,BULL R I,et al.Shared waypoints and social tagging to support collaboration in software development[C]∥Proceedings of the 2006 ACM Conference on Computer Supported Cooperative Work(CSCW 2006).Banff,Alberta,Canada,ACM,2006.
[23]KUHN A,DUCASSE S,GIRBA T,et al.Semantic clustering:Identifying topics in source code[J].Information & Software Technology,2007,49(3):230-243.
[24]STANCIULESCU S,SCHULZE S,WASOWSKI A.Forked and integrated variants in an open-source firmware project[C]∥2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).IEEE,2015.
[25]KHATAVKAR V,KULKARNI P.Comparison of Support Vector Machines With and Without Latent Semantic Analysis for Document Classification ∥Data Management,Analytics and Innovation.2018:263--274.
[26]LANDAUER T K.Latent Semantic Analysis[M]∥Encyclopedia of Cognitive Science.Berlin:Springer,2006.
[27]BERGER T,NAIR D,RUBLACK R,et al.Three Cases of Feature-Based Variability Modeling in Industry[C]∥Model Driven Engineering Languages and Systems.2014:302-319.
[1] ZHNAG Kai-qi, TU Zhi-ying, CHU Dian-hui, LI Chun-shan. Survey on Service Resource Availability Forecast Based on Queuing Theory [J]. Computer Science, 2021, 48(1): 26-33.
[2] TANG Wen-jun, LIU Yue, CHEN Rong. User Allocation Approach in Dynamic Mobile Edge Computing [J]. Computer Science, 2021, 48(1): 58-64.
[3] GUO Fei-yan, TANG Bing. Mobile Edge Server Placement Method Based on User Latency-aware [J]. Computer Science, 2021, 48(1): 103-110.
[4] YE Ya-zhen, LIU Guo-hua, ZHU Yang-yong. Two-step Authorization Pattern of Data Product Circulation [J]. Computer Science, 2021, 48(1): 119-124.
[5] SANG Miao-miao, PENG Jin-xian, DA Tong-hang, ZHANG Xu-feng. Efficient Semi-global Binocular Stereo Matching Algorithm Based on PatchMatch [J]. Computer Science, 2021, 48(1): 204-208.
[6] LEI Yang, JIANG Ying. Anomaly Judgment of Directly Associated Nodes Under Cloud Computing Environment [J]. Computer Science, 2021, 48(1): 295-300.
[7] SUN Chang-ai, ZHANG Shou-feng, ZHU Wei-zhong. Mutation Based Fault Localization Technique for BPEL Programs [J]. Computer Science, 2021, 48(1): 301-307.
[8] MENG Fan-yi, WANG Ying, YU Hai, ZHU Zhi-liang. Refactoring of Complex Software Systems Research:PresentProblem and Prospect [J]. Computer Science, 2020, 47(12): 1-10.
[9] WU Wen-jun, YU Xin, PU Yan-jun, WANG Qun-bo, YU Xiao-ming. Development of Complex Service Software in Microservice Era [J]. Computer Science, 2020, 47(12): 11-17.
[10] ZHOU Kai, REN Yi, WANG Zhe, GUAN Jian-bo, ZHANG Fang, ZHAO Yan-kang. Classification and Analysis of Ubuntu Bug Reports Based on Topic Model [J]. Computer Science, 2020, 47(12): 35-41.
[11] TIAN Yu-li, LI Ning. System Usage Analysis and Failure Analysis for Cloud Computing [J]. Computer Science, 2020, 47(12): 50-55.
[12] WANG Ying, ZHENG Li-wei, ZHANG Yu-yao, ZHANG Xiao-yun. Software Requirement Mining Method for Chinese APP User Review Data [J]. Computer Science, 2020, 47(12): 56-64.
[13] LI Zhi, DENG Jie, YANG Yi-long, WEI Shang-feng. Transformational Approach from Problem Models of Cyber-Physical Systems to Use Case Diagrams in UML [J]. Computer Science, 2020, 47(12): 65-72.
[14] CHEN Shuo, HU Jun, TANG Hong-ying, SHI Meng-ye. Transformation Method for AltaRica3.0 Model to NuSMV Model [J]. Computer Science, 2020, 47(12): 73-86.
[15] DING Rong, YU Qian-hui. Growth Framework of Autonomous Unmanned Systems Based on AADL [J]. Computer Science, 2020, 47(12): 87-92.
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .