融合文本与分类信息的重复缺陷报告检测方法

doi:10.11896/jsjkx.181102232

摘要/Abstract

摘要： 软件缺陷是软件出现错误、故障的根源。软件缺陷是需求分析不合理、编程语言不严谨、开发人员缺少经验等因素导致的。软件缺陷不可避免,提交缺陷报告是发现缺陷并改进缺陷的重要途径。缺陷报告是描述缺陷的载体,对缺陷报告的修复是完善软件的必要手段。维护人员和用户因同一缺陷重复提交报告,导致缺陷报告库中存在大量冗余的报告,手动分诊已无法适应越来越复杂的软件系统。重复缺陷报告检测能过滤缺陷报告库中冗余的重复报告,并将人力与时间投入到新的缺陷报告上。当前研究方法的预测准确率始终不高,其难点在于寻找一个合适且全面的方法来衡量缺陷报告之间的相似性。借鉴集成方法的思想,提出了一种基于文本信息、分类信息相融合的重复缺陷报告检测方法——BSO(combination of BM25F、LSI and One-Hot)。在数据预处理的基础上,文中将重复缺陷报告分割为文本信息域与分类信息域。在文本信息域上使用 BM25F与LSI算法,得到两个方法的相似性打分,运用相似性融合方法将两个方法的相似性打分进行整合;在分类信息域上使用One-Hot算法得到相似性打分。运用相似性融合方法,融合文本信息域与分类信息域的相似性打分,为每个缺陷报告对应一个重复缺陷报告推荐列表,并计算重复缺陷报告检测的准确率。利用Python语言,在公开的数据集OpenOffice上与基线方法以及较新水平方法REP、DBTM进行对比。实验结果表明,与DBTM相比,本文方法的准确率平均提高了4.7%;与REP方法相比,本文方法的准确率平均提高了6.3%;与基线方法相比,本文方法的准确率提升较高。实验结果充分证明了BSO方法的有效性。

关键词: One-Hot, 相似性融合, 信息检索方法, 重复缺陷报告, 主题模型

Abstract: Software defect is the root of software errors and failures.Software defect is caused by unreasonable requirement analysis,imprecise programming language and lack of experience of developers.Software defects are inevitable,and submitting defect reports is an important way to find and improve defects.Defect report is the carrier of describing defects,and the repair of defect report is the necessary means to improve software.Maintenance personnel and users submit reports for the same defect repeatedly,resulting in a large number of redundant reports in the defect report library.Manual triage is unable to adapt to more and more complex software systems.The detection of duplicate defect reports can filter redundant duplicate reports from defect report libraries and invests human and time in new defect reports.The prediction accuracy rate of current research methods is not high,and the difficulty is to find a suitable and comprehensive method to measure the similarity between defect reports.Based on the idea of the integration method and the python language,a new method named BSO (combination of BM25F,LSI and One-Hot) for detecting duplicate defect report was proposed by using text information and categorization information.On the basis of data preprocessing,duplicate defect report is divided into text information domain and categorization information domain.BM25F and LSI algorithms are used to get similarity scores in text information domain,and One-Hot algorithm is used to get similarity scores in categorization information domain.The similarity fusion method is used to synthesize the similarity score between text information domain and categorization information domain,and a recommendation list for each defect report corresponds to a duplicate defect report.The accuracy of the duplicate defect report detection is calculated.Compared with the baseline method and the state-of the art methods including REP and DBTM on OpenOffice.The experimental results show that the accuracy of the proposed method is 4.7% higher than that of DBTM,6.3% higher than that of REP,and higher than that of baseline method.Experiment results fully prove the effectiveness of BSO method.

Key words: Duplicate defect report, Information retrieval method, One-Hot, Similarity fusion, Topic model

中图分类号:

TP311.5

范道远, 孙吉红, 王炜, 涂吉屏, 何欣. 融合文本与分类信息的重复缺陷报告检测方法[J]. 计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232

FAN Dao-yuan, SUN Ji-hong, WANG Wei, TU Ji-ping, HE Xin. Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information[J]. Computer Science, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232

参考文献

[1]RUNESON P,ALEXANDERSSON M,NYHOLM O.Detection of Duplicate Defect Reports Using Natural Language Processing[C]//International Conference on Software Engineering.IEEE Computer Society,2007:499-510.
[2]BETTENBURG N,PREMRAJ R,ZIMMERMANN T,et al. Extracting structural information from bug reports[C]//International Working Conference on Mining Software Repositories.ACM,2008:27-30.
[3]XIA X,LO D,SHIHAB E,et al.Automated Bug Report Field Reassignment and Refinement Prediction[J].IEEE Transactions on Reliability,2016,65(3):1094-1113.
[4]HUANG X L.Research on Automatic Distribution of Software Defects[D].Shanghai:Fudan University,2011.
[5]JALBERT N,WEIMER W.Automated duplicate detection for bug tracking systems[C]//IEEE International Conference on Dependable Systems and Networks with Ftcs and DCC.IEEE,2008:52-61.
[6]CUBRANIC D.Automatic bug triage using text categorization [C]//International Conference on Software Engineering & Knowledge Engineering.USA:KSI Press,2004:92-97.
[7]WANG X,ZHANG L,XIE T,et al.An approach to detecting duplicate bug reports using natural language and execution information[C]//ACM/IEEE International Conference on Software Engineering.IEEE,2008:461-470.
[8]ALIPOUR A,HINDLE A,STROULIA E.A contextual ap- proach towards more accurate duplicate bug report detection[C]//Mining Software Repositories.IEEE,2013:183-192.
[9]ROBERTSON S,ZARAGOZA H,TAYLOR M.Simple BM25 extension to multiple weighted fields[C]//Thirteenth ACM International Conference on Information and Knowledge Management,2004:42-49.
[10]CHENG Z Y,HUNG H D,W S S,et al.Duplication Detection for Software Bug Reports Based on BM25 Term Weighting[C]//2012 Conference on Technologies and Applications of Artificial Intelligence (TAAI).2012.
[11]PÉREZ-AGÜERA J R,GREENBERG A J,et al.Using BM25F for semantic search[C]//Proceeding of the 3rd International Semantic Search Workshop.2010:1-8.
[12]SUN C,LO D,KHOO S C,et al.Towards more accurate retrie- val of duplicate bug reports[C]//IEEE/ACM International Conference on Automated Software Engineering.IEEE,2011:253-262.
[13]LAZAR A,RITCHEY S,SHARIF B.Improving the accuracy of duplicate bug report detection using textual similarity measures[C]//Working Conference on Mining Software Repositories.ACM,2014:308-311.
[14]PODGURSKI A,LEON D,FRANCIS P,et al.Automated Support for Classifying Software Failure Reports[C]//International Conference on Software Engineering.IEEE,2003:465-475.
[15]SUN C N,LO D,WANG X Y,et al.A discriminative model approach for accurate duplicate bug report retrieval[C]//2010 ACM/IEEE 32nd International Conference on Software Engineering.2010.
[16]SANGUANSAT P.Paragraph2Vec-based sentiment analysis on social media for business in Thailand[C]//International Confe-rence on Knowledge and Smart Technology.IEEE,2016:175-178.
[17]WANG B.Research on detection method of duplicate defect report[D].Shanghai:East China Normal University,2016.
[18]SOMASUNDARAM K,MURPHY G C.Automatic categorization of bug reports using latent Dirichlet allocation[C]//India Software Engineering Conference.ACM,2012:125-130.
[19]NGUYEN A T.Duplicate bug report detection with a combination of information retrieval and topic modeling[C]//Procee-dings of International Conference on Automated Software Engineering.IEEE,2012:70-79.
[20]REN Y G,YANG R J,YIN M F.Text feature selection algorithm based on the correlation between feature weights and words[J].Computer Application and Software,2012,29(9):33-36.
[21]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].JMLR,2003,3(1):993-1022.
[22]BELLEGARDA J R.Exploiting latent sematic information in statistical language modeling[J].Proceedings of the IEEE,2000,88(8):1279-1296.
[23]HE L,WU J,FANG D T,et al.Speaker Adaptation Method Based on Maximum Posterior Estimation and Nearest Neighbor Linear Regression[J].Acta Electronic Sinica,2000,28(11):55-58.
[24]BETTENBURG N,PREMRAJ R,ZIMMERMANN T.Dupli- cate bug reports considered harmful.Really?[C]//Proc. International Conference on Software Maintenance 2008.2008:337-345.
[25]HUANG X L,YU Z S,GUAN J H.Software Defect Assignment Method Based on LDA Topic Model [J].Computer Engineering,2011,37(21):46-48.
[26]TIAN Y,LO D,XIA X,et al.Automated prediction of bug report priority using multi-factor analysis[J].Empirical Software Engineering,2015,20(5):1354-1383.
[27]ALENEZI M,BANITAAN S,ZAROUR M.Using Categorical Features in Mining Bug Tracking Systems to Assign Bug Reports[J].International Journal of Software Engineering & Applications,2018,9(2):29-39.
[28]KARIM M R,IHARA A,XIN Y,et al.Understanding Key Features of High-Impact Bug Reports[C]//2017 8th International Workshop on Empirical Software Engineering in Practice (IWESEP).IEEE,2017:53-58.
[29]RAKHA M S,BEZEMER C P,HASSAN A E.Revisiting the Performance Evaluation of Automated Approaches for the Retrieval of Duplicate Issue Reports[J].IEEE Transactions on Software Engineering,2017,PP(99):1-1.
[30]CHREN W A.One-hot residue coding for low delay-power pro- duct CMOS design[J].IEEE Transactions on Circuits & Systems II Analog & Digital Signal Processing,1998,45(3):303-313.
[31]SHI Z,KEUNG J,SONG Q.An empirical study of BM25 and BM25F based feature location techniques[C]//International Workshop on Innovative Software Development Methodologies and Practices.2014:106-114.
[32]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the Association for Information Science & Technology,2010,41(6):391-407.
[33]BELLEGARDA J R.Exploiting latent semantic information in statistical language modeling[J].Proceedings of the IEEE,2000,88(8):1279-1296.

相关文章 15

[1]	文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086
[2]	周波. 融合语义模型的二分网络推荐算法 Bipartite Network Recommendation Algorithm Based on Semantic Model 计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028
[3]	王涵, 夏鸿斌. LDA模型和列表排序混合的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model 计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032
[4]	居亚亚, 杨璐, 严建峰. 基于动态权重的LDA算法 LDA Algorithm Based on Dynamic Weight 计算机科学, 2019, 46(8): 260-265. https://doi.org/10.11896/j.issn.1002-137X.2019.08.043
[5]	张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注 Image Annotation Based on Topic Fusion and Frequent Patterns Mining 计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[6]	贾宁, 郑纯军. 基于注意力LSTM的音乐主题推荐模型 Model of Music Theme Recommendation Based on Attention LSTM 计算机科学, 2019, 46(11A): 230-235.
[7]	余圆圆, 巢文涵, 何跃鹰, 李舟军. 基于双语主题模型和双语词向量的跨语言知识链接 Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding 计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896／j.issn.1002-137X.2019.01.037
[8]	张小川, 余林峰, 张宜浩. 基于LDA的多特征融合的短文本相似度计算 Multi-feature Fusion for Short Text Similarity Calculation Based on LDA 计算机科学, 2018, 45(9): 266-270. https://doi.org/10.11896／j.issn.1002-137X.2018.09.044
[9]	邱先标, 陈笑蓉. 一种基于SA_LDA模型的文本相似度计算方法 Text Similarity Calculation Algorithm Based on SA_LDA Model 计算机科学, 2018, 45(6A): 106-109.
[10]	董晨露,柯新生. 基于用户兴趣变化和评论的协同过滤算法研究 Study on Collaborative Filtering Algorithm Based on User Interest Change and Comment 计算机科学, 2018, 45(3): 213-217. https://doi.org/10.11896/j.issn.1002-137X.2018.03.033
[11]	鲜学丰,崔志明,赵朋朋,刘昭斌,顾才东. 基于主题模型的位置感知订阅发布系统 Location-awareness Publication Subscription System Based on Topic Model 计算机科学, 2018, 45(3): 165-170. https://doi.org/10.11896/j.issn.1002-137X.2018.03.026
[12]	朱引, 黄海燕. 基于主题增强的递归自编码情感分类研究 Study on Recursive Auto-encoding Sentiment Classification Based on Topic Enhancement 计算机科学, 2018, 45(12): 142-147. https://doi.org/10.11896／j.issn.1002-137X.2018.12.022
[13]	王凯祥. 面向查询的自动文本摘要技术研究综述 Survey of Query-oriented Automatic Summarization Technology 计算机科学, 2018, 45(11A): 12-16.
[14]	汤颖, 孙康高, 秦绪佳, 周建美. 基于局部模型加权融合的Top-N电影推荐算法 Local Model Weighted Ensemble for Top-N Movie Recommendation 计算机科学, 2018, 45(11A): 439-444.
[15]	杜慧,陈云芳,张伟. 主题模型中的参数估计方法综述 Survey for Methods of Parameter Estimation in Topic Models 计算机科学, 2017, 44(Z6): 29-32. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.006

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed