Computer Science ›› 2019, Vol. 46 ›› Issue (12): 192-200.doi: 10.11896/jsjkx.181102232

• Software & Database Technology • Previous Articles     Next Articles

Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information

FAN Dao-yuan1, SUN Ji-hong2, WANG Wei1,3, TU Ji-ping1, HE Xin1   

  1. (College of Software,Yunnan University,Kunming 650500,China)1;
    (Academy of Sciences in Yunnan Province,Kunming 650091,China)2;
    (Key Laboratory for Software Engineering of Yunnan Province,Kunming 650500,China)3
  • Received:2018-11-30 Online:2019-12-15 Published:2019-12-17

Abstract: Software defect is the root of software errors and failures.Software defect is caused by unreasonable requirement analysis,imprecise programming language and lack of experience of developers.Software defects are inevitable,and submitting defect reports is an important way to find and improve defects.Defect report is the carrier of describing defects,and the repair of defect report is the necessary means to improve software.Maintenance personnel and users submit reports for the same defect repeatedly,resulting in a large number of redundant reports in the defect report library.Manual triage is unable to adapt to more and more complex software systems.The detection of duplicate defect reports can filter redundant duplicate reports from defect report libraries and invests human and time in new defect reports.The prediction accuracy rate of current research methods is not high,and the difficulty is to find a suitable and comprehensive method to measure the similarity between defect reports.Based on the idea of the integration method and the python language,a new method named BSO (combination of BM25F,LSI and One-Hot) for detecting duplicate defect report was proposed by using text information and categorization information.On the basis of data preprocessing,duplicate defect report is divided into text information domain and categorization information domain.BM25F and LSI algorithms are used to get similarity scores in text information domain,and One-Hot algorithm is used to get similarity scores in categorization information domain.The similarity fusion method is used to synthesize the similarity score between text information domain and categorization information domain,and a recommendation list for each defect report corresponds to a duplicate defect report.The accuracy of the duplicate defect report detection is calculated.Compared with the baseline method and the state-of the art methods including REP and DBTM on OpenOffice.The experimental results show that the accuracy of the proposed method is 4.7% higher than that of DBTM,6.3% higher than that of REP,and higher than that of baseline method.Experiment results fully prove the effectiveness of BSO method.

Key words: Duplicate defect report, Information retrieval method, One-Hot, Similarity fusion, Topic model

CLC Number: 

  • TP311.5
[1]RUNESON P,ALEXANDERSSON M,NYHOLM O.Detection of Duplicate Defect Reports Using Natural Language Processing[C]//International Conference on Software Engineering.IEEE Computer Society,2007:499-510.
[2]BETTENBURG N,PREMRAJ R,ZIMMERMANN T,et al. Extracting structural information from bug reports[C]//International Working Conference on Mining Software Repositories.ACM,2008:27-30.
[3]XIA X,LO D,SHIHAB E,et al.Automated Bug Report Field Reassignment and Refinement Prediction[J].IEEE Transactions on Reliability,2016,65(3):1094-1113.
[4]HUANG X L.Research on Automatic Distribution of Software Defects[D].Shanghai:Fudan University,2011.
[5]JALBERT N,WEIMER W.Automated duplicate detection for bug tracking systems[C]//IEEE International Conference on Dependable Systems and Networks with Ftcs and DCC.IEEE,2008:52-61.
[6]CUBRANIC D.Automatic bug triage using text categorization [C]//International Conference on Software Engineering & Knowledge Engineering.USA:KSI Press,2004:92-97.
[7]WANG X,ZHANG L,XIE T,et al.An approach to detecting duplicate bug reports using natural language and execution information[C]//ACM/IEEE International Conference on Software Engineering.IEEE,2008:461-470.
[8]ALIPOUR A,HINDLE A,STROULIA E.A contextual ap- proach towards more accurate duplicate bug report detection[C]//Mining Software Repositories.IEEE,2013:183-192.
[9]ROBERTSON S,ZARAGOZA H,TAYLOR M.Simple BM25 extension to multiple weighted fields[C]//Thirteenth ACM International Conference on Information and Knowledge Management,2004:42-49.
[10]CHENG Z Y,HUNG H D,W S S,et al.Duplication Detection for Software Bug Reports Based on BM25 Term Weighting[C]//2012 Conference on Technologies and Applications of Artificial Intelligence (TAAI).2012.
[11]PÉREZ-AGÜERA J R,GREENBERG A J,et al.Using BM25F for semantic search[C]//Proceeding of the 3rd International Semantic Search Workshop.2010:1-8.
[12]SUN C,LO D,KHOO S C,et al.Towards more accurate retrie- val of duplicate bug reports[C]//IEEE/ACM International Conference on Automated Software Engineering.IEEE,2011:253-262.
[13]LAZAR A,RITCHEY S,SHARIF B.Improving the accuracy of duplicate bug report detection using textual similarity measures[C]//Working Conference on Mining Software Repositories.ACM,2014:308-311.
[14]PODGURSKI A,LEON D,FRANCIS P,et al.Automated Support for Classifying Software Failure Reports[C]//International Conference on Software Engineering.IEEE,2003:465-475.
[15]SUN C N,LO D,WANG X Y,et al.A discriminative model approach for accurate duplicate bug report retrieval[C]//2010 ACM/IEEE 32nd International Conference on Software Engineering.2010.
[16]SANGUANSAT P.Paragraph2Vec-based sentiment analysis on social media for business in Thailand[C]//International Confe-rence on Knowledge and Smart Technology.IEEE,2016:175-178.
[17]WANG B.Research on detection method of duplicate defect report[D].Shanghai:East China Normal University,2016.
[18]SOMASUNDARAM K,MURPHY G C.Automatic categorization of bug reports using latent Dirichlet allocation[C]//India Software Engineering Conference.ACM,2012:125-130.
[19]NGUYEN A T.Duplicate bug report detection with a combination of information retrieval and topic modeling[C]//Procee-dings of International Conference on Automated Software Engineering.IEEE,2012:70-79.
[20]REN Y G,YANG R J,YIN M F.Text feature selection algorithm based on the correlation between feature weights and words[J].Computer Application and Software,2012,29(9):33-36.
[21]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].JMLR,2003,3(1):993-1022.
[22]BELLEGARDA J R.Exploiting latent sematic information in statistical language modeling[J].Proceedings of the IEEE,2000,88(8):1279-1296.
[23]HE L,WU J,FANG D T,et al.Speaker Adaptation Method Based on Maximum Posterior Estimation and Nearest Neighbor Linear Regression[J].Acta Electronic Sinica,2000,28(11):55-58.
[24]BETTENBURG N,PREMRAJ R,ZIMMERMANN T.Dupli- cate bug reports considered harmful.Really?[C]//Proc. International Conference on Software Maintenance 2008.2008:337-345.
[25]HUANG X L,YU Z S,GUAN J H.Software Defect Assignment Method Based on LDA Topic Model [J].Computer Engineering,2011,37(21):46-48.
[26]TIAN Y,LO D,XIA X,et al.Automated prediction of bug report priority using multi-factor analysis[J].Empirical Software Engineering,2015,20(5):1354-1383.
[27]ALENEZI M,BANITAAN S,ZAROUR M.Using Categorical Features in Mining Bug Tracking Systems to Assign Bug Reports[J].International Journal of Software Engineering & Applications,2018,9(2):29-39.
[28]KARIM M R,IHARA A,XIN Y,et al.Understanding Key Features of High-Impact Bug Reports[C]//2017 8th International Workshop on Empirical Software Engineering in Practice (IWESEP).IEEE,2017:53-58.
[29]RAKHA M S,BEZEMER C P,HASSAN A E.Revisiting the Performance Evaluation of Automated Approaches for the Retrieval of Duplicate Issue Reports[J].IEEE Transactions on Software Engineering,2017,PP(99):1-1.
[30]CHREN W A.One-hot residue coding for low delay-power pro- duct CMOS design[J].IEEE Transactions on Circuits & Systems II Analog & Digital Signal Processing,1998,45(3):303-313.
[31]SHI Z,KEUNG J,SONG Q.An empirical study of BM25 and BM25F based feature location techniques[C]//International Workshop on Innovative Software Development Methodologies and Practices.2014:106-114.
[32]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the Association for Information Science & Technology,2010,41(6):391-407.
[33]BELLEGARDA J R.Exploiting latent semantic information in statistical language modeling[J].Proceedings of the IEEE,2000,88(8):1279-1296.
[1] LIU Yun-han, SHA Chao-feng, NIU Jun-yu. Analysis of Topics on Database Systems in Stack Overflow [J]. Computer Science, 2021, 48(6): 48-56.
[2] WEN Jin, ZHANG Xing-yu, SHA Chao-feng, LIU Yan-jun. Test Suite Reduction via Submodular Function Maximization [J]. Computer Science, 2021, 48(12): 75-84.
[3] MA Li-bo, QIN Xiao-lin. Topic-Location-Category Aware Point-of-interest Recommendation [J]. Computer Science, 2020, 47(9): 81-87.
[4] ZHOU Bo. Bipartite Network Recommendation Algorithm Based on Semantic Model [J]. Computer Science, 2020, 47(11A): 482-485.
[5] WANG Han, XIA Hong-bin. Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model [J]. Computer Science, 2019, 46(9): 216-222.
[6] JU Ya-ya, YANG Lu, YAN Jian-feng. LDA Algorithm Based on Dynamic Weight [J]. Computer Science, 2019, 46(8): 260-265.
[7] ZHANG Lei,CAI Ming. Image Annotation Based on Topic Fusion and Frequent Patterns Mining [J]. Computer Science, 2019, 46(7): 246-251.
[8] JIA Ning, ZHENG Chun-jun. Model of Music Theme Recommendation Based on Attention LSTM [J]. Computer Science, 2019, 46(11A): 230-235.
[9] YU Yuan-yuan, CHAO Wen-han, HE Yue-ying, LI Zhou-jun. Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding [J]. Computer Science, 2019, 46(1): 238-244.
[10] ZHANG Xiao-chuan, YU Lin-feng, ZHANG Yi-hao. Multi-feature Fusion for Short Text Similarity Calculation Based on LDA [J]. Computer Science, 2018, 45(9): 266-270.
[11] QIU Xian-biao, CHEN Xiao-rong. Text Similarity Calculation Algorithm Based on SA_LDA Model [J]. Computer Science, 2018, 45(6A): 106-109.
[12] DONG Chen-lu and KE Xin-sheng. Study on Collaborative Filtering Algorithm Based on User Interest Change and Comment [J]. Computer Science, 2018, 45(3): 213-217.
[13] ZHU Yin, HUANG Hai-yan. Study on Recursive Auto-encoding Sentiment Classification Based on Topic Enhancement [J]. Computer Science, 2018, 45(12): 142-147.
[14] WANG Kai-xiang. Survey of Query-oriented Automatic Summarization Technology [J]. Computer Science, 2018, 45(11A): 12-16.
[15] TANG Ying, SUN Kang-gao, QIN Xu-jia, ZHOU Jian-mei. Local Model Weighted Ensemble for Top-N Movie Recommendation [J]. Computer Science, 2018, 45(11A): 439-444.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!