计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 48-56.doi: 10.11896/jsjkx.200800217

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于Stack Overflow的数据库相关主题分析

刘蕴涵, 沙朝锋, 牛军钰   

  1. 复旦大学计算机科学技术学院 上海200433
  • 收稿日期:2020-08-30 修回日期:2020-10-24 出版日期:2021-06-15 发布日期:2021-06-03
  • 通讯作者: 沙朝锋(cfsha@fudan.edu.cn)
  • 基金资助:
    国家重点研发计划(2018YFB0904503)

Analysis of Topics on Database Systems in Stack Overflow

LIU Yun-han, SHA Chao-feng, NIU Jun-yu   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
  • Received:2020-08-30 Revised:2020-10-24 Online:2021-06-15 Published:2021-06-03
  • About author:LIU Yun-han,born in 1996,postgra-duate.Her main research interests include natural language processing and software engineering.(18212010018@fudan.edu.cn)
    SHA Chao-feng,born in 1976,Ph.D,associate professor.His main research interests include machine learning & data mining,and natural language processing.
  • Supported by:
    National Key Research and Development Program of China(2018YFB0904503).

摘要: 数据库管理系统虽是一种较为成熟的软件系统,但开发人员在应用数据库系统进行数据管理以及数据分析时还是会遇到各种问题,因此会在Stack Overflow之类的问答论坛上寻求解决方法。文中获取了Stack Overflow上94473条与数据库相关的问题,应用LDA主题模型将这些问题归为25个主题,结果显示开发者的问题可归为“表”“SQL”“SELECT”等主题。通过研究与数据库相关的不同主题的流行度和困难程度发现,“SQL”主题相关的问题较为流行。除此以外,文中还分别研究了3种不同的数据库,即MySQL,Oracle和MongoDB,分析了与不同数据库系统相关的问题的主题分布。文中的研究成果有助于了解数据库开发者所面临的挑战,从而为数据库系统版本更新、数据库课程教学内容的设置,甚至是数据库领域的研究问题提供参考。

关键词: LDA, Stack Overflow, 数据库, 主题建模

Abstract: Database management system has been a more mature software system,but software developers still encounter a variety of problems when using database systems to manage or analyze data.They would access Stack Overflow or other CQA forums to seek solutions.In this paper,94473 database related questions are obtained on Stack Overflow.Applying the LDA topic model on the dataset and grouping these questions into 25 topics,the results show that the developers’ questions can be classified as “table”“SQL” and “SELECT” etc.By studying the prevalence and difficulty of different database-related topics,it is found that a topic such as “SQL” is more popular.In addition,three different databases MySQL,Oracle and MongoDB are also studied,and the topic distribution of questions related to different database systems is analyzed in this paper.The findings of this paper will help to understand the challenges faced by database developers and thus provide suggestions for updating database system versions,design of database courses and even research questions in the field of database.

Key words: Database, LDA, Stack Overflow, Topic modeling

中图分类号: 

  • TP311
[1]MAMYKINA L,MANOIM B,MITTAL M,et al.Design lessons from the fastest Q&A site in the west[C]//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.ACM,2011:2857-2866.
[2]Stack Overflow.Developer Survey Results 2019[EB/OL].(2020) [2020-03-20] https://insights.stackoverflow.com/survey/2019.
[3]TREUDE C,BARZILAY O,STOREY M A.How do programmers ask and answer questions on the web?(Nier track)[C]//2011 33rd International Conference on Software Engineering(ICSE).IEEE,2011:804-807.
[4]ALLAMANIS M,SUTTON C.Why,when,and what:analyzing stack overflow questions by topic,type,and code[C]//Procee-dings of the 10th Working Conference on Mining Software Repositories.IEEE,2013:53-56.
[5]AHMED S,BAGHERZADEH M.What Do Concurrency Deve-lopers Ask About?A Large-scale Study Using Stack Overflow[C]//Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(ESEM’18).ACM,2018:1-10.
[6]BAJAJ K,PATTABIRAMAN K,MESBAH A.Mining ques-tions asked by web developers[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.ACM,2014:112-121.
[7]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[8]GALVIS CARREÑO L V,WINBLADH K.Analysis of usercomments:an approach for software requirements evolution[C]//Proceedings of the 2013 International Conference on Software Engineering.IEEE,2013:582-591.
[9]BAGHERZADEH M,KHATCHADOURIAN R.Going Big:A Large-Scale Study on What Big Data Developers Ask [C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2019:432-442.
[10]BARUA A,THOMAS S W,HASSAN A E.What are develo-pers talking about? an analysis of topics and trends in stack overflow[J].Empirical Software Engineering,2014,19(3):619-654.
[11]ROSEN C,SHIHAB E.What are mobile developers askingabout? a large scale study using stack overflow[J].Empirical Software Engineering,2016,21(3):1192-1223.
[12]YANG X L,LO D,XIA X,et al.What security questions do developers ask? a large-scale study of stack overflow posts[J].Journal of Computer Science and Technology,2016,31(5):910-924.
[13]HAN J,SHIHAB E,WAN Z,et al.What do Programmers Discuss about Deep Learning Frameworks[J].Empirical Software Engineering,2020,25(4):2694-2747.
[14]LUKINS S K,KRAFT N A,ETZKORN L H,Source code retrieval for bug localization using latent dirichlet allocation[C]//2008 15th Working Conference on Reverse Engineering(WCRE’08).IEEE,2008:155-164.
[15]KUHN A,DUCASSE S,GIRBA T.Semantic clustering:identi-fying topics in source code[J].Inf Softw Technol,2007,49(3):230-243.
[16]PLETEA D,VASILESCU B,SEREBRENIK A.Security andemotion:sentiment analysis of security discussions on github[C]//Proceedings of the 11th Working Conference on Mining Software Repositories(MSR).2014:348-351.
[17]ISLAM M J,NGUYEN H A,PAN R,et al.What do developers ask about ml libraries? a large-scale study using stack overflow[J].arXiv:1906.11940,2019.
[18]MILLER G A.Wordnet:a lexical database for english[J].Communications of the ACM,1995,38(11):39-41.
[19]GRIFFITHS T L,STEYVERS M.Finding scientific topics[J].Proceedings of the National Academy of Sciences of the United States of America,2004,101(Supplement 1):5228-5235.
[20]NEWMAN D,LAU J H,GRIESER K,et al.Automatic evaluation of topic coherence[C]//Human Language Technologies:Conference of the North American Chapter of the Association of Computational Linguistics.2010:100-108.
[1] 王润安, 邹兆年.
基于物理操作级模型的查询执行时间预测方法
Query Performance Prediction Based on Physical Operation-level Models
计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074
[2] 余本功, 张子薇, 王惠灵.
一种融合多层次情感和主题信息的TS-AC-EWM在线商品排序方法
TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information
计算机科学, 2022, 49(6A): 165-171. https://doi.org/10.11896/jsjkx.210400238
[3] 梁静茹, 鄂海红, 宋美娜.
基于属性图模型的领域知识图谱构建方法
Method of Domain Knowledge Graph Construction Based on Property Graph Model
计算机科学, 2022, 49(2): 174-181. https://doi.org/10.11896/jsjkx.210500076
[4] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[5] 黄梅根, 刘川, 杜欢, 刘佳乐.
基于知识图谱的认知诊断模型及其在教辅中的应用研究
Research on Cognitive Diagnosis Model Based on Knowledge Graph and Its Application in Teaching Assistant
计算机科学, 2021, 48(6A): 644-648. https://doi.org/10.11896/jsjkx.200700163
[6] 范鹏浩, 黄国锐, 金培权.
NVRC:一种面向NVM的写限制日志方案
NVRC:Write-limited Logging for Non-volatile Memory
计算机科学, 2021, 48(3): 130-135. https://doi.org/10.11896/jsjkx.200900071
[7] 刘立成, 徐一凡, 谢贵才, 段磊.
面向NoSQL数据库的JSON文档异常检测与语义消歧模型
Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database
计算机科学, 2021, 48(2): 93-99. https://doi.org/10.11896/jsjkx.200900039
[8] 凌飞, 陈世平.
基于区块链的企业联盟共享数字积分管理机制
Shared Digital Credits Management Mechanism of Enterprise Alliance Based on Blockchain
计算机科学, 2021, 48(11A): 533-539. https://doi.org/10.11896/jsjkx.201200170
[9] 鄂海红, 韩鹏昊, 宋美娜.
关系型数据库向图数据库的转换方法
Conversion Method from Relational Database to Graph Database
计算机科学, 2021, 48(10): 140-144. https://doi.org/10.11896/jsjkx.201100073
[10] 鲁佳文, 严丽.
对象关系数据库到RDF(S)的映射方法
Mapping Method from Object-relational Database to RDF(S)
计算机科学, 2021, 48(10): 145-151. https://doi.org/10.11896/jsjkx.200800006
[11] 陆龙龙, 陈统, 潘敏学, 张天.
CodeSearcher:基于自然语言功能描述的代码查询
CodeSearcher:Code Query Using Functional Descriptions in Natural Languages
计算机科学, 2020, 47(9): 1-9. https://doi.org/10.11896/jsjkx.191200170
[12] 田野, 寿黎但, 陈珂, 骆歆远, 陈刚.
基于字段嵌入的数据库自然语言查询接口
Natural Language Interface for Databases with Content-based Table Column Embeddings
计算机科学, 2020, 47(9): 60-66. https://doi.org/10.11896/jsjkx.190800138
[13] 冯安然, 王旭仁, 汪秋云, 熊梦博.
基于PCA和随机树的数据库异常访问检测
Database Anomaly Access Detection Based on Principal Component Analysis and Random Tree
计算机科学, 2020, 47(9): 94-98. https://doi.org/10.11896/jsjkx.190800056
[14] 张善彬, 袁金钊, 陈辉, 王玉荣, 王杰, 屠长河.
基于标准路牌的车辆自定位
Vehicle Self-localization Based on Standard Road Sign
计算机科学, 2020, 47(7): 97-102. https://doi.org/10.11896/jsjkx.190900011
[15] 周凯, 任怡, 汪哲, 管剑波, 张芳, 赵言亢.
基于主题模型的Ubuntu操作系统缺陷报告的分类及分析
Classification and Analysis of Ubuntu Bug Reports Based on Topic Model
计算机科学, 2020, 47(12): 35-41. https://doi.org/10.11896/jsjkx.200100022
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!