计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 66-71.

• 智能计算 • 上一篇    下一篇

基于贝叶斯网的短文本特征扩展方法

刘慧清, 郭延哺, 李红灵, 李维华   

  1. (云南大学信息学院 昆明650500)
  • 出版日期:2019-11-10 发布日期:2019-11-20
  • 通讯作者: 刘慧清(1996-),女,硕士生,主要研究方向为自然语言处理等;李维华(1977-),女,博士,副教授,主要研究方向为机器学习,E-mail:lywey@163.com。
  • 基金资助:
    本文受云南省应用基础研究计划重点项目(2016FA026),国家自然科学基金项目(61762090),云南大学研究生科研创新基金项目(2018226)资助。

Short Text Feature Extension Method Based on Bayesian Networks

LIU Hui-qing, GUO Yan-bu, LI Hong-ling, LI Wei-hua   

  1. (School of Information,Yunnan University,Kunming 650500,China)
  • Online:2019-11-10 Published:2019-11-20

摘要: 针对短文本特征词稀疏、表示能力不足等问题,提出了一种基于贝叶斯网的短文本特征扩展方法。该方法根据短文本中特征词之间的依赖关系构建语义贝叶斯网,定义特征词与短文本之间的关联度。基于贝叶斯网的推理计算关联度,将与短文本关联密切的特征词扩展到短文本中,以达到降低短文本的噪声、改善特征稀疏的目的。在此基础上,以短文本分类作为基本的文本分析任务,分析所提方法的可行性和有效性。在Amazon评论数据集上进行实验,结果表明所提方法是可行和有效的。

关键词: 贝叶斯网, 短文本, 特征扩展, 文本分析

Abstract: Aiming at the problems of feature sparsity and insuffcient representation ability in short text,this paper proposed a feature extension method based on Bayesian networks.Firstly,the semantic Bayesian network is constructed by defining the dependencies between the feature words in the short texts.Then,the correlation degree is defined between the feature word and the short text,and the feature words closely related to the short text are selected.These words are further extended to the short text to reduce the noise and sparsity of short texts.Finally,this paper analyzed the feasibility and effectiveness of the proposed method with the short text classification as the basic task of text analysis.The experimental results on the Amazon product dataset show that the proposed method is feasible and effective.

Key words: Bayesian network, Feature extension, Short text, Text analysis

中图分类号: 

  • TP391
[1]SEVERYN A,MOSCHITTI A.Learning to Rank Short TextPairs with Convolutional Deep Neural Networks[C]∥The International ACM SIGIR Conference.2015:373-382.
[2]ZHANG W,XUE G R,XUE G R,et al,Advertising Keywords Recommendation for Short-Text Web Pages Using Wikipedia[J].Acm Transactions on Intelligent Systems & Technology,2012,3(2):36:1-36:25.
[3]NGUYEN T H,GRISHMAN R.Relation Extraction:Perspec-tive from Convolutional Neural Networks[C]∥The Workshop on Vector Space Modeling for Natural Language Processing.2015:39-48.
[4]MA H,JI Y,LI X,et al.A Microblog Hot Topic Detection Algorithm Based on Discrete Particle Swarm Optimization[C]∥Pacific Rim International Conference on Trends in Artificial Intelligence.2016:271-282.
[5]MA J L,LIU J L,YU C H.An efficient algorithm for Chinese text clustering[J].Computer Engineering & Science,2013,35(2):103-108.
[6]高永兵,钟振华,王宇,等.基于混合方法的中文微博自动摘要技术研究[J].计算机工程与科学,2016,38(6):1257-1261.
[7]王仲远,程健鹏,王海勋,等.短文本理解研究[J].计算机研究与发展,2016,53(2):262-269.
[8]YU Z,WANG H,LIN X,et al.Understanding short textsthrough semantic enrichment and hashing[J].IEEE Transactions on Knowledge & Data Engineering,2016,28(2):566-579.
[9]WANG Y,HUANG H,FENG C.Query Expansion Based on a Feedback Concept Model for Microblog Retrieval[C]∥International Conference on World Wide Web.2017:559-568.
[10]崔婉秋,杜军平,寇菲菲,等.面向微博短文本的社交与概念化语义扩展搜索方法[J].计算机研究与发展,2018,55(8):1641-1652.
[11]吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015,51(4):123-127.
[12]XU K,FENG Y,HUANG S,et al.Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling[J].Computer Science,2015,71(7):941-949.
[13]SRIRAM B,FUHRY D,DEMIR E,et al.Short text classification in twitter to improve information filtering[C]∥Internatio-nal ACM SIGIR Conference on Research and Development in Information Retrieval.2010:841-842.
[14]ZHANG W,XU W,CHEN G,et al.A Feature Extraction Me-thod Based on Word Embedding for Word Similarity Computing[J].Communications in Computer & Information Science,2014,496:160-167.
[15]袁满,欧阳元新,熊璋,等.一种基于频繁词集的短文本特征扩展方法[J].东南大学学报(自然科学版),2014,44(2):256-260.
[16]郭永辉.面向短文本分类的特征扩展方法[D].哈尔滨:哈尔滨工业大学,2013.
[17]MENDES E.Introduction to Bayesian Networks[J].Medical Imaging Technology,2014,21(2):1-5.
[18]PEARL J.Probabilistic Reasoning in Intelligent Systems[M].Morgan Kaufmann Publishers,1988:1022-1027.
[19]YI Z H,WEI W L,XI C Y,et al.Research Progress of Probabilistic Graphical Models:A Survey[J].Journal of Software,2013,24(11):2476-2497.
[20]TANG B,KAY S,HE H.Toward Optimal Feature Selection in Naive Bayes for Text Categorization[J].IEEE Transactions on Knowledge & Data Engineering,2016,28(9):2508-2521.
[21]陈为,朱标,张宏鑫.BN-Mapping:基于贝叶斯网络的地理空间数据可视分析[J].计算机学报,2016(7):1281-1293.
[22]王双成,高瑞,杜瑞杰.具有超父结点时间序列贝叶斯网络集成回归模型[J].计算机学报,2017,40(12):2748-2761.
[23]HECKERMAN D,DAN G,CHICKERING D M.LearningBayesian networks:The combination of knowledge and statistical data[J].Machine Learning,1995,20(3):197-243.
[24]BLITZER J,DREDZE M,PEREIRA F.Biographies,Bollywood,Boom-boxes and Blenders:Domain Adaptation for Sentiment Classification[C]∥Proceedings of ACL’07.2007.
[1] 吕晓锋, 赵书良, 高恒达, 武永亮, 张宝奇.
基于异质信息网的短文本特征扩充方法
Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network
计算机科学, 2022, 49(9): 92-100. https://doi.org/10.11896/jsjkx.210700241
[2] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[3] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[4] 李嘉睿, 凌晓波, 李晨曦, 李子木, 杨家海, 张蕾, 吴程楠.
基于贝叶斯攻击图的动态网络安全分析
Dynamic Network Security Analysis Based on Bayesian Attack Graphs
计算机科学, 2022, 49(3): 62-69. https://doi.org/10.11896/jsjkx.210800107
[5] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[6] 史伟, 付月.
考虑语境的微博短文本挖掘:情感分析的方法
Microblog Short Text Mining Considering Context:A Method of Sentiment Analysis
计算机科学, 2021, 48(6A): 158-164. https://doi.org/10.11896/jsjkx.210200089
[7] 韩丽霞, 张占营.
基于树增益朴素贝叶斯网络的服务定价策略
TAN-based Service Pricing Strategy
计算机科学, 2021, 48(6A): 203-. https://doi.org/10.11896/jsjkx.200900024
[8] 张明阳, 王刚, 彭起, 张岩峰.
学术论文公开评审平台数据分析
Data Analysis of OpenReview
计算机科学, 2021, 48(6): 63-70. https://doi.org/10.11896/jsjkx.200500138
[9] 李超, 覃飙.
高效计算因果网中的最大可能解释
Efficient Computation of MPE in Causal Bayesian Networks
计算机科学, 2021, 48(4): 14-19. https://doi.org/10.11896/jsjkx.200500155
[10] 鲁博仁, 胡世哲, 娄铮铮, 叶阳东.
面向铁路文本分类的字符级特征提取方法
Character-level Feature Extraction Method for Railway Text Classification
计算机科学, 2021, 48(3): 220-226. https://doi.org/10.11896/jsjkx.200200061
[11] 李建兰, 潘岳, 李小聪, 刘子维, 王天宇.
基于CiteSpace的中文评论文本研究现状与趋势分析
Chinese Commentary Text Research Status and Trend Analysis Based on CiteSpace
计算机科学, 2021, 48(11A): 17-21. https://doi.org/10.11896/jsjkx.210300172
[12] 纪南巡, 孙晓燕, 李祯其.
多源异构用户生成内容的融合向量化表示学习
Fusion Vectorized Representation Learning of Multi-source Heterogeneous User-generated Contents
计算机科学, 2021, 48(10): 51-58. https://doi.org/10.11896/jsjkx.200900194
[13] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[14] 倪海清, 刘丹, 史梦雨.
基于语义感知的中文短文本摘要生成模型
Chinese Short Text Summarization Generation Model Based on Semantic-aware
计算机科学, 2020, 47(6): 74-78. https://doi.org/10.11896/jsjkx.190600006
[15] 徐源音,柴玉梅,王黎明,刘箴.
基于OCC模型和贝叶斯网络的情绪句分类方法
Emotional Sentence Classification Method Based on OCC Model and Bayesian Network
计算机科学, 2020, 47(3): 222-230. https://doi.org/10.11896/jsjkx.190200331
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!