计算机科学 ›› 2017, Vol. 44 ›› Issue (7): 191-196.doi: 10.11896/j.issn.1002-137X.2017.07.034

• 人工智能 • 上一篇    下一篇

融合异常检测与随机森林的微博转发行为预测方法

周先亭,黄文明,邓珍荣   

  1. 桂林电子科技大学计算机与信息安全学院 桂林541004,桂林电子科技大学广西可信软件重点实验室 桂林541004,桂林电子科技大学广西可信软件重点实验室 桂林541004
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受广西科技攻关项目(桂科攻1598019-6)资助

Micro-blog Retweet Behavior Prediction Algorithm Based on Anomaly Detection and Random Forest

ZHOU Xian-ting, HUANG Wen-ming and DENG Zhen-rong   

  • Online:2018-11-13 Published:2018-11-13

摘要: 针对目前微博转发行为预测具有的特征选择任意性、准确率不高的问题,提出了融合异常检测与随机森林的微博转发行为预测方法。首先,提取用户基本特征、博文基本特征、博文内容主题特征,并基于相对熵计算用户活跃度、博文影响力;其次,通过结合过滤式与封装式特征选择方法筛选出关键特征组;最后,融合异常检测与随机森林算法,依据筛选后的关键特征组进行微博转发行为预测,并利用袋外数据误差估计设置随机森林中的决策树和特征数。在真实新浪微博数据集上与基于逻辑回归、决策树、朴素贝叶斯、随机森林等算法的微博转发行为预测方法进行实验对比,结果表明所提方法的预测准确率(90.5%) 高于基准方法中最优的随机森林方法的预测准确率,同时验证了特征筛选方法的有效性。

关键词: 转发预测,随机森林,异常检测,特征筛选,相对熵

Abstract: Aiming to solve the issue that the accuracy of micro-blog retweet behavior prediction is not good enough and features are selected with an arbitrary choice,a new method using anomaly detection and random forest algorithms to predict micro-blog retweet behavior was proposed.Firstly,the basic features of the user,the basic characteristics of blog and blog content theme features are extracted,and the user activity and blog influence are calculated based on relative entropy.Secondly,the best feature set are selected by combining the filter and wrapper feature selection method.Finally,anomaly detection and random forest algorithms are fused to predict micro-blog retweet behavior based on selected features.The algorithm parameters of random forest are selected by analyzing the error estimation of out of bag data.By contrasting with Logistic Regression,Decision Tree,Naive Bias and Random Forest algorithms,which are used in the analysis for micro-blog retweet behavior,the prediction accuracy of the proposed method is higher than that of the optimal random forest method on real data,and reaches 90.5%.Meanwhile,the validity of feature selection method is verified.

Key words: Retweet prediction,Random forest,Anomaly detection,Feature filter,Relative entropy

[1] CAO J X,WU J L,SHI W,et al.Sina microblog information diffusion analysis and prediction[J].Chinese Journal of Compu-ters,2014,37(4):779-790.(in Chinese) 曹玖新,吴江林,石伟,等.新浪微博网信息传播分析与预测[J].计算机学报,2014,37(4):779-790.
[2] 中国互联网络信息中心.第37次中国互联网络发展状况统计报告[R].北京:中国互联网络信息中心,2016.
[3] PETROVIC S,OSBORNE M,LAVRENKO V.RT to win! Predicting message propagation in Twitter[C]∥ Proceedings of the Fifth International Conference on Weblogs and Social Media.Barcelonia,Spain,2011.
[4] MORCHID M,DUFOUR R,BOUSQUET P M,et al.Feature selection using Principal Component Analysis for massive retweet detection[J].Pattern Recognition Letters,2014,49:33-39.
[5] YANG Z,GUO J,CAI K,et al.Understanding retweeting be-haviors in social networks[C]∥Proceedings of the 19th ACM Conference on Information and Knowledge Management(CIKM 2010).Toronto,Ontario,Canada,2010:1633-1636.
[6] ROMERO D M,MEEDER B,KLEINBERG J.Differences in themechanics of information diffusion across topics:idioms,political hashtags,and complex contagion on twitter[C]∥Proceedings of the 20th International Conference on World Wide Web(WWW 2011).Hyderabad,India,2011:695-704.
[7] ZHANG Y,LU R,YANG Q.Predicting retweeting in microblogs[J].Journal of Chinese Information Processing,2012,26(4):109-114.(in Chinese) 张旸,路荣,杨青.微博客中转发行为的预测研究[J].中文信息学报,2012,26(4):109-114.
[8] LI Y L,YU H T,LIU L X.Predict algorithm of micro-blog retweet scale based on SVM[J].Application Research of Computers,2013,30(9):2594-2597.(in Chinese) 李英乐,于洪涛,刘力雄.基于SVM的微博转发规模预测方法[J].计算机应用研究,2013,30(9):2594-2597.
[9] ZHAO Y,SHAO B L,BIAN G Q,et al.Prediction of retweeting behavior for imbalanced dataset in microblogs[J].Journal of Computer Applications,2015,35(7):1959-1964.(in Chinese) 赵煜,邵必林,边根庆,等.面向不平衡微博数据集的转发行为预测方法[J].计算机应用,2015,35(7):1959-1964.
[10] SUH B,HONG L,PIROLLI P,et al.Want to be Retweeted? Large scale analytics on factors impacting retweet in Twitter network[C]∥2010 IEEE International Conference on Social Computing / IEEE International Conference on Privacy,Security,Risk and Trust.IEEE,2010:177-184.
[11] HU W.Real-time Twitter sentiment toward midterm exams[J].Sociology Mind,2012,2(2):177-184.
[12] WU J H,ZUO K Z,JIE B,et al.New discriminative feature selection method[J].Journal of Computer Applications,2015,35(10):2752-2756.(in Chinese) 吴锦华,左开中,接标,等.新颖的判别性特征选择方法[J].计算机应用,2015,35(10):2752-2756.
[13] KAR M,NUNES S,RIBEIRO C.Summarization of changes in dynamic text collections using Latent Dirichlet Allocation model[J].Information Processing & Management,2015,51(6):809-833.
[14] LIU S P,YIN J,OUYANG J,et al.Topic mining from microblogs based on MB-HDP model[J].Chinese Journal of Compu-ters,2015(7):1408-1419.(in Chinese) 刘少鹏,印鉴,欧阳佳,等.基于MB-HDP 模型的微博主题挖掘[J].计算机学报,2015(7):1408-1419.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!