基于半监督协同训练算法的微博水军识别

doi:10.11896/jsjkx.180901617

摘要/Abstract

摘要： 在迅速发展的互联网时代,微博产生了大量的信息,但是在微博话题等地带存在着较多水军,水军在一定程度上影响了普通用户了解某人或者某事的真实情况。因此,为了高效、准确地识别水军,针对水军样本数量少、非水军样本数量庞大等问题,综合考虑使用半监督协同训练算法。该算法通过研究微博用户的多个特征并对其进行综合分析,重新定义了6个属性特征值,包括账户关注度、每日发表微博数、微博影响力等。依据算法的特点,将6个属性特征值分为两个属性集,每个属性集对应一个视图,每个视图利用Scikit-Learn 机器学习库中的7种分类方法训练出分类器,以对微博用户进行水军识别,最后在爬取的微博用户数据集上进行实验。实验结果表明,两个视图在分别使用朴素贝叶斯算法、逻辑回归算法训练分类器时,分类结果的准确率、召回率、精度和F1-measure值都较高。因此,综合分析微博用户特征并且使用符合实际情况的半监督协同训练算法,能够准确、高效、快速地识别微博水军。

关键词: 半监督, 分类器, 水军识别, 协同训练

Abstract: In the fast-developing Internet era,Weibo brings a large amount of information,but there exists water army in Weibo topic.To a certain extent,the water army affects ordinary users to understand the real situation.In order to efficiently and accurately identify the water army,the semi-supervised collaborative training algorithm is considered comprehensively in view of the small number of water military samples and the large number of non-water military samples.By studying and analyzing multiple characteristics of Weibousers,the proposed algorithm redefines six attribute feature values,such as account attention,daily microblog number,and microblog influence.According to the characteristics of the algorithm,the six attribute feature values are divided into two attribute sets,each attribute set corresponds to one view,and each view uses seven classification methods in the Scikit-Learn machine learning library to train the classifier to identify the water army.Finally,experiments are conducted on dataset.The results show that the accuracy,recall rate,accuracy and F1-measure value of the classification results are higher when the two views use the naive Bayes algorithm and the logistic regression algorithm to train the classifier.Therefore,comprehensive analysis of Weibo user cha-racteristics and the use of semi-supervised collaborative training algorithms in line with the actual situation can accurately,efficiently and quickly identify Weibo water army.

Key words: Classifier, Collaborative training, Semi-supervised, Water army identification

中图分类号:

TP393

韩晴晴, 张艳梅, 牛娃. 基于半监督协同训练算法的微博水军识别[J]. 计算机科学, 2019, 46(11): 202-208. https://doi.org/10.11896/jsjkx.180901617

HAN Qing-qing, ZHANG Yan-mei, NIU Wa. Microblogging Water Army Identification Based on Semi-supervised Collaborative Training Algorithm[J]. Computer Science, 2019, 46(11): 202-208. https://doi.org/10.11896/jsjkx.180901617

参考文献

[1]LIU S W,XU Y,WANG B L,et al.Water Army Detection of Weibo Using User Representation Learning.Journal of Intelligence,2018,37(7):95-100.(in Chinese)
刘姝雯,徐扬,王冰璐,等.基于用户表示学习的微博水军识别研究.情报杂志,2018,37(7):95-100.
[2]CHEN K,CHEN L,ZHU P D,et al.Interaction based on method for spam detection in online social networks.Journal on Communications,2015,36(7):120-128.(in Chinese)
陈侃,陈亮,朱培栋,等.基于交互行为的在线社会网络水军检测方法.通信学报,2015,36(7):120-128.
[3]GAYO-AVELLO D,BRENES D J.Overcoming Spammers inTwitter-A Tale of Five Algorithms∥CERI.Madrid:Spain,2010:41-52.
[4]CHEN H ,LIU J ,LV Y ,et al.Semi-supervised Clue Fusion for Spammer Detection in Sina Weibo.Information Fusion,2017:S1566253517300714.
[5]ZHANG M L,ZHOU Z H.CoTrade:Confident Co-TrainingWith Data Editing.IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society,2011,41(6):1612-1626.
[6]BLUM A.Combining labeled and unlabeled data with co-training∥Conference on Computational Learning Theory.1998:92-100.
[7]MILLER Z,DICKINSON B,DEITRICK W,et al.Twitter spammer detection using data stream clustering.Information Sciences,2014,260(1):64-73.
[8]HAN Z M,YANG K,TAN X S.Analyzing Spectrum Featuresof Weight User Relation Graph to Identify Large Spammer Groups in Online Shopping Websites.Chinese Journal of Computers,2017,40(4):939-954.(in Chinese)
韩忠明,杨珂,谭旭升.利用加权用户关系图的谱分析探测大规模电子商务水军团体.计算机学报,2017,40(4):939-954.
[9]KIM C,HWANG K.Naive Bayes Classifier Learning with Feature Selection for Spam Detection in Social Bookmarking.Pennsylvania,USA:Penn State,2008.
[10]ZHANG Y M,HUANG Y Y,GAN S J,et al.Weibo spammers’ identification algorithm based on Bayesian model.Journal on Communications,2017,38(1):44-53.(in Chinese)
张艳梅,黄莹莹,甘世杰,等.基于贝叶斯模型的微博网络水军识别算法研究.通信学报,2017,38(1):44-53.
[11]ZHENG X,ZENG Z,CHEN Z,et al.Detecting spammers on social networks.Neurocomputing,2015,159(C):27-34.
[12]YUAN X P,WANG R W,ZHAI B Y.Automatic Recognition of Micro-blog Water Army Based on Multi-index Comprehensive Index Method and Entropy Method.Journal of Intelligence,2014(7):176-179.(in Chinese)
袁旭萍,王仁武,翟伯荫.基于综合指数和熵值法的微博水军自动识别.情报杂志,2014(7):176-179.
[13]CHENG X T,LIU C X,LIU S X.Graph-based Features forIdentifying Spammers in Microblog Networks.Acta Automa-tica Sinica,2015,41(9):1533-1541.(in Chinese)
程晓涛,刘彩霞,刘树新.基于关系图特征的微博水军发现方法.自动化学报,2015,41(9):1533-1541.
[14]ZHANG L.The Research and Implementation on the Technology of Spammer Detection for Sina Mircoblog.Changsha:National University of Defense Technology,2015.(in Chinese)
张良.面向新浪微博的水军识别技术的研究与实现.长沙:国防科学技术大学,2015.
[15]LV C.Research and Implementation of Internet Forum WaterArmy Detection Based on User Behaiors.Chengdu:Southwest Jiaotong University,2017.(in Chinese)
吕晨.基于用户行为的网络论坛水军检测研究与实现.成都:西南交通大学,2017.
[16]BLUM A.Combining Labeled and unlabeled Data with Cotraining∥Proc.of the Conference on Computational Learning Theory.1998.
[17]ZHI-HUA Z.Disagreement-based Semi-supervised Learning.Acta Automatica Sinica,2013,39(11):1871-1878.
[18]NIGAM K,GHANI R.Analyzing the effectiveness and applicability of co-training∥International Conference on Information and Knowledge Management.ACM,2000:86-93.
[19]PENG Y,ZHANG D Q.Semi-Supervised Canonical Correlation Analysis Algorithm.Journal of Software,2008,19(11):2822-2832.(in Chinese)
彭岩,张道强.半监督典型相关分析算法.软件学报,2008,19(11):2822-2832.
[20]LI F,HUANG M,YANG Y,et al.Learning to identify review spam∥International Joint Conference on Artificial Intelligence.AAAI Press,2011:2488-2493.
[21]ZHU J.Semi-supervised learning literature survey.Computer Sciences Department,2005,37(1):63-77.

相关文章 15

[1]	武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[2]	侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[3]	庞兴龙, 朱国胜. 基于半监督学习的网络流量分析研究 Survey of Network Traffic Analysis Based on Semi Supervised Learning 计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
[4]	王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[5]	许华杰, 陈育, 杨洋, 秦远卓. 基于混合样本自动数据增强技术的半监督学习方法 Semi-supervised Learning Method Based on Automated Mixed Sample Data Augmentation Techniques 计算机科学, 2022, 49(3): 288-293. https://doi.org/10.11896/jsjkx.210100156
[6]	侯宏旭, 孙硕, 乌尼尔. 蒙汉神经机器翻译研究综述 Survey of Mongolian-Chinese Neural Machine Translation 计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[7]	赵敏, 刘惊雷. 基于高斯场和自适应图正则的半监督聚类 Semi-supervised Clustering Based on Gaussian Fields and Adaptive Graph Regularization 计算机科学, 2021, 48(7): 137-144. https://doi.org/10.11896/jsjkx.200800190
[8]	李梦荷, 许宏吉, 石磊鑫, 赵文杰, 李娟. 基于骨骼关键点检测的多人行为识别 Multi-person Activity Recognition Based on Bone Keypoints Detection 计算机科学, 2021, 48(4): 138-143. https://doi.org/10.11896/jsjkx.200300042
[9]	王省, 康昭. 基于光滑表示的半监督分类算法 Smooth Representation-based Semi-supervised Classification 计算机科学, 2021, 48(3): 124-129. https://doi.org/10.11896/jsjkx.200700078
[10]	储杰, 张正军, 汤鑫瑶, 黄振生. 基于加权样本和共识率的标记传播算法 Label Propagation Algorithm Based on Weighted Samples and Consensus-rate 计算机科学, 2021, 48(3): 214-219. https://doi.org/10.11896/jsjkx.191200103
[11]	杨帆, 王俊斌, 白亮. 基于安全性的成对约束扩充算法 Extended Algorithm of Pairwise Constraints Based on Security 计算机科学, 2020, 47(9): 324-329. https://doi.org/10.11896/jsjkx.200700092
[12]	谢源, 苗玉彬, 许凤麟, 张铭. 基于半监督深度卷积生成对抗网络的注塑瓶表面缺陷检测模型 Injection-molded Bottle Defect Detection Using Semi-supervised Deep Convolutional Generative Adversarial Network 计算机科学, 2020, 47(7): 92-96. https://doi.org/10.11896/jsjkx.190700093
[13]	刘肖, 袁冠, 张艳梅, 闫秋艳, 王志晓. 基于自适应多分类器融合的手势识别 Hand Gesture Recognition Based on Self-adaptive Multi-classifiers Fusion 计算机科学, 2020, 47(7): 103-110. https://doi.org/10.11896/jsjkx.200100073
[14]	祁宝莲, 钟坤华, 陈芋文. 基于卷积神经网络的半监督手术视频流程识别 Semi-supervised Surgical Video Workflow Recognition Based on Convolution Neural Network 计算机科学, 2020, 47(6A): 172-175. https://doi.org/10.11896/JsJkx.190500154
[15]	秦悦, 丁世飞. 半监督聚类综述 Survey of Semi-supervised Clustering 计算机科学, 2019, 46(9): 15-21. https://doi.org/10.11896/j.issn.1002-137X.2019.09.002

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed