计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 219-225.doi: 10.11896/jsjkx.201100128

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于多粒度文本特征表示的微博用户兴趣识别

郁友琴, 李弼程   

  1. 华侨大学计算机科学与技术学院 福建 厦门361021
  • 收稿日期:2020-11-17 修回日期:2021-02-18 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 李弼程(lbclm@163.com)
  • 作者简介:418846636@qq.com
  • 基金资助:
    国家社会科学基金资助项目(19BXW110)

Microblog User Interest Recognition Based on Multi-granularity Text Feature Representation

YU You-qin, LI Bi-cheng   

  1. College of Computer Science and Technology,Huaqiao University,Xiamen,Fujian 361021,China
  • Received:2020-11-17 Revised:2021-02-18 Online:2021-12-15 Published:2021-11-26
  • About author:YU You-qin,born in 1993,postgra-duate.Her main research interests include user portrait and personalized information recommendation.
    LI Bi-cheng,born in 1970,Ph.D,professor,Ph.D supervisor.His main research interests include text analysis and understanding,information fusion.
  • Supported by:
    National Social Science Foundation of China(19BXW110).

摘要: 微博用户兴趣发现对社交网络的个性化推荐和信息传播的正确引导具有重要意义,因此提出了一种基于多粒度文本特征表示的微博用户兴趣识别方法。首先,从主题层、词序层和词汇层3个方面对微博用户构造文本向量,利用LDA提取内容的主题特征,通过LSTM学习内容的语义特征,引入腾讯AI Lab开源词向量获取词义特征;然后,将以上3种特征向量拼接得到的多粒度文本特征表示矩阵输入CNN中,进行文本分类训练;最后,通过多端输出层实现对微博用户的兴趣识别。实验结果表明,多粒度特征表示模型的分类实验结果比单粒度特征表示模型的精准率、召回率和F1值分别提高了8%,12%和13%。基于对文本粗、细语义粒度和词粒度的综合考量,结合神经网络分类算法,多粒度特征表示模型的评价指标均优于单粒度特征表示模型。

关键词: 社交网络, 微博用户, 兴趣识别, 文本特征, 文本分类

Abstract: Microblog user interest discovery is of great significance to the personalized recommendation of social networks and the correct information dissemination guidance.We propose a method of microblog user interest recognition based on multi-granular text feature representation.First,this paper constructs a text vector for microblog users from three aspects,including topic layer,word order layer,and vocabulary layer.LDA is used to extract the content's topic features,and LSTM learns the semantic features of the sentences.The open-source word vector of Tencent AI Lab is introduced to obtain the semantic features of words;then,the multi-granular text feature representative matrix obtained by the above three feature vectors is input into CNN for text classification training.Finally,the interest recognition of Weibo users is completed through the multi-terminal output layer.Experimental results show that the precision rate,recall rate,and F1 value of the multi-granularity feature representation model are improved by 8%,12%,and 13%,respectively.Based on the careful consideration of text coarse and fine semantic granularity and word granularity,combined with the neural network classification algorithm,the multi-granularity feature representation model's evaluation index is better than the single-granularity feature representation model.

Key words: Social network, Weibo user, Interest recognition, Text feature, Text classification

中图分类号: 

  • TP391
[1]WANG X,YU X,ZHOU B,et al.Mining personal interests of microbloggers based on free tags in SINA Weibo[C]//International Conference on Web-Age Information Management.Cham:Springer,2015:79-87.
[2]SHI W J,XU Y B.Research on Discovering Micro-blog User Interests[J].New Technology of Library and Information Ser-vice,2015(1):52-58.
[3]ZHONG Z M,GUAN Y,HU Y,et al.Mining User Interests on Microblog Based on Profile and Content[J].Journal of Software,2017,28(2):278-291.
[4]LIU Z,CHEN X,SUN M.Mining the interests of Chinese microbloggers via keyword extraction[J].Frontiers of Computer Science,2012,6(1):76-87.
[5]WANG W,WU S,ZHANG Q.Content-Based Weibo User In- terest Recognition[M]//LISS2019.Springer,Singapore,2020:685-700.
[6]BLEI D M,NG A Y,JORDAN M I,et al.Latent dirichlet allocation[J/OL].Journal of Machine Learning Research,2003:993-1022.https://dl.acm.org/doi/10.5555/944919.944937.
[7]LIU Q,NIU K,HE Z,et al.Microblog user interest modeling based on feature propagation[C]//2013 Sixth International Symposium on Computational Intelligence and Design.IEEE,2013:383-386.
[8]HE L,JIA Y,HAN W,et al.Mining user interest in microblogs with a user-topic model[J].China Communications,2014,11(8):131-144.
[9]YU J,QIU L.ULW-DMM:An effective topic modeling method for microblog short text[J].IEEE Access,2018,7:884-893.
[10]ZHENG W,GE B,WANG C.Building a TIN-LDA model for mining microblog users' interest[J].IEEE Access,2019,7:21795-21806.
[11]QIU Y F,WANG L Y,SHAO L S,et al.User Interest Modeling Approach Based on Short Text of Microblog[J].Computer Engineering,2014,40(2):275-279.
[12]TANG X B,LIANG M J.Research of Silent User Interest Mo- deling in Microblog Based on the Features of Structure and Content[J].Journal of the China Society for Scientific and Technical Information,2015,34(11):1214-1224.
[13]SONG W,ZHANG Y,XIE Y B,et al.Identifying User Interests based on Microblog Classification[J].Intelligent Computer and Applications,2013,3(4):80-83.
[14]DU Y M,ZHANG W N,LIU T.User interest recognition based on topic enhanced convolution neural network[J].Journal of Computer Research and Development,2018,55(1):188-197.
[15] KIM Y.Cnvolutional neural networks for sentence classification[J/OL].Eprint Arxiv,2014.https://arXiv.org/abs/1408.5882.
[16]ZENG J,LU W,CHEN H H,et al.Research on User Interest Recognition Based on Multi mode Data[J].Information Science,2018,36(1):124-129.
[17]YANG P,LIU J,QI J,et al.Comparison and Modelling of Country-level Microblog User and Activity in Cyber-physical-social Systems Using Weibo and Twitter Data[J].ACM Transactions on Intelligent Systems and Technology(TIST),2019,10(6):1-24.
[18]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me- mory[J].Neural Computation,1997,9(8):1735-1780.
[19]DARLING W M.A theoretical and practical implementation tutorial on topic modeling and gibbs sampling[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies.2011:642-647.
[20]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[21]LI X L,WANG H,LIU X M,et al.Comparing Text Vector Generators for Weibo Short Text Classification[J].Data Analysis and Knowledge Discovery,2018,2(8):41-50.
[22]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing(almost) from scratch[J].Journal of machine learning research,2011,12(ARTICLE):2493-2537.
[1] 王剑, 王玉翠, 黄梦杰. 社交网络中的虚假信息:定义、检测及控制[J]. 计算机科学, 2021, 48(8): 263-277.
[2] 谭琪, 张凤荔, 王婷, 王瑞锦, 周世杰. 融入结构度中心性的社交网络用户影响力评估算法[J]. 计算机科学, 2021, 48(7): 124-129.
[3] 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法[J]. 计算机科学, 2021, 48(6): 332-337.
[4] 杜少华, 万怀宇, 武志昊, 林友芳. 融合文本序列和图信息的海关商品HS编码分类[J]. 计算机科学, 2021, 48(4): 97-103.
[5] 鲍志强, 陈卫东. 基于最大后验估计的谣言源定位器[J]. 计算机科学, 2021, 48(4): 243-248.
[6] 张少杰, 鹿旭东, 郭伟, 王世鹏, 何伟. 供需匹配中的非诚信行为预防[J]. 计算机科学, 2021, 48(4): 303-308.
[7] 鲁博仁, 胡世哲, 娄铮铮, 叶阳东. 面向铁路文本分类的字符级特征提取方法[J]. 计算机科学, 2021, 48(3): 220-226.
[8] 袁得嵛, 陈世聪, 高见, 王小娟. 基于斯塔克尔伯格博弈的在线社交网络扭曲信息干预算法[J]. 计算机科学, 2021, 48(3): 313-319.
[9] 谭琪, 张凤荔, 张志扬, 陈学勤. 社交网络用户影响力的建模方法[J]. 计算机科学, 2021, 48(2): 76-86.
[10] 李可悦, 陈轶, 牛少彰. 基于BERT的社交电商文本分类算法[J]. 计算机科学, 2021, 48(2): 87-92.
[11] 景丽, 何婷婷. 基于改进TF-IDF和ABLCNN的中文文本分类模型[J]. 计算机科学, 2021, 48(11A): 170-175.
[12] 袁禄, 朱郑州, 任庭玉. 虚假评论识别研究综述[J]. 计算机科学, 2021, 48(1): 111-118.
[13] 马理博, 秦小麟. 话题-位置-类别感知的兴趣点推荐[J]. 计算机科学, 2020, 47(9): 81-87.
[14] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆. 一种低频词词向量优化方法及其在短文本分类中的应用[J]. 计算机科学, 2020, 47(8): 255-260.
[15] 张晓辉, 于双元, 王全新, 徐保民. 基于对抗训练的文本表示和分类算法[J]. 计算机科学, 2020, 47(6A): 12-16.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 郭东岳,刘林峰. 一种基于区域朋友关系的机会路由算法[J]. 计算机科学, 2017, 44(3): 105 -109 .
[2] 王洋,沈记全. 基于发车时刻表的单线公交组合调度模型[J]. 计算机科学, 2017, 44(10): 269 -275 .
[3] 杜行舟, 张凯, 江坤, 马昊伯. 基于区块链的数字化指挥控制系统信息传输与追溯模式研究[J]. 计算机科学, 2018, 45(11A): 576 -579 .
[4] 王丽苹, 高瑞贞, 张京军, 王二成. 基于卷积神经网络的混凝土路面裂缝检测[J]. 计算机科学, 2019, 46(11A): 584 -589 .
[5] 曾俊飞,杨海清,吴浩. 面向三维重建的自适应列文伯格-马夸尔特点云配准方法[J]. 计算机科学, 2020, 47(3): 137 -142 .
[6] 刘伟, 孙童心, 杜薇. 面向访问模式的混合内存缓存替换策略[J]. 计算机科学, 2020, 47(10): 130 -135 .
[7] 丁玲, 向阳. 基于分层次多粒度语义融合的中文事件检测[J]. 计算机科学, 2021, 48(5): 202 -208 .
[8] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述[J]. 计算机科学, 2021, 48(8): 200 -208 .
[9] 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究[J]. 计算机科学, 2021, 48(9): 36 -42 .
[10] 余力, 杜启翰, 岳博妍, 向君瑶, 徐冠宇, 冷友方. 基于强化学习的推荐研究综述[J]. 计算机科学, 2021, 48(10): 1 -18 .