计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 157-161.doi: 10.11896/j.issn.1002-137X.2018.01.027

• 第十六届中国机器学习会议 • 上一篇    下一篇

一种用于构建用户画像的二级融合算法框架

李恒超,林鸿飞,杨亮,徐博,魏晓聪,张绍武,古丽孜热·艾尼外   

  1. 大连理工大学计算机科学与技术学院信息检索实验室 辽宁 大连116024,大连理工大学计算机科学与技术学院信息检索实验室 辽宁 大连116024,大连理工大学计算机科学与技术学院信息检索实验室 辽宁 大连116024,大连理工大学计算机科学与技术学院信息检索实验室 辽宁 大连116024,大连理工大学计算机科学与技术学院信息检索实验室 辽宁 大连116024,大连理工大学计算机科学与技术学院信息检索实验室 辽宁 大连116024,伊犁师范学院电子与信息工程学院 新疆 伊宁835000
  • 出版日期:2018-01-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61632011,2,61562080,9)资助

Two-level Stacking Algorithm Framework for Building User Portrait

LI Heng-chao, LIN Hong-fei, YANG Liang, XU Bo, WEI Xiao-cong, ZHANG Shao-wu and Gulziya ANIWAR   

  • Online:2018-01-15 Published:2018-11-13

摘要: 用户画像是根据用户社会属性、生活习惯和消费行为等信息而抽象出的一个标签化的用户模型。构建用户画像的核心工作是给用户贴“标签”。基于用户的查询词历史记录,提出一种用于预测用户多维标签的二级融合算法框架。在第一级模型中,分别在各个标签预测子任务上建立多种模型,使用传统机器学习方法与Trigram特征相结合来抽取用户用词习惯的差异,使用doc2vec浅层神经网络模型来抽取查询词的语义关联信息,使用卷积神经网络模型来抽取查询词之间的深层语义关联信息。实验表明,doc2vec在处理用户查询这样的短文本相关任务时有着相对较好的预测准确性。在第二级模型中,针对用户画像这样的多标签预测任务,使用XGBTree模型及Stacking多模型相融合的方法提取出用户各标签属性之间的关联信息,使得平均预测准确率进一步提高了2%左右。在2016年中国计算机学会(CCF)组织的大数据竞赛《大数据精准营销中搜狗用户画像挖掘》中,所提二级融合算法框架在894支队伍中夺得了冠军。

关键词: 用户画像,标签预测,短文本分类,多模型融合

Abstract: User portraits are a kind of tagged user model constructed from user’s social attributes,lifestyle and consu-mer behavior,etc.The core work of building user portraits is to “tag” the user.Based on the user’s query word history,this paper proposed a two-level stacking algorithm framework for predicting user’s multi-dimensional labels.For the first-level models,a variety of models are built on each tag prediction subtask.The SVM model and Trigram feature are used to extract the differences of user’s words habit.The doc2vec shallow neural network model is used to extract the semantic relation information of the query words,and the convolution neural network model is used to extract the deep semantic association information between the query words.Experiments show that doc2vec has relatively good predictive accuracy in dealing with short texts related tasks (such as user queries).For the second-level models,the XGBTree model and the Stacking method are used to extract the association information between the label’s attributes of the user,so that the average prediction accuracy is further improved by 2%.In the big data competition “Sougou User Portrait Mining For Precision Marketing” organizated by China Computer Federation in 2016,this two-level stacking algorithm framework won the championship from 894 teams.

Key words: User portraits,Tag prediction,Short text classification,Multi-model ensemble

[1] PANG B,LEE L.Opinion Mining and Sentiment Analysis[J].Foundations and Trends in Information Retrieval,2008,2(12):1-135.
[2] WANG S I,MANNING C D.Baselines and Bigrams:Simple,Good Sentiment and Topic Classification[C]∥Meeting of the Association for Computational Linguistics.2012:90-94.
[3] BENGIO Y,DUCHARME R,VINCENT P,et al.A neuralprobabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[4] COLLOBERT R,WESTON J.A unified architecture for natural language processing:deep neural networks with multitask lear-ning[C]∥International Conference on Machine Learning.2008:160-167.
[5] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing (Almost) from Scratch[J].Journal of Machine Learning Research,2011,2(1):2493-2537.
[6] HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[7] SUNDERMEYER M,SCHLUTER R,NEY H,et al.LSTMNeural Networks for Language Modeling [C]∥Conference of the International Speech Communication Association.2012:601-608.
[8] SUTSKEVER I,VINYALS O,LE Q V,et al.Sequence to Sequence Learning with Neural Networks[C]∥Advances in Neural Information Processing Systems 27 (NIPS 2014).2014:3104-3112.
[9] CHEN D,MAK B.Multitask learning of deep neural networks for low-resource speech recognition[J].IEEE Transactions on Audio,Speech,and Language Processing,2015,23(7):1172-1183.
[10] BERGER A L,PIETRA V J,PIETRA S A,et al.A maximum entropy approach to natural language processing[J].Computational Linguistics,1996,22(1):39-71.
[11] KIM Y.Convolutional Neural Networks for Sentence Classification[J].Empirical Methods in Natural Language Processing,2014:1746-1751.
[12] KALCHBRENER N,GREFENSTETTE E, BLUNSOM P,et al.A Convolutional Neural Network for Modeling Sentences[C]∥Meeting of the Association for Computational Linguistics.2014:655-665.
[13] HE K,ZHANG X,REN S,et al.Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNetClassification[C]∥International Conference on Computer Vision.2015:1026-1034.
[14] JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[C]∥Conference of the Euro-pean Chapter of the Association for Computational Linguistics.2017:427-431.
[15] MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[16] LE Q V,MIKOLOV T.Distributed Representations of Sen-tences and Documents[C]∥International Conference on Machine Learning.2014:1188-1196.
[17] JAHRER M,LEGENSTEIN R.Combining predictions for accurate recommender systems[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2010:693-702.
[18] MESNIL G,MIKOLOV T,RANZATO M,et al.Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews[J].Journal of Lightwave Technology,2014,32(17):3043-3060.
[19] PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:Global Vectors for Word Representation[C]∥Empirical Me-thods in Natural Language Processing.2014:1532-1543.
[20] LIU Y,LIU Z,CHUA T,et al.Topical word embeddings[C]∥National Conference on Artificial Intelligence.2015:2418-2424.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .