基于强化学习的推荐研究综述

doi:10.11896/jsjkx.210200085

摘要/Abstract

摘要： 推荐系统致力于从海量数据中为用户寻找并自动推荐有价值的信息和服务,可有效解决信息过载问题,成为大数据时代一种重要的信息技术。但推荐系统的数据稀疏性、冷启动和可解释性等问题,仍是制约推荐系统广泛应用的关键技术难点。强化学习是一种交互学习技术,该方法通过与用户交互并获得反馈来实时捕捉其兴趣漂移,从而动态地建模用户偏好,可以较好地解决传统推荐系统面临的经典关键问题。强化学习已成为近年来推荐系统领域的研究热点。文中从综述的角度,首先在简要回顾推荐系统和强化学习的基础上,分析了强化学习对推荐系统的提升思路,对近年来基于强化学习的推荐研究进行了梳理与总结,并分别对传统强化学习推荐和深度强化学习推荐的研究情况进行总结;在此基础上,重点总结了近年来强化学习推荐研究的若干前沿,以及其应用研究情况。最后,对强化学习在推荐系统中应用的未来发展趋势进行分析与展望。

关键词: 多臂老虎机, 马尔可夫决策过程, 强化学习, 深度强化学习, 推荐系统

Abstract: Recommender systems are devoted to find and automatically recommend valuable information and services for users from massive data,which can effectively solve the information overload problem,and become an important information technology in the era of big data.However,the problems of data sparsity,cold start,and interpretability are still the key technical difficulties that limit the wide application of the recommender systems.Reinforcement learning is an interactive learning technique,which can dynamically model user preferences by interacting with users and obtaining feedback to capture their interest drift in real time,and can better solve the classical key issues faced by traditional recommender systems.Nowadays,reinforcement lear-ning has become a hot research topic in the field of recommendation systems.From the perspective of survey,this paper first analyzes the improvement ideas of reinforcement learning for recommender systems based on a brief review of recommender systems and reinforcement learning.Then,the paper makes a general overview and summary of reinforcement learning based recommender systems in recent years,and further summarizes the research situation of traditional reinforcement learning based recommendation and deep reinforcement learning based recommendation respectively.Furthermore,the paper summarizes the frontiers of reinforcement learning based recommendation research topic in recent years and its application.Finally,the future development trend and application of reinforcement learning in recommender systems are analyzed.

Key words: Deep reinforcement learning, Markov decision process, Multiple arm bandits, Recommender systems, Reinforcement learning

中图分类号:

TP183

余力, 杜启翰, 岳博妍, 向君瑶, 徐冠宇, 冷友方. 基于强化学习的推荐研究综述[J]. 计算机科学, 2021, 48(10): 1-18. https://doi.org/10.11896/jsjkx.210200085

YU Li, DU Qi-han, YUE Bo-yan, XIANG Jun-yao, XU Guan-yu, LENG You-fang. Survey of Reinforcement Learning Based Recommender Systems[J]. Computer Science, 2021, 48(10): 1-18. https://doi.org/10.11896/jsjkx.210200085

参考文献

[1]MARZ N,WARREN J.Big Data:Principles and best practices of scalable realtime data systems [M].USA:Manning,2015:44-49.
[2]KOREN Y,BELL R,VOLINSKY C.Matrix factorization techniques for recommender systems[J].Computer,2009,42(8):30-37.
[3]BOBADILLA J,ORTEGA F,HERNANDO A,et al.Recommender systems survey[J].Knowledge Based Systems,2013,46:109-132.
[4]HUANG L W,JIANG B T,LV S Y,et al.Survey on deep lear-ning based recommender systems[J].Chinese Journal of Compu-ters,2018,41(7):1619-1647.
[5]BATMAZ Z,YUREKLI A,BILGE A,et al.A review on deep learning for recommender systems:challenges and remedies[J].Artificial Intelligence Review,2019,52(1):1-37.
[6]ZHAO X Y,XIA L,TANG J L,et al.Deep ReinforcementLearning for Search,Recommendation and Online Advertising:A Survey[J].ACM SIGWEB Newsletter,2019 (Spring):1-15.
[7]LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[8]ZHAO X X,ZHANG W N,WANG J.Interactive collaborative filtering [C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management.ACM Press,2013:1411-1420.
[9]ZHAO X Y,ZHANG L,DING Z Y,et al.Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2018:1040-1048.
[10]WAN L P,LAN X G,ZHANG H B.The theory and application of deep reinforcement learning[J].Pattern Recognition and Artificial Intelligence,2019,32(1):67-81.
[11]SARWAR B M,KARYPIS G,KONSTAN J A,et al.Item-based collaborative filtering recommendation algorithms[C]//Proceedings of the 10th International Conference on World Wide Web.2001:285-295.
[12]VAN M R,VAN S M.Using content-based filtering for recommendation[C]//Proceedings of the Workshop on Machine Learning in The New Information Age.2000:47-56.
[13]AN M X,WU F Z,WU C H.Neural News Recommendation with Long and Short-term User Representations [C]//The 57th Annual Meeting of the Association for Computational Linguistics.2019:336-345.
[14]MA J Q,ZHAO Z,YI X Y.Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2018:1930-1939.
[15]DU W,DING S F.A survey of Multi-Agent ReinforcementLearning[J].Computer Science,2019,46(8):1-8.
[16]LIN X,CHEN H J.A pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation [C]//Proceedings of the 13th ACM Conference on Recommender Systems.2019:20-28.
[17]GE T Z,ZHAO L Q.Image Matters:Visually modeling user behaviors using Advanced Model Server [C]//Proceedings of the 27th ACM International Conference on Information and Know-ledge Management.2018:2087-2095.
[18]GUO Q Y,ZHUANG F Z,QIN C,et al.A survey on knowledge graph-based recommender systems[J].IEEE Transactions on Knowledge and Data Engineering,2020,50(7):937-957.
[19]YUE Y S,GUESTRIN G.Linear submodular bandits and their application to diversified retrieval [C]//Neural Information Processing Systems.2011:2483-2491.
[20]SHANI G,HECKERMAN D,BRAFMAN R I.An MDP-based recommender system[J].Journal of Machine Learning Research,2005,6(9):1265-1295.
[21]AUER P.Using confidence bounds for exploitation-exploration trade-offs[J].Journal of Machine Learning Research,2002,3(1):397-422.
[22]AGRAWAL S,GOYAL N.Analysis of thompson sampling for the multi-armed bandit problem [C]//Proceedings of the 25th Annual Conference on Learning Theory.2012:1-26.
[23]BOUNEFFOUF D,BOUZEGHOUB A,GANCARSKI A L.A contextual-bandit algorithm for mobile context-aware recommender system [C]//Neural Information Processing.2012:324-331.
[24]LI L H,CHU W,LANGFORD J,et al.A Contextual-Bandit Approach to Personalized News Article Recommendation [C]//Proceedings of the 19th International Conference on World Wide Web.Raleigh,2010:661-670.
[25]ALLESIARDO R,FERAUD R,BOUNEFFOUF D.A neural networks committee for the contextual bandit problem [C]//International Conference on Neural Information Processing.2014:374-381.
[26]AGRAWAL S,GOYAL N.Thompson sampling for contextual bandits with linear payoffs[C]//International Conference on Machine Learning.2013:127-135.
[27]LIU J W,GAO F,LUO X L.A survey of deep reinforcement learning based on value function and strategy gradient[J].Journal of Computer Science,2019,42(6):1406-1438.
[28]MNIHL V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-542.
[29]VAN H H,GUEZAND A,SILVER D.Deep ReinforcementLearning with Double Q-learning [C]//Proceedings of AAAI Conference on Artificial Intelligence.2016:2094-2110.
[30]WANG Z Y,SCHAUL T,HESSEL M,et al.Dueling Network Architectures for Deep Reinforcement Learning[C]//International Conference on Machine Learning.2016:1995-2003.
[31]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized Experience Replay [C]//Proceedings of International Conference on Learning Representations.2016:1-21.
[32]FORTUNATO M,AZARM G,PIOT B,et al.Noisy networksfor exploration[J].arXiv:1706.10295,2017.
[33]BELLEMARE M G,DABNEY W,MUNOS R.A distributional perspective on reinforcement learning [C]//International Conference on Machine Learning.2017:449-458.
[34]HESSEL M,MODAYIL J,VAN H H,et al.Rainbow:Combining Improvements in Deep Reinforcement Learning [C]//Proceedings of Association for the Advancement of Artificial Intelligence.2018:3215-3222.
[35]SILVER D,LEVER G,HEESS N,et al.Deterministic PolicyGradient Algorithms [C]//International Conference on Machine Learning.2014:387-395.
[36]KULKARNI T D,NARASIMHAN K R,SAEEDI A,et al.Hierarchical Deep Reinforcement Learning:Integrating Temporal Abstraction and Intrinsic Motivation [C]//Proceedings of Thirtieth Conference on Neural Information Processing Systems.2016:1-9.
[37]ZHENG G J,ZHANG F Z,ZHENG Z H,et al.DRN:A Deep Reinforcement Learning Framework for News Recommendation [C]//Proceedings of the 2018 World Wide Web Conference.2018:167-176.
[38]SHANI G,GUNAWARDANA A.Evaluating recommendation systems[M]//Recommender Systems Handbook.Boston:Springer,2011:257-297.
[39]WANG X X,WANG Y,HSU D,et al.Exploration in Interactive Personalized Music Recommendation:A Reinforcement Learning Approach[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2014,11(1):1-22.
[40]WU Q Y,WANG H Z,GU Q Q,et al.Contextual Bandits in a Collaborative Environment [C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.2016:529-538.
[41]BRODEN B,HAMMAR M,NILSSON B J,et al.Ensemble Re-commendations via Thompson Sampling:an Experimental Study within e-Commerce [C]//Proceedings of the 2018 Conference on Human Information Interaction & Retrieval.2018:19-29.
[42]WANG H Z,WU Q Y,WANG H N.Factorization Bandits forInteractive Recommendation [C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.2017:2695-2702.
[43]INTAYOAD W,KAMYOD C,TEMDEE P.ReinforcementLearning Based on Contextual Bandits for Personalized Online Learning Recommendation Systems[J].Wireless Personal Communications,2020(115):2917-2932.
[44]SHEN Y L,DENG Y,RAY A,et al.Interactive recommendation via deep neural memory augmented contextual bandits [C]//Proceedings of the 12th ACM Conference on Recommender Systems.2018:122-130.
[45]ZHANG Y,ZHANG C W,LIU X Z.Dynamic Scholarly Colla-borator Recommendation via Competitive Multi-Agent Reinforcement Learning [C]//Proceedings of the Eleventh ACM Confe-rence on Recommender Systems.2017:331-335.
[46]LIEBMAN E,SAAR T M,STONE P.DJ-MC:A Reinforce-ment-Learning Agent for Music Playlist Recommendation [C]//Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems.2015:591-599.
[47]CHOI S,HA H,HWANG U,et al.Reinforcement Learningbased Recommender System using Biclustering Technique[J].arXiv:1801.05532,2018.
[48]DE N F,THEOCHAROUS G,VLASSIS N,et al.Capacity-aware Sequential Recommendations [C]//Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems.2018:416-424.
[49]LIU W,LIU F,TANG R,et al.Balancing Between Accuracy and Fairness for Interactive Recommendation with Reinforcement Learning[C]//Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.2020:155-167.
[50]LIU F,TANG R,GUO H,et al.Top-aware reinforcement lear-ning based recommendation[J].Neurocomputing,2020,417:255-269.
[51]CHEN S Y,YU Y,DA Q,et al.Stabilizing ReinforcementLearning in Dynamic Environment with Application to Online Recommendation [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mi-ning.2018:1187-1196.
[52]ZOU L X,XIA L,DING Z Y,et al.Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2019:2810-2818.
[53]CHANG H P,YANG X R,CUI Q,et al.Value-aware Recommendation based on Reinforcement Profit Maximization [C]//Proceedings of the 2019 World Wide Web Conference.2019:3123-3129.
[54]EUGENE I,JAIN V,WANG J,et al.Slate Q:a tractable decomposition for reinforcement learning with recommendation sets [C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.2019:2592-2599.
[55]ZOU L,XIA L,DU P,et al.Pseudo dyna-q:a reinforcement learning framework for interactive recommendation [C]//Proceedings of the 13th International Conference on Web Search and Data Mining.2020:816-824.
[56]LEI Y,LI W J.Interactive Recommendation with User-Specific Deep Reinforcement Learning[J].ACM Transactions on Know-ledge Discovery from Data,2019,13(6):1-15.
[57]LEI Y,PEI H,YAN H,et al.Reinforcement learning based re-commendation with graph convolutional q-network [C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1757-1760.
[58]ZHANG Y T,CHEN R,TANG J,et al.LEAP:Learning to Prescribe Effective and Safe Treatment Combinations for Multimorbidity [C]//Proceedings of the 23rd ACM SIGKDD Internatio-nal Conferenceon Knowledge Discovery and Data Mining.2017:1315-1324.
[59]ZHAO W,WANG W Y,YE J B,et al.Leveraging Long and Short-Term Information in Content-Aware Movie Recommendation via Adversarial Training[J].IEEE Transactions on Cybernetics,2019,50(11):4680-4693.
[60]SUN Y M,ZHANG Y.Conversational Recommender System [C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.2018:235-244.
[61]CHEN M M,BEUTEL A,COVINGTON P,et al.Top-K Off-Policy Correction for a REINFORCE Recommender System [C]//Proceedings of the Twelfth ACM International Confe-rence on Web Search and Data Mining.2019:456-464.
[62]CHEN H K,DAI X Y,CAI H,et al.Large-scale interactive re-commendation with tree-structured policy gradient[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2019:3312-3320.
[63]PAN F Y,CAI Q P,TANG P Z,et al.Policy gradients for contextual recommendations [C]//Proceedings of The World Wide Web Conference.2019:1421-1431.
[64]BAI X Y,GUAN J,WANG H N.A model-based reinforcement learning with adversarial training for online recommendation [C]//Proceedings of the 33rd Conference on Neural Information Processing Systems.2019:1-12.
[65]WANG L,ZHANG W,HE X F.Supervised ReinforcementLearning with Recurrent Neural Network for Dynamic Treatment Recommendation [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2018:2447-2456.
[66]ZHAO X Y,XIA L,ZHANG L,et al.Deep ReinforcementLearning for List-wise Recommendations[J].arXiv:1801.00209,2017.
[67]ZHAO X Y,XIA L,ZHANG L,et al.Deep ReinforcementLearning for Page-wise Recommendations [C]//Proceedings of the 12th ACM Conference on Recommender Systems.2018:95-103.
[68]ZHANG R Y,YU T,SHEN Y L,et al.Text-based interactive recommendation via constraint-augmented reinforcement lear-ning[C]//Proceedings of the 33rd Conference on Neural Information Processing Systems.2019:13-24.
[69]YU T,SHEN Y L,ZHANG R Y,et al.Vision-Language Re-commendation via Attribute Augmented Multimodal Reinforcement Learning [C]//Proceedings of the 27th ACM Internatio-nal Conference on Multimedia.2019:39-47.
[70]WANG P,FAN Y,XIA L,et al.KERL:A knowledge-guidedreinforcement learning model for sequential recommendation [C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:209-218.
[71]CHEN X,HUANG C,YAO L,et al.Knowledge-guided deepreinforcement learning for interactive recommendation [C]//Proceedings of the 2020 International Joint Conference on Neural Networks.2020:1-8.
[72]ZHAO X Y,XIA L,ZHANG L,et al.Model-Based Reinforcement Learning for Whole-Chain Recommendations[J].arXiv:1902.03987,2019.
[73]CHEN X S,LI S,LI H,et al.Generative Adversarial UserModel for Reinforcement Learning Based Recommendation System [C]//Proceedings of the 34th International Conference on Machine Learning.2019:1052-1061.
[74]GAO R,XIA H F,LI J,et al.DRCGR:Deep ReinforcementLearning Framework Incorporating CNN and GAN-Based for Interactive Recommendation [C]//Proceedings of the 2019 IEEE International Conference on Data Mining.2019:1048-1053.
[75]WU H J,DAI D D,FU Q M.Research progress on the combination of reinforcement learning and generative adversary network[J].Journal of Computer Engineering and Application,2019,55(10):41-49.
[76]LIN J H,ZHANG Z C,JIANG C.A survey of imitation learning based on generative adversary network[J].Journal of Computer Science,2020,43(2):326-351.
[77]ZHAO D Y,ZHANG L,ZHANG B,et al.MaHRL:Multi-goals Abstraction Based Deep Hierarchical Reinforcement Learning for Recommendations [C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:871-880.
[78]XIE R B,ZHANG S L,WANG R,et al.Hierarchical Reinforcement Learning for Integrated Recommendation [C]//Procee-dings of the 35th AAAI Conference on Artificial Intelligence.2021:1-8.
[79]ZHANG J,HAO B W,CHEN B,et al.Hierarchical reinforcement learning for course recommendation in MOOCs [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:435-442.
[80]FENG J,LI H,HUANG M,et al.Learning to collaborate:Multi-scenario ranking via multi-agent reinforcement learning [C]//Proceedings of the World Wide Web Conference.2018:1939-1948.
[81]HE X,AN B,LI Y,et al.Learning to Collaborate in Multi-Module Recommendation via Multi-Agent Reinforcement Lear-ning without Communication[C]//Proceedings of the Fourteenth ACM Conference on Recommender Systems.2020:210-219.
[82]GUI T,LIU P,ZHANG Q,et al.Mention Recommendation in Twitter with Cooperative Multi-Agent Reinforcement Learning [C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:535-544.
[83]SHI J C,YU Y,DA Q,et al.Virtual-Taobao:virtualizing real-world online retail environment for reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019,33(1):4902-4909.
[84]SHANG W J,YU Y,LI Q Y,et al.Environment reconstruction with hidden confounders for reinforcement learning based re-commendation [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mi-ning.2019:566-576.
[85]ZHAO X Y,XIA L,ZOU L X,et al.Toward simulating environments in reinforcement learning based recommendations[J].arXiv:1906.11462,2019.
[86]ROHDE D,BONNER S,DUNLOP T,et al.RecoGym:a reinforcement learning environment for the problem of product re-commendation in online advertising[J].arXiv:1808.00720,2018.
[87]SHI B,OZSOY M G,HURLEY N,et al.PyRecGym:a rein-forcement learning gym for recommender systems [C]//Proceedings of the 13th ACM Conference on Recommender Systems.2019:491-495.
[88]EUGENE I,HSU C,MLADENOV M,et al.RecSim:a configurablesimulation platform for recommender systems[J].arXiv:1909.04847,2019.
[89]WANG X T,CHEN Y R,JIE Y,et al.A reinforcement learning framework for explainable recommendation [C]//Proceedings of the 2018 IEEE International Conference on Data Mining.2018:587-596.
[90]XIAN Y K,FU Z H,MUTHUKRISHNAN S.Reinforcement knowledge graph reasoning for explainable recommendation [C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:285-294.
[91]MCINERNEY J,LACKER B,HANSEN S,et al.Explore,ex-ploit,and explain:personalizing explainable recommendations with bandits [C]//Proceedings of the 12th ACM Conference on Recommender Systems.2018:31-39.
[92]LEI Y,WANG Z T,LI W J.Social attentive deep q-network for recommendation [C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:1189-1192.
[93]LIU F,GUO H F,LI X T,et al.End-to-end deep reinforcement learning based recommendation with supervised embedding [C]//Proceedings of the 13th International Conference on Web Search and Data Mining.2020:384-392.
[94]WANG Y.A Hybrid Recommendation for Music Based on Reinforcement Learning [C]//Pacific-Asia Conference on Know-ledge Discovery and Data.2020:91-103.
[95]HONG D,LI Y,DONG Q.Nonintrusive-Sensing and Reinforce-ment-Learning Based Adaptive Personalized Music Recommendation [C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1721-1724.
[96]MASSIMO D,ELAHI M,RICCI F.Learning User Preferences by Observing User-Items Interactions in an IoT Augmented Space [C]//Adjunct Publication of the 25th Conference on User Modeling,Adaptation and Personalization.2017:35-40.
[97]ZHAO Y,ZENG D,SOCINSKI M A,et al.ReinforcementLearning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer[J].Journal of the International Biometric Society,2011,67(4):1422-1433.
[98]LU Z Q,YANG Q.Partially Observable Markov DecisionProcess for Recommender Systems[J].arXiv:1608.07793,2016.
[99]HU Y J,DA Q,ZENG A X,et al.Reinforcement Learning to Rank in E-Commerce Search Engine:Formalization,Analysis,and Application [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mi-ning.2018:368-377.
[100]CHI C Y,TSAI R T H,LAI J Y,et al.A ReinforcementLearning Approach to Emotion-based Automatic Playlist Gene-ration[C]//Proceedings of International Conference on Technologies and Applications of Artificial Intelligence.2010:60-65.
[101]ZENG C Q,WANG Q,MOKHTARI S.Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:2025-2034.
[102]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learning [C]//Proceedings of the Workshops at the 26th Neural Information Processing Systems.2013:201-220.
[103]ZHU Y X,LV L Y.Evaluation Metrics for Recommender Systems[J].Journal of University of Electronic Science and Technology of China,2012,41(2):163-176.
[104]ZHANG S,YAO L N,SUN A X,et al.Deep learning basedrecommender system:a survey and new perspectives[J].ACM Computing Surveys,2019,52(1):1-38.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed