计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 210800212-6.doi: 10.11896/jsjkx.210800212
林宝玲, 贾日恒, 林飞龙, 郑忠龙, 李明禄
LIN Bao-ling, JIA Ri-heng, LIN Fei-long, ZHENG Zhong-long, LI Ming-lu
摘要: 目前已有很多有关预算的多臂赌博机模型,但这些模型能解决的实际问题具有局限性,即这些问题必须都是全程受一个总预算限制。对此,文中提出基于预算时变的多臂赌博机模型,该模型能够打破这种局限性,并被用于解决其他更多的实际问题。该模型抓住了学习者每一轮的动作都受到相应这一轮预算限制的情况。更具体地说,每一轮,玩家都需要在相应这一轮预算的限制下选择拉L(L≥1)个臂(L不是一个固定值)。玩家的目标就是在每一轮预算的限制下,最大化总的平均奖励。根据这个模型,文中提出基于置信界的动态规划算法。该算法利用模型的特点,每一轮都以臂的经验平均奖励的置信上界为依据,然后使用动态规划算法进行拉臂操作。文中进一步引入遗憾的概念,并从理论上推导得出该算法遗憾的上界与最终预算的总和存在一定的关系。最后,通过实验,将所提算法在不同场景下和其他几个传统的预算受限的多臂赌博机算法(ε-first,KUBE,BTS)进行比较,验证了所提算法的可行性。
中图分类号:
[1]CHEN K.Research on Recommendation System Based onMulti-armed Bandit Algorithm[J].Journal of Changjiang Information and Communication,2021,34(3):43-46. [2]LI J F,LIAO S Z.Adversarial Multi-armed Bandit Model with Online Kernel Selection [J].Computer Science,2019,46(1):57-63. [3]SHI J,ZHENG J L,YUAN Y,et al.RFID Multi-reader Channel Resources Allocation Algorithm Based on Whittle Index[J].Computer Science,2019,46(10):122-127. [4]ERIC S,NICHOLAS T,SAMUEL J.Finding Structure inMulti-armed Bandits[J].Cognitive Psychology,2020,119:1-35. [5]ZHANG X F,ZHOU Q,LIANG B,et al.An Adaptive Algo-rithm in Multi-armed Bandit Problem[J].Journal of Computer Research and Development,2019,56(3):643-654. [6]ZUO J H,ZHANG X X,JOE-WONG C.Observe Before Play:Multi-armed Bandit with Pre-observations[C]//Proc of the AAAI Conf on Artificial Intelligence.New York:AAAI,2020:7023-7030. [7]WANG S W,HUANG L B.Multi-armed Bandits with Compensation[C]//Proc of the 32th Conf on Neural Information Processing Systems.montreal:NeurlPS,2018:5114-5122. [8]PATIL V,GHALME G,NAIR V,et al.Achieving Fairness in the Stochastic Multi-armed Bandit Problem[C]//Proc of the AAAI Conf on Artificial Intelligence.New York:AAAI,2020:5379-5386. [9]GUTOWSKI N,AMGHAR T,CAMP O,et al.Context En-hancement for Linear Contextual Multi-armed Bandits[C]//Proc of IEEE Int Conf on Tools with Artificial Intelligence.Volos:IEEE,2018:1048-1055. [10]MANICKAM I,LAN A S,BARANIUK R G.Contextual Multi-armed Bandit Algorithms for Personalized Learning Action Selection[C]//Proc of IEEE Int Conf on Acoustics.New Orleans:IEEE,2017:6344-6348. [11]AZIZM,ANDERTON J,KAUFMANN E,et al.Pure Exploration in Infinitely-armed Bandit Models with Fixed-confidence[C]//Proc of Algorithmic Learning Theory Int Conf.Lanzaro-te:ALT,2018:3-24. [12]BUBECK S,MUNOS R,STOLTZ G.Pure Exploration in Multi-armed Bandits Problems[C]//Proc of Algorithmic Learning Theory Int Conf.Porto:ALT,2009:23-37. [13]XUE Y,ZHOU P,MAO S W,et al.Pure-exploration Bandits for Channel Selection in Mission-critical Wireless Communications[J].IEEE Transactions on Vehicular Technology,2018,67(11):10995-11007. [14]LONG T T,CHAPMAN A,ROGERS A,et al.Knapsack Based Optimal Policies for Budget-limited Multi-armed Bandits[C]//Proc of the AAAI Conf on Artificial Intelligence.Toronto:AAAI,2012:1134-1140. [15]LONG T T,CHAPMAN A,ROGERS A,et al.Epsilon-first Pol-icies for Budget-limited Multi-armed Bandits[C]//Proc of the AAAI Conf on Artificial Intelligence.Atlanta:AAAI,2010:1211-1216. [16]XIA Y C,LI H F,QIN T,et al.Thompson Sampling for Budgeted Multi-armed Bandits[C]//Proc of the 24th International Joint Conf on Artificial Intelligence.Buenos Aires:IJCAI,2015:3960-3966. [17]DING W K,QIN T,ZHANG X D,et al.Multi-armed Bandit with Budget Constraint and Variable Costs[C]//Proc of the AAAI Conf on Artificial Intelligence.Washington:AAAI,2013:232-238. [18]XIA Y C,QIN T,MA W D,et al.Budgeted Multi-armed Bandits with Multipleplays[C]//Proc of the 25th International Joint Conf on Artificial Intelligence.New York:IJCAI,2016:2210-2216. [19]ZHOU D P,TOMLIN C J.Budget-constrained Multi-armedBandits with Multiple Plays[C]//Proc of the AAAI Conf on Artificial Intelligence.New Orleans:AAAI,2018:4572-4579. |
[1] | 陈莹, 郝应光, 王洪玉, 王坤. 基于局部梯度强度图的动态规划检测前跟踪算法 Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map 计算机科学, 2022, 49(8): 150-156. https://doi.org/10.11896/jsjkx.210700135 |
[2] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[3] | 马新宇, 姜春茂, 黄春梅. 基于三支聚类的云任务优化调度 Optimal Scheduling of Cloud Task Based on Three-way Clustering 计算机科学, 2022, 49(11A): 211100139-7. https://doi.org/10.11896/jsjkx.211100139 |
[4] | 李双刚, 张爽, 王兴伟. 基于自适应虚拟机迁移的云资源调度机制 Cloud Resource Scheduling Mechanism Based on Adaptive Virtual Machine Migration 计算机科学, 2020, 47(9): 238-245. https://doi.org/10.11896/jsjkx.190900189 |
[5] | 简琤峰, 平靖, 张美玉. 面向边缘计算的Storm边缘节点调度优化方法 Edge Computing-oriented Storm Edge Node Scheduling Optimization Method 计算机科学, 2020, 47(5): 277-283. https://doi.org/10.11896/jsjkx.190600048 |
[6] | 石静, 郑嘉利, 袁源, 王哲, 李丽. 基于Whittle索引的RFID多阅读器信道资源分配算法 RFID Multi-reader Channel Resources Allocation Algorithm Based on Whittle Index 计算机科学, 2019, 46(10): 122-127. https://doi.org/10.11896/jsjkx.180801602 |
[7] | 李峻樊, 廖士中. 在线核选择的对抗式多臂赌博机模型 Adversarial Multi-armed Bandit Model with Online Kernel Selection 计算机科学, 2019, 46(1): 57-63. https://doi.org/10.11896/j.issn.1002-137X.2019.01.009 |
[8] | 王正理,谢添,何琨,金燕. 考虑时间因素的0-1背包调度问题 0-1 Knapsack Variant with Time Scheduling 计算机科学, 2018, 45(4): 53-59. https://doi.org/10.11896/j.issn.1002-137X.2018.04.007 |
[9] | 张勋,顾春华,罗飞,常耀辉,文赓. 基于动态规划的虚拟机放置策略 Virtual Machine Placement Strategy Based on Dynamic Programming 计算机科学, 2017, 44(8): 54-59. https://doi.org/10.11896/j.issn.1002-137X.2017.08.010 |
[10] | 刘学军,陈玉凤,李斌. 基于不可信环境的移动位置隐私保护 Mobile Location Privacy Protection Based on Untrusted Environment 计算机科学, 2015, 42(2): 108-113. https://doi.org/10.11896/j.issn.1002-137X.2015.02.023 |
[11] | 孙焘,夏 斐,刘洪波. 基于动态规划求解时间序列DTW中心 Calculating DTW Center of Time Series Using Dynamic Planning 计算机科学, 2015, 42(12): 278-282. |
[12] | 周欢欢,姜瑛. 一种基于动态规划的云环境测试配置方法 Test Configuration Method Based on Dynamic Programming under Cloud Environment 计算机科学, 2014, 41(9): 215-219. https://doi.org/10.11896/j.issn.1002-137X.2014.09.040 |
[13] | 刘坤良,张大坤,武继刚. 基于权重约束的最大密度路径改进算法 Improved Algorithm for Finding Weight-constrained Maximum-density Path 计算机科学, 2014, 41(8): 122-124. https://doi.org/10.11896/j.issn.1002-137X.2014.08.027 |
[14] | 贺毅朝,田海燕,张新禄,王志威,高锁刚. 基于动态规划法求解动态o-1背包问题 Solving Dynamic 0-1 Knapsack Problems Based on Dynamic Programming Algorithm 计算机科学, 2012, 39(7): 237-241. |
[15] | 刘铭,徐杨,陈峥,梁瀚,孙婷婷. 基于Multi-Agent系统的多飞行器协同路径规划方法的研究 Decentralized Multi-Agent Based Cooperative Path Planning for Multi-UAVs 计算机科学, 2012, 39(1): 219-222. |
|