基于预算时变的多臂赌博机模型

doi:10.11896/jsjkx.210800212

计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 210800212-6.doi: 10.11896/jsjkx.210800212

基于预算时变的多臂赌博机模型

林宝玲, 贾日恒, 林飞龙, 郑忠龙, 李明禄

浙江师范大学数学与计算机科学学院浙江金华 321004

出版日期:2022-11-10 发布日期:2022-11-21
通讯作者: 贾日恒(rihengjia@zjnu.edu.cn)
作者简介:(18205776899@163.com)
基金资助:
国家自然科学基金(61902358)

Multi-armed Bandit Model Based on Time-variant Budgets

LIN Bao-ling, JIA Ri-heng, LIN Fei-long, ZHENG Zhong-long, LI Ming-lu

College of Mathematics and Computer Science,Zhejiang Normal University,Jinhua,Zhejiang 321004,China

Online:2022-11-10 Published:2022-11-21
About author:LIN Bao-ling,born in 1998,postgra-duate.Her main research interests include reinforcement learning and deep reinforcement learning.
JIA Ri-heng,born in 1989,Ph.D,asso-ciate professor,is a member of China Computer Federation.His main research interests include wireless networks,machine learning,and mobile charging.
Supported by:
National Natural Science Foundation of China(61902358).

摘要/Abstract

摘要： 目前已有很多有关预算的多臂赌博机模型,但这些模型能解决的实际问题具有局限性,即这些问题必须都是全程受一个总预算限制。对此,文中提出基于预算时变的多臂赌博机模型,该模型能够打破这种局限性,并被用于解决其他更多的实际问题。该模型抓住了学习者每一轮的动作都受到相应这一轮预算限制的情况。更具体地说,每一轮,玩家都需要在相应这一轮预算的限制下选择拉L(L≥1)个臂(L不是一个固定值)。玩家的目标就是在每一轮预算的限制下,最大化总的平均奖励。根据这个模型,文中提出基于置信界的动态规划算法。该算法利用模型的特点,每一轮都以臂的经验平均奖励的置信上界为依据,然后使用动态规划算法进行拉臂操作。文中进一步引入遗憾的概念,并从理论上推导得出该算法遗憾的上界与最终预算的总和存在一定的关系。最后,通过实验,将所提算法在不同场景下和其他几个传统的预算受限的多臂赌博机算法(ε-first,KUBE,BTS)进行比较,验证了所提算法的可行性。

关键词: 多臂赌博机, 预算时变, 经验平均奖励, 动态规划, 遗憾

Abstract: Many budget-related multi-armed bandit models have been proposed,but the practical problems they can solve are limi-ted,that is,they must all be subject to a total budget limit.Therefore,this paper proposes a multi-armed bandit model based on time-variant budgets,which can break this limitation and be used to solve other practical problems.The model captures the situation where the learner’s actions for each round are limited by the corresponding round budget.More specifically,at each round,the player is required to choose to pull L(L≥1) arms(L is not a fixed value) within the budget limits of that round.The player’sgoal is to maximize the total average reward within the budget limits of each round.According to this model,a dynamic programming algorithm based on confidence bound is proposed.The algorithm takes advantage of the characteristics of the model,takes the confidence upper bound of the empirical average reward of the arm as the basis for each round,and then uses the dynamic programming algorithm to perform the arm pull operation.In this paper,the concept of regret is introduced,and it is deduced theoretically that there is a relationship between the upper bound of regret and the sum of budget.Finally,the feasibility of the proposed algorithm is verified by comparing the proposed algorithm with other traditional budget-limited multi-armed bandit algorithms(ε-first,KUBE,BTS) under different scenarios.

Key words: Multi-armed bandit, Time-variant budgets, Empirical average reward, Dynamic programming, Regret

中图分类号:

TP301

林宝玲, 贾日恒, 林飞龙, 郑忠龙, 李明禄. 基于预算时变的多臂赌博机模型[J]. 计算机科学, 2022, 49(11A): 210800212-6. https://doi.org/10.11896/jsjkx.210800212

LIN Bao-ling, JIA Ri-heng, LIN Fei-long, ZHENG Zhong-long, LI Ming-lu. Multi-armed Bandit Model Based on Time-variant Budgets[J]. Computer Science, 2022, 49(11A): 210800212-6. https://doi.org/10.11896/jsjkx.210800212

参考文献

[1]CHEN K.Research on Recommendation System Based onMulti-armed Bandit Algorithm[J].Journal of Changjiang Information and Communication,2021,34(3):43-46.
[2]LI J F,LIAO S Z.Adversarial Multi-armed Bandit Model with Online Kernel Selection [J].Computer Science,2019,46(1):57-63.
[3]SHI J,ZHENG J L,YUAN Y,et al.RFID Multi-reader Channel Resources Allocation Algorithm Based on Whittle Index[J].Computer Science,2019,46(10):122-127.
[4]ERIC S,NICHOLAS T,SAMUEL J.Finding Structure inMulti-armed Bandits[J].Cognitive Psychology,2020,119:1-35.
[5]ZHANG X F,ZHOU Q,LIANG B,et al.An Adaptive Algo-rithm in Multi-armed Bandit Problem[J].Journal of Computer Research and Development,2019,56(3):643-654.
[6]ZUO J H,ZHANG X X,JOE-WONG C.Observe Before Play:Multi-armed Bandit with Pre-observations[C]//Proc of the AAAI Conf on Artificial Intelligence.New York:AAAI,2020:7023-7030.
[7]WANG S W,HUANG L B.Multi-armed Bandits with Compensation[C]//Proc of the 32th Conf on Neural Information Processing Systems.montreal:NeurlPS,2018:5114-5122.
[8]PATIL V,GHALME G,NAIR V,et al.Achieving Fairness in the Stochastic Multi-armed Bandit Problem[C]//Proc of the AAAI Conf on Artificial Intelligence.New York:AAAI,2020:5379-5386.
[9]GUTOWSKI N,AMGHAR T,CAMP O,et al.Context En-hancement for Linear Contextual Multi-armed Bandits[C]//Proc of IEEE Int Conf on Tools with Artificial Intelligence.Volos:IEEE,2018:1048-1055.
[10]MANICKAM I,LAN A S,BARANIUK R G.Contextual Multi-armed Bandit Algorithms for Personalized Learning Action Selection[C]//Proc of IEEE Int Conf on Acoustics.New Orleans:IEEE,2017:6344-6348.
[11]AZIZM,ANDERTON J,KAUFMANN E,et al.Pure Exploration in Infinitely-armed Bandit Models with Fixed-confidence[C]//Proc of Algorithmic Learning Theory Int Conf.Lanzaro-te:ALT,2018:3-24.
[12]BUBECK S,MUNOS R,STOLTZ G.Pure Exploration in Multi-armed Bandits Problems[C]//Proc of Algorithmic Learning Theory Int Conf.Porto:ALT,2009:23-37.
[13]XUE Y,ZHOU P,MAO S W,et al.Pure-exploration Bandits for Channel Selection in Mission-critical Wireless Communications[J].IEEE Transactions on Vehicular Technology,2018,67(11):10995-11007.
[14]LONG T T,CHAPMAN A,ROGERS A,et al.Knapsack Based Optimal Policies for Budget-limited Multi-armed Bandits[C]//Proc of the AAAI Conf on Artificial Intelligence.Toronto:AAAI,2012:1134-1140.
[15]LONG T T,CHAPMAN A,ROGERS A,et al.Epsilon-first Pol-icies for Budget-limited Multi-armed Bandits[C]//Proc of the AAAI Conf on Artificial Intelligence.Atlanta:AAAI,2010:1211-1216.
[16]XIA Y C,LI H F,QIN T,et al.Thompson Sampling for Budgeted Multi-armed Bandits[C]//Proc of the 24th International Joint Conf on Artificial Intelligence.Buenos Aires:IJCAI,2015:3960-3966.
[17]DING W K,QIN T,ZHANG X D,et al.Multi-armed Bandit with Budget Constraint and Variable Costs[C]//Proc of the AAAI Conf on Artificial Intelligence.Washington:AAAI,2013:232-238.
[18]XIA Y C,QIN T,MA W D,et al.Budgeted Multi-armed Bandits with Multipleplays[C]//Proc of the 25th International Joint Conf on Artificial Intelligence.New York:IJCAI,2016:2210-2216.
[19]ZHOU D P,TOMLIN C J.Budget-constrained Multi-armedBandits with Multiple Plays[C]//Proc of the AAAI Conf on Artificial Intelligence.New Orleans:AAAI,2018:4572-4579.

相关文章 15

[1]	陈莹, 郝应光, 王洪玉, 王坤. 基于局部梯度强度图的动态规划检测前跟踪算法 Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map 计算机科学, 2022, 49(8): 150-156. https://doi.org/10.11896/jsjkx.210700135
[2]	洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[3]	马新宇, 姜春茂, 黄春梅. 基于三支聚类的云任务优化调度 Optimal Scheduling of Cloud Task Based on Three-way Clustering 计算机科学, 2022, 49(11A): 211100139-7. https://doi.org/10.11896/jsjkx.211100139
[4]	李双刚, 张爽, 王兴伟. 基于自适应虚拟机迁移的云资源调度机制 Cloud Resource Scheduling Mechanism Based on Adaptive Virtual Machine Migration 计算机科学, 2020, 47(9): 238-245. https://doi.org/10.11896/jsjkx.190900189
[5]	简琤峰, 平靖, 张美玉. 面向边缘计算的Storm边缘节点调度优化方法 Edge Computing-oriented Storm Edge Node Scheduling Optimization Method 计算机科学, 2020, 47(5): 277-283. https://doi.org/10.11896/jsjkx.190600048
[6]	石静, 郑嘉利, 袁源, 王哲, 李丽. 基于Whittle索引的RFID多阅读器信道资源分配算法 RFID Multi-reader Channel Resources Allocation Algorithm Based on Whittle Index 计算机科学, 2019, 46(10): 122-127. https://doi.org/10.11896/jsjkx.180801602
[7]	李峻樊, 廖士中. 在线核选择的对抗式多臂赌博机模型 Adversarial Multi-armed Bandit Model with Online Kernel Selection 计算机科学, 2019, 46(1): 57-63. https://doi.org/10.11896／j.issn.1002-137X.2019.01.009
[8]	王正理,谢添,何琨,金燕. 考虑时间因素的0-1背包调度问题 0-1 Knapsack Variant with Time Scheduling 计算机科学, 2018, 45(4): 53-59. https://doi.org/10.11896/j.issn.1002-137X.2018.04.007
[9]	张勋,顾春华,罗飞,常耀辉,文赓. 基于动态规划的虚拟机放置策略 Virtual Machine Placement Strategy Based on Dynamic Programming 计算机科学, 2017, 44(8): 54-59. https://doi.org/10.11896/j.issn.1002-137X.2017.08.010
[10]	刘学军,陈玉凤,李斌. 基于不可信环境的移动位置隐私保护 Mobile Location Privacy Protection Based on Untrusted Environment 计算机科学, 2015, 42(2): 108-113. https://doi.org/10.11896/j.issn.1002-137X.2015.02.023
[11]	孙焘,夏斐,刘洪波. 基于动态规划求解时间序列DTW中心 Calculating DTW Center of Time Series Using Dynamic Planning 计算机科学, 2015, 42(12): 278-282.
[12]	周欢欢,姜瑛. 一种基于动态规划的云环境测试配置方法 Test Configuration Method Based on Dynamic Programming under Cloud Environment 计算机科学, 2014, 41(9): 215-219. https://doi.org/10.11896/j.issn.1002-137X.2014.09.040
[13]	刘坤良,张大坤,武继刚. 基于权重约束的最大密度路径改进算法 Improved Algorithm for Finding Weight-constrained Maximum-density Path 计算机科学, 2014, 41(8): 122-124. https://doi.org/10.11896/j.issn.1002-137X.2014.08.027
[14]	贺毅朝,田海燕,张新禄,王志威,高锁刚. 基于动态规划法求解动态o-1背包问题 Solving Dynamic 0-1 Knapsack Problems Based on Dynamic Programming Algorithm 计算机科学, 2012, 39(7): 237-241.
[15]	刘铭，徐杨，陈峥，梁瀚，孙婷婷. 基于Multi-Agent系统的多飞行器协同路径规划方法的研究 Decentralized Multi-Agent Based Cooperative Path Planning for Multi-UAVs 计算机科学, 2012, 39(1): 219-222.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于预算时变的多臂赌博机模型

Multi-armed Bandit Model Based on Time-variant Budgets

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0