基于预算时变的多臂赌博机模型

doi:10.11896/jsjkx.210800212

Abstract

Abstract: Many budget-related multi-armed bandit models have been proposed,but the practical problems they can solve are limi-ted,that is,they must all be subject to a total budget limit.Therefore,this paper proposes a multi-armed bandit model based on time-variant budgets,which can break this limitation and be used to solve other practical problems.The model captures the situation where the learner’s actions for each round are limited by the corresponding round budget.More specifically,at each round,the player is required to choose to pull L(L≥1) arms(L is not a fixed value) within the budget limits of that round.The player’sgoal is to maximize the total average reward within the budget limits of each round.According to this model,a dynamic programming algorithm based on confidence bound is proposed.The algorithm takes advantage of the characteristics of the model,takes the confidence upper bound of the empirical average reward of the arm as the basis for each round,and then uses the dynamic programming algorithm to perform the arm pull operation.In this paper,the concept of regret is introduced,and it is deduced theoretically that there is a relationship between the upper bound of regret and the sum of budget.Finally,the feasibility of the proposed algorithm is verified by comparing the proposed algorithm with other traditional budget-limited multi-armed bandit algorithms(ε-first,KUBE,BTS) under different scenarios.

Key words: Multi-armed bandit, Time-variant budgets, Empirical average reward, Dynamic programming, Regret

CLC Number:

TP301

LIN Bao-ling, JIA Ri-heng, LIN Fei-long, ZHENG Zhong-long, LI Ming-lu. Multi-armed Bandit Model Based on Time-variant Budgets[J].Computer Science, 2022, 49(11A): 210800212-6.

References

[1]CHEN K.Research on Recommendation System Based onMulti-armed Bandit Algorithm[J].Journal of Changjiang Information and Communication,2021,34(3):43-46.
[2]LI J F,LIAO S Z.Adversarial Multi-armed Bandit Model with Online Kernel Selection [J].Computer Science,2019,46(1):57-63.
[3]SHI J,ZHENG J L,YUAN Y,et al.RFID Multi-reader Channel Resources Allocation Algorithm Based on Whittle Index[J].Computer Science,2019,46(10):122-127.
[4]ERIC S,NICHOLAS T,SAMUEL J.Finding Structure inMulti-armed Bandits[J].Cognitive Psychology,2020,119:1-35.
[5]ZHANG X F,ZHOU Q,LIANG B,et al.An Adaptive Algo-rithm in Multi-armed Bandit Problem[J].Journal of Computer Research and Development,2019,56(3):643-654.
[6]ZUO J H,ZHANG X X,JOE-WONG C.Observe Before Play:Multi-armed Bandit with Pre-observations[C]//Proc of the AAAI Conf on Artificial Intelligence.New York:AAAI,2020:7023-7030.
[7]WANG S W,HUANG L B.Multi-armed Bandits with Compensation[C]//Proc of the 32th Conf on Neural Information Processing Systems.montreal:NeurlPS,2018:5114-5122.
[8]PATIL V,GHALME G,NAIR V,et al.Achieving Fairness in the Stochastic Multi-armed Bandit Problem[C]//Proc of the AAAI Conf on Artificial Intelligence.New York:AAAI,2020:5379-5386.
[9]GUTOWSKI N,AMGHAR T,CAMP O,et al.Context En-hancement for Linear Contextual Multi-armed Bandits[C]//Proc of IEEE Int Conf on Tools with Artificial Intelligence.Volos:IEEE,2018:1048-1055.
[10]MANICKAM I,LAN A S,BARANIUK R G.Contextual Multi-armed Bandit Algorithms for Personalized Learning Action Selection[C]//Proc of IEEE Int Conf on Acoustics.New Orleans:IEEE,2017:6344-6348.
[11]AZIZM,ANDERTON J,KAUFMANN E,et al.Pure Exploration in Infinitely-armed Bandit Models with Fixed-confidence[C]//Proc of Algorithmic Learning Theory Int Conf.Lanzaro-te:ALT,2018:3-24.
[12]BUBECK S,MUNOS R,STOLTZ G.Pure Exploration in Multi-armed Bandits Problems[C]//Proc of Algorithmic Learning Theory Int Conf.Porto:ALT,2009:23-37.
[13]XUE Y,ZHOU P,MAO S W,et al.Pure-exploration Bandits for Channel Selection in Mission-critical Wireless Communications[J].IEEE Transactions on Vehicular Technology,2018,67(11):10995-11007.
[14]LONG T T,CHAPMAN A,ROGERS A,et al.Knapsack Based Optimal Policies for Budget-limited Multi-armed Bandits[C]//Proc of the AAAI Conf on Artificial Intelligence.Toronto:AAAI,2012:1134-1140.
[15]LONG T T,CHAPMAN A,ROGERS A,et al.Epsilon-first Pol-icies for Budget-limited Multi-armed Bandits[C]//Proc of the AAAI Conf on Artificial Intelligence.Atlanta:AAAI,2010:1211-1216.
[16]XIA Y C,LI H F,QIN T,et al.Thompson Sampling for Budgeted Multi-armed Bandits[C]//Proc of the 24th International Joint Conf on Artificial Intelligence.Buenos Aires:IJCAI,2015:3960-3966.
[17]DING W K,QIN T,ZHANG X D,et al.Multi-armed Bandit with Budget Constraint and Variable Costs[C]//Proc of the AAAI Conf on Artificial Intelligence.Washington:AAAI,2013:232-238.
[18]XIA Y C,QIN T,MA W D,et al.Budgeted Multi-armed Bandits with Multipleplays[C]//Proc of the 25th International Joint Conf on Artificial Intelligence.New York:IJCAI,2016:2210-2216.
[19]ZHOU D P,TOMLIN C J.Budget-constrained Multi-armedBandits with Multiple Plays[C]//Proc of the AAAI Conf on Artificial Intelligence.New Orleans:AAAI,2018:4572-4579.

Related Articles 15

[1]	CHEN Ying, HAO Ying-guang, WANG Hong-yu, WANG Kun. Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map [J]. Computer Science, 2022, 49(8): 150-156.
[2]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[4]	MA Xin-yu, JIANG Chun-mao, HUANG Chun-mei. Optimal Scheduling of Cloud Task Based on Three-way Clustering [J]. Computer Science, 2022, 49(11A): 211100139-7.
[5]	LI Shuang-gang, ZHANG Shuang, WANG Xing-wei. Cloud Resource Scheduling Mechanism Based on Adaptive Virtual Machine Migration [J]. Computer Science, 2020, 47(9): 238-245.
[6]	KONG Fang, LI Qi-zhi, LI Shuai. Survey on Online Influence Maximization [J]. Computer Science, 2020, 47(5): 7-13.
[7]	LI De-quan, DONG Qiao, ZHOU Yue-jin. Distributed Online Conditional Gradient Optimization Algorithm [J]. Computer Science, 2019, 46(3): 332-337.
[8]	SHI Jing, ZHENG Jia-li, YUAN Yuan, WANG Zhe, LI Li. RFID Multi-reader Channel Resources Allocation Algorithm Based on Whittle Index [J]. Computer Science, 2019, 46(10): 122-127.
[9]	LI Jun-fan, LIAO Shi-zhong. Adversarial Multi-armed Bandit Model with Online Kernel Selection [J]. Computer Science, 2019, 46(1): 57-63.
[10]	WANG Zheng-li, XIE Tian, HE Kun and JIN Yan. 0-1 Knapsack Variant with Time Scheduling [J]. Computer Science, 2018, 45(4): 53-59.
[11]	ZHANG Xun, GU Chun-hua, LUO Fei, CHANG Yao-hui and WEN Geng. Virtual Machine Placement Strategy Based on Dynamic Programming [J]. Computer Science, 2017, 44(8): 54-59.
[12]	ZHOU Huan-huan and JIANG Ying. Test Configuration Method Based on Dynamic Programming under Cloud Environment [J]. Computer Science, 2014, 41(9): 215-219.
[13]	LIU Kun-liang,ZHANG Da-kun and WU Ji-gang. Improved Algorithm for Finding Weight-constrained Maximum-density Path [J]. Computer Science, 2014, 41(8): 122-124.
[14]	. Solving Dynamic 0-1 Knapsack Problems Based on Dynamic Programming Algorithm [J]. Computer Science, 2012, 39(7): 237-241.
[15]	. Decentralized Multi-Agent Based Cooperative Path Planning for Multi-UAVs [J]. Computer Science, 2012, 39(1): 219-222.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Multi-armed Bandit Model Based on Time-variant Budgets

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0