基于后状态强化学习的最优订单接受决策

doi:10.11896/jsjkx.210800261

摘要/Abstract

摘要： 随着客户多样化需求不断提升,根据客户对订单的不同需求来组织生产的订单生产型(Make-To-Order,MTO)模式在企业生产活动中越来越重要。根据企业有限的生产能力和订单状态来确定是否接受到达的订单,对企业提高利润至关重要。在传统的订单接受问题基础上,提出了更完备的MTO企业订单接受问题的模型:在延期交货成本、拒绝成本、生产成本传统模型要素的基础上,进一步考虑了订单的库存成本、多种顾客优先级因素,并将最优订单接受决策问题建模为马尔可夫决策过程(Markov Decision Process,MDP)。此外,由于经典的MDP求解方法依赖于对高维状态价值函数的求解和估计,其计算复杂性较高,为了降低复杂性,证明了经典的MDP问题中基于状态价值函数的最优策略可以等价地用基于后状态的价值函数进行定义和构造,将多维控制问题转化为一维控制问题。同时,为了解决连续状态空间问题,结合神经网络对后状态价值函数进行参数化标表征,解决了状态空间较大的问题。最后,通过仿真验证了所提出的订单接受策略模型和算法的适用性和优越性。

关键词: 订单接受, 强化学习, 马尔可夫决策过程, 神经网络, 后状态

Abstract: As the diversification of customer demand increases,the make-to-order(MTO) model,i.e.,adapting production scheme according to customers’ orders,has attracted increasingly more attention from industry.How to determine whether to accept incoming orders according to the limited production capacity and order status of the enterprise,which is crucial for the enterprise to improve profits.On the basis of the traditional order acceptance problems,this paper proposes a more complete model.Besides the traditional model elements(including delayed delivery cost,rejection cost,and production cost),we further consider the order inventory cost,customer priority and others.Moreover,we model the optimal order acceptance problem as a Markov decision process(MDP).In addition,because the classic MDP method relies on solving and estimating high-dimensional state value function,its computation complexity is high.Therefore,in order to reduce the complexity,this paper proves that the optimal strategy based on the state value function in the classical MDP problem can be defined and constructed by the value function based on the after-state equivalent,thus transforming the multi-dimensional control problem into a one-dimensional control problem.At the same time,in order to solve the continuous state space,this paper combines neural network to parameterize the after-state value function,and solves the problem of large state space.Finally,simulation experiments verify the applicability and superiority of the proposed order acceptance strategy model and algorithm.

Key words: Order acceptance, Reinforcement learning, Markov decision process, Neural network, After-state

中图分类号:

TP399

钱静, 吴克宇, 陈超, 胡星辰. 基于后状态强化学习的最优订单接受决策[J]. 计算机科学, 2022, 49(11A): 210800261-9. https://doi.org/10.11896/jsjkx.210800261

QIAN Jing, WU Ke-yu, CHEN Chao, HU Xing-chen. Optimal Order Acceptance Decision Based on After-state Reinforcement Learning[J]. Computer Science, 2022, 49(11A): 210800261-9. https://doi.org/10.11896/jsjkx.210800261

参考文献

[1]MILLER B L.A Queueing Reward System with Several Custo-mer Classes[J].Management Science,1969,16(3):234-245.
[2]ABEDI A,ZHU W H.An advanced order acceptance model for hybrid production strategy[J].Journal of Manufacturing System,2020,55:82-93.
[3]ZHANGX,MA S H.Order acceptance with limited capacity and finite output buffers in MTO environment[J].Industrial Engineering and Management,2008,13(2):34-38.
[4]GAO H L,DAN B,YAN J.Integrated order selection andscheduling decisions in the MTO environment considering the timeseries associations[J].Journal of Management Engineering,2017,31(3):108-116.
[5]FAN L F,CHEN X.Order Acceptance Policy based on EMSRMethod[J].Management Review,2010,22(4):109-113.
[6]WANG Z,QI Y Q,CUI H R,et al.A hybrid algorithmfor order acceptance and scheduling problem in make-to-stock/make-to-order industries[J].Computers & Industrial Engineering,2019,127:841-852.
[7]TARIK A,KOBE G,KUNAL K,et al.Production planning with order acceptance and demand uncertainty[J].Computers and Operations Rsearch,2018,91:145-159.
[8]FAN L F,CHEN X.Order pricing and acceptance policy inmake-to-order firm based on revenue management[J].System Engineer,2011,29(2):87-93.
[9]LI X,VENTURA J A.Exact algorithms for a joint order acce-ptance and scheduling problem[J].International Journal of Production Economics,2020,223:107516.
[10]ROM W O,SLOTNICK S A.Order acceptance using genetic algorithms[J].Computers & Operations Research,2008,36(6):1758-1767.
[11]NOBIBON F T,LEUS R.Exact algorithms for a generalizationof the order acceptance and scheduling problem in a single-machine environment[J].Computers & Operations Research,2010,38(1):367-378.
[12]CESARET B,OGUZ C,SALMAN F S.A tabu search algorithmfor order acceptance and scheduling[J].Computers and Operations Research,2010,39(6):1197-1205.
[13]WANG L,XU Z Y,ZHAO Y,et al.Model and algorit-hm for order acceptance on multi-node production environment with limited buffer[J].Chinese Journal of Management Science,2015,23(12):135-141.
[14]RAHMAN H F,JANARDHANAN M N,NIELSEN L E.Real-time order acceptance and scheduling problems in a flow shop environment using hybrid GA-PSO algorithm[J].IEEE Access,2019,7:112742-112755.
[15]lLI X P,WANG J,SAWHNEY R.Reinforcement learning forjoint pricing,lead-time and scheduling decisions in make-to-or-der systems[J].European Journal of Operational Research,2012,221(1):99-109.
[16]ARREDONDO F,MARTINEZ E.Learning and adaptation of a policy for dynamic order acceptance in make-to-order manufacturing[J].Computers and Industrial Engineering,2009,58(1):70-83.
[17]HAO J,YU J J,ZHOU W H.Order acceptance policy in make-to-order manufacturing based on average-reward reinforcement learning[J].Journal of Computer Applications,2013,33(4):976-979.
[18]WANG X H,WANG N N,FAN Z P.Reinforcement learning based order acceptance policy in make-to-order enterprises[J].System Engineering-Theory & Practice,2014,34(12):3121-3129.
[19]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].Cambridge:Cambridge University,2011.
[20]LEWICKI G,MARINO G.Approximation by superpositions of a sigmoidal function[J].Journal for Analysis and Its Applications,2003,22(2):463-470.
[21]MITCHELL T.Machine Learning[M].New York:McGraw-Hill,1997.
[22]RIEDMILLER M.Neural fitted Q iteration-first experienceswith a data efficient neural reinforcement learning method[C]//Machine Learning:European Conference on Machine Learning (ECML) 2005.Porto:Portugal,2005:317-328.
[23]HERBOTS J,HERROELEN W,LEUS R.Dynamic order ac-ceptance and capacity planning on a single bottleneck resource[J].Naval Research Logistics,2007,54(8):874-889.
[24]HING M M,HARTEN A V,SCHUUR P.Reinforcement lear-ning versus heuristics for order acceptance on a single resource[J].Journal of Heuristics,2007,13(2):167-187.
[25]CHARNSIRISAKSKUL K,GRIFFIN P M,KESKINOCAK P.Order selection and scheduling with leadtime flexibility[J].IIE Transactions,2004,36(7):697-707.

相关文章 15

[1]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[3]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[4]	刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[5]	宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[6]	王润安, 邹兆年. 基于物理操作级模型的查询执行时间预测方法 Query Performance Prediction Based on Physical Operation-level Models 计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074
[7]	陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[8]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[9]	袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[10]	檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[11]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[12]	史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[13]	李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[14]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[15]	齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed