Reward

RLIO

https://web.mst.edu/~gosavia/emj.pdf Про DRL в Supply Chain, ни слова про reward

https://towardsdatascience.com/deep-reinforcement-learning-for-supply-chain-optimization-3e4d99ad4b58 Про DRL в Supply Chain

есть хороший бенчмарк (!)
Используется стандартная reward функция из or-gym https://arxiv.org/pdf/2008.06319.pdf Про or-gym
Раздел 5.1 - про inventory optimisation
Стр. 15 - про reward в IO
Уравнение 7f - вроде как про reward

7a - material balances for the on hand (I) at the beginning of each period n
7b - pipeline (T) inventory at the beginning of each period n
7c - relates the accepted reorder quantity (R) to the requested reorder quantity (Rˆ)
7d - gives the sales S at each period, which equal the accepted reorder quantities from the succeeding stages for stages 1 through M and equal the fulfilled customer demands for the retailer (stage 0) It should be noted that R^{^−1}{t} ≡ D{t} + B^{0}_{t-1}
7e - The unfulfilled demand U or unfulfilled reorder requests
7f - The profit P is given by Equation 7f, which discounts the profit (sales revenue minus procurement costs, unfulfilled demand costs, and excess inventory holding costs) with a discount factor α. p, r, k, and h are the unit sales price, unit procurement cost, unit penalty for unfulfilled demand, and unit inventory holding cost at each stage m, respectively. For stage M, R^{m}{t} ≡ S^{m}{t} and r^{m} represents a raw material procurement cost. If backlogging is not allowed, any unfulfilled demand or procurement orders are lost sales and all B terms are set to 0. I - material balances for the on hand at the beginning of each period T - pipeline inventory at the beginning of each period n R - reorder quantity arrived S - sales in the current period

http://essay.utwente.nl/85432/1/Geevers_MA_BMS.pdf Тут совсем новая методичка именно по RL в IO

https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_44.pdf Кейс про фабрики в Бразилии (читали ранее, уже есть на GitHub) Тут reinforce, approx. Sarsa и fixed heuristic based on the (ς,Q)-Policy как бейслайн (described by Tempelmeier, 2011 - читать Horst Tempelmeier. Inventory management in supply networks : problems, models, solutions. Books on Demand, Norderstedt, 2. ed. edition, 2011. ISBN 978-3-8423-4677-2. URL http://deposit.d-nb.de/cgi-bin/dokserv?id=3676720&prov=M&dok_var= 1&dok_ext=htm ) Сама (ς,Q)-Policy - Fred Janssen, R Heuts, and Ton de Kok. The value of information in an (r,s,q) inventory model. 02 1996.

https://arxiv.org/pdf/2006.04037.pdf A3C Очень специфичная постановка задачи - там ограничено количество SKU и экземпляров, которые может хендлить каждый юнит сети, и reward основан на том, чтобы штрафовать агента, если он закупает много одного товара и не закупает другого

https://link.springer.com/content/pdf/10.1007/s11740-020-01000-8.pdf Deep Q-learning в линейной логистической сети The decision criterion for controlling the existing process chain is formed by the result- ing costs. These are subdivided into storage and backorder costs. The storage costs result from existing stocks due to excessive order quantities at the intermediate stations. The backlog costs, in turn, result from insufficient order quantities. Too prevent an over-stimulation of the learning algorithm the interval of the value of the reward function is limited to [− 1;1]. This is achieved by the following function: According to this function, the value of the reward at a time t is calculated as a function of the storage and backlog costs C_s and C_late. Using the weighting factors f_sc and f_late, the relative weights of these two variables can be varied as a function of the observed learning behavior and the case-specific target variables. Depending on the size of the system, the total costs are calculated using the total cost factor f_c so that they lie within the defined interval of the reward function. In addition, f_c can be used to adjust the weighting of the reward function within the gradient descent optimizations by considering higher costs to a greater or lesser extent. For the optimizations process, stochastic gradient descent (SGD) was selected as the optimizer and the error was calculated as the mean average error (MAE). For the calculation of the reward, the late order and storage costs are weighted equally with f_sc = f_late = 1 and the overall costs are weighted with f_c = 0.02 for all conducted experiments.

https://libstore.ugent.be/fulltxt/RUG01/002/790/831/RUG01-002790831_2019_0001_AC.pdf Чья-то очень крутая заграничная дипломная работа про RLIO

Q-learning
Deep Q-learning
DDPG As mentioned in the introduction we judge the agents’ performance based on the achieved profit for the complete supply chain. Which means an attempt to maximise the attained profit. However, components like the service level achieved could also be included in the reward function, to aim for a di↵erent outcome. In the remainder of this subsection, the components of the demand function will be mathematically specified.
Revenue
Production cost
Holding cost
Transportation cost
Setup cost
Opportunity cost of shortage
https://poseidon01.ssrn.com/delivery.php?ID=992074118067116084007066073013083102120084005013002000092117068092024114069122104011106062111041056027030024101111092083067069050015071081017116114116108120112126089053011068002095005001097065110125087115100111001008117098076101073111068124065089005&EXT=pdf&INDEX=TRUE
https://www.academia.edu/718895/Inventory_management_in_supply_chains_a_reinforcement_learning_approach
https://medium.com/streamba-data/reinforcement-learning-in-the-supply-chain-2fd56ab59c2e
https://libstore.ugent.be/fulltxt/RUG01/002/790/831/RUG01-002790831_2019_0001_AC.pdf
https://link.springer.com/content/pdf/10.1007/s11740-020-01000-8.pdf
https://arxiv.org/pdf/2006.04037.pdf
https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_44.pdf
http://essay.utwente.nl/85432/1/Geevers_MA_BMS.pdf
https://arxiv.org/pdf/2008.06319.pdf
https://towardsdatascience.com/deep-reinforcement-learning-for-supply-chain-optimization-3e4d99ad4b58
https://web.mst.edu/~gosavia/emj.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally