Skip to content
Popov Ilya edited this page Apr 25, 2021 · 2 revisions

RLIO

https://web.mst.edu/~gosavia/emj.pdf Про DRL в Supply Chain, ни слова про reward

https://towardsdatascience.com/deep-reinforcement-learning-for-supply-chain-optimization-3e4d99ad4b58 Про DRL в Supply Chain

  1. есть хороший бенчмарк (!)
  2. Используется стандартная reward функция из or-gym https://arxiv.org/pdf/2008.06319.pdf Про or-gym
  3. Раздел 5.1 - про inventory optimisation
  4. Стр. 15 - про reward в IO
  5. Уравнение 7f - вроде как про reward 
  • 7a - material balances for the on hand (I) at the beginning of each period n
  • 7b - pipeline (T) inventory at the beginning of each period n
  • 7c - relates the accepted reorder quantity (R) to the requested reorder quantity (Rˆ)
  • 7d - gives the sales S at each period, which equal the accepted reorder quantities from the succeeding stages for stages 1 through M and equal the fulfilled customer demands for the retailer (stage 0)
It should be noted that R^{^−1}{t} ≡ D{t} + B^{0}_{t-1}
  • 7e - The unfulfilled demand U or unfulfilled reorder requests
  • 7f - The profit P is given by Equation 7f, which discounts the profit (sales revenue minus procurement costs, unfulfilled demand costs, and excess inventory holding costs) with a discount factor α. p, r, k, and h are the unit sales price, unit procurement cost, unit penalty for unfulfilled demand, and unit inventory holding cost at each stage m, respectively. For stage M, R^{m}{t} ≡ S^{m}{t} and r^{m} represents a raw material procurement cost. If backlogging is not allowed, any unfulfilled demand or procurement orders are lost sales and all B terms are set to 0. I - material balances for the on hand at the beginning of each period T - pipeline inventory at the beginning of each period n R - reorder quantity arrived S - sales in the current period

http://essay.utwente.nl/85432/1/Geevers_MA_BMS.pdf Тут совсем новая методичка именно по RL в IO 

https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_44.pdf Кейс про фабрики в Бразилии (читали ранее, уже есть на GitHub) Тут reinforce, approx. Sarsa и fixed heuristic based on the (ς,Q)-Policy как бейслайн (described by Tempelmeier, 2011 - читать Horst Tempelmeier. Inventory management in supply networks : problems, models, solutions. Books on Demand, Norderstedt, 2. ed. edition, 2011. ISBN 978-3-8423-4677-2. URL http://deposit.d-nb.de/cgi-bin/dokserv?id=3676720&prov=M&dok_var= 1&dok_ext=htm ) Сама (ς,Q)-Policy - Fred Janssen, R Heuts, and Ton de Kok. The value of information in an (r,s,q) inventory model. 02 1996.  

https://arxiv.org/pdf/2006.04037.pdf A3C Очень специфичная постановка задачи - там ограничено количество SKU и экземпляров, которые может хендлить каждый юнит сети, и reward основан на том, чтобы штрафовать агента, если он закупает много одного товара и не закупает другого

https://link.springer.com/content/pdf/10.1007/s11740-020-01000-8.pdf Deep Q-learning в линейной логистической сети The decision criterion for controlling the existing process chain is formed by the result- ing costs. These are subdivided into storage and backorder costs. The storage costs result from existing stocks due to excessive order quantities at the intermediate stations. The backlog costs, in turn, result from insufficient order quantities. Too prevent an over-stimulation of the learning algorithm the interval of the value of the reward function is limited to [− 1;1]. This is achieved by the following function:  According to this function, the value of the reward at a time t is calculated as a function of the storage and backlog costs C_s and C_late. Using the weighting factors f_sc and f_late, the relative weights of these two variables can be varied as a function of the observed learning behavior and the case-specific target variables. Depending on the size of the system, the total costs are calculated using the total cost factor f_c so that they lie within the defined interval of the reward function. In addition, f_c can be used to adjust the weighting of the reward function within the gradient descent optimizations by considering higher costs to a greater or lesser extent. For the optimizations process, stochastic gradient descent (SGD) was selected as the optimizer and the error was calculated as the mean average error (MAE). For the calculation of the reward, the late order and storage costs are weighted equally with f_sc = f_late = 1 and the overall costs are weighted with f_c = 0.02 for all conducted experiments.

https://libstore.ugent.be/fulltxt/RUG01/002/790/831/RUG01-002790831_2019_0001_AC.pdf Чья-то очень крутая заграничная дипломная работа про RLIO

Clone this wiki locally