Using elegibility traces -> E(s,a) for efficient assigning of rewards for (state, action) pairs. more specifically: dutch elegibility traces.
-- ฦ: trace decay parameter::: set to 0.1
-- โบ: Learning parameter...โบ โ (0,1]
-- ฦ: have to fine-tune as training proceeds ฦ โ (0,0.3)
-- R โ [0,1]
-- gradually decrease temperature ฮฒ in softmax to slowly increase exploitation.
Learning policy: Q-learning instead of SARSA -> q(st,at) = q(st,at) + โบ(Rt+1 + ฦmax(q(st,a')) - q(st,at))