-
Notifications
You must be signed in to change notification settings - Fork 40
QLearning comparison
Scripts for detailed procedure learning comparisons between branches are available. The scripts are available in OpenNARS-for-Applications/misc/evaluation/ In this page we will compare QLearnerComparison branch which implements a QLearner-based agent with master.
It demands the following directory structure:
BASE/master <- v0.8.5 (built via its build.sh)
BASE/QLearnerComparison <- QLearnerComparison branch (built via its build.sh)
BASE/comparison.py (copy from master, OpenNARS-for-Applications/misc/evaluation/)
BASE/plot.py (copy from master, OpenNARS-for-Applications/misc/evaluation/)
First run BASE/comparison.py to generate the outputs from the procedure learning examples. Then run BASE/plot.py to generate the output plots from the generated data.
While there usually exist hyperparameters for examples for which Q-learning will converge to a suitable policy, these hyperparameters either need to be obtained by meta-optimization or hand-tuning by a human, both taking example-specific performance into account. ONA parameters on the other hand are mostly independent of particular examples:
- ONA does not rely on learning rates and changes thereof. New evidence always weights equally much, and how much it changes an existing belief is only dependent on the amount of evidence which already supports it, making well-reinforced beliefs automatically more stable.
- ONA reduces motorbabbling by itself once the hypotheses it bases its decisions on are stable and predict successfully, and hence does not depend on a time-dependent reduction of alpha.
All time dependencies of hyperparameters are implicitly example-specific, and have hence to be avoided when generality is evaluated. Reduction of learning rate hampers the learner to keep adapting to new circumstances after some time, and temporal reduction of motorbabbling will hamper it from even attemping alternative solutions. Generally, a lower learning rate makes the Q-learner learn slower, making it a potentially bad choice whenever learning speed matters. But also a high learning rate is problematic, it can hamper the Q-learner to refine its behavior further, essentially letting it oscillate between options.
Now where parameter decay has been forbidden, there is still the question what a good tradeoff between learning speed and success ratio after learning is, how high the learning rate should be, which of course depends on the requirements. Most importantly, ONA and generally NARS isn't affected by this tradeoff, due to its way of handling uncertainty and measuring of evidence, especially allowing multiple hypotheses to co-exist to avoid overfitting. These aspects allow it to learn as quickly as the evidence and resources allow for, while not depending on implicit example-specific parameter assumptions.
Example plot from two Pong variants which can be obtained with the scripts mentioned above, displayed for the first 10K steps which show the largest differences in behavior: (Q-learning parameters: Gamma = 0.8, Lambda = 0.1, Alpha = 0.1, ONA parameters: default config of ONA v0.8.5)
This highlights ONA's quicker learning capability, which allows it to reach high success rates much earlier on average, which is visible both in pong, but also in pong2:
In pong2, the performance after 10K steps turned out to be comparable, though again ONA learned faster. Please also note that in pong2 only ONA managed to learn the use of the ^stop operator, at least in 2 of 10 runs, which gave it a success ratio lead in these particular cases.
Unfortunately, with the same parameters, the QLearner struggled to find a good state-action mapping in Cartpole within 10K steps, except of two runs where it started to look promising, while ONA performed excellent in all runs:
Last, in the ONA Alien (a form of space invader) example, ONA learns faster on average, while the end performance after 10K steps is comparable (as it tends to be when both techniques succeed, due to the similar found behavior):
-
Clearly, ONA demands more computational resources, for various reasons. First the correlations it mines are not only about reward as consequent. This allows it to learn temporal patterns even in the absence of reward / goal fulfillment. Second, correlating is not everything its inference is doing, and the general inference mechanisms have their cost.
-
There is a similarity in both NARS and RL decision theories, in that actions tend to be chosen which most likely will lead to the desired outcome or reward, though the calculations and design philosophies vary greatly.
-
Another key distinction is that in ONA goals can change, and there can be multiple of them which are pursued simultaneously. This however is not the focus of this page, which concentrates on overlapping capability and how this capability is affected by the different methods.