Skip to content

QLearning comparison

Patrick Hammer edited this page Dec 17, 2020 · 60 revisions

Running QL comparisons with ONA

Scripts for detailed procedure learning comparisons between branches are available. The scripts are available in OpenNARS-for-Applications/misc/evaluation/ In this page we will compare QLearnerComparison branch which implements a QLearner-based agent with master.

It demands the following directory structure:

BASE/master <- v0.8.5 (built via its build.sh)
BASE/QLearnerComparison <- QLearnerComparison branch (built via its build.sh)
BASE/comparison.py (copy from master, OpenNARS-for-Applications/misc/evaluation/)
BASE/plot.py (copy from master, OpenNARS-for-Applications/misc/evaluation/)

First run BASE/comparison.py to generate the outputs from the procedure learning examples. Then run BASE/plot.py to generate the output plots from the generated data.

Theoretical comparison

Besides the capability to deal with multiple and changing objectives by design, ONA demands less implicitly example-dependent parameter tuning than Q-Learning:

  1. ONA does not rely on learning rate decay. How much new evidences changes an existing belief is only dependent on the amount of evidence which already supports it, making high-confident beliefs automatically more stable.
  2. ONA reduces motorbabbling by itself once the hypotheses it bases its decisions on are stable and predict successfully, and hence does not depend on a time-dependent reduction of the exploration rate either.

All time dependencies of hyperparameters are implicitly example-specific, and have hence to be avoided when generality is evaluated. With the passing of time, a Reduction of the learning rate makes the Q-Learner take longer to change its policy when new circumstances demand it. Additionally, reduction of motorbabbling over time will make it increasingly unlikely to attempt an alternative solutions. Both is problematic if a good policy has not yet been found.

To ensure generality of the learner's hyper-parameters across tasks, for Q-Learning a set of parameters (via grid search with granularity 0.1) was chosen with highest competence product across the 4 examples (which penalizes strong failure on any example severely). Also ONA parameters were not varied across the examples. The grid search found the best hyperparameters for the Q-Learning to be alpha=0.1, gamma=0.1, lambda=0.8, epsilon=0.1. The ONA parameters are the default config in ONA v0.8.5.

Practical comparison

Example plot from two Pong variants which can be obtained with the scripts mentioned above, displayed for the first 10K steps which show the largest differences in behavior: (Q-learning parameters: Gamma = 0.8, Lambda = 0.1, Alpha = 0.1, Epsilon = 0.1, ONA parameters: default config of ONA v0.8.5)

The experiment was ran over 5000 iterations, which with the used simulation speed, corresponds to close to 300 opportunities to catch the ball on average. Both methods received horizontal ball position relative to the bat discretized into 3 values left, center, right. The actions are ^left and right.

ONA vs. QLearning

This highlights ONA's quicker learning capability, which allows it to reach high success rates much earlier on average, which is visible both in pong, but also in pong2:

ONA vs. QLearning2

In pong2, the performance after 10K steps turned out to be comparable, though again ONA learned faster. Please also note that in pong2 only ONA managed to learn the use of the ^stop operator, at least in 2 of 10 runs, which gave it a success ratio lead in these particular cases.

Unfortunately, with the same parameters, the QLearner struggled to find a good state-action mapping in Cartpole within 10K steps, except of two runs where it started to look promising, while ONA performed excellent in all runs:

QL fail

Last, in the ONA Alien (a form of space invader) example, ONA learns faster on average, while the end performance after 10K steps is comparable (as it tends to be when both techniques succeed, due to the similar found behavior):

Alien comparison

Distinguished properties of ONA

  • Clearly, ONA demands more computational resources, for various reasons. First the correlations it mines are not only about reward as consequent. This allows it to learn temporal patterns even in the absence of reward / goal fulfillment. Second, correlating is not everything its inference is doing, and the general inference mechanisms have their cost.

  • There is a similarity in both NARS and RL decision theories, in that actions tend to be chosen which most likely will lead to the desired outcome or reward, though the calculations and design philosophies vary greatly.

  • Another key distinction is that in ONA goals can change, and there can be multiple of them which are pursued simultaneously. This however is not the focus of this page, which concentrates on overlapping capability and how this capability is affected by the different methods.

Clone this wiki locally