Skip to content

QLearning comparison

Patrick Hammer edited this page Jul 16, 2021 · 60 revisions

Theoretical comparison

Besides the capability to deal with multiple and changing objectives by design, ONA demands less implicitly example-dependent parameter tuning than Q-Learning:

  1. ONA does not rely on learning rate decay. How much new evidences changes an existing belief is only dependent on the amount of evidence which already supports it, making high-confident beliefs automatically more stable.
  2. ONA reduces motorbabbling by itself once the hypotheses it bases its decisions on are stable and predict successfully, and hence does not depend on a time-dependent reduction of the exploration rate either.

All time dependencies of hyperparameters are implicitly example-specific, and have hence to be avoided when generality is evaluated. With the passing of time, a Reduction of the learning rate makes the Q-Learner take longer to change its policy when new circumstances demand it. Additionally, reduction of motorbabbling over time will make it increasingly unlikely to attempt an alternative solutions. Both is problematic if a good policy has not yet been found.

To ensure generality of the learner's hyper-parameters choice across tasks, for Q-Learning a set of parameters (via grid search with granularity 0.1) was chosen with highest competence product across the 4 examples (which penalizes strong failure on any example severely). Also ONA parameters were not varied across the examples. The grid search found the best hyperparameters for the Q-Learning to be alpha=0.1, gamma=0.1, lambda=0.8, epsilon=0.1. The ONA parameters are the default config in ONA v0.8.5.

Summary of distinguished properties of ONA

  • ONA does not need a specific reduction of learning rate and exploration rate to work well for a particular example, hence needs less parameter tuning.

  • ONA demands more computational resources than a table-based Q-Learning implementation.

  • The recurring correlations ONA finds are not only about reward as consequent. This allows it to learn temporal patterns even in the absence of reward / goal fulfillment. Should an event become a goal in the future, or should its occurrence turn out to become necessary as an intermediate step to reach an outcome, ONA can immediately exploit the previously learned knowledge. to make it happen.

  • ONA allows to pursue multiple goals / objectives simultaneously rather than pursuing a single max. reward outcome.

  • Goals in ONA can change at any time, and when it happens the system's knowledge can be used to achieve the new goals without having to re-learn a state-action mapping.

  • There is a shared property of both the NARS and RL decision theories, in that actions tend to be chosen which most likely will lead to the desired outcome or reward, though in NARS there is no guarantee since there can be multiple such outcomes competing for attention simultaneously.