Meta-reinforcement learning refers to embedding meta-learning (i.e., higher-order learning) mechanisms in reinforcement learning (RL) models.
I have used this approach to significantly improve a repertoire of RL models:
- Value-based RL models (e.g., Rescorla-Wagner)
- Paired State-Action value-based RL models (e.g., Q-learning)
- RL models applied to predict time-series signals
In this approach, learning rates are dynamically adjusted. This foundation is different yet of similar philosophy to previous RL models, such as that of Pearce and Hall, or especially, that of Mackintosh.
The main difference is that here learning rates follow continuous integration (of information). As a result, these meta-reinforcement learning models are able to distinguish between good, bad and ugly abstract feature representations, according to their predictability of reward: positive prediction, negative prediction, or noise, respectively.
This higher-order learning is recursively used in the models to prioritize information processing, whether it refers to sensory information, proper actions, context-dependent information, or the optimal path in a grid world. It is particularly remarkable as well, how the Nash equilibrium of stochastic choice naturally emerges from this meta-reinforcement learning model in competing mixed-strategy games such as the matching pennies task.