Auxiliary Losses
In the majority of reinforcement learning situations, our agent is given some kind of unprocessed input data, such as image sequences. The agent will then utilize some sort of feature extraction pipeline to extract relevant information from those raw input photos, and it will then have a policy network that uses those derived features to do the task we intended to learn.
Our feedback signal can be so sparse in reinforcement learning that the agent never succeeds in extracting relevant characteristics from the input frames. So, in this scenario, adding additional learning goals to our agent that exploit the strengths of supervised learning to come up with extremely valuable feature extractors on those photos is a successful method. So let’s go through Google Deepmind’s article – reinforcement learning with unsupervised auxiliary tasks. The basic sparse reward signal is that our agent is walking around in a 3D Maze, looking for specified things, and it receives a reward anytime it encounters one of those objects. They do, however, supplement the entire training process with three additional reward signals, rather than having this relatively sparse feedback signal. The agent must initially learn what is known as pixel control. It uses the primary feature extraction process and learns a separate policy to maximize adjust the pixel intensities in particular regions of the input images given a frame from the environment. It may, for example, learn that looking up at the sky entirely affects all of the pixel values in the input. In their proposed implementation:
- the input frame is separated into a small number of grids, with each grid receiving a visual change score. The strategy is then trained to maximize the total visual change across all grids, with the goal of forcing the feature extractor to become more sensitive to the game’s overall dynamics. The agent is given three recent frames from the episode sequence and is tasked with predicting the reward that will be provided.
- In the second auxiliary job, which is called Reward Prediction. We’ve added another learning goal to our agent, this time to optimize the feature extraction pipeline in a way that we think will be generally useful for the end goal that we care about.
- The third job is Value Function Replay, which attempts to evaluate the value of being in a current state by projecting the entire future reward that the agent will receive from this moment forward. This is essentially what every off-policy algorithm, such as DQ n, does all of the time. It turns out that by adding these relatively easy extra goals to our training pipeline, we can dramatically improve our learning agent’s sampling efficiency.
In three-dimensional environments, the addition of pixel control activities appears to work particularly well. Learning to control your gaze direction and how this affects your own visual input signals is critical for learning any form of effective behaviour.
Sparse Rewards in Reinforcement Learning
Prerequisite: Understanding Reinforcement Learning in-depth
In the previous articles, we learned about reinforcement learning, as well as the general paradigm and the issues with sparse reward settings. In this article, we’ll dive a little further into some more technical work aimed at resolving the sparse reward setting problem. The fact that we’re dealing with sparse rewards means that we don’t know the target label that our network should create for each input frame, so our agent must learn from very sparse feedback and figure out what action sequences led to the end reward on its own. One option that has emerged from the research is to supplement the sparse extrinsic reward signal received from the environment with additional dense reward signals to enhance the agent’s learning. We’ll go over some fascinating technical work that introduces concepts such as additional incentive signals, curiosity-driven exploration, and experience playback in retrospect.
In order to overcome some of the most difficult challenges in reinforcement learning, a wide range of novel ideas in reinforcement learning research have emerged. One recent trend has been to supplement the in-game environment’s sparse extrinsic reward signal with additional feedback signals that aid your agent’s learning. Many of these new concepts are variants of the same core theme.
Instead of a sparse reward signal that your agent only sees on rare occasions, we want to construct extra feedback signals that are very rich, or in other words, we want to create a supervised setting. The purpose of those extra rewards, as well as the additional feedback signals, is that they are tied to the task that we want our agent to do in some way. We want to produce these dense feedback signals so that anytime our agent completes those tasks, it will most likely gain information or feature extractors that will be useful in the final work (the sparse tasks) that we are interested in. It is impossible to give you an in-depth explanation of all the methodologies in a single article, but this article will attempt to sketch a few highly intriguing papers that attempt to give you an idea of the main directions that research is now taking.