Auxiliary Losses

Architecture of Reinforcement Learning Agent with Unsupervised Auxiliary Task

In the majority of reinforcement learning situations, our agent is given some kind of unprocessed input data, such as image sequences. The agent will then utilize some sort of feature extraction pipeline to extract relevant information from those raw input photos, and it will then have a policy network that uses those derived features to do the task we intended to learn.

Our feedback signal can be so sparse in reinforcement learning that the agent never succeeds in extracting relevant characteristics from the input frames. So, in this scenario, adding additional learning goals to our agent that exploit the strengths of supervised learning to come up with extremely valuable feature extractors on those photos is a successful method. So let’s go through Google Deepmind’s article – reinforcement learning with unsupervised auxiliary tasks. The basic sparse reward signal is that our agent is walking around in a 3D Maze, looking for specified things, and it receives a reward anytime it encounters one of those objects. They do, however, supplement the entire training process with three additional reward signals, rather than having this relatively sparse feedback signal. The agent must initially learn what is known as pixel control. It uses the primary feature extraction process and learns a separate policy to maximize adjust the pixel intensities in particular regions of the input images given a frame from the environment. It may, for example, learn that looking up at the sky entirely affects all of the pixel values in the input. In their proposed implementation:

  • the input frame is separated into a small number of grids, with each grid receiving a visual change score. The strategy is then trained to maximize the total visual change across all grids, with the goal of forcing the feature extractor to become more sensitive to the game’s overall dynamics. The agent is given three recent frames from the episode sequence and is tasked with predicting the reward that will be provided.
  • In the second auxiliary job, which is called Reward Prediction. We’ve added another learning goal to our agent, this time to optimize the feature extraction pipeline in a way that we think will be generally useful for the end goal that we care about.
  • The third job is Value Function Replay, which attempts to evaluate the value of being in a current state by projecting the entire future reward that the agent will receive from this moment forward. This is essentially what every off-policy algorithm, such as DQ n, does all of the time. It turns out that by adding these relatively easy extra goals to our training pipeline, we can dramatically improve our learning agent’s sampling efficiency.

In three-dimensional environments, the addition of pixel control activities appears to work particularly well. Learning to control your gaze direction and how this affects your own visual input signals is critical for learning any form of effective behaviour.

Sparse Rewards in Reinforcement Learning

In the previous articles, we learned about reinforcement learning, as well as the general paradigm and the issues with sparse reward settings. In this article, we’ll dive a little further into some more technical work aimed at resolving the sparse reward setting problem. The fact that we’re dealing with sparse rewards means that we don’t know the target label that our network should create for each input frame, so our agent must learn from very sparse feedback and figure out what action sequences led to the end reward on its own. One option that has emerged from the research is to supplement the sparse extrinsic reward signal received from the environment with additional dense reward signals to enhance the agent’s learning. We’ll go over some fascinating technical work that introduces concepts such as additional incentive signals, curiosity-driven exploration, and experience playback in retrospect.

In order to overcome some of the most difficult challenges in reinforcement learning, a wide range of novel ideas in reinforcement learning research have emerged. One recent trend has been to supplement the in-game environment’s sparse extrinsic reward signal with additional feedback signals that aid your agent’s learning. Many of these new concepts are variants of the same core theme.

Instead of a sparse reward signal that your agent only sees on rare occasions, we want to construct extra feedback signals that are very rich, or in other words, we want to create a supervised setting. The purpose of those extra rewards, as well as the additional feedback signals, is that they are tied to the task that we want our agent to do in some way. We want to produce these dense feedback signals so that anytime our agent completes those tasks, it will most likely gain information or feature extractors that will be useful in the final work (the sparse tasks) that we are interested in. It is impossible to give you an in-depth explanation of all the methodologies in a single article, but this article will attempt to sketch a few highly intriguing papers that attempt to give you an idea of the main directions that research is now taking.

Similar Reads

Auxiliary Losses:

Architecture of Reinforcement Learning Agent with Unsupervised Auxiliary Task...

Hindsight Experience Replay:

Hindsight Experience Replay is a fantastic addition to the usual reward system (HER). HER is based on a very simple concept that is quite successful. Consider the following scenario: you wish to teach a robotic arm to push an object on a table to a specified spot. The difficulty is that if you do this through random exploration, you’re very unlikely to obtain a lot of rewards, making it very difficult to train this policy. The general method is to provide a dense reward that is, for example, the object’s distance from the target point in Euclidean space. As a result, you’ll get a very particular dense reward for each frame, which you can train using basic gradient descent. Now, the difficulty is that, as we’ve seen, reward shaping isn’t the best answer; instead, we’d prefer to do this with a basic sparse reward, such as success or failure. The general notion behind hindsight experience replay is that they want to learn from all episodes, even if one or more of them failed to teach us the tasks we wanted to learn. As a result, hindsight experience replay employs a deceptively easy approach to persuade the agent to learn from a failed episode....

Curiosity Driven Exploration:

The main notion is that you want to encourage your agent to learn about new items it encounters in its environment in some way. People utilize epsilon greedy exploration in most default reinforcement learning algorithms. This indicates that in the vast majority of circumstances, your agent will choose the best feasible action based on its present policy, but with a small probability of epsilon, it will adopt a random action. Then, at the start of training, this epsilon number is 100%, which means it’s fully random, and as you train and advance, this epsilon value will start to decline until you completely follow your policy at the end. The idea is that your bot will learn to explore the surroundings by doing these random acts....