Hindsight Experience Replay

Hindsight Experience Replay is a fantastic addition to the usual reward system (HER). HER is based on a very simple concept that is quite successful. Consider the following scenario: you wish to teach a robotic arm to push an object on a table to a specified spot. The difficulty is that if you do this through random exploration, you’re very unlikely to obtain a lot of rewards, making it very difficult to train this policy. The general method is to provide a dense reward that is, for example, the object’s distance from the target point in Euclidean space. As a result, you’ll get a very particular dense reward for each frame, which you can train using basic gradient descent. Now, the difficulty is that, as we’ve seen, reward shaping isn’t the best answer; instead, we’d prefer to do this with a basic sparse reward, such as success or failure. The general notion behind hindsight experience replay is that they want to learn from all episodes, even if one or more of them failed to teach us the tasks we wanted to learn. As a result, hindsight experience replay employs a deceptively easy approach to persuade the agent to learn from a failed episode.

The agent begins by pushing an object about on the table, attempting to reach position A, but because the policy isn’t very good yet, the object ends up at position B, which is incorrect. Instead of simply informing you that the model did something wrong and you received a 0 reward, HER will act as if going to be was the thing you wanted it to do, and then tell you, “Yes, very well done, this is how you move the object to position B.” You’re essentially turning a sparse reward situation into a dense reward setting.

Start with a standard off-policy reinforcement learning algorithm and a sampling goal position method. We’ll just use our current policy to get a trajectory and a final position where the object ended up, so given a certain goal position, we’ll just use our current policy to get a trajectory and a final position where the object ended up in. We save all of those transitions in the replay buffer with the goal that was chosen for the policy once our episode has concluded. Then we sample a set of updated supplementary goals, swapping them out in state transitions, and finally saving everything in the replay buffer. The great thing about this algorithm is that once you’ve trained it, you’ll have a policy network that can do different things depending on the goal you give it. So, if you wish to shift the object to a different place, you don’t have to retrain the entire policy; simply modify the objective vector and your policy will adjust accordingly. The blue curve in this graph reflects the outcomes of our hindsight experience game, in which the additional sampled goal was always the proper conclusion state of the episode sequence. It’s the actual position in which the object ended up after performing a series of activities. When the additional goals are sampled from future states encountered on the same trajectories, the red curve illustrates the findings even better. The concept is simple, and the algorithm is simple to implement, but it addresses a basic difficulty in learning: we want to make the most of every experience we have.

Algorithm from HER Research paper:

Given:

  • – an off-policy RL algorithm ,
  • – a strategy  for sampling goals for replay,
  • – a reward function .

Initialize 

Initialize replay buffer 

for episode = 1, M do

    Sample a goal $g$ and an initial state .

    for t=0, T-1 do

            Sample an action  using the behavioral policy from \mathbb{A} 

                    

            Execute the action   and observe a new state  

     end for

    for t=0, T-1 do

           

           Store the transition  in 

           Sample a set of additional goals for replay  

           for do

                         r^{\prime}:=r\left(s_{t}, a_{t}, g^{\prime}\right)

                        Store the transition  in 

          end for

    end for

    for t=1, N do

    Sample a minibatch  from the replay buffer R

    Perform one step of optimization using  and minibatch  The algorithm

    end for 

end for

Sparse Rewards in Reinforcement Learning

In the previous articles, we learned about reinforcement learning, as well as the general paradigm and the issues with sparse reward settings. In this article, we’ll dive a little further into some more technical work aimed at resolving the sparse reward setting problem. The fact that we’re dealing with sparse rewards means that we don’t know the target label that our network should create for each input frame, so our agent must learn from very sparse feedback and figure out what action sequences led to the end reward on its own. One option that has emerged from the research is to supplement the sparse extrinsic reward signal received from the environment with additional dense reward signals to enhance the agent’s learning. We’ll go over some fascinating technical work that introduces concepts such as additional incentive signals, curiosity-driven exploration, and experience playback in retrospect.

In order to overcome some of the most difficult challenges in reinforcement learning, a wide range of novel ideas in reinforcement learning research have emerged. One recent trend has been to supplement the in-game environment’s sparse extrinsic reward signal with additional feedback signals that aid your agent’s learning. Many of these new concepts are variants of the same core theme.

Instead of a sparse reward signal that your agent only sees on rare occasions, we want to construct extra feedback signals that are very rich, or in other words, we want to create a supervised setting. The purpose of those extra rewards, as well as the additional feedback signals, is that they are tied to the task that we want our agent to do in some way. We want to produce these dense feedback signals so that anytime our agent completes those tasks, it will most likely gain information or feature extractors that will be useful in the final work (the sparse tasks) that we are interested in. It is impossible to give you an in-depth explanation of all the methodologies in a single article, but this article will attempt to sketch a few highly intriguing papers that attempt to give you an idea of the main directions that research is now taking.

Similar Reads

Auxiliary Losses:

Architecture of Reinforcement Learning Agent with Unsupervised Auxiliary Task...

Hindsight Experience Replay:

Hindsight Experience Replay is a fantastic addition to the usual reward system (HER). HER is based on a very simple concept that is quite successful. Consider the following scenario: you wish to teach a robotic arm to push an object on a table to a specified spot. The difficulty is that if you do this through random exploration, you’re very unlikely to obtain a lot of rewards, making it very difficult to train this policy. The general method is to provide a dense reward that is, for example, the object’s distance from the target point in Euclidean space. As a result, you’ll get a very particular dense reward for each frame, which you can train using basic gradient descent. Now, the difficulty is that, as we’ve seen, reward shaping isn’t the best answer; instead, we’d prefer to do this with a basic sparse reward, such as success or failure. The general notion behind hindsight experience replay is that they want to learn from all episodes, even if one or more of them failed to teach us the tasks we wanted to learn. As a result, hindsight experience replay employs a deceptively easy approach to persuade the agent to learn from a failed episode....

Curiosity Driven Exploration:

The main notion is that you want to encourage your agent to learn about new items it encounters in its environment in some way. People utilize epsilon greedy exploration in most default reinforcement learning algorithms. This indicates that in the vast majority of circumstances, your agent will choose the best feasible action based on its present policy, but with a small probability of epsilon, it will adopt a random action. Then, at the start of training, this epsilon number is 100%, which means it’s fully random, and as you train and advance, this epsilon value will start to decline until you completely follow your policy at the end. The idea is that your bot will learn to explore the surroundings by doing these random acts....