Bellman Equation in RL

Model-based vs Model Free Reinforcement Learning (RL)

The Bellman equation is the foundation of many algorithms in RL. It has many forms depending on the type of algorithm and the value being optimized.

The return for any state action value is decomposed into two parts.

Immediate reward from the action that is taken to reach the next stage.
The Discounted return from that next state by following the same policy for all subsequent steps.

This recursive relationship is known as the Bellman Equation.

V(S₁) = E[R + YV(S₂)]

Let us understand Bellman equation with the help of the q value.

For example, Bellman considers the q value in valued based model free RL. When we build a value function in an RL algorithm, we update it with a value called Q i.e. Quality value for each state action pair.

Consider a simple naive environment where there are only 5 possible state s’ and 4 possible actions. Hence, we can develop a look-up table with rows representing the state s’ and columns representing the value. Each of the values in the matrix represents the reward obtained by taking that particle action given the agent is in that state. This value is known as the q value. It is this q value that the agent learns through its interaction with the environment. Once the agent has interacted with the environment for a sufficient amount of time the value contained in the table will be optimal values. Based on this value the agent will decide its action.

Formally the Q-value of a state-action pair is denoted as Q(s, a), where:

s is the current state of the environment.
a is the action taken by the agent.

Bellman equation expresses the relationship between the value of a state (or state-action pair) and the expected future rewards achievable from that state.

Whenever an agent interacts with an environment it gets two things in return – the immediate reward for that action and its successor state. Based on these two it updates the Q value of the current state

Q(s,a) = E [R_t+1+ max(Q(s’,a’)]

The expected return from starting state s, taking action a, and with the optimal policy afterward will be equal to the expected reward R_t+1 we can get by selecting action a in state s plus the maximum of “expected discounted return” that is achievable of any (s′, a′) where (s′, a′) is a potential next state-action pair

Model-Free Reinforcement Learning: An Overview

In Reinforcement learning, we have agents that use a particular ‘algorithm’ to interact with the environment.

Now this algorithm can be model-based or model-free.

The traditional algorithms are model-based in the sense they tend to develop a ‘model’ for the environment. This developed model captured the ‘transition probability’ and ‘rewards’ for each state and action that the agent used for planning its action.
The new age algorithms are model-free in the sense they do not develop a model for the environment. Instead, they develop a policy that guides the agent to take the best possible action in a given state without needing transition probabilities. In the model-free algorithm, the agent learns by hit and trial method by interacting with the environment.

In this article we will first get a basic overview of RL, then we will discuss the difference between model-based and model-free algorithms in detail. Then we will study the Bellman equation which is the basis of the model-free learning algorithm and then we will see the difference between on-policy and off-policy, value-based, and policy-based model-free algorithms. Then we will get an overview of major model-free algorithms. Finally, we will do a simple implementation of the Q-learning algorithm using an open AI gym.

Bellman Equation in RL

Model-Free Reinforcement Learning: An Overview

Similar Reads