Model-based vs Model Free Reinforcement Learning (RL)

The environment with which the agent interacts is a black box for the agent i.e. how it operates is not known to the agent. Based on what the agent tries to learn about this black box we can define the RL in two categories.

Model-Based

In model-based RL the agent tries to understand how the environment is generating outcomes and rewards. The idea is to understand how the environment produces the outcomes that it produces to develop a ‘model’ that can simulate the environment.
This model is used to simulate possible future state s’ and outcomes, allowing the agent to plan and make decisions based on these simulations.
Here the agent can estimate the reward of the action beforehand without interacting with the environment as it now has a model or a simulator that behaves like the environment.
Ultimately the model learns the transition probability (probability of going from one state to another state and then to another state) and which transition produces good rewards.
For example, consider an agent interacting with a computer chess. Here the agent can try to learn that if I move a particular piece on the chess board what could be the response of my opponent. Based on its interaction, the agent will try to build a model that would have learned all the strategies and nuances of playing a game of chess from start to finish.
Example Dynamic Programming Policy Evaluation

Model-free

Here the RL agent does not try to understand the environment dynamics. Instead, it builds a guide (policy) for itself that tells what the optimal behavior in a is given state i.e. the best action to be taken in a given state. This is built using error and trial methods by the agent.
Here the agent cannot predict or guess what will be future output of its action. It well known to the agent only in real-time.
In model-free RL, the focus is on learning by observing the consequences of actions rather than attempting to understand the dynamics of the environment.
The agent does not estimate the transition probability distribution (and the reward function) associated with the environment.
This approach is particularly useful in situations where the underlying model is either unknown or too complex to be accurately represented.
For example, consider the game of cards – We have a handful of cards in our hand, and we must pick one card to play. Here instead of thinking of all possible future outcomes associated with playing each card which is nearly impossible to model the agent will try to learn what is the best card to play given the current hands of the card based on its interaction with the environment.
Examples of such algorithms are SARSA, Policy Gradient, Q-learning, Deep Q
It is important to note that most of the focus now is on the Model-free RL and it is what most people mean when they say the term ‘Reinforcement Learning.

So, in the remainder of this article, we will be focusing our discussion on only Model free approaches of RL. Model-free algorithms are generally further characterized by whether there are value-based, or policy based and on-policy or off-policy.

Value-based and Policy Based RL

Value Based: Here we don’t store any policy. Instead, policy is derived from value function. Value-based RL builds a Q value function for state action pairs. These Q values quantify the expected reward of taking a particulate action in a given state. The policy is derived from this function which is generally the action with the best Q value.
Policy Based: In the policy-based method we don’t build a function for state value. Instead, the method focuses on making a policy for state value pairs which is the probability of taking a particular action in a given state that maximizes the reward. This probability could either be stochastic or determinist.

Off Policy and On Policy Model Free RL

This division is generally done for value-based policy on how the Q values are updated.

Off Policy: In this behavioral policy i.e. the policy with which it picks up the action and the policy that it is to learn is different. Suppose you are in state s and make an action that leads you to state s’. If we update our Q function on the best possible action in s’ then we have off-policy RL. This will become clearer when we discuss the Q learning algorithm below.
On Policy: In this the behavioral policy and target policy are the same. Suppose you are in state s and make an action that leads you to state s’. If we update our Q function on the action taken in states’ then we have on policy RL

Model-Free Reinforcement Learning: An Overview

In Reinforcement learning, we have agents that use a particular ‘algorithm’ to interact with the environment.

Now this algorithm can be model-based or model-free.

The traditional algorithms are model-based in the sense they tend to develop a ‘model’ for the environment. This developed model captured the ‘transition probability’ and ‘rewards’ for each state and action that the agent used for planning its action.
The new age algorithms are model-free in the sense they do not develop a model for the environment. Instead, they develop a policy that guides the agent to take the best possible action in a given state without needing transition probabilities. In the model-free algorithm, the agent learns by hit and trial method by interacting with the environment.

In this article we will first get a basic overview of RL, then we will discuss the difference between model-based and model-free algorithms in detail. Then we will study the Bellman equation which is the basis of the model-free learning algorithm and then we will see the difference between on-policy and off-policy, value-based, and policy-based model-free algorithms. Then we will get an overview of major model-free algorithms. Finally, we will do a simple implementation of the Q-learning algorithm using an open AI gym.