Model-Free Algorithms

Let us discuss some popular model-free algorithms.

Q-learning

Q-learning is a classic RL algorithm that aims to learn the quality of actions in a given state. The Q-value represents the expected cumulative reward of taking a particular action in a specific state. We covered the Q learning algorithm in detail when we discussed the on-policy model-free algorithm.

In this behavioral policy i.e. the policy with which it picks up the action and the policy that it is to learn is different. The bellman equation is given by:

  • s – current state of the environment.
  • a – is the action taken by the agent.
  • α – is a learning rate a number that controls the size of the updates of the Q-values.
  • r – is the reward received by the agent for taking action in state s
  • β – is a discount factor that decides what importance must be given to immediate rewards concerning future rewards when making a decision.
  • s’, a’ – are new state s and actions in the new state.

SARSA

SARSA stands for State Action Reward State Action. The updated equation for SARSA depends on the current state, current action, reward obtained, next state, and next action. SARSA operates by choosing an action following the current epsilon-greedy policy and updates its Q-values accordingly.

In this, the behavioral policy and target policy is the same. The Bellman equation for state action value pair is given by:

  • Note the difference w.r.t to Q value. Here instead of updating with the maximum q value of the future state, we update the Q value associated with the actual action that the agent has taken in the next state.
  • Here the behavior-making policy and update to the target policy are the same. Whatever the behavior chosen by the agent, the update to the q value is done according to the q value associated with it.
  • The agent evaluates and updates its policy based on the data it collects while following that policy.
  • The data used for learning comes from the same policy that is being improved.
  • This type of policy is followed when there is a cost associated with a wrong action.
    • Training a robot on walking downstairs. If it places a wrong foot, there will be costs associated with physical damage.

DQN

DQN is based on the Q learning algorithm. It integrates the deep learning technique with Q learning. In Q learning we develop a table for state action pairs. However, this is not feasible in scenarios where the number of state action pairs reaches high. So instead of making a value function like a table we develop a neural network that plays the role of a function that outputs quality value for a given input of state and action. So instead of using a table to store Q-values for each state-action pair, a deep neural network is used to approximate the Q-function.

Actor Critic

Actor critic combines elements of both policy-based (actor) and value-based (critic) methods. The main idea is to have two components working together: an actor, which learns a policy to select actions, and a critic, which evaluates the chosen actions.

  • The actor is responsible for selecting actions based on the current policy which is policy-based based.
  • The critic evaluates the actions taken by the actor by estimating the value or advantage function which is value-based.

Model-Free Reinforcement Learning: An Overview

In Reinforcement learning, we have agents that use a particular ‘algorithm’ to interact with the environment.

Now this algorithm can be model-based or model-free.

  • The traditional algorithms are model-based in the sense they tend to develop a ‘model’ for the environment. This developed model captured the ‘transition probability’ and ‘rewards’ for each state and action that the agent used for planning its action.
  • The new age algorithms are model-free in the sense they do not develop a model for the environment. Instead, they develop a policy that guides the agent to take the best possible action in a given state without needing transition probabilities. In the model-free algorithm, the agent learns by hit and trial method by interacting with the environment.

In this article we will first get a basic overview of RL, then we will discuss the difference between model-based and model-free algorithms in detail. Then we will study the Bellman equation which is the basis of the model-free learning algorithm and then we will see the difference between on-policy and off-policy, value-based, and policy-based model-free algorithms. Then we will get an overview of major model-free algorithms. Finally, we will do a simple implementation of the Q-learning algorithm using an open AI gym.

Similar Reads

Reinforcement Learning

Generally, when we talk about machine learning problems, we think of categorizing them into supervised and unsupervised learning. However, there is another less glamorized category which is reinforcement learning (RL)....

Model-based vs Model Free Reinforcement Learning (RL)

The environment with which the agent interacts is a black box for the agent i.e. how it operates is not known to the agent. Based on what the agent tries to learn about this black box we can define the RL in two categories....

Bellman Equation in RL

The Bellman equation is the foundation of many algorithms in RL. It has many forms depending on the type of algorithm and the value being optimized....

Model-Free Algorithms

Let us discuss some popular model-free algorithms....

Implementation of Model Free RL

We will use OpenAI gymnasium (also known formerly as OpenAI gym) to build a model-free RL using a Q learning algorithm (off policy). Now to train an RL agent we need an environment that can provide a simulation of the environment. This is what the open gym toolkit does. It provides a variety of environments that can be used for testing different reinforcement learning algorithms. Users can also create and register their custom environments in OpenAI Gym, allowing them to test algorithms on specific tasks relevant to their research or application....

Conclusion

...