Implementation of Model Free RL

We will use OpenAI gymnasium (also known formerly as OpenAI gym) to build a model-free RL using a Q learning algorithm (off policy). Now to train an RL agent we need an environment that can provide a simulation of the environment. This is what the open gym toolkit does. It provides a variety of environments that can be used for testing different reinforcement learning algorithms. Users can also create and register their custom environments in OpenAI Gym, allowing them to test algorithms on specific tasks relevant to their research or application.

Just as Pytorch or TensorFlow have become the standard framework for implementing deep learning tasks, OpenAI gym has become the default standard for benchmarking and evaluation of RL algorithms.

To install the gymnasium, we can use the below command

!pip install gymnasium

1. Understand the environment

We will be using the taxi-v3 environment from the OpenAI gym.

The OpenAI Gym Taxi-v3 environment is a classic reinforcement learning problem often used for learning and testing RL algorithms. In this environment, the agent controls a taxi navigating a grid world, to pick up a passenger from one location and drop them off at another.
The setting consists of a 5 * 5 gird.
The taxi driver is shown with a yellow background.
There are walls represented by vertical lines.
The goal is to move the taxi to the passenger’s location (colored in blue), pick up the passenger, move to the passenger’s desired destination (colored in purple), and drop off the passenger.
The agent is rewarded as follows:
- +20 for successfully dropping off the passenger.
- -10 for unsuccessful attempts to pick up or drop off the passenger.
- -1 for each step taken by the agent, aiming to encourage the agent to take an efficient route.

Python3

import gymnasium as gym
env = gym.make('Taxi-v3',render_mode='ansi')
env.reset()
 
print(env.render())

Output:

Taxi-V3 environment

The gym.make(‘Taxi-v3′, render_mode=’ansi’) line creates an instance of the Taxi-v3 environment. The render_mode=’ansi’ argument specifies the rendering mode to be in ANSI mode, which is a text-based mode suitable for displaying the environment in a text console
The env.reset() method is called to reset the environment to its initial state. This is typically done at the beginning of each episode to start fresh.

2. Create the q learning agent

Initialization (__init__ method):
- env: The environment in which the agent operates.
- learning_rate: The learning rate for updating Q-values.
- initial_epsilon: The initial exploration rate.
- epsilon_decay: The rate at which the exploration rate decreases.
- final_epsilon: The minimum exploration rate.
- discount_factor: The discount factor for future rewards.
get_action method:
- With probability ε, it chooses a random action (exploration).
- With probability 1-ε it chooses the action with the highest Q-value for the current observation (exploitation).
- This value is high initially and gradually reduced
update method:
- Updates the Q-value based on the observed reward and the maximum Q-value of the next state.
decay_epsilon method:
- decreases the exploration rate (epsilon) until it reaches its final value.

Python3

import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
 
 
class QLearningAgent:
    def __init__(self, env, learning_rate, initial_epsilon, epsilon_decay, final_epsilon, discount_factor=0.95
                 ):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
 
        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon
 
        # Initialize an empty dictionary of state-action values
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
 
    def get_action(self, obs) -> int:
        x = np.random.rand()
        if x < self.final_epsilon:
            return self.env.action_space.sample()
        else:
            return np.argmax(self.q_values[obs])
 
    def update(self, obs, action, reward, terminated, next_obs):
        if not terminated:
            future_q_value = np.max(self.q_values[next_obs])
            self.q_values[obs][action] += self.learning_rate * \
                (reward + self.discount_factor *
                 future_q_value-self.q_values[obs][action])
 
    def decay_epsilon(self):
        """Decrease the exploration rate epsilon until it reaches its final value"""
        self.epsilon = max(self.final_epsilon,
                           self.epsilon - self.epsilon_decay)

3. Define our training method

The train_agent function is responsible for training the provided Q-learning agent in a given environment over a specified number of episodes.
Each episode is one iteration either till the agent makes a mistake leading to termination or completes drop-off successfully.
The function iterates through the specified number of episodes (episodes).
Episode Initialization:
- It resets the environment and initializes variables to track episode progress.
Within each episode, the agent
- Selects actions based on the method defined in our agent class,
- Interacts with the environment,
- updates Q-values and accumulates rewards until the episode terminates.
Epsilon Decay:
- After each episode, the agent’s exploration rate (epsilon) is decayed using the decay_epsilon method.
- Initially, the epsilon value is high leading to more exploration.
- This is brought down to 0.1 in half the number of episodes.
Performance Tracking:
- The total reward for each episode is stored in the rewards list.
- We calculate the average of the last 10 episodes and save the best average reward obtained at each episode’s end.
Print Progress:
- We print the best average progress at every 100 evaluation intervals,
- We also return all the rewards that we obtained in each episode. This will help us in plotting.

Python3

def train_agent(agent, env, episodes, eval_interval=100):
    rewards = []
    best_reward = -np.inf
    for i in range(episodes):
        obs, _ = env.reset()
        terminated = False
        truncated = False
        length = 0
        total_reward = 0
 
        while (terminated == False) and (truncated == False):
 
            action = agent.get_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)
 
            agent.update(obs, action, reward, terminated, next_obs)
            obs = next_obs
            length = length+1
            total_reward += reward
 
        agent.decay_epsilon()
        rewards.append(total_reward)
 
        if i >= eval_interval:
            avg_return = np.mean(rewards[i-eval_interval: i])
            best_reward = max(avg_return, best_reward)
        if i % eval_interval == 0 and i > 0:
 
            print(f"Episode{i} -> best_reward={best_reward} ")
    return rewards

4. Running our training method

Sets up parameters for training, such as the number of episodes, learning rate, discount factor, and exploration rates.
Creates the Taxi-v3 environment from OpenAI Gym.
Initializes a Q-learning agent (QLearningAgent) with the specified parameters.
Calls the train_agent function to train the agent using the specified environment and parameters.

Python3

episodes = 20000
learning_rate = 0.5
discount_factor = 0.95
initial_epsilon = 1
final_epsilon = 0
epsilon_decay = ((final_epsilon-initial_epsilon) / (episodes/2))
env = gym.make('Taxi-v3', render_mode='ansi')
agent = QLearningAgent(env, learning_rate, initial_epsilon,
                       epsilon_decay, final_epsilon)
 
returns = train_agent(agent, env, episodes)

Output:

Episode100 -> best_reward=-224.3 
Episode200 -> best_reward=-116.22 
Episode300 -> best_reward=-40.75 
Episode400 -> best_reward=-14.89 
Episode500 -> best_reward=-3.9 
Episode600 -> best_reward=1.65 
Episode700 -> best_reward=2.13 
Episode800 -> best_reward=2.13 
Episode900 -> best_reward=3.3 
Episode1000 -> best_reward=4.32 
Episode1100 -> best_reward=6.03 
Episode1200 -> best_reward=6.28 
Episode1300 -> best_reward=7.15 
Episode1400 -> best_reward=7.62 
...

5. Plotting our returns

We can plot all the rewards obtained against the episode
We see a gradual decrease in reward value fa from large negative value towards zero and ultimately reaching positive value around 8.6.

Python3

def plot_returns(returns):
    plt.plot(np.arange(len(returns)), returns)
    plt.title('Episode returns')
    plt.xlabel('Episode')
    plt.ylabel('Return')
    plt.show()
 
plot_returns(returns)

Output:

Plot of rewards

6. Running our Agent

The run_agent function is designed to execute our trained agent in the Taxi-v3 environment and displays its interaction

agent. epsilon = 0: This line sets the exploration rate (epsilon) of the agent to zero, indicating that the agent should exploit its learned policy without further exploration
The while loop continues until either the episode terminates (terminated == True) or the interaction is truncated (truncated == True)
action = agent.get_action(obs): The agent selects an action based on its learned policy
env.render(): Renders the updated state after the agent’s action.
obs = next_obs: Updates the current state to the next state for the next iteration.

Python3

def run_agent(agent, env):
    agent.epsilon = 0    # No need to keep exploring
    obs, _ = env.reset() # get the current state
    env.render()
    terminated = truncated = False
 
    while terminated == False and truncated == False   :     
        action = agent.get_action(obs)        
        next_obs, _, terminated, truncated, _ = env.step(action)
        print(env.render())
         
        obs = next_obs
 
env = gym.make('Taxi-v3', render_mode='ansi')
run_agent(agent, env)

Output:

Output of the agent action

Model-Free Reinforcement Learning: An Overview

In Reinforcement learning, we have agents that use a particular ‘algorithm’ to interact with the environment.

Now this algorithm can be model-based or model-free.

The traditional algorithms are model-based in the sense they tend to develop a ‘model’ for the environment. This developed model captured the ‘transition probability’ and ‘rewards’ for each state and action that the agent used for planning its action.
The new age algorithms are model-free in the sense they do not develop a model for the environment. Instead, they develop a policy that guides the agent to take the best possible action in a given state without needing transition probabilities. In the model-free algorithm, the agent learns by hit and trial method by interacting with the environment.

In this article we will first get a basic overview of RL, then we will discuss the difference between model-based and model-free algorithms in detail. Then we will study the Bellman equation which is the basis of the model-free learning algorithm and then we will see the difference between on-policy and off-policy, value-based, and policy-based model-free algorithms. Then we will get an overview of major model-free algorithms. Finally, we will do a simple implementation of the Q-learning algorithm using an open AI gym.

Implementation of Model Free RL

1. Understand the environment

Python3

2. Create the q learning agent

Python3

3. Define our training method

Python3

4. Running our training method

Python3

5. Plotting our returns

Python3

6. Running our Agent

Python3

Model-Free Reinforcement Learning: An Overview

Similar Reads