On-Policy Learning In Reinforcement Learning (RL)

On-policy methods are about learning from what you are currently doing. Imagine you’re trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking. It’s like learning to cook by trying out different recipes yourself. It refers to learning the value of the policy being used by the agent, including the exploration steps. The policy directs the agent’s actions in every state, including the decision-making process while learning. The agent evaluates the outcomes of its present actions, refining its strategy incrementally. This method, much like mastering a skill through hands-on practice, allows the agent to adapt and improve its decision-making by directly engaging with the environment and learning from its own real-time interactions.

SARSA for On-Policy Learning

A prominent example of an on-policy method is SARSA, which stands for State-Action-Reward-State-Action. In SARSA, the agent learns by updating its policy based on the current action (A), the reward (R) received, and the next state-action pair. The update is based on the observed transition without needing a model of the environment’s dynamics. This approach is like learning on the job, where every step you take informs your next decision. SARSA updates the action-value function based on the current action and the following state and action.

Mathematically, it can be represented as:

where,

  • represents the action-value function, denoting the expected cumulative future rewards of taking action a in state s.
  • is the learning rate, determining the step size for the update.
  • is the reward received after taking action at ​ in state ​ and transitioning to state ​.
  • is the discount factor, weighing the importance of future rewards.
  • is the Q-value for the next state-action pair.

The SARSA algorithm updates its Q-values based on the observed reward and the estimate of future rewards, promoting the learning of an optimal policy over successive iterations.

On-Policy Learning Implementation

Let’s use the OpenAI Gym library, which provides various environments for testing RL algorithms. We will demonstrate both on-policy (using SARSA) and off-policy (using Q-Learning) methods.

Install necessary Python package

!pip install gym

Step 1: Import necessary packages

Python3

import gym
import numpy as np
import matplotlib.pyplot as plt

                    

Step 2: Initialize Environment

Python3

env = gym.make('FrozenLake-v1')
env.reset()

                    

Step 3: Initialize Q-table and Setting Hyperparameters

The Q-table is a matrix where rows correspond to states in the environment and columns to possible actions. Initially, it’s filled with zeros.

Learning Process

  • The agent learns through episodes. In each episode, it starts in an initial state and continues until a terminal state is reached. During each step:
  • The agent selects an action based on the current policy, typically using an epsilon-greedy strategy (a mix of exploration and exploitation).
  • After performing the action, the agent observes the reward and the new state.

Python3

Q = np.zeros([env.observation_space.n, env.action_space.n])
 
# Hyperparameters
alpha = 0.1
gamma = 0.99
epsilon = 0.1
num_episodes = 1000

                    

Step 4: On-Policy Method (SARSA) Algorithm Implementation

The epsilon-greedy strategy is employed for exploration, and the Q-values are updated using the SARSA formula which considers the reward received and the estimated value of the next action according to the current policy. The code tracks rewards and steps per episode for analysis.

Policy Improvement: Over time, as the agent explores the environment and receives feedback (rewards), the Q-table (representing the policy) gets refined, ideally converging to an optimal policy.

Python

rewards_sarsa = []
steps_per_episode = []
for i in range(num_episodes):
    state = env.reset()
    done = False
    total_reward = 0
    step_count = 0
     
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])
 
        new_state, reward, done, _ = env.step(action)
        new_action = np.argmax(Q[new_state, :])
        Q[state, action] += alpha * (reward + gamma * Q[new_state, new_action] - Q[state, action])
        state = new_state
        total_reward += reward
        step_count += 1
 
    rewards_sarsa.append(total_reward)
    steps_per_episode.append(step_count)

                    

Step 5: Visualization

Python3

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(rewards_sarsa)
plt.title("Rewards per Episode - SARSA")
plt.xlabel("Episode")
plt.ylabel("Total Reward")
 
plt.subplot(1, 2, 2)
plt.plot(steps_per_episode)
plt.title("Steps per Episode - SARSA")
plt.xlabel("Episode")
plt.ylabel("Steps")
 
plt.tight_layout()
plt.show()
 
print("Training complete with SARSA")

                    

Output:

Graphs

  • The rewards plot illustrates how the agent’s ability to accumulate rewards evolves over episodes, indicating learning efficiency. A rising trend signifies better strategy formulation.
  • The steps plot demonstrates the agent’s efficiency in completing episodes, where a decreasing trend indicates quicker solutions, reflecting improved decision-making over time.

The agent learns based on the current policy it is following, including the exploration steps. It evaluates and improves the policy it uses to make decisions.

Training complete with SARSA

On-policy vs off-policy methods Reinforcement Learning

In the world of Reinforcement Learning (RL), two primary approaches dictate how an agent (like a robot or a software program) learns from its environment: On-policy methods and Off-policy methods. Understanding the difference between these two is crucial for grasping the fundamentals of RL. This tutorial aims to demystify the concepts, providing a solid foundation for understanding the nuances between on-policy and off-policy strategies.

Similar Reads

Reinforcement Learning

Reinforcement Learning (RL) is an exciting and rapidly evolving field of artificial intelligence that focuses on how agents (like robots or software programs) should take actions in an environment to maximize a notion of cumulative reward. It is inspired by behavioural psychology and revolves around the idea of agents learning optimal behaviour through trial-and-error interactions with their environment. At its core, RL involves an agent, a set of states representing the environment, a set of actions the agent can take, and rewards that the agent receives for performing actions in specific states. The agent’s goal is to learn a policy – a strategy for choosing actions based on states – that maximizes its total reward over time....

On-Policy Learning In Reinforcement Learning (RL)

On-policy methods are about learning from what you are currently doing. Imagine you’re trying to teach a robot to navigate a maze. In on-policy learning, the robot learns based on the actions it is currently taking. It’s like learning to cook by trying out different recipes yourself. It refers to learning the value of the policy being used by the agent, including the exploration steps. The policy directs the agent’s actions in every state, including the decision-making process while learning. The agent evaluates the outcomes of its present actions, refining its strategy incrementally. This method, much like mastering a skill through hands-on practice, allows the agent to adapt and improve its decision-making by directly engaging with the environment and learning from its own real-time interactions....

Off-policy learning In Reinforcement Learning (RL)

...