This blog post explains the On-policy Every-Visit Monte-Carlo algorithm and its implementation in MountainCar-v0 openai-gym environment. It is assumed that the reader has a basic understanding of Markov Decision Process (MDP), value iteration and policy iteration. Refer to Long-peak into RL- Lilian Weng, which explains all the basic concepts of RL.

For any value-based reinforcement learning (RL) algorithm, prediction (policy evaluation) and control (policy improvement) are the two main steps. Policy evaluation corresponds to finding the state-value for a given policy. The state-value corresponding to state $s$ for a policy $\pi$ is given by,

$v_{\pi}(s) = \mathbb{E}_{\pi}\left[\sum_{t=t_0}^T \gamma^{(t-t_0)} r_t|(s_{t_0}=s)\right]$

where $T$ is the episode length and $\gamma \in (0,1]$ is the discount factor. In Monte-Carlo method, the state-value $v_{\pi}(s)$ is estimated by averaging the returns over episodes. The return for state $s$ vistited at time step $t_0$ while following the policy $\pi$ is given by,

$G(s)|_{(s_{t_0} = s)} = \sum_{t=t_0}^T \gamma^{(t-t_0)} r_t$

In MC method the state-value is estimated as,

$\hat{v}_{\pi}(s) = \frac{\sum_{n=1}^{N}G_i(s)}{N}$

Since a state $s$ can be visited multiple times in a episode, the estimation of state-value function can be either performed using every-visit Monte Carlo method or first-visit Monte Carlo method. Hence, in the above equation $N$ is the number of episodes in which the state is visited at least once in first-visit MC method and in every-visit MC method, $N$ represents the number of times a state is visited including the possible multiple visits in a single episode. Here, $G_i(s)$ is the return obtained corresponding to the $i^{\rm th}$ visit to the state $s$. Following policy $\pi$ will only help us to estimate the state-value corresponding to policy $\pi$ and doesn’t improve the policy. To improve the policy we need to try different policies i.e., exploration has to be done. Hence we use $\epsilon$-soft policy for Monte-Carlo control. In $\epsilon$-soft policy, for a state $s$ the greedy action (action which gives the maximum state-action value) is chosen with probability $\left(1-\epsilon + \frac{\epsilon}{N_a}\right)$ and any other action is chosen with probability given by $\frac{\epsilon}{N_a}$, where $N_a$ is the number of different actions that can be chosen in a state $s$.

We run the Monte-Carlo algorithm on the MountainCar-v0 openai-gym environment. The description of the environment is described as follows:

Environment Description: A car has to travel in a one-dimensional (horizontal) track with the road being similar to a valley between mountains on either side of it. The goal is to reach the top of right side mountain starting in the valley.

State space: 2-D states representing (position, velocity)

Action space:

1. Push left (accelerate left),
2. Push right (accelerate right),
3. Do nothing

Reward:

• Reward $=-1$ for any action taken.
• Reward $=0$ on reaching the goal.

Initial state: $(x,0)$ with $x$ taking value between $[-0.6,-0.4]$ uniformally.

Termination:

• On reaching the goal i.e. $x>0.5$.
• Else if length of episode is $200$.

Remember: $x \in [-1.2, 0.6]$ and the max speed is $0.07$ with the state space being continuous.

From the environment description we can observe that the state space of MountainCar-v0 environment is continuous, whereas the Monte-Carlo RL methods are tabular methods which stores the state-action values $Q(s,a)$ for each state and action pair belonging to the finite-dimensional state and action space. Hence we use the following function to discretize the observations.

def discretize_obs(continuous_state):
pos_precision = 0.09
vel_precision = 0.0007

min_pos = -1.2
min_vel = -0.07

# max_pos = 0.6
# max_vel = 0.07

dis_pos = int((continuous_state[0] - min_pos) / pos_precision)
dis_vel = int((continuous_state[1] - min_vel) / vel_precision)

# print(dis_pos, dis_vel)
return np.array([dis_pos, dis_vel])


The code for First-visit Monte Carlo method is as follows:

class MonteCarlo:
def __init__(self, num_states, epsilon=1, n_actions=3):
self.epsilon = epsilon
self.epsilon_min = 0.01
self.eps_reduction = 0.99999  #(self.epsilon - self.epsilon_min)/400000

self.action_space = np.arange(n_actions)

self.Q = np.random.uniform(size=(num_states[0], num_states[1], n_actions))
self.Q[-1, :, :] = 0

self.count = np.zeros(shape=(num_states[0], num_states[1], n_actions))

self.gamma = 1
self.alpha = 0.001

def update_Qvalues(self, episode):
T = len(episode)
G = 0
states_visited = []
for i in range(T):
states_visited.append(list(episode[T-1-i][0]))

# print(states_visited)
for i in range(T):
# print(i,T)
state = episode[T-1-i][0]
action = episode[T-1-i][1]
reward = episode[T-1-i][2]
# print(list(state))

G =  self.gamma * G + reward

if list(state) not in states_visited[:T-2-i] :
self.count[state[0], state[1], action] += 1
self.Q[state[0], state[1], action] =   (G - self.Q[state[0], state[1], action])/self.count[state[0], state[1], action]
# print(G)

def choose_action(self, state):
if np.random.random() < self.epsilon:
action = np.random.choice(self.action_space)

else:
action = np.argmax(self.Q[state[0], state[1], :])
return action


The First-Vist MC agent as described above was not able to learn a optimal policy to reach the destination even after playing $10^{6}$ games/episodes. Increasing the number of timesteps in a episode to $500$ or $800$ also didn’t work. This may be due to the fact that the agent only receives $-1$ as the reward till it reaches the destination. Hence the agent is not able to find a policy that would reslut in reaching the destination. To overcome this, an additional reward was given whenever the agent took an action that resulted in the position of next state being greater than $0.05$. Since the car has to move back and forth to reach the destination, the actions that resulted in car position being less than $0.05$ were not penalized. An extra reward of $100$ was given whenever the agent reached the destination. It was observed that under this reward reshaping the agent was able to find a optimal policy resulting in the car reaching the destination. Concretely, the changes made to the reward structure was:

• If the position in next state is greater than $0.05$, then a additional reward of $\exp({\text {position of next state}\times5})$ was given. (There is no need to multiply position of next state by $5$. It works for any other value $>1$)
• If it reaches the destination then a extra reward of $100$ was given.
• The maximum number of steps in a episode was increased to $500$.

The corresponding code for training the agent is given below:


env = gym.make("MountainCar-v0")  # specify the environment you wish to have
env._max_episodes_steps = 500
env._max_episode_steps = 500

lr = 0.01  # learning rate
gamma = 0.99 # discount factor
n_games = 1000000
epsilon = 1 # Initial exploration factor
scores = []
avg_scores = []

num_states = (env.observation_space.high - env.observation_space.low)/([pos_precision, vel_precision])
num_states = num_states.astype(int)

agent = MonteCarlo(epsilon=epsilon, n_actions=env.action_space.n, num_states=num_states)

for i in range(n_games):
done = False
score = 0
observation = env.reset()  # initalizes the starting state of the car
discrete_obs = discretize_obs(observation)
episode = []

while not done:
if i%5001 == 0:
# print(f"Episode is {i}")
env.render()
action = agent.choose_action(discrete_obs)
next_observation, reward, done, info = env.step(action)
discrete_next = discretize_obs(next_observation)
# print(discrete_next, reward, action)

if next_observation[0] > 0.05:
reward += np.exp(next_observation[0]*5)
if done:
reward += 100
score += reward

episode.append((discrete_obs, action, reward))

discrete_obs = np.copy(discrete_next)

agent.update_Qvalues(episode)
agent.epsilon = np.max((agent.epsilon*agent.eps_reduction, agent.epsilon_min))
scores.append(score)
avg_scores.append(np.mean(scores))

if (i % 5000 == 0):
print(f"The average reward is {np.mean(scores[-100:])} and epsiode is {i}")

x = [i+1 for i in range(n_games)]
plt.plot(x, avg_scores)
plt.xlabel("Epsiodes")
plt.ylabel("Score")
env.close()


Once the agent was trained, it was run for a specifice number of episodes and most of the times the car reached the destination. The corresponding code is given as follows:

n_games = 20

for i in range(n_games):
done = False
score = 0
observation = env.reset()  # initalizes the starting state of the car
discrete_obs = discretize_obs(observation)
episodes_length = []
episode_length = 0
# episode = []
avg_scores = []

while not done:
env.render()

action = agent.choose_action(discrete_obs)
next_observation, reward, done, info = env.step(action)
discrete_next = discretize_obs(next_observation)
# print(discrete_next, reward, action)

if next_observation[0] > 0.05:
reward += np.exp(next_observation[0]*5)
if done:
reward += 100
score += reward

discrete_obs = np.copy(discrete_next)
episode_length += 1

episodes_length.append(episode_length)

scores.append(score)
avg_scores.append(np.mean(scores[-10:]))

print(f"The epsiode is {i} with episode length {episode_length} and total reward {score: .2f}")
x = [i+1 for i in range(n_games)]
plt.plot(x, avg_scores)
plt.xlabel("Epsiodes")
plt.ylabel("Score")

env.close()


One episode of the agent playing in the MountainCar-v0 environment is given below (In the .gif Episode refers to the time-step):

The complete code of first-vist MC agent for MountainCar-v0 can be found here.