Rank | Name | Score | Train time | Episodes | Git Link |
---|---|---|---|---|---|
1 | Educo | +11.3 | ~75 hours | 15.180 | educo |
After CartPole, we are now going to adjust that code to work for Pong. The whole process is also explained in the blog from Karpathy
. The code that will be used here is based on the repository from Karpathy, with some small tweaks to make it converge slightly faster. There is a second rewrite of Karpathy
's repository available here, which is the same code, but is a little bit cleaner, than the original repository.
If you are more interested in the DQN version of this tutorial (in any case, using Keras), spinning up has this blog about using a Sequential model for the game instead of the Policy Gradient Method.
Take a copy of the PGCartPole agent from the previous tutorial, PG CartPole, and think about what we have to change to make it work for Pong.
Ofcourse this is going to be PGPong
and not PGCartPole
. You can use the refactor option (shift + F6 ) in Pycharm, to quickly rename all code occurrences.
In the previous CartPole example we only had to worry about 0 and 1, (left and right). For this game we have 6 actions, you can view them using these commands in the python console:
env = gym.make('Pong-v0')
env.unwrapped.get_action_meanings()
['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']
To make the problem similar to our CartPole problem we are only going to use up
(action 2, right) and down
(action 3, left). This means that we have to change the actions to the new correct values, 2 and 3, instead of 1 and 0.
For the encouraging of the gradients, we have to map 2 and 3 to 1 and 0 to work properly.
# grad that encourages the action that was taken to be taken
# (see http://cs231n.github.io/neural-networks-2/#losses if confused)
log_probability = (1 if action == 2 else 0) - action_probability # <----
Now whenever we have an observation we will preprocess the observation. This is because the original observation is quite big and is not entirely useful. So by reducing the size we make it easier for our model to learn the right actions.
A second big additions is going to be a difference map, looking at the difference between observations. This is important because there is a time aspect in Pong. In CartPole we got all the information directly from the observation, namely the position and the velocity. For Pong however we have to create this information and a simple way of doing this is by looking at the difference between two image.
For this we are going to change the observation in the following ways:
Processing | Final image shape | Numpy code |
---|---|---|
Start image | (210, 160, 3) | obs |
Crop first axis to 35:195 | (160, 160, 3) | obs[35:195] |
Downsample by 2 | ( 80, 80, 3) | obs[::2, ::2] |
Take the R layer from RGB | ( 80, 80, 1) | obs[:, :, 0] |
Erase background values (144 and 109) | ( 80, 80, 1) | obs[obs == 144] = 0 |
Set all nonzero values to 1 (paddle, ball) | ( 80, 80, 1) | obs[obs != 0] = 1 |
Convert values to float | ( 80, 80, 1) | obs.astype(np.float) |
Flatten the observation | (6400,) | obs.flatten() orobs.ravel() |
Difference map | (6400,) | old_obs - obs |
Tip 1: You can combine the downsampling and picking the R layer in one command (obs[::2, ::2, 0]
)
Tip 2: Perform the difference map separately from the others.
def pre_proccessing(self, obs):
""" prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
obs = obs[35:195] # crop
obs = obs[::2, ::2, 0] # downsample by factor of 2
obs[obs == 144] = 0 # erase background (background type 1)
obs[obs == 109] = 0 # erase background (background type 2)
obs[obs != 0] = 1 # everything else (paddles, ball) just set to 1
return obs.astype(np.float).ravel()
Note 1: The preprocesser can be a staticmethod, but does not really matter.
# In the reset fase
obs = env.reset()
old_obs = np.zeros_like(self.pre_proccessing(obs))
# Calculate forward policy
obs = self.pre_proccessing(obs)
obs, old_obs = old_obs - obs, obs
Note 2: Python properly can handle multiple assignments, so there is no need for a swapping variables, in any case Python (in comparison to C++ or Java) does not require
temp = old_obs - obs
old_obs = obs
obs = temp
Now that we have a preprocessor, apply it to all the places that require an observation (run
and evaluate
). Also by changing the observation, we are changing the model inputs, so correct this and test if it works.
The standard weight optimizer that we used in CartPole was Stocastic Gradient Descent, this is not the only available optimizer. In Pong the RMSProp optimizer will be used, this is an optimzer that has two parts. One part is the updating of the weights and the second part is the updating of the cache, this cache makes sure that the learning rate becomes smaller when there are big corrections.
RMSProp is a variant of Adagrad, which stands for adaptive gradients. A more detailed explanation can be found on Deep Learning Demystified. For now implement the following two update formulas:
Where:
# perform rmsprop parameter update every batch_size episodes
if episode % self.batch_size == 0:
for layer, weights in self.model.items():
gradient = gradient_buffer[layer]
rmsprop_cache[layer] = self.decay_rate * rmsprop_cache[layer] \
+ (1 - self.decay_rate) * gradient ** 2
self.model[layer] += self.learning_rate * gradient / (np.sqrt(rmsprop_cache[layer]) + 1e-5)
gradient_buffer[layer] = np.zeros_like(weights)
Since training this will be a lot slower, it might be a good idea to update the print statement a bit before the done statement. For Pong specific we know that a point is scored whenever the observation is not equal to 0. The score is +1 if we scored a point and -1 if we lost a point. So this will update the print on every score.
# Pong specific, a point is scored
if reward:
print(f"\r\tEpisode {episode:6d}, score: {int(score): 3d}", end='')
Now add a \r
to the final print statement in the bookkeeping (near running_reward
), so it overwrites the above and adds the running mean.
from collections import defaultdict
scores = defaultdict(int)
# Pong specific, a point is scored
if reward:
score[reward] += 1
msg = "\r\tEpisode {ep:6d}, own {own:2d}, enemy {enemy:2d}, total: {total: 2d}"
print(msg.format(ep=episode, own=score[1], enemy=score[-1], total=score[1] - score[-1]), end='')
...
# To simplify bookkeeping
score = score[1] - score[-1]
running_reward = running_reward * 0.99 + score * 0.01
print(f"\rEpisode {episode:6d}, score: {score: 4.0f}, running mean: {running_reward: 6.2f}")
Now add a \r
to the final print statement in the bookkeeping (near running_reward
), so it overwrites the above and adds the running mean.
We already implemented the saving in the PG CartPole, but we didn't store the training process itself, in any case the scores and mean over time. Since CartPole only requires a few minutes to run, this is not important, but for Pong it will take quiete a lot of time, so running multiple experiments take a lot of time. Therefore it is a good idea to write the results to a file, so it can be analyzed and compared later on.
Note: I assume the fancy printing, otherwise ignore the own and enemy score.
# Above the init (class constant)
save_file = time.strftime('%Y-%b-%d-%a_%H.%M.%S') + '.csv'
# Add this near the bookkeeping
with open(self.save_file, 'a') as file:
msg = "episode {ep:6d}, score {s: 2d}, own {o:2d}, enemy {e:2d}, mean {m: 6.2f}\n"
file.write(msg.format(ep=episode, s=score[1] - score[-1], o=score[1], e=score[-1], m=running_reward))
There have been quite a few changes, so make sure that every part is working. A good way of doing this is to test your code on every small change that you make. So when implementing the new pritning, try and see if you can still run the code.
One big problem in programming is that you try to change or implement too many things at once without verifying that your code is still doing what you want it to do (or expect). Hopefully everything is going well, but in case you have trouble getting some parts to work, check again that the inputs and outputs are what you expect (use Pycharm debugger
with breakpoints
.) In case it is still not working feel free to contact the education committee via the slack channel ec-help-me
.
import time
import gym
import numpy as np
import pickle
from collections import namedtuple, defaultdict
class PGPong:
# hyperparameters
HIDDEN_LAYER = 200 # number of hidden layer neurons
INPUT_DIMENSION = 6400 # input dimension for the model
batch_size = 10 # every how many episodes to do a param update?
learning_rate = 1e-3
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
render = False
save_model = True
save_interval = 100
save_file = time.strftime('%Y-%b-%d-%a_%H.%M.%S') + '.csv'
# resume from previous checkpoint?
resume = False
save_path = 'Pong.pkl'
transition_ = namedtuple('transition', ('state', 'hidden', 'probability', 'reward'))
def __init__(self, game_name):
self.game_name = game_name
self.model = self.create_model()
if self.resume:
self.load(self.save_path)
@staticmethod
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
@staticmethod
def discount_rewards(r, gamma):
""" take 1D float array of rewards and compute discounted reward """
running_add = 0
discounted_r = np.zeros_like(r)
for idx in reversed(range(0, r.size)):
running_add = running_add * gamma + r[idx]
discounted_r[idx] = running_add
return discounted_r
def action(self, obs):
action_probability, hidden = self.policy_forward(obs)
action = 2 if action_probability >= 0.5 else 3
return action
def load(self, save_path):
try:
self.model = pickle.load(open(save_path, 'rb'))
print("The following model is loaded:", save_path)
except FileNotFoundError:
print("The following model could not be found:", save_path)
def save(self, save_path):
with open(save_path, 'wb') as file:
pickle.dump(self.model, file)
def create_model(self):
model = dict()
model['W1'] = np.random.randn(self.HIDDEN_LAYER, self.INPUT_DIMENSION) / np.sqrt(self.INPUT_DIMENSION)
model['W2'] = np.random.randn(self.HIDDEN_LAYER) / np.sqrt(self.HIDDEN_LAYER)
return model
def policy_forward(self, obs):
""" Return probability of taking action 1 (right), and the hidden state """
hidden = np.dot(self.model['W1'], obs)
hidden[hidden < 0] = 0 # ReLU nonlinearity
log_probability = np.dot(self.model['W2'], hidden)
probability = self.sigmoid(log_probability)
return probability, hidden
def policy_backward(self, episode_observations, episode_hidden, episode_probability):
""" backward pass. (eph is array of intermediate hidden states) """
dW2 = np.dot(episode_hidden.T, episode_probability)
dh = np.outer(episode_probability, self.model['W2'])
dh[episode_hidden <= 0] = 0 # backprop relu
dW1 = np.dot(dh.T, episode_observations)
return {'W1': dW1, 'W2': dW2}
def pre_proccessing(self, obs):
""" prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
obs = obs[35:195] # crop
obs = obs[::2, ::2, 0] # downsample by factor of 2
obs[obs == 144] = 0 # erase background (background type 1)
obs[obs == 109] = 0 # erase background (background type 2)
obs[obs != 0] = 1 # everything else (paddles, ball) just set to 1
return obs.astype(np.float).ravel()
def run(self, nr_games=100):
env = gym.make(self.game_name)
running_reward = -21 # self.evaluate(nr_games=5)
# update buffers that add up gradients over a batch and rmsprop memory
gradient_buffer = {k: np.zeros_like(v, dtype=np.float) for k, v in self.model.items()}
rmsprop_cache = {k: np.zeros_like(v, dtype=np.float) for k, v in self.model.items()}
for episode in range(1, nr_games + 1):
score = defaultdict(int)
memory = list()
obs = env.reset()
done = False
while not done:
# Render environment
if self.render:
env.render()
# Calculate forward policy
obs = self.pre_proccessing(obs)
action_probability, hidden = self.policy_forward(obs)
action = 2 if np.random.uniform() < action_probability else 3
next_obs, reward, done, info = env.step(action)
# grad that encourages the action that was taken to be taken
# (see http://cs231n.github.io/neural-networks-2/#losses if confused)
probability = (1 if action == 2 else 0) - action_probability
memory.append(self.transition_(obs, hidden, probability, reward))
obs = next_obs
# Pong specific, a point is scored
if reward:
score[reward] += 1
msg = "\r\tEpisode {ep:6d}, own {own:2d}, enemy {enemy:2d}, total: {total: 2d}"
print(msg.format(ep=episode, own=score[1], enemy=score[-1], total=score[1] - score[-1]), end='')
# Convert memory to a stack
transition = self.transition_(*zip(*memory))
observations = np.vstack(transition.state)
hiddens = np.vstack(transition.hidden)
probabilities = np.hstack(transition.probability)
rewards = np.hstack(transition.reward)
# Calculate discounted rewards
discounter_reward = self.discount_rewards(rewards, self.gamma)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounter_reward -= np.mean(discounter_reward)
discounter_reward /= np.std(discounter_reward)
# modulate the gradient with advantage (PG magic happens right here.)
probabilities *= discounter_reward
grad = self.policy_backward(observations, hiddens, probabilities)
# accumulate grad over batch
for weight in self.model:
gradient_buffer[weight] += np.array(grad[weight], dtype=np.float)
# perform rmsprop parameter update every batch_size episodes
if episode % self.batch_size == 0:
for layer, weights in self.model.items():
gradient = gradient_buffer[layer]
rmsprop_cache[layer] = self.decay_rate * rmsprop_cache[layer] \
+ (1 - self.decay_rate) * gradient ** 2
self.model[layer] += self.learning_rate * gradient / (np.sqrt(rmsprop_cache[layer]) + 1e-5)
gradient_buffer[layer] = np.zeros_like(weights)
with open(self.save_file, 'a') as file:
msg = "episode {ep:6d}, score {s: 2d}, own {o:2d}, enemy {e:2d}, mean {m: 6.2f}\n"
file.write(msg.format(ep=episode, s=score[1] - score[-1], o=score[1], e=score[-1], m=running_reward))
score = score[1] - score[-1]
running_reward = running_reward * 0.99 + score * 0.01
print(f"\rEpisode {episode:6d}, score: {score: 4.0f}, running mean: {running_reward: 6.2f}\t\t")
if self.save_model and episode % self.save_interval == 0:
self.save(self.save_path)
env.close()
def evaluate(self, nr_games=100):
""" Evaluate the model results. """
env = gym.make(self.game_name)
collected_scores = []
for episode in range(1, nr_games + 1):
obs = env.reset()
done = False
score = 0
while not done:
# Get action from model
obs = self.pre_proccessing(obs)
action = self.action(obs)
# update everything
obs, reward, done, info = env.step(action)
score += reward
print(f"\r\tGame {episode:3d}/{nr_games:3d} score: {score}", end='')
collected_scores.append(score)
average = sum(collected_scores) / nr_games
print(f"\n\nThe model played: {nr_games} games, with an average score of: {average: 5.2f}")
return average
if __name__ == '__main__':
agent = PGPong(game_name='Pong-v0')
agent.run(nr_games=50_000)
agent.save(agent.save_path)
agent.evaluate(nr_games=10)
If all goes will this should start training reasonably fast, at least a 100 games per 5 minutes at the start. When the game becomes closer it will take a lot more time for a single game to play. The whole training process is therefore quiete tedious and will take some days to complete. Now we do not ask you to run it for days, but if you want to see some progress run it for a few hours. The score should increase from -21 to around -18 in the first 1.000 games, if everything is going well.
To give an idea of how a run could go, there are some test runs for this lesson available in the tabs.
Episode | Mean | Time | Remark |
---|---|---|---|
1 | -21.00 | 0 Hours | |
500 | -19.70 | ~1 Hours | |
1.000 | -17.84 | ~2 Hours | |
1.500 | -14.46 | ~3 Hours | |
2.000 | -10.04 | ~4 Hours | |
2.500 | - 8.23 | ~6 Hours | |
3.000 | - 7.67 | ~9 Hours | |
5.000 | - 1.71 | ~17 Hours | Overnight |
6.000 | - 2.83 | ~22 Hours | |
7.000 | - 1.05 | ~27 Hours | |
7.800 | 0.11 | ~32 Hours | |
10.000 | 1.88 | ~45 Hours | Overnight |
11.000 | 2.77 | ~48 Hours | |
12.000 | 4.49 | ~51 Hours | |
15.300 | 7.31 | ~72 Hours | Overnight |
XX:XXX | 9.00 | XX Hours | Human average |
Episode | Mean | Time (HH:MM) | Remark |
---|---|---|---|
1 | -21.00 | 00:00 | 17:48 |
500 | -19.85 | 00:10 | 17:58 |
1.000 | -17.91 | 00:37 | 18:15 |
1.500 | -13.02 | 01:19 | 18:57 |
2.000 | -11.49 | 02:13 | 19:51 |
2.500 | - 8.93 | 03:28 | 21:06 |
3.000 | - 6.81 | 05:00 | 22:38 |
5.000 | 2.23 | ~20:00 | Overnight |
7.000 | 4.02 | ~24:00 |
This training can ofcourse be sped up with several techniques. The best ones are changing the parameters so the algorithm converges faster, but since we do not know those better parameters, we have to try them out and that takes a lot of time.
Another way is to run multiple versions of the game at the same time. This is called multi threading or multi processing. For this there is a small problem with Python due to the Global Interpreter Lock, or GIL. Due to this problem it is not possible to run a python program on multiple cores using threading. This is a known problem and a good work around is to use the build in multiprocessing module.
This multiprocessing options has already been solved for you, but we will get to that a bit later.
In this lesson we adapted the policy gradient algorithm for CartPole to work on Pong. One of the things that we added is the preprocessing, to simplify the observation for the model. This simplifaction is occuring a lot in the machine learning world and is generally known as preprocessing your data. One important thing is that you can prove that your preprocessing is doing what you claim it is doing. So in the next chapter we will take a look at how we can prove it is working correctly and make the preprocessing more general, so it can also apply for other gym games.