The author selected Girls Who Code to receive a donation as part of the Write for DOnations program.
Introduction
Reinforcement learning is a subfield within control theory, which concerns controlling systems that change over time and broadly includes applications such as self-driving cars, robotics, and bots for games. Throughout this guide, you will use reinforcement learning to build a bot for Atari video games. This bot is not given access to internal information about the game. Instead, it’s only given access to the game’s rendered display and the reward for that display, meaning that it can only see what a human player would see.
In machine learning, a bot is formally known as an agent. In the case of this tutorial, an agent is a “player” in the system that acts according to a decision-making function, called a policy. The primary goal is to develop strong agents by arming them with strong policies. In other words, our aim is to develop intelligent bots by arming them with strong decision-making capabilities.
You will begin this tutorial by training a basic reinforcement learning agent that takes random actions when playing Space Invaders, the classic Atari arcade game, which will serve as your baseline for comparison. Following this, you will explore several other techniques — including Q-learning, deep Q-learning, and least squares — while building agents that play Space Invaders and Frozen Lake, a simple game environment included in Gym, a reinforcement learning toolkit released by OpenAI. By following this tutorial, you will gain an understanding of the fundamental concepts that govern one’s choice of model complexity in machine learning.
Prerequisites
To complete this tutorial, you will need:
- A server running Ubuntu 18.04, with at least 1GB of RAM. This server should have a non-root user with
sudo
privileges configured, as well as a firewall set up with UFW. You can set this up by following this Initial Server Setup Guide for Ubuntu 18.04. - A Python 3 virtual environment which you can achieve by reading our guide “How To Install Python 3 and Set Up a Programming Environment on an Ubuntu 18.04 Server.”
Alternatively, if you are using a local machine, you can install Python 3 and set up a local programming environment by reading the appropriate tutorial for your operating system via our Python Installation and Setup Series.
Step 1 — Creating the Project and Installing Dependencies
In order to set up the development environment for your bots, you must download the game itself and the libraries needed for computation.
Begin by creating a workspace for this project named AtariBot
:
- mkdir ~/AtariBot
Navigate to the new AtariBot
directory:
- cd ~/AtariBot
Then create a new virtual environment for the project. You can name this virtual environment anything you’d like; here, we will name it ataribot
:
- python3 -m venv ataribot
Activate your environment:
- source ataribot/bin/activate
On Ubuntu, as of version 16.04, OpenCV requires a few more packages to be installed in order to function. These include CMake — an application that manages software build processes — as well as a session manager, miscellaneous extensions, and digital image composition. Run the following command to install these packages:
- sudo apt-get install -y cmake libsm6 libxext6 libxrender-dev libz-dev
NOTE: If you’re following this guide on a local machine running MacOS, the only additional software you need to install is CMake. Install it using Homebrew (which you will have installed if you followed the prerequisite MacOS tutorial) by typing:
- brew install cmake
Next, use pip
to install the wheel
package, the reference implementation of the wheel packaging standard. A Python library, this package serves as an extension for building wheels and includes a command line tool for working with .whl
files:
- python -m pip install wheel
In addition to wheel
, you’ll need to install the following packages:
- Gym, a Python library that makes various games available for research, as well as all dependencies for the Atari games. Developed by OpenAI, Gym offers public benchmarks for each of the games so that the performance for various agents and algorithms can be uniformly /evaluated.
- Tensorflow, a deep learning library. This library gives us the ability to run computations more efficiently. Specifically, it does this by building mathematical functions using Tensorflow’s abstractions that run exclusively on your GPU.
- OpenCV, the computer vision library mentioned previously.
- SciPy, a scientific computing library that offers efficient optimization algorithms.
- NumPy, a linear algebra library.
Install each of these packages with the following command. Note that this command specifies which version of each package to install:
- python -m pip install gym==0.9.5 tensorflow==1.5.0 tensorpack==0.8.0 numpy==1.14.0 scipy==1.1.0 opencv-python==3.4.1.15
Following this, use pip
once more to install Gym’s Atari environments, which includes a variety of Atari video games, including Space Invaders:
- python -m pip install gym[atari]
If your installation of the gym[atari]
package was successful, your output will end with the following:
OutputInstalling collected packages: atari-py, Pillow, PyOpenGLSuccessfully installed Pillow-5.4.1 PyOpenGL-3.1.0 atari-py-0.1.7
With these dependencies installed, you’re ready to move on and build an agent that plays randomly to serve as your baseline for comparison.
Step 2 — Creating a Baseline Random Agent with Gym
Now that the required software is on your server, you will set up an agent that will play a simplified version of the classic Atari game, Space Invaders. For any experiment, it is necessary to obtain a baseline to help you understand how well your model performs. Because this agent takes random actions at each frame, we’ll refer to it as our random, baseline agent. In this case, you will compare against this baseline agent to understand how well your agents perform in later steps.
With Gym, you maintain your own game loop. This means that you handle every step of the game’s execution: at every time step, you give the gym
a new action and ask gym
for the game state. In this tutorial, the game state is the game’s appearance at a given time step, and is precisely what you would see if you were playing the game.
Using your preferred text editor, create a Python file named bot_2_random.py
. Here, we’ll use nano
:
- nano bot_2_random.py
Note: Throughout this guide, the bots’ names are aligned with the Step number in which they appear, rather than the order in which they appear. Hence, this bot is named bot_2_random.py
rather than bot_1_random.py
.
Start this script by adding the following highlighted lines. These lines include a comment block that explains what this script will do and two import
statements that will import the packages this script will ultimately need in order to function:
/AtariBot/bot_2_random.py
"""Bot 2 -- Make a random, baseline agent for the SpaceInvaders game."""import gymimport random
Add a main
function. In this function, create the game environment — SpaceInvaders-v0
— and then initialize the game using env.reset
:
/AtariBot/bot_2_random.py
. . .import gymimport randomdef main(): env = gym.make('SpaceInvaders-v0') env.reset()
Next, add an env.step
function. This function can return the following kinds of values:
state
: The new state of the game, after applying the provided action.reward
: The increase in score that the state incurs. By way of example, this could be when a bullet has destroyed an alien, and the score increases by 50 points. Then,reward = 50
. In playing any score-based game, the player’s goal is to maximize the score. This is synonymous with maximizing the total reward.done
: Whether or not the episode has ended, which usually occurs when a player has lost all lives.info
: Extraneous information that you’ll put aside for now.
You will use reward
to count your total reward. You’ll also use done
to determine when the player dies, which will be when done
returns True
.
Add the following game loop, which instructs the game to loop until the player dies:
/AtariBot/bot_2_random.py
. . .def main(): env = gym.make('SpaceInvaders-v0') env.reset() episode_reward = 0 while True: action = env.action_space.sample() _, reward, done, _ = env.step(action) episode_reward += reward if done: print('Reward: %s' % episode_reward) break
Finally, run the main
function. Include a __name__
check to ensure that main
only runs when you invoke it directly with python bot_2_random.py
. If you do not add the if
check, main
will always be triggered when the Python file is executed, even when you import the file. Consequently, it’s a good practice to place the code in a main
function, executed only when __name__ == '__main__'
.
/AtariBot/bot_2_random.py
. . .def main(): . . . if done: print('Reward %s' % episode_reward) breakif __name__ == '__main__': main()
Save the file and exit the editor. If you’re using nano
, do so by pressing CTRL+X
, Y
, then ENTER
. Then, run your script by typing:
- python bot_2_random.py
Your program will output a number, akin to the following. Note that each time you run the file you will get a different result:
OutputMaking new env: SpaceInvaders-v0Reward: 210.0
These random results present an issue. In order to produce work that other researchers and practitioners can benefit from, your results and trials must be reproducible. To correct this, reopen the script file:
- nano bot_2_random.py
After import random
, add random.seed(0)
. After env = gym.make('SpaceInvaders-v0')
, add env.seed(0)
. Together, these lines “seed” the environment with a consistent starting point, ensuring that the results will always be reproducible. Your final file will match the following, exactly:
/AtariBot/bot_2_random.py
"""Bot 2 -- Make a random, baseline agent for the SpaceInvaders game."""import gymimport randomrandom.seed(0)def main(): env = gym.make('SpaceInvaders-v0') env.seed(0) env.reset() episode_reward = 0 while True: action = env.action_space.sample() _, reward, done, _ = env.step(action) episode_reward += reward if done: print('Reward: %s' % episode_reward) breakif __name__ == '__main__': main()
Save the file and close your editor, then run the script by typing the following in your terminal:
- python bot_2_random.py
This will output the following reward, exactly:
OutputMaking new env: SpaceInvaders-v0Reward: 555.0
This is your very first bot, although it’s rather unintelligent since it doesn’t account for the surrounding environment when it makes decisions. For a more reliable estimate of your bot’s performance, you could have the agent run for multiple episodes at a time, reporting rewards averaged across multiple episodes. To configure this, first reopen the file:
- nano bot_2_random.py
After random.seed(0)
, add the following highlighted line which tells the agent to play the game for 10 episodes:
/AtariBot/bot_2_random.py
. . .random.seed(0)num_episodes = 10. . .
Right after env.seed(0)
, start a new list of rewards:
/AtariBot/bot_2_random.py
. . . env.seed(0) rewards = []. . .
Nest all code from env.reset()
to the end of main()
in a for
loop, iterating num_episodes
times. Make sure to indent each line from env.reset()
to break
by four spaces:
/AtariBot/bot_2_random.py
. . .def main(): env = gym.make('SpaceInvaders-v0') env.seed(0) rewards = [] for _ in range(num_episodes): env.reset() episode_reward = 0 while True: ...
Right before break
, currently the last line of the main game loop, add the current episode’s reward to the list of all rewards:
/AtariBot/bot_2_random.py
. . . if done: print('Reward: %s' % episode_reward) rewards.append(episode_reward) break. . .
At the end of the main
function, report the average reward:
/AtariBot/bot_2_random.py
. . .def main(): ... print('Reward: %s' % episode_reward) break print('Average reward: %.2f' % (sum(rewards) / len(rewards))) . . .
Your file will now align with the following. Please note that the following code block includes a few comments to clarify key parts of the script:
/AtariBot/bot_2_random.py
"""Bot 2 -- Make a random, baseline agent for the SpaceInvaders game."""import gymimport randomrandom.seed(0) # make results reproduciblenum_episodes = 10def main(): env = gym.make('SpaceInvaders-v0') # create the game env.seed(0) # make results reproducible rewards = [] for _ in range(num_episodes): env.reset() episode_reward = 0 while True: action = env.action_space.sample() _, reward, done, _ = env.step(action) # random action episode_reward += reward if done: print('Reward: %d' % episode_reward) rewards.append(episode_reward) break print('Average reward: %.2f' % (sum(rewards) / len(rewards)))if __name__ == '__main__': main()
Save the file, exit the editor, and run the script:
- python bot_2_random.py
This will print the following average reward, exactly:
OutputMaking new env: SpaceInvaders-v0. . .Average reward: 163.50
We now have a more reliable estimate of the baseline score to beat. To create a superior agent, though, you will need to understand the framework for reinforcement learning. How can one make the abstract notion of “decision-making” more concrete?
Understanding Reinforcement Learning
In any game, the player’s goal is to maximize their score. In this guide, the player’s score is referred to as its reward. To maximize their reward, the player must be able to refine its decision-making abilities. Formally, a decision is the process of looking at the game, or observing the game’s state, and picking an action. Our decision-making function is called a policy; a policy accepts a state as input and “decides” on an action:
policy: state - action
To build such a function, we will start with a specific set of algorithms in reinforcement learning called Q-learning algorithms. To illustrate these, consider the initial state of a game, which we’ll call state0
: your spaceship and the aliens are all in their starting positions. Then, assume we have access to a magical “Q-table” which tells us how much reward each action will earn:
state | action | reward |
---|---|---|
state0 | shoot | 10 |
state0 | right | 3 |
state0 | left | 3 |
The shoot
action will maximize your reward, as it results in the reward with the highest value: 10. As you can see, a Q-table provides a straightforward way to make decisions, based on the observed state:
policy: state - look at Q-table, pick action with greatest reward
However, most games have too many states to list in a table. In such cases, the Q-learning agent learns a Q-function instead of a Q-table. We use this Q-function similarly to how we used the Q-table previously. Rewriting the table entries as functions gives us the following:
Q(state0, shoot) = 10Q(state0, right) = 3Q(state0, left) = 3
Given a particular state, it’s easy for us to make a decision: we simply look at each possible action and its reward, then take the action that corresponds with the highest expected reward. Reformulating the earlier policy more formally, we have:
policy: state - argmax_{action} Q(state, action)
This satisfies the requirements of a decision-making function: given a state in the game, it decides on an action. However, this solution depends on knowing Q(state, action)
for every state and action. To estimate Q(state, action)
, consider the following:
- Given many observations of an agent’s states, actions, and rewards, one can obtain an estimate of the reward for every state and action by taking a running average.
- Space Invaders is a game with delayed rewards: the player is rewarded when the alien is blown up and not when the player shoots. However, the player taking an action by shooting is the true impetus for the reward. Somehow, the Q-function must assign
(state0, shoot)
a positive reward.
These two insights are codified in the following equations:
Q(state, action) = (1 - learning_rate) * Q(state, action) + learning_rate * Q_targetQ_target = reward + discount_factor * max_{action'} Q(state', action')
These equations use the following definitions:
state
: the state at current time stepaction
: the action taken at current time stepreward
: the reward for current time stepstate'
: the new state for next time step, given that we took actiona
action'
: all possible actionslearning_rate
: the learning ratediscount_factor
: the discount factor, how much reward “degrades” as we propagate it
For a complete explanation of these two equations, see this article on Understanding Q-Learning.
With this understanding of reinforcement learning in mind, all that remains is to actually run the game and obtain these Q-value estimates for a new policy.
Step 3 — Creating a Simple Q-learning Agent for Frozen Lake
Now that you have a baseline agent, you can begin creating new agents and compare them against the original. In this step, you will create an agent that uses Q-learning, a reinforcement learning technique used to teach an agent which action to take given a certain state. This agent will play a new game, FrozenLake. The setup for this game is described as follows on the Gym website:
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. At this time, there’s an international frisbee shortage, so it’s absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won’t always move in the direction you intend.
The surface is described using a grid like the following:
SFFF (S: starting point, safe)FHFH (F: frozen surface, safe)FFFH (H: hole, fall to your doom)HFFG (G: goal, where the frisbee is located)
The player starts at the top left, denoted by S
, and works its way to the goal at the bottom right, denoted by G
. The available actions are right, left, up, and down, and reaching the goal results in a score of 1. There are a number of holes, denoted H
, and falling into one immediately results in a score of 0.
In this section, you will implement a simple Q-learning agent. Using what you’ve learned previously, you will create an agent that trades off between exploration and exploitation. In this context, exploration means the agent acts randomly, and exploitation means it uses its Q-values to choose what it believes to be the optimal action. You will also create a table to hold the Q-values, updating it incrementally as the agent acts and learns.
Make a copy of your script from Step 2:
- cp bot_2_random.py bot_3_q_table.py
Then open up this new file for editing:
- nano bot_3_q_table.py
Begin by updating the comment at the top of the file that describes the script’s purpose. Because this is only a comment, this change isn’t necessary for the script to function properly, but it can be helpful for keeping track of what the script does:
/AtariBot/bot_3_q_table.py
"""Bot 3 -- Build simple q-learning agent for FrozenLake""". . .
Before you make functional modifications to the script, you will need to import numpy
for its linear algebra utilities. Right underneath import gym
, add the highlighted line:
/AtariBot/bot_3_q_table.py
"""Bot 3 -- Build simple q-learning agent for FrozenLake"""import gymimport numpy as npimport randomrandom.seed(0) # make results reproducible. . .
Underneath random.seed(0)
, add a seed for numpy
:
/AtariBot/bot_3_q_table.py
. . .import randomrandom.seed(0) # make results reproduciblenp.random.seed(0). . .
Next, make the game states accessible. Update the env.reset()
line to say the following, which stores the initial state of the game in the variable state
:
/AtariBot/bot_3_q_table.py
. . . for _ in range(num_episodes): state = env.reset() . . .
Update the env.step(...)
line to say the following, which stores the next state, state2
. You will need both the current state
and the next one — state2
— to update the Q-function.
/AtariBot/bot_3_q_table.py
. . . while True: action = env.action_space.sample() state2, reward, done, _ = env.step(action) . . .
After episode_reward += reward
, add a line updating the variable state
. This keeps the variable state
updated for the next iteration, as you will expect state
to reflect the current state:
/AtariBot/bot_3_q_table.py
. . . while True: . . . episode_reward += reward state = state2 if done: . . .
In the if done
block, delete the print
statement which prints the reward for each episode. Instead, you’ll output the average reward over many episodes. The if done
block will then look like this:
/AtariBot/bot_3_q_table.py
. . . if done: rewards.append(episode_reward) break . . .
After these modifications your game loop will match the following:
/AtariBot/bot_3_q_table.py
. . . for _ in range(num_episodes): state = env.reset() episode_reward = 0 while True: action = env.action_space.sample() state2, reward, done, _ = env.step(action) episode_reward += reward state = state2 if done: rewards.append(episode_reward)) break . . .
Next, add the ability for the agent to trade off between exploration and exploitation. Right before your main game loop (which starts with for...
), create the Q-value table:
/AtariBot/bot_3_q_table.py
. . . Q = np.zeros((env.observation_space.n, env.action_space.n)) for _ in range(num_episodes): . . .
Then, rewrite the for
loop to expose the episode number:
/AtariBot/bot_3_q_table.py
. . . Q = np.zeros((env.observation_space.n, env.action_space.n)) for episode in range(1, num_episodes + 1): . . .
Inside the while True:
inner game loop, create noise
. Noise, or meaningless, random data, is sometimes introduced when training deep neural networks because it can improve both the performance and the accuracy of the model. Note that the higher the noise, the less the values in Q[state, :]
matter. As a result, the higher the noise, the more likely that the agent acts independently of its knowledge of the game. In other words, higher noise encourages the agent to explore random actions:
/AtariBot/bot_3_q_table.py
. . . while True: noise = np.random.random((1, env.action_space.n)) / (episode**2.) action = env.action_space.sample() . . .
Note that as episodes
increases, the amount of noise decreases quadratically: as time goes on, the agent explores less and less because it can trust its own assessment of the game’s reward and begin to exploit its knowledge.
Update the action
line to have your agent pick actions according to the Q-value table, with some exploration built in:
/AtariBot/bot_3_q_table.py
. . . noise = np.random.random((1, env.action_space.n)) / (episode**2.) action = np.argmax(Q[state, :] + noise) state2, reward, done, _ = env.step(action) . . .
Your main game loop will then match the following:
/AtariBot/bot_3_q_table.py
. . . Q = np.zeros((env.observation_space.n, env.action_space.n)) for episode in range(1, num_episodes + 1): state = env.reset() episode_reward = 0 while True: noise = np.random.random((1, env.action_space.n)) / (episode**2.) action = np.argmax(Q[state, :] + noise) state2, reward, done, _ = env.step(action) episode_reward += reward state = state2 if done: rewards.append(episode_reward) break . . .
Next, you will update your Q-value table using the Bellman update equation, an equation widely used in machine learning to find the optimal policy within a given environment.
The Bellman equation incorporates two ideas that are highly relevant to this project. First, taking a particular action from a particular state many times will result in a good estimate for the Q-value associated with that state and action. To this end, you will increase the number of episodes this bot must play through in order to return a stronger Q-value estimate. Second, rewards must propagate through time, so that the original action is assigned a non-zero reward. This idea is clearest in games with delayed rewards; for example, in Space Invaders, the player is rewarded when the alien is blown up and not when the player shoots. However, the player shooting is the true impetus for a reward. Likewise, the Q-function must assign (state0
, shoot
) a positive reward.
First, update num_episodes
to equal 4000:
/AtariBot/bot_3_q_table.py
. . .np.random.seed(0)num_episodes = 4000. . .
Then, add the necessary hyperparameters to the top of the file in the form of two more variables:
/AtariBot/bot_3_q_table.py
. . .num_episodes = 4000discount_factor = 0.8learning_rate = 0.9. . .
Compute the new target Q-value, right after the line containing env.step(...)
:
/AtariBot/bot_3_q_table.py
. . . state2, reward, done, _ = env.step(action) Qtarget = reward + discount_factor * np.max(Q[state2, :]) episode_reward += reward . . .
On the line directly after Qtarget
, update the Q-value table using a weighted average of the old and new Q-values:
/AtariBot/bot_3_q_table.py
. . . Qtarget = reward + discount_factor * np.max(Q[state2, :]) Q[state, action] = (1-learning_rate) * Q[state, action] + learning_rate * Qtarget episode_reward += reward . . .
Check that your main game loop now matches the following:
/AtariBot/bot_3_q_table.py
. . . Q = np.zeros((env.observation_space.n, env.action_space.n)) for episode in range(1, num_episodes + 1): state = env.reset() episode_reward = 0 while True: noise = np.random.random((1, env.action_space.n)) / (episode**2.) action = np.argmax(Q[state, :] + noise) state2, reward, done, _ = env.step(action) Qtarget = reward + discount_factor * np.max(Q[state2, :]) Q[state, action] = (1-learning_rate) * Q[state, action] + learning_rate * Qtarget episode_reward += reward state = state2 if done: rewards.append(episode_reward) break . . .
Our logic for training the agent is now complete. All that’s left is to add reporting mechanisms.
Even though Python does not enforce strict type checking, add types to your function declarations for cleanliness. At the top of the file, before the first line reading import gym
, import the List
type:
/AtariBot/bot_3_q_table.py
. . .