0. OpenAI Gym

September 23, 2018 • Busa Victor

This article is the first of a long serie of articles about reinforcement learning. This serie is intented for readers who already have some notions of machine learning and are confident with Python and TensorFlow. Don’t worry, you don’t need to be an expert in TensorFlow. You can just read my introductory article about TensorFlow [TODO here]. In this article, I will present what is OpenAI Gym, how we can install it and how we can use it. the code for this tutorial is available here

OpenAI Gym

In machine learning and particularly in deep learning, once we have implemented our model (CNN, RNN, …) what we need to test its quality is some data. Indeed, we feed our model with our data and it will learn based on the data it is seeing. In reinforcement learning, we don’t need data. Instead, we need an environment with a set of rules and a set of functions. For example a chessboard and all the rules of the chess game form the environment.

Creating the environment is quite complex and bothersome. If you create the environment for the chess game you won’t be able to use it for the Go game, or for some Atari games. This is where OpenAI Gym comes into play. OpenAI Gym provides a set of virtual environments that you can use to test the quality of your agents. To understand how to use the OpenAI Gym, I will focus on one of the most basic environment in this article: FrozenLake.

Installing OpenAI Gym

We will install OpenAI Gym on Anaconda to be able to code our agent on a Jupyter notebook but OpenAI Gym can be installed on any regular python installation.

To install OpenAI Gym:

Open a git bash and type git clone https://github.com/openai/gym

Go to: https://github.com/openai/gym and click on Clone or download $\rightarrow$ Download ZIP

Then

Extract the contains of the zip
open an Anaconda prompt and go to the gym folder by typing: cd path/to/the/gym/folder
type pip install gym
You’re done !

If you type pip freeze you should see the gym package.

Playing with OpenAI Gym

In this section, I will briefly present how to interact with the environments from OpenAI Gym. I will only focus on the FrozenLake-V0 environment in this article. The FrozenLake-V0 environment is (by default) an $4 \times 4$ grid that is represented as follow:

SFFF
FHFH
FFFH
HFFG

Where:

F represents a Frozen tile, that is to say that if the agent is on a frozen tile and if he chooses to go in a certain direction, he won’t necessarily go in this direction.
H represents an Hole. If the agent falls in an hole, he dies and the game ends here.
G represents the Goal. If the agent reaches the goal, you win the game.
S represents the Start state. This is where the agent is at the beginning of the game.

Figure 2 represents a more friendly visualization of the FrozenLake-v0 board game.

Figure 2: FrozenLake-v0 board game

To load the FrozenLake-V0 environment, you can just write, in python:

import gym # import the gym package
env = gym.make('FrozenLake-v0') # load the env

To reset the environment, write:

# reset the env and returns the start state
s = env.reset()

To render an environment write:

env.render()

The result of the previous command is simply:

SFFF
FHFH
FFFH
HFFG

To see the number of states/actions write:

print(env.action_space) # Discrete(4)
print(env.observation_space) # Discrete(16)

That means that the FrozenLake-V0 environment has 4 discrete actions and 16 discrete states. To actually recover an int instead of a Discrete(int_value) you can add .n as follow:

print(env.action_space.n) # 4
print(env.observation_space.n) # 16

or you can also use:

print(env.env.nA) # number of actions: 4
print(env.env.nS) # number of states: 16

To let our agent executes an action $a$, we can use:

# execute the action `a`. The environment gives us back 4 values: 
# - the next_state we ended in after executing our action
# - the reward we get from executing that action
# - whether or not the game ended
# - the probability of executing our intented action
next_state, reward, terminate, probability = env.step(a)

To randomly sample an action from the set of actions, we can write:

# a is a random action
a = env.action_space.sample()

To retrieve all the informations available from the environment in a particular action-state $(s, a)$ you can write:

env.env.P[s][a]

Where $s$ is the state and $a$ is the action. Hence env.env.P[s][a] will give you all the information available if you are in the state $s$ and execute the action $a$. For example:

env.env.P[0][0]

returns:

[(0.3333333333333333, 0, 0.0, False),
 (0.3333333333333333, 0, 0.0, False),
 (0.3333333333333333, 4, 0.0, False)]

It means that, from the state-action pair $(0,0)$:

you can remain in the same state ($0$) with probability $1/3$ and that will give you a reward of $0.0$. This state is not a terminal state (False)
you can remain in the same state ($0$) with probability $1/3$ and that will give you a reward of $0.0$. This state is not a terminal state (False)
you can go to the state $4$ with probability $1/3$ and that will give you a reward of $0.0$. The state $4$ is not a terminal state (False)

Another example to be sure you understood how to interpret the information returned by env.env.P[s][a]

# state 15 is the Goal state, so state 14 is the state 
# at the left side of the Goal state.
env.env.P[14][1] 

returns:

[(0.3333333333333333, 13, 0.0, False),
 (0.3333333333333333, 14, 0.0, False),
 (0.3333333333333333, 15, 1.0, True)]

It means that, from the state-action pair $(14,1)$:

you can go to the state $13$ with probability $1/3$ and that will give you a reward of $0.0$. The state $13$ is not a terminal state (False)
you can remain in the same state ($14$) with probability $1/3$ and that will give you a reward of $0.0$. This state is not a terminal state (False)
you can go to the state $15$ with probability $1/3$ and that will give you a reward of $1.0$. The state $15$ is a terminal state (True)

Customize the environment

The FrozenLake-v0 environment contains by default 16 states ($4 \times 4$ grid) and the environment is stochastic, which means that if we tell our agent that we want him to execute the action $a$, it will not necessarily execute it. Let’s say we want our agent to execute the action Up. The agent will actually go Up with probability $1/3$, go Left with probability $1/3$ and go Right with probability $1/3$. The curious reader can see how the environment is implemented when the tiles are slippery (the environment is stochastic) here. What if we want the environment to be deterministic? That is to say that if I tell my agent to execute the action $a$, I’m 100% confident that he will actually execute the action $a$? To do so we can customize the environment and it is exactly the topic of this section.

To load and register a new environment we can write:

from gym.envs.registration import register
register(
    id='Deterministic-4x4-FrozenLake-v0', # name given to this new environment
    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv', # env entry point
    kwargs={'map_name': '4x4', 'is_slippery': False} # argument passed to the env
)

Here we specified that we want to load the $4 \times 4$ map. That we want the tiles to be non slippery (deterministic environment) and that this newly environment is registered under the name Deterministic-4x4-FrozenLake-v0.

and then we just have to load our new Deterministic-4x4-FrozenLake-v0 environment instead of the usual FrozenLake-v0 environment simply by doing:

env = gym.make('Deterministic-4x4-FrozenLake-v0') # load the environment

For example if we want to load the $8 \times 8$ grid with non slippery tiles we just have to register that environment as follow:

from gym.envs.registration import register
# map 8x8 with non slippery tiles
register(
    id='Deterministic-8x8-FrozenLake-v0',
    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',
    kwargs={'map_name': '8x8', 'is_slippery': False})

How can I know which arguments kwargs are available and how can I know which values are correct? To know this information, the best way is to look at the source code here. From the code source we can see that the map_name argument only accepts values 4x4 and 8x8. What if I want to create my own environment? We can also see that the FrozenLakeEnv accepts a desc (stands for description) which allow us to create our own environment. Let’s see how we can accomplish such a thing on an example:

I define my own environment using a list that contains F, G, H or S characters (that we already seen previously):

my_desc = [
    "SFFFF",
    "FHFHH",
    "FHFFH",
    "HHFFG"
]

Then I load my environment by passing my_desc:

from gym.envs.registration import register

register(
    id='Stochastic-5x5-FrozenLake-v0',
    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',
    kwargs={'desc': my_desc, 'is_slippery': True})

To make sure my environment was successfully loaded, I can display it using:

env = gym.make('Stochastic-5x5-FrozenLake-v0')
env.render()

and it actually outputs our environment:

SFFFF
FHFHH
FHFFH
HHFFG

What if I want to change the reward or the probability in a certain action-state pair? Well, If you want to achieve such a thing, the easiest way to do it, is to create your own class that inherits from gym.envs.toy_text.frozen_lake.FrozenLakeEnv. then in the constructor of your class you can redefine whatever you want. For example, let’s create an environment that will change the reward of the H state to be $-5$ and the reward of the G state to be $10$. To do so we need to create a new python file. I named it my_env.py and I put inside:

import gym

class CustomizedFrozenLake(gym.envs.toy_text.frozen_lake.FrozenLakeEnv):
    def __init__(self, **kwargs):
        super(CustomizedFrozenLake, self).__init__(**kwargs)

        for state in range(self.nS): # for all states
            for action in range(self.nA): # for all actions
                my_transitions = []
                for (prob, next_state, _, is_terminal) in self.P[state][action]:
                    row = next_state // self.ncol
                    col = next_state - row * self.ncol
                    tile_type = self.desc[row, col]
                    if tile_type == b'H':
                        reward = -5
                    elif tile_type == b'G':
                        reward = 10
                    else:
                        reward = 0

                    my_transitions.append((prob, next_state, reward, is_terminal))
                self.P[state][action] = my_transitions

Then in my Jupyter notebook, I can register my new environment under the name Stochastic-4x4-CustomizedFrozenLake-v0 and I can load it using:

from gym.envs.registration import register

register(
    id='Stochastic-4x4-CustomizedFrozenLake-v0',
    entry_point='my_env:CustomizedFrozenLake',
    kwargs={'map_name': '8x8', 'is_slippery': True})

env = gym.make('ssCustomizedFrozenLake-v0')

To actually see that the reward was actually modified, I can display env.P[18][1] for example. It outputs:

[(0.3333333333333333, 17, 0, False),
 (0.3333333333333333, 26, 0, False),
 (0.3333333333333333, 19, -5, True)]

We can see that the reward that we get from reaching the state $19$ (which is a Hole state in the $8 \times 8$ grid) is $-5.0$.

Conclusion

In this article, I presented the OpenAI Gym which is a package that you can use to load various envrionments to test your agent. I detailed some of the most important functions you should know to be ready to train an agent. To do so, I focused on the FrozenLake-v0 environment. Finally we have also seen how we can customize our environment.
In the next article we will load the FrozenLake-v0 environment and we will implement 2 algorithms to train our agent to reach the Goal state.