Reinforcement Learning Exercises
In this article, I present some solutions to some reinforcement learning exercises. These exercises are taken from the book “Artificial Intelligence A Modern Approach 3rd edition”.
decision problem. (b) Illustration of the transition model of the environment: the "intented"
outcome occurs with probability 0.8, but with probability 0.2 the agent moves at right angles
to the intended direction. A collision with a wall results in no movement. The two terminal
states have reward +1 and -1, respectively, and all other states have a reward of -0.04.
Answer Well, this exercise doesn’t seem difficult but we surely need lot’s of time to
compute all the solutions with their respective probabilities. So, first of all, we are asking to compute the probabilities of being in each square at the end of the sequence [Up, Up, Right, Right, Right]. Hence we don’t need to use the Ballman equation here, as we only want to compute probabilities and not utilities.
Obviously, we can do something a little bit smarter than just enumerate all the solutions. Indeed,
we can compute the probability at each time step and reuse these probabilities to compute the probabilities at
the next time step.
For example, we compute the probability of reaching each position after doing one Upward
movement and then we reuse these probabilities (first column of Figure 17.2) to compute the probabilities at the second time step (second column of Figure 17.1.2):
So for example, to compute 0.241 in position (1,2), we reused the probabilities of the first column and we multiplied by the probability given in Figure 17.1.
\[P((1,2) \text{ in step 2}) = P((1,1) \text{ in step 1}) \times P((1,2)|(1,1) \text{ in step 1}) + P((1,2) \text{ in step 1}) \times P((1,2)|(1,2) \text{ in step 1}) + P((2,1) \text{ in step 1}) \times P((1,2)|(2,1) \text{ in step 1}) \\ = 0.1 \times 0.8 + 0.8 \times (0.1 + 0.1) + 0.1 \times 0 = 0.24\]So finally we can come up with the full table, with the asked occupancy probabilities in the last column:
The probabilities in the last column are the answer to the question.
This computation is related to the prediction task for a HMM in the sense that we only need to consider the probabilities of the previous positions to compute the probabilities of being in the next positions.
are 4 arrows, the agent can decide to go in any of these directions.
Answer According to Figure 17.2.1, we can take whatever policy we want for the squares: (1,2), (2,1), (3,1), (1,2), (1,3), (2,3). For example, we can choose to always go right. We must note that, again, if we choose to go right the agent will go right with probability 0.8 and it will go down or up with probability 0.1. Having said that our Transition matrix looks like:
So we have to solve the system:
\[\pi \begin{bmatrix} 0.1 & 0.8 & 0 & 0 & 0.1 & 0 & 0 & 0 & 0\\ 0 & 0.2 & 0.8 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0.1 & 0.8 & 0 & 0.1 & 0 & 0 & 0\\ 0 & 0 & 0.1 & 0.9 & 0 & 0 & 0 & 0 & 0\\ 0.1 & 0 & 0 & 0 & 0.8 & 0 & 0.1 & 0 & 0\\ 0 & 0 & 0.1 & 0 & 0 & 0.8 & 0 & 0 & 0.1\\ 0 & 0 & 0 & 0 & 0.1 & 0 & 0.1 & 0.8 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0.2 & 0.8\\ 0 & 0 & 0 & 0 & 0 & 0.1 & 0 & 0.8 & 0.1\\ \end{bmatrix} = \pi\]with:
\(\pi = \begin{bmatrix} \pi_{11} & \pi_{21} & \pi_{31} & \pi_{41} & \pi_{12} & \pi_{32} & \pi_{13} & \pi_{23} & \pi_{33}\\ \end{bmatrix}\) and \(\sum\limits_{i,j} \pi_{ij} = 1\)
I let you solve this system…
Answer To understand the problem, let’s write it mathematically. The utility function can be written:
\[U(s_0, a_0, s_1, ..., a_n, s_n) = \max\limits_{i=0}^{n-1} R(s_i, a_i, s_{i+1})\]We say that a utility function meets the stationary property if the result of applying the utility function to the sequences $[s_1, s_2, …]$ and $[s_1’, s_2’, …]$ leads to the same solution and the result of applying the utility function to the (next) sequences $[s_2, s_3, …]$ and $[s_2’, s_3’, …]$ leads again to the same solution.
Obviously, if we take $[2, 1, 0, 0 …]$ and $[2, 0, 0, 0 …]$ then the utility function will return the same result: 2. While in the (next) sequences $[1, 0, 0 …]$ and $[0, 0, 0 …]$ the utility function won’t return the same value so this utility function does not result in stationary preferences between state sequences.
We can, nonetheless, still define $U^{\pi}(s)$ as the expected maximum reward obtained by using the policy $\pi$ starting in state $s$.
- a. Write the Bellman equations for these formulations
- b. Show how an MDP with reward function $R(s, a, s')$ can be transformed into a different MDP with reward function $R(s, a)$, such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP.
- c. Now do the same to convert MDPs with $R(s, a)$ into MDPs with $R(s)$.
Answer
- a. In the book, the Bellman equation is written as:
In this question we are asking to compute the Bellman equation using $R(s,a)$ and $R(s,a,s’)$. If the reward depends on the action then, as we want to maximize the utility (see the $\max$ in the equation), we need to maximize our action too, so we can rewrite the Utility function as:
\[U(s) = \max\limits_{a \in A(s)}\left[R(s,a) + \gamma\sum\limits_{s'}P(s'|s,a)U(s')\right]\]We are then asked to rewrite it using $R(s, a, s’)$. This time the action depends on the previous state and on the resultant state, so we need to put this term in both the max over a and the sum over s’:
\[U(s) = \max\limits_{a \in A(s)}\sum\limits_{s'}P(s'|s,a)[R(s,a, s') + \gamma U(s')]\]- b. The idea here is to define for every s, a, s’ a pre-state such that $T’(s, a, pre(s,a,s’))$, i.e executing the action $a$ in state $s$ leads to the pre-state $pre(s, a, s’)$ from which there is only one action that always leads to s’. Hence we can rewrite U(s) as follow:
The second equality comes from the idea of the pre-state and the fact that we expand U(s) by one recursive call so we can write the pre-state and the state s’. The third equality is how we would like to rewrite the utility function. So, by analyzing the second and the last relations, we can see that the equality is satisfied if we define:
- $R’(s,a) = 0$
- $T’(pre, b, s’) = 1$
- $R’(pre, b) = \gamma^{-1/2}R(s,a,s’)$
- $\gamma ‘ = \gamma^{1/2}$
-
$T’(s, a, pre) = T(s, a, s’)$
- c. It’s the same principle. I won’t detail the computation.
- a. Show that, for any functions f and g, $$| \max\limits_{a} f(a) − \max\limits_{a} g(a)| ≤ \max\limits_{a} |f(a) − g(a)|$$
- b. Write out an expression for $|(B Ui − B Ui')(s)|$ and then apply the result from (a) to complete the proof that the Bellman operator is a contraction.
Answer a. Without loss of generality we can assum that $\max\limits_{a} f(a) \geq \max\limits_{a} g(a)$. We can then rewrite:
\[| \max\limits_{a} f(a) − \max\limits_{a} g(a)| = \max\limits_{a} f(a) − \max\limits_{a} g(a)\]Now, let’s say that the maximum of f is obtained in $a_1$, or put it differently: $a_1 = \arg\max\limits_{a} f(a)$. We also define $a_2$ as being: $a_2 = \arg\max\limits_{a} g(a)$. With these definition we can then write:
\[| \max\limits_{a} f(a) − \max\limits_{a} g(a)| = \max\limits_{a} f(a) − \max\limits_{a} g(a) \\ = f(a_1) - f(a_2) \leq f(a_1) - g(a_1)\]because $\forall a \in A, g(a_2) \geq g(a)$, so in particular for $a_1$, $g(a_1) \leq g(a_2)$. So final we have:
\[| \max\limits_{a} f(a) − \max\limits_{a} g(a)| \leq f(a_1) - g(a_1) \\ = | f(a_1) - g(a_1) | \leq \max\limits_{a}|f(a) - g(a)|\]b. This question isn’t difficult, we just need to carefully notice that what we want to prove is that:
\[||BUi - BUi'|| \leq \gamma ||Ui - Ui'||\]and the question asked us to compute $ | (B Ui − B Ui’)(s) | $ first! So let’s compute this quantity. We have: |
- The first equality comes from the definition of the Bellman operator.
- The second equality comes from the fact that $\gamma$ is positive (between 0 and 1).
- The first inequality comes from the a.
If we let :
\[a^{*} = \arg \max \limits_{a} (\sum\limits_{s'}T(s,a,s')U_i(s') - \sum\limits_{s'}T(s,a,s')U_i'(s))\]we can then write (without the max operator):
\[|(B U_i − B U_i')(s)| \leq \gamma|\sum\limits_{s'}T(s,a^{*},s')U_i(s') - \sum\limits_{s'}T(s,a^{*},s')U_i'(s)| \\ = \gamma|\sum\limits_{s'}T(s,a^{*},s')(U_i(s') - U_i'(s))|\]Finally, we can now compute the max norm of $ BU_i - B U_i’$ to prove that the Bellman operator is a contraction:
\[||B U_i − B U_i'|| = \max_{s}|(B U_i − B U_i')(s)| \\ \leq \gamma \max_{s} |\sum\limits_{s'}T(s,a^{*},s')(U_i(s') - U_i'(s))| \leq \gamma \max_{s} |U_i(s) − U_i'(s)| = \gamma ||U_i - Ui'||\]where the last inequality comes from the fact that $T(s, a, s’)$ are probabilities and so we have a convex inequality.
- a. Let $U_A(s)$ be the utility of state $s$ when it is $A$’s turn to move in s, and let $U_B(s)$ be the utility of state $s$ when it is $B$’s turn to move in s. All rewards and utilities are calculated from $A$’s point of view (just as in a minimax game tree). Write down Bellman equations defining $U_A$(s) and $U_B$(s).
- b. Explain how to do two-player value iteration with these equations, and define a suitable termination criterion.
- c. Consider the game described in Figure 17.7.1. Draw the state space (rather than the game tree), showing the moves by $A$ as solid lines and moves by $B$ as dashed lines. Mark each state with $R(s)$. You will find it helpful to arrange the states $(sA, sB)$ on a two-dimensional grid, using $sA$ and $sB$ as “coordinates.”
- d. Now apply two-player value iteration to solve this game, and derive the optimal policy.
Answer
- a. When is $A$’s turn to move, $A$ reach a new state $s’$ from s and, in this new state $s’$ it’s $B$’s turn to move. The utility function is written as:
As we want the utility $U_B$ from $A$’s point of view, $A$ will likely take into consideration that $B$ will want to minimize its utility. So we have:
\[U_B(s) = R(s) + \min_{a}\sum\limits_{s'}T(s,a,s')U_A(s')\]-
b. To do two-player value iteration we simply apply the Bellman update for the two-player alternatively. The process terminates when 2 successive utilities (for the same player) are equal or within a certain (fixed) epsilon.
-
c, d To solve these questions we need to iteratively apply the value iteration algorithm starting from the four final states: (4,3), (4,2) and (2,1), (3,1). Note that the state (4,1) is not a final state as, if A reaches 4 then the game ends and B can not reach 1 (and vice-versa). I won’t try to solve this by hand. Yet, I might update this exercise in the future to sketch out the first few steps to take.
Implement value iteration for this world for each value of $r$ below. Use discounted rewards with a discount factor of 0.99. Show the policy obtained in each case. Explain intuitively why the value of $r$ leads to each policy
- a. $r = 100$
- b. $r = -3$
- c. $r = 0$
- d. $r= +3$
Answer It would be too cumbersome to run the value iteration algorithm on all the 4 cases by hand. And the question asked to implement value iteration, so I think we should write a program to solve this question. I won’t do it. Yet we can try to figure out the policy obtained in each case. I draw a figure with all the different policies for all the different cases a, b, c, d:
a. If the reward in the red square is 100, the agent will likely want to stay in this square forever and hence avoid to go to the final state (in gray). As we are dealing with a stochastic environment (we go in the direction we want with probability 0.8 and in the perpendicular directions with probability 0.1), the arrow around the final gray state need to point in the opposite direction to avoid going into the final state.
b. If the reward in the red square is -3, then, as the reward of the white squares are -1 and the reward in the final square is +10, the agent we likely want to avoid the red square and go as fast as possible to the gray square. However, we don’t have a down arrow in (1,2) because if we were to put a down arrow in (1,2) the agent will likely make a detour that can can cost more than -3 points.
c. Here the reward for the red square is 0, so, as the rewards in the white squares are -1, the agent will want to go through the red square before reaching the final gray square. It won’t want to stay in the red square as the final square offer a +10 reward. That explains the sense of the arrows.
d. Here r = 3, so the agent will want to stay in the red square indefinitely (same explanations as in a).
Answer: This exercise is quite straightforward. We need to apply the Bellman equation in the two different situations. Let’s assume first that we want to go UP. We have:
\[U(s) = 0 + \gamma (\max_{a}\sum\limits_{s_{13}} P(s_{13}|s_{12},a) U(s_{13})) \\\]If the agent goes UP we can only reach the state $s_{13}$ with probability 1, so
\[\sum\limits_{s_{13}} P(s_{13}|s_{12},a) = 1 \times U(s_{13})\]And finally we have:
\[U_{up}(s) = \gamma U(s_{13}) = \gamma(50 + \gamma \sum\limits_{s_{23}} P(s_{23} | s_{12},a) U(s_{23})) \\ = \gamma (50 + \gamma U(s_{23})) = 50 \gamma + \gamma^{2} U(s_{23}) \\ = 50 \gamma + \gamma^{2} (-1 + \gamma U(s_{33})) \\ = 50 \gamma - \gamma^{2} + \gamma^{3} (-1 + \gamma U(s_{43})) \\ = 50 \gamma - \gamma^{2} - \gamma^{3} + \gamma^{4} U(s_{43}) \\ = \text{...} \\ = 50 \gamma - \sum\limits_{i = 2}^{101} \gamma^{i} \\ = 50 \gamma - \gamma^{2} \sum\limits_{i=0}^{99} \gamma^{i} \\ = 50 \gamma - \gamma^{2} \frac{(1-\gamma^{100})}{1-\gamma}\]The last relation comes from the fact that $\gamma \in [0,1]$.
We use the Bellman equation to compute the utility if the agent goes DOWN. We obtain: \(U_{down}(s) = -50 \gamma + \gamma^{2} \frac{(1-\gamma^{100})}{1-\gamma}\)
We then need to solve the system (with a computer):
\[50 \gamma - \gamma^{2} \frac{(1-\gamma^{100})}{1-\gamma} = -50 \gamma + \gamma^{2} \frac{(1-\gamma^{100})}{1-\gamma}\]to find the value of $\gamma$.
- In state 1, action $a$ moves the agent to state 2 with probability 0.8 and makes the agent stay put with probability 0.2.
- In state 2, action $a$ moves the agent to state 1 with probability 0.8 and makes the agent stay put with probability 0.2.
- In either state 1 or state 2, action $b$ moves the agent to state 3 with probability 0.1 and makes the agent stay put with probability 0.9
- a. What can be determined qualitatively about the optimal policy in states 1 and 2?.
- b. Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and 2. Assume that the initial policy has action $b$ in both states.
- c. What happens to policy iteration if the initial policy has action $a$ in both states? Does discounting help? Does the optimal policy depend on the discount factor?
Answer a. If the agent is in state 1 it should do action b to reach the terminal state (state 3) with reward 0. If the agent is in state 2, it might prefer to do action $a$ in order to reach state $1$ and then action $b$ from state 1 to reach the terminal state. Indeed, if the agent do action $b$ in state 2, he has 0.1 chance to end in state 0 and 0.9 chance to stay in state 2 with reward -2, while, if he is in state 1 and fails to go to state 0 it will cost the agent -1 at each attempt. So there is a trade-off to compute.
b. We apply policy iteration to find out what is the best policy in each state. Initialization:
- $U = (u_1, u_2, u_3) = (-1, -2, 0)$
- $p = (b, b)$ (initialize policy to b and b for each state 1 and 2)
- $\gamma = 1$
Computation
-
$u_1 = -1 + \gamma \sum\limits_{s’} P(s’ s,\pi{s}) u_1(s’) = -1 + 0.1 u_3 + 0.9 u_1$ - $u_2 = -2 + 0.1 u_3 + 0.9 u_2$
- $u_3 = 0$
So $u_1 = -10$, $u_2 = -20$
policy update:
- state 1:
- action a: $\sum\limits_{i} T(1,a,i) u_i = 0.8 \times -20 + 0.2 \times -10 = -18$
-
action b: $\sum\limits_{i} T(1,b,i) u_i = 0.1 \times 0 + 0.9 \times -10 = -9$ $-9 \geq -18$ so $\pi = b$ in state 1
- state 2:
- action a: $\sum\limits_{i} T(1,a,i) u_i = 0.8 \times -10 + 0.2 \times -20 = -12$
- action b: $\sum\limits_{i} T(1,b,i) u_i = 0.1 \times 0 + 0.9 \times -20 = -18$ $-12 \geq -18$ so $\pi = a$ in state 2
As the action has changed for state 2, we need to continue the policy iteration algorithm.
- $u_1 = -1 + 0.1 u_3 + 0.9 u_1$
- $u_2 = -2 + 0.8 u_1 + 0.2 u_2$
- $u_3 = 0$
- so $u_1 = -10$ and $u_2 = -15$
policy update:
- state 1:
- action a: -14
-
action b: -9 $-9 \geq -14$ so $\pi = b$ in state 1 (hasn’t changed!)
- state 2:
- action a: -11
- action b: -13.5 $-11 \geq -13.5$ so $\pi = a$ in state 2 (hasn’t changed!)
As the action for both state hasn’t changed, we stop the policy iteration here.
So finally, if we are in state 1 we will choose action $b$ and if we are in state 2 we will choose action $a$. This result match the analysis from part $a)$.
c. If the initial policy has action $a$ in both states then the problem is unsolvable. We only need to write the initialization to see that the equations are inconsistent:
- u_1 = -1 + 0.2 u_1 + 0.8 u_2
- u_2 = -2 + 0.8 u_1 + 0.2 u_2
- u_3 = 0
discounting may help actually because it allows us to bound the penalty. Indeed, intuitively if $\gamma$ is near 0 then the cost incurs in the distant future plays are negligible. Hence, the action of the agent depends on the choice of the value of $\gamma$.
Answer I think the question is a bit unclear, even if we have the book. So we will make the assumption that the sensor measures the number of adjacent walls, which
happen to be 2 in all the squares beside the square of the third column. Moreover, we will assume that this is a noisy sensor and that it will return the True value (
i.e the exact number of adjacent walls) with probability 0.9 and a wrong value with probability 0.1.
As we are dealing with a POMDP and we want the belief next belief state according to the previous belief states, we will use the formula:
For example, if we are end in square $(1,1)$ (equivalently in state $s_{11}$), that would have meant that we were either in state $(1,1)$, $(1,2)$ or $(2,1)$ before. So, using the formula with these notations we have:
\[b_1(s_{11}) = \alpha P(o|s_{11}) \sum\limits_{s} P(s_{11}|s,a)b(s) \\ = \alpha[P(s_{11}|s_{12},a) \times \frac{1}{9} + P(s_{11}|s_{11},a) \times \frac{1}{9} + P(s_{11}|s_{21},a) \times \frac{1}{9}] \\ = \alpha 0.1 \times \frac{1}{9} \times [0.1 + 0.9 + 0.8] \\ = 0.02 \alpha\]For example for state $s_{12}$ (square $(1,2)$), we will have:
\[b_1(s_{12}) = \alpha P(o|s_{12}) \sum\limits_{s} P(s_{12}|s,a)b(s) \\ = \alpha[P(s_{12}|s_{12},a) \times \frac{1}{9} + P(s_{12}|s_{11},a) \times \frac{1}{9} + P(s_{12}|s_{13},a) \times \frac{1}{9}] \\ = \alpha 0.1 \times \frac{1}{9} \times [0.8 + 0.1 + 0.1] \\ = \frac{0.1}{9} \alpha\]where $\alpha$ is a constant such that $\forall i \in S, \sum\limits_{s}b_i(s) = 1$. Hence to find $\alpha$ we need to solve:
\[b_1(s_{11}) + b_1(s_{12}) + b_1(s_{21}) + ... + b_1{s_{43}} = 1\]i.e
\[\alpha * [0.02 + \frac{0.1}{9} + ...] = 1\]So, I won’t do all the computation here. What is important is how we can solve the problem and not the computation in itself.
Answer In a sensor environment the time complexity is $O(|A|^d.|E|^q)$ where $|A|$ is the number of actions and $|E|$ is the number of observation and $d$ is the depth search. In a sensorless environment we don’t have to build branches for the observations so we would simply have a time complexity of $O(|A|^d)$
To do that we need to write down the mathematical definition of a dominant strategy equilibrium and a Nash equilibrium.
A strategy s for a player p dominates strategy s’ if the outcome for s is better for p than the outcome for s’,
for every choice of strategies by the other player(s).
So, mathematically we can say that a dominant strategy equilibrium can be written:
\[\exists s_j \in [s_1, s_2, ..., s_n] / \forall p \in Player, \forall s_i' \in [s_1', s_2', ... s_n'], \forall str \in \text{[other strategies of the opponents]}, outcome(s_j,str) > outcome(s_j', str)\tag{1}\]where $str$ is a strategy among all the possible combination of strategies of all the opponents. While, in a Nash equilibrium, we only require that the strategy $s_i$ is optimal for the current combination of the opponents’ strategies:
\[\exists s_j \in [s_1, s_2, ..., s_n] / \forall p \in Player, \forall s_i' \in [s_1', s_2', ... s_n'], outcome(s_j,s_{-j}) > outcome(s_j', s_{-j})\tag{2}\]So (1) => (2).
Yet, (2) does not neccessarily imply (1) as it is depicted by the BluRay/DVD example of the book.
Answer We will create a table for this game with reward +1 for the winner and -1 for the loser (reward 0 for a draw). The table is straightforward and it is antisymmetric:
To find the mixed-strategy solution to this game we will need to find the probability of playing Rock (r), Paper (p), Scissors (s), Fire (f), Water (w), such that $r + p + s + f + w = 1$. To do so we will firstly focus on what we need to compute if Player A plays Rock. To answer this question we can draw a graph like this:
According to Figure 17.17.2, if A chooses Rock then the payoff is : $+1 \times p + (-1) \times s + 1 \times f + (-1) \times w + 0 \times r$ We do that for all choices of A and we end with a system of equation:
\[\text{A chooses R:} +p -s +f -w \\ \text{A chooses P:} -r +s +f -w \\ \text{A chooses S:} +r -p +f -w \\ \text{A chooses F:} -r -p -s +w \\ \text{A chooses W:} +r +p +s -f \\\]We then need to solve for the intersection of the hyperplanes. We find that $r=p=s=\frac{1}{9}$ and $f=w=\frac{1}{3}$.