Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:

We define as a reward probability received by moving from state
to
while performing an action a:


We know that the value function can be represented as:


We can rewrite our value function by taking the first reward out:

The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:

In the RHS, we will substitute from equation (5) as follows:

Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:

So, our final expectation equation becomes:

Now we will substitute our expectation (7) in value function (6) as follows:

Instead of , we can substitute
with equation (6), and our final value function looks like the following:

In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:

Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.