Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/290fe/290fe41d46a726b0de77ebc266293894ca2cba5b" alt=""
We define as a reward probability received by moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/f89a5/f89a5f3ec21456d13f21252c8103fc386a6c1563" alt=""
data:image/s3,"s3://crabby-images/b65b3/b65b3c7fdfe67de416858be182861a74bd37982a" alt=""
We know that the value function can be represented as:
data:image/s3,"s3://crabby-images/b7b89/b7b89d03f2361a18606f990ad70f9eca9a1c8ad6" alt=""
data:image/s3,"s3://crabby-images/24e18/24e18f4dafa0bf76b65b5d070a666f438beb3924" alt=""
We can rewrite our value function by taking the first reward out:
data:image/s3,"s3://crabby-images/c62ec/c62eca9f9b74ca43705afda4a0ef9266576c990f" alt=""
The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:
data:image/s3,"s3://crabby-images/d96ee/d96eea99a4a43ef340899e6f887f5f62bec62823" alt=""
In the RHS, we will substitute from equation (5) as follows:
data:image/s3,"s3://crabby-images/34c3e/34c3edb9365e5df96b1b2608d81f83465eca4e03" alt=""
Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:
data:image/s3,"s3://crabby-images/60fb3/60fb3b4056354d34ed1a3b8275eb841312ea6e4e" alt=""
So, our final expectation equation becomes:
data:image/s3,"s3://crabby-images/6dfd3/6dfd396ab2c3a489adf7298b385718dfe26f2faa" alt=""
Now we will substitute our expectation (7) in value function (6) as follows:
data:image/s3,"s3://crabby-images/bc696/bc696b0a27dd45804f0f8de8a5a7467a28be1a62" alt=""
Instead of , we can substitute
with equation (6), and our final value function looks like the following:
data:image/s3,"s3://crabby-images/73161/73161af31b259aa5565b04607acfd863cce6335e" alt=""
In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:
data:image/s3,"s3://crabby-images/5fed0/5fed0ea1849ed06067a343f62af83a3541d179cb" alt=""
Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.