Bellemare et al. (2017) proposed to explore the distribution of the random values received by an RL agent instead of the typical approach that models the expectation of these values using the Bellman expectation equation for state-action value function𝑄(𝑠, 𝑎)which is written as
𝑄(𝑠, 𝑎) =E𝜋[𝑅(𝑠, 𝑎) +𝛾𝑄(𝑠, 𝑎)], (2.18) where𝑅(𝑠, 𝑎)is the immediate reward provided in state𝑠under the action𝑎,𝛾 is the discount factor,𝑄(𝑠, 𝑎)is the state-action value in the next state𝑠taking action𝑎,Eis the expectation symbol. The authors remove the expectations inside Bellman’s equation and consider instead the full distribution of the random variable𝑍𝜋.
𝑍(𝑠, 𝑎) 𝐷= 𝑅(𝑠, 𝑎) +𝛾𝑍(𝑠, 𝑎), (2.19) where R(s,a) is the immediate reward provided in state𝑠 under the action𝑎,𝛾 is the discount factor,𝑍(𝑠, 𝑎)is the distribution of the the random variable𝑍 in next state𝑠and taking action 𝑎. 𝐷=states that𝑍(𝑠, 𝑎)is equivalent to a distribution. This distribution is adopted as a mapping from state-action pairs to distributions over returns. It is called the value distribution.
Based on a distributional approach, Bellemare et al. (2017) proposed an algorithm to approximate distribution probability masses placed on a discrete support vector𝑧parameterized by𝑁𝑎𝑡𝑜𝑚𝑠 ∈N+. The discrete distribution’s atoms may be seen as the "canonical returns" of it.
They are consecutive, non-overlapping intervals with evenly spaced values in𝑧. The discrete support𝑧𝑖is given by
𝑧𝑖 =𝑉𝑚𝑖𝑛+𝑖Δ𝑧, (2.20)
where𝑉𝑚𝑖𝑛 ∈Ris the minimum (starting) value of the support vector𝑧𝑖whose values are evenly spaced,𝑖is the position of vector𝑧𝑖. Δ𝑧is the minimum value that is added to each position of vector𝑧defined by
Δ𝑧= 𝑉𝑚𝑎𝑥−𝑉𝑚𝑖𝑛
𝑁𝑎𝑡𝑜𝑚𝑠−1 , (2.21)
where𝑁𝑎𝑡𝑜𝑚𝑠is the number of atoms,𝑉𝑚𝑎𝑥 ∈Ris the maximum value of the discrete support and 𝑉𝑚𝑖𝑛is described by equation above. In addition, the atom probabilities 𝑝𝑖(𝑠, 𝑎)of a distribution 𝑍𝜋(𝑠, 𝑎)can be computed as
𝑝𝑖(𝑠, 𝑎) = 𝑒𝑧𝑖 𝑁−1
𝑗=0 𝑒𝑧𝑗, (2.22)
where𝑒is the Euler’s number,𝑧is the support vector that holds the "canonical" values of the distribution. 𝑁 is the number of atoms, 𝑖 is the position of the atom that the probability is calculated in the vector 𝑧, and 𝑗 is an evenly-space variable that sums all-atom values from position 0 to𝑁𝑎𝑡𝑜𝑚𝑠−1. 𝑝𝑖(𝑠, 𝑎). It is known as the softmax function. This function takes the vector𝑧and normalizes it into a probability distribution consisting of probabilities proportional to the exponentials of the input numbers. Figure 2.6 shows an illustrative example of generating the support vector𝑧and calculating the probability of each atom.
Suppose the number of atoms 𝑁𝑎𝑡𝑜𝑚𝑠 is 10. The support vector 𝑧𝑖 will contain ten positions. In addition, the example adopted the vector’s boundaries with the minimum starting
Figure 2.6: Example of support vector𝑧𝑖and its probability distribution
number𝑉𝑚𝑖𝑛of -5 and the maximum final number𝑉𝑚𝑎𝑥 of +5. To fill the remaining positions, first, a minimum variation of vector𝑧(Δ𝑧) is calculated by:
Δ𝑧 =𝑉𝑚𝑎𝑥−𝑉𝑚𝑖𝑛
𝑁 −1
5− (−5)
10−1 =1.11,
where Adopting Δ𝑧 = 1.11, the remaining positions are calculated by multiplying Δ𝑧 with its respective position and sum the minimum adopted value𝑉𝑚𝑖𝑛, according to Equation 2.20.
To illustrate this computation, the first and second positions of the support vector are written, respectively
𝑧1=−5+ (1×1.11) =−3.89, 𝑧2=−5+ (2×1.11) =−2.78.
In order to calculate the probabilities𝑝𝑖(𝑠, 𝑎)for each atom of the vector𝑧𝑖, the Softmax function in Equation 2.22 was applied to it. In the upper part of the equation, each atom on positions𝑖 was taken and served as the exponent of Euler’s number. The value of each atom on positions𝑖 was added to the lower part. To illustrate how the function is applied, the probabilities of the value in positions four and five are given by:
𝑝4(𝑠, 𝑎) = 𝑒−0.56 9
𝑗=0𝑒𝑧𝑗 =221.242 =0.00259=0.259%, 𝑝5(𝑠, 𝑎) = 𝑒0.55
9
𝑗=0𝑒𝑧𝑗 =221.242 =0.00787=0.788%.
The main idea of Categorical DQN is that the return distributions satisfy equation 2.19.
Suppose a given state𝑠 and action𝑎, the distribution of returns under the optimal policy 𝜋∗ should match the target distribution, defined by taking the distribution for the next state𝑠and action𝑎∗𝑡+1=𝜋∗(𝑠𝑡+1). Figure 2.9 shows the effects of reward, discount factor, and projection step in the returns distribution.
Figure 2.7: Operations using Bellman operator on Distribution of returns
𝑃𝜋𝑍(𝑠, 𝑎) is the next state distribution under policy𝜋,𝛾 is the discount factor,𝑅 is the immediate reward,Φis a L2-projection of the target distribution onto the support vector 𝑧𝑖
andT is the distributional Bellman optimality operator. By increasing the𝛾 discount factor, the distribution shrinks towards 0, and the probability masses are concentrated towards the center, increasing the probability in a certain range. In addition, adding rewards shifts the distribution in the x-axis. Lastly, the projected Bellman update stepΦ distribution is shown in the last Figure. This projection may be used as a target to calculate the loss when updating the Bellman equation. To adapt the variant Bellman update to the DQN architecture for a given experience tuple(𝑠, 𝑎, 𝑟, 𝑠), first, theQ-valuefor the next state𝑄(𝑠, 𝑎) is calculated:
𝑄(𝑠, 𝑎)=𝑁−
1 𝑖=0
𝑧𝑖𝑝𝑖(𝑠, 𝑎). (2.23)
Where the𝑄−𝑣𝑎𝑙𝑢𝑒in the next state𝑠is the sum𝑁−1
𝑖=0 of the inner products of the distribution in respect to the greedy action with its probability distribution vector 𝑝𝑖(𝑠, 𝑎) from 0 to the number of atoms𝑁−1. Figure 2.8 shows an illustration of feeding a state in the neural network to retrieve the best distribution regarding the greedy action𝑎∗.
Figure 2.8: Neural Network architecture in Categorical DQN
Suppose the array𝑍(𝑠, 𝑎∗) selected in Figure 2.8 was in respect to the greedy action distribution. 𝑍(𝑠, 𝑎∗)can be denoted𝑧𝑖, and theQ-valueof that distribution would be calculated as the inner product of the support vector𝑧𝑖and 𝑝𝑖(𝑠, 𝑎∗)written as:
𝑄(𝑠, 𝑎∗) = 9
𝑖=0
𝑧𝑖𝑝𝑖(𝑠, 𝑎∗) =0.68.
To calculate the loss of the neural network, a new projected vector support𝑧𝑗 is created with evenly spaced values, according to equations 2.20 and 2.21. ˆT𝑧𝑗 is computed as well as three additional𝑏𝑗, 𝑙, and𝑢vector variables respectively written:
Tˆ𝑧𝑗 =
𝑟+𝛾𝑧𝑗 𝑉𝑚𝑎𝑥
𝑉𝑚𝑖𝑛 , (2.24)
𝑏𝑗 = Tˆ𝑧𝑗−𝑉𝑚𝑖𝑛
Δ𝑧 , (2.25)
𝑙 = 𝑏𝑗, (2.26)
𝑢= 𝑏𝑗. (2.27)
Variable𝑏𝑗 contains real value index positions which each value of ˆT𝑧𝑗 is closest in respect to the support𝑧𝑗 and is defined by the projected support ˆT𝑧𝑗 subtracted by the minimum value of the distribution𝑉𝑚𝑖𝑛divided byΔ𝑧which are defined above. The lower variable𝑙is the floor of the variable𝑏𝑗. The floor of a number is the greatest integer less than or equal to𝑏𝑗. On the other hand, the array variable𝑢is the ceiling of𝑏𝑗. The ceiling of a variable is the least integer greater than or equal to𝑏𝑗. Figure 2.9 illustrates the calculation of the abovementioned variables.
Suppose that the distribution 𝑍(𝑠, 𝑎∗) was taken from Figure 2.8. To calculate the neural network’s loss, a new support vector must be created. In addition, suppose that the first array 𝑧𝑖 is the created support vector. The result of the projection of ˆT𝑧𝑖 onto the support
Figure 2.9: Computation of the projection of ˆT𝑧𝑗 onto the support𝑧𝑖, and variables𝑏𝑗,𝑙, and𝑢derived from it
𝑧𝑖 is shown 2.24 in the second array. The immediate reward given from the environment 𝑟𝑡 was adopted as 4 and the discount factor 𝛾 was set to 0.99, the calculation done was Tˆ𝑧𝑗 =
𝑟+𝛾𝑧𝑗 𝑉𝑚𝑎𝑥
𝑉𝑚𝑖𝑛 =4+ [(0.99×𝑧𝑗)]+−55. The bounds of the projection must be in[𝑉𝑚𝑖𝑛, 𝑉𝑚𝑎𝑥].
Following the example, they should be within the [-5,+5] range. It can be observed that the last four positions of ˆT𝑧𝑗 contained samples greater than𝑉𝑚𝑎𝑥; hence they were clipped and replaced by the maximum value. In the following, the variable𝑏𝑗 is calculated using Equation 2.25 based on the previous ˆT𝑧𝑖 array withΔ𝑧 =1.11. This variable computes the real-valued index positions in which each value of ˆT𝑧𝑖is closest in respect to the support𝑧𝑖. Lastly,𝑙and𝑢 variables are calculated. The distribution𝑙 and𝑢compute the integer neighboring indexes from 𝑏𝑖. 𝑙 contains the greatest integers less than or equal to its respective position on𝑏𝑗, and𝑢 holds the least integers greater than or equal to its respective position on𝑏𝑗. Lastly,𝑚is described as the projected distribution of 𝑍(𝑠, 𝑎∗). The vector 𝑝𝑖(𝑠, 𝑎∗) holds the probability masses in respect to𝑍(𝑠, 𝑎∗). The previous variables𝑢,𝑙 and 𝑏𝑗 shifts the probability 𝑝𝑖(𝑠, 𝑎∗)and distributes to its neighbors. The projected distribution𝑚is used as the target for the network.
Equations 2.28 and 2.29 define the probability mass distribution.
𝑚𝑙 =𝑚𝑙+𝑝𝑖(𝑠, 𝑎∗)(𝑢−𝑏𝑖), (2.28) 𝑚𝑢 =𝑚𝑢+ 𝑝𝑖(𝑠, 𝑎∗)(𝑏𝑖−𝑙), (2.29) where 𝑝𝑗(𝑠, 𝑎∗) is the probability vector of ˆT𝑧𝑗,𝑢,𝑏𝑖 are the same as the variables mentioned above and variable𝑚is the variable that holds probability masses function of the distribution values.
Lastly, Cross-entropy is used as the loss function. It is a measure of the difference between two probability distributions. The rest follows the same architecture as the standard DQN algorithm. Cross-entropy is defined as
𝐿(𝜃) =−
𝑖
𝑚𝑖log 𝑝𝑖(𝑠, 𝑎), (2.30)
where it is the sum defined as
of the multiplication of 𝑚𝑖 the projected distribution vector of probability masses and 𝑝𝑖(𝑠, 𝑎) the probability vector of the distribution 𝑍(𝑠, 𝑎) which is generated by inputting the state𝑠taking action𝑎, the base of the used logarithm (log) was the Euler’s number𝑒.