Reinforcement Learning
with Selective Perception and Hidden State
By Andrew K. McCallum Presented by Jerod Weinman
University of Massachusetts-Amherst
Outline
Motivations
Utile Distinctions
U-TREE algorithm Driving Experiment Conclusions
Extensions
The Set Up...
Agents have opposite, yet intertwining problems regarding internal state space.
Too many distinctions Selective Perception
Too few distinctions Short Term Memory
Most RL algorithms depend on knowledge engineers to design state-space.
Opposite and Related?
Selective perception creates hidden state on purpose.
Short term memory, which alleviates hidden state, allows agents selective perception.
The “black magic” of RL applications has been engineering state distinctions.
Motivating Statements
Learning closed-loop behaviors is useful.
Selective perception provides an efficient interface.
Environment interfaces suffer from hidden
state; selective perception can make it worse.
Non-Markov hidden state problems can be solved with memory.
Motivating Statements
Learning selective perception and using memory is difficult.
Experience is expensive.
Agents must handle noisy perceptions and actions.
Final performance is to be balanced against training time.
Utile Distinctions
State-space should be dependent on the task at hand.
Learning should be proportional to task difficulty, not world complexity.
Perceptual aliasing
Agents should only make distinctions needed to predict future reward.
Distinctions for Learning
Theorem The state distinctions necessary for representing the optimal policy are not
necessarily sufficient for learning the optimal policy.
Describe an environment and task for which an optimal policy may be calculated.
Find a minimum set of state distinctions adequate for representing that policy.
Recalculate a policy in the reduced internal state space and the result is a non-optimal policy.
Utile Distinctions
Why? If states and are aliased, then
the path through may have slightly lower reward than through , i.e. , yet
the utility of may be higher than that of , i.e.
.
Optimal policy calculations don’t care which aliased state the agent goes through, so it
chooses the lower cost path, which is the state
with lower utility. Reinforcement Learning with Selective Perception and Hidden State – p.9/31
Utile Distinction Test
Distinguishes states that have different policy actions or different utilities
Merges states that have the same policy action and same utility
U-T REE Overview
Treats percepts as multi-dimensional vectors of features
Allows the agent to ignore certain dimensions of perception
Internal state space can be smaller than space of all percepts
Combines instance-based learning with utile distinctions
Agent builds a tree for making state distinctions.
U-T REE Overview
Non-leaf nodes branch on present or past percepts and actions.
Training instances are deposited in leaves.
Suffix tree is like an order- Markov model with varying .
Factored state representation captures only necessary state distinctions.
Value function approximation achieved by representing value function with a more
compact structure than a mapping from all world states.
Example
Agent and Environment
Finite set of actions, . Scalar range of possible rewards,
.
Finite set of observations,
.
At time , agent executes action and receives an observation ! and reward
!
.
Agent and Environment
Set of observations is the set of all values of a perceptual vector, with perceptual features
" # $
.
Each feature is an element of a finite set
% %
%
%
& '
.
Value of dimension % at time is
( , so an observation is written
) +* +, - .
.
is Cartesian product of all feature sets,
/ / 0
%21 / % /
.
Instance Chain
Agent records raw experiences in a transition instance,
) 43 43
.
.
Tree nodes add a distinction based on
History index, 5, indicating number of steps backwards in time
Perception or action dimension, ( .
Every node is uniquely identified by the set of labels on the path from the root, the
conjunction 6.
U-T REE
An instance is deposited in a leaf node
whose conjunction 6 is satisfied by and its predecessors.
6
is the set of instances associated with leaf 6.
7
specifies the leaf to which instance belongs.
Below the official leaves, fringes are added that provide “hypothesis” distinctions.
If more distinctions help predict future reward, the fringes are promoted to “official”
distinctions. Reinforcement Learning with Selective Perception and Hidden State – p.17/31
Hidden State Space
U-Tree leaves correspond to internal states of the agent.
Deep branches represent finely distinguished space
Shallow branches represent broadly distinguished state space
6
is the learned estimate of expected future discounted reward for a state-action pair.
All -values indicate expected values for the next step in the future.
U-T REE Algorithm
1. Begin with a tree that represents no distinctions. One root node, 6, and
6 " $
. 2. Agent takes a step in the environment
(a) Record transition
) 43
8
3
.
(b) Associate with the appropriate conjunction, 6,
6 6 :9 " $
U-T REE Algorithm
3. Perform one sweep of value iteration with leaves as states.
6
6
; <
Pr
6 = / 6
6 =
6
>@? A B C
; D E F
/ 6
/
Pr
6 = / 6
/ " F 6
/ 7 F
! 6 = $ /
/ 6
/
U-T REE Algorithm
4. After every G steps, test whether the transition utility has changed enough to warrant new
distinctions in the internal state space.
(a) Compare distributions of future discounted rewards associated with the same action from different nodes.
(b) The fringe could be expanded by all
possible permutations of observations and actions to a fixed depth H with a maximum history index, 5, yielding an enormous
branch factor of
5 / / * I
Reinforcement Learning with Selective Perception and Hidden State – p.21/31
U-T REE Algorithm
4. Test for utile distinctions.
(c) Possible expansion pruning methods
i. Don’t expand leaves containing zero (or few) instances.
ii. Don’t expand leaves whose instances exhibit little deviation in utility.
iii. Order the terms in the conjunction (i.e.
perceptual dimensions, action) for expansion.
U-T REE Algorithm
4. Test for utile distinctions.
(d) Expected future discounted reward of instance F is
F F J C >?K L E
Pr
7 F
! 7 F
!
(e) When a deep fringe node is promoted, all of its uncles and great uncles are too.
U-T REE Algorithm
5. Choose next action based on -values of the corresponding leaf.
!
argmax D A
7
Alternatively, explore by choosing a random action with probability M.
6. Set * . Goto 2.
Take the Agents
Driving
Driving Experiment
Actions include gaze directions, and shift to gaze-lane.
Sensory system includes hearing, and several gaze gauges.
2,592 sensor states
3,536 world states not including agent’s
sensor system, otherwise 21,216 world states Trying to solve the task with only perceptual distinctions would be disastrous
Driving Experiment
Over 5,000 time steps with only slower cars Hand-written policy (32 leaves) makes 99 collisions.
Random actions makes 788 collisions.
U-Tree trained with 10,000 time steps and decreasing exploration policy (51 leaves) makes 67 collisions.
Driving Experiment
Over 5,000 time steps with slower and faster cars Random actions makes 1,260 collisions, 775 steps being honked at.
U-Tree trained with 18,000 steps and
decreasing exploration policy (143 leaves) makes 280 collisions, 176 steps being
honked at.
Discussion
“Chicken and egg” problem Distinctions
Utility Policy
Difficulty with long memories
Difficulty with large conjunctions Difficulty with hard to find rewards
Difficulty with loops in the environment
Discussion
Success with large perception spaces Success with hidden state
Success with noise
Success with expensive experience Applicable to general RL domains
Extensions
Better Statistical Tests
Utile-Clustered Branches
Information-Theoretic Splitting Eliminate the Fringe
Options