with Selective Perception and Hidden State

(1)

Reinforcement Learning

with Selective Perception and Hidden State

By Andrew K. McCallum Presented by Jerod Weinman

University of Massachusetts-Amherst

(2)

Outline

Motivations

Utile Distinctions

U-TREE algorithm Driving Experiment Conclusions

Extensions

(3)

The Set Up...

Agents have opposite, yet intertwining problems regarding internal state space.

Too many distinctions Selective Perception

Too few distinctions Short Term Memory

Most RL algorithms depend on knowledge engineers to design state-space.

(4)

Opposite and Related?

Selective perception creates hidden state on purpose.

Short term memory, which alleviates hidden state, allows agents selective perception.

The “black magic” of RL applications has been engineering state distinctions.

(5)

Motivating Statements

Learning closed-loop behaviors is useful.

Selective perception provides an efficient interface.

Environment interfaces suffer from hidden

state; selective perception can make it worse.

Non-Markov hidden state problems can be solved with memory.

(6)

Motivating Statements

Learning selective perception and using memory is difficult.

Experience is expensive.

Agents must handle noisy perceptions and actions.

Final performance is to be balanced against training time.

(7)

Utile Distinctions

State-space should be dependent on the task at hand.

Learning should be proportional to task difficulty, not world complexity.

Perceptual aliasing

Agents should only make distinctions needed to predict future reward.

(8)

Distinctions for Learning

Theorem The state distinctions necessary for representing the optimal policy are not

necessarily sufficient for learning the optimal policy.

Describe an environment and task for which an optimal policy may be calculated.

Find a minimum set of state distinctions adequate for representing that policy.

Recalculate a policy in the reduced internal state space and the result is a non-optimal policy.

(9)

Utile Distinctions

Why? If states and are aliased, then

the path through may have slightly lower reward than through , i.e. , yet

the utility of may be higher than that of , i.e.

.

Optimal policy calculations don’t care which aliased state the agent goes through, so it

chooses the lower cost path, which is the state

with lower utility. Reinforcement Learning with Selective Perception and Hidden State – p.9/31

(10)

Utile Distinction Test

Distinguishes states that have different policy actions or different utilities

Merges states that have the same policy action and same utility

(11)

U-T ^REE Overview

Treats percepts as multi-dimensional vectors of features

Allows the agent to ignore certain dimensions of perception

Internal state space can be smaller than space of all percepts

Combines instance-based learning with utile distinctions

Agent builds a tree for making state distinctions.

(12)

U-T ^REE Overview

Non-leaf nodes branch on present or past percepts and actions.

Training instances are deposited in leaves.

Suffix tree is like an order- Markov model with varying .

Factored state representation captures only necessary state distinctions.

Value function approximation achieved by representing value function with a more

compact structure than a mapping from all world states.

(13)

Example

(14)

Agent and Environment

Finite set of actions, . Scalar range of possible rewards,

.

Finite set of observations,

.

At time , agent executes action and receives an observation ^! and reward

!

.

(15)

Agent and Environment

Set of observations is the set of all values of a perceptual vector, with perceptual features

" # $

.

Each feature is an element of a finite set

% %

%

& '

.

Value of dimension ^% at time is

( , so an observation is written

) +* +, - .

.

is Cartesian product of all feature sets,

/ / 0

%21 / % /

.

(16)

Instance Chain

Agent records raw experiences in a transition instance,

) 43 43

.

Tree nodes add a distinction based on

History index, ⁵, indicating number of steps backwards in time

Perception or action dimension, ⁽ .

Every node is uniquely identified by the set of labels on the path from the root, the

conjunction ⁶.

(17)

U-T ^REE

An instance is deposited in a leaf node

whose conjunction ⁶ is satisfied by and its predecessors.

6

is the set of instances associated with leaf ⁶.

7

specifies the leaf to which instance belongs.

Below the official leaves, fringes are added that provide “hypothesis” distinctions.

If more distinctions help predict future reward, the fringes are promoted to “official”

distinctions. Reinforcement Learning with Selective Perception and Hidden State – p.17/31

(18)

Hidden State Space

U-Tree leaves correspond to internal states of the agent.

Deep branches represent finely distinguished space

Shallow branches represent broadly distinguished state space

6

is the learned estimate of expected future discounted reward for a state-action pair.

All -values indicate expected values for the next step in the future.

(19)

U-T ^REE Algorithm

1. Begin with a tree that represents no distinctions. One root node, ⁶, and

6 " $

. 2. Agent takes a step in the environment

(a) Record transition

) 43

8

3

.

(b) Associate with the appropriate conjunction, ⁶,

6 6 :9 " $

(20)

U-T ^REE Algorithm

3. Perform one sweep of value iteration with leaves as states.

6

; <

Pr

6 = / 6

6 =

6

>@? A B C

; D E F

/ 6

/

Pr

6 = / 6

/ " F 6

/ 7 F

! 6 = $ /

/ 6

/

(21)

U-T ^REE Algorithm

4. After every ^G steps, test whether the transition utility has changed enough to warrant new

distinctions in the internal state space.

(a) Compare distributions of future discounted rewards associated with the same action from different nodes.

(b) The fringe could be expanded by all

possible permutations of observations and actions to a fixed depth ^H with a maximum history index, ⁵, yielding an enormous

branch factor of

5 / / * I

Reinforcement Learning with Selective Perception and Hidden State – p.21/31

(22)

U-T ^REE Algorithm

4. Test for utile distinctions.

(c) Possible expansion pruning methods

i. Don’t expand leaves containing zero (or few) instances.

ii. Don’t expand leaves whose instances exhibit little deviation in utility.

iii. Order the terms in the conjunction (i.e.

perceptual dimensions, action) for expansion.

(23)

U-T ^REE Algorithm

4. Test for utile distinctions.

(d) Expected future discounted reward of instance ^F is

F F J C >?K L E

Pr

7 F

! 7 F

!

(e) When a deep fringe node is promoted, all of its uncles and great uncles are too.

(24)

U-T ^REE Algorithm

5. Choose next action based on -values of the corresponding leaf.

!

argmax ^D ^A

7

Alternatively, explore by choosing a random action with probability ^M.

6. Set ^* . Goto 2.

(25)

Take the Agents

Driving

(26)

Driving Experiment

Actions include gaze directions, and shift to gaze-lane.

Sensory system includes hearing, and several gaze gauges.

2,592 sensor states

3,536 world states not including agent’s

sensor system, otherwise 21,216 world states Trying to solve the task with only perceptual distinctions would be disastrous

(27)

Driving Experiment

Over 5,000 time steps with only slower cars Hand-written policy (32 leaves) makes 99 collisions.

Random actions makes 788 collisions.

U-Tree trained with 10,000 time steps and decreasing exploration policy (51 leaves) makes 67 collisions.

(28)

Driving Experiment

Over 5,000 time steps with slower and faster cars Random actions makes 1,260 collisions, 775 steps being honked at.

U-Tree trained with 18,000 steps and

decreasing exploration policy (143 leaves) makes 280 collisions, 176 steps being

honked at.

(29)

Discussion

“Chicken and egg” problem Distinctions

Utility Policy

Difficulty with long memories

Difficulty with large conjunctions Difficulty with hard to find rewards

Difficulty with loops in the environment

(30)

Discussion

Success with large perception spaces Success with hidden state

Success with noise

Success with expensive experience Applicable to general RL domains

(31)

Extensions

Better Statistical Tests

Utile-Clustered Branches

Information-Theoretic Splitting Eliminate the Fringe

Options

with Selective Perception and Hidden State

Reinforcement Learning