Algorithm - Ph.D.ThesisProposalProf.Dr.AnaLúciaC.BazzanAdvisorPortoAlegre,November2007 TowardsJ

Algorithm 5.1: Proposed Approach

−

→s is the current state ;

repeatfor eachstep o f episode

Choose−→a from−→s using a policy derived fromQ;

if−→a >1then

Negotiate with the agents of−→a;

ifnegotiation failedthen

6 −→a ←my component of−→a ;

Perform action−→a, observe next states⁰and rewardr;

Update Q-table using Procedure 5.2;

ifE(max−→a⁰(−→s ⁰,−→a⁰))<E_minthen

10 −→sn ← −→s ⁰;

while|−→sn|==|−→s |or|−→sn|<|CommunicationArea|do

agent←NearestAgent∈CommunicationArea ;

ifstate−component from agent 6∈ −→sn then

Add to−→sn astate−componentfromagent;

ifQ(−→s_n,−→a⁰)6∈Q-tablethen

AddQ(−→sn,−→a⁰)to Q-table;

else if|−→sn|==|CommunicationArea|then

18 −→a_n ← −→a⁰;

while|−→a_n|==|−→a⁰|do

agent←NearestAgent∈CommunicationArea;

ifaction from agent 6∈ −→a_nthen

Add to−→a_n theaction;

ifQ(−→s⁰,−→an)6∈Q-tablethen

AddQ(−→s ⁰,−→a_n)to Q-table;

−

→a⁰← −→a_n;

−

→s ← −→s ⁰;

untils is terminal;

At the beginning, the agent has only information about the environment states per-ceived and the set of possible actions in this environment. When the agent perceives that its information about the environment is generating error (the precisionE(s,a)is lower than a limitEmin), it tries to obtain information from other agent in order to create a new state on its knowledge base (Q-table). At this point a group is needed, since the agent must communicate in order to obtain more information about the environment. When the agents’ knowledge about the state of the environment is already expanded and it still

ProcedureupdateQTable(state vector: s, action vector: a, state vector: s’) foreachcombination i of state-components of sdo

foreachcombination j of action-components of ado

foreachcombination k of state-components of s⁰do

ifQ(i,j)∈Q−tablethen

Q(i,j)←Q(i,j) +α

³ r ^|^j|

|−→a|+γmax_a⁰Q(k,a⁰)−Q(i,j)

5 ;

UpdateE(i,j);

having significant errors, it starts to create joint actions with the nearest agents. For per-forming these new joint actions, the agent must negotiate with the agents that are needed to perform the joint action (line 5). This negotiation mechanism must be defined accord-ing to the group formation method used to create the communication and action groups.

There is need to study how to make this precision or error function E(s,a) in order to have a value that indicates the low precision only when the value of the pair(s,a)is changing too much for a stationary problem. This value is relative to the number of times that the pair(s,a)was visited. The limitE_min is a decreasing function over time, with a lower limit, that can be defined according to the needed precision.E_minmust always start with the highest possible value, so the agent do not increase its Q-table until it has already experienced acting with its own information for a sufficient time. The decreasing rate for actualizingE_minmust be defined according to the problem characteristics.

Table 5.1: Complete Transition Description

State Action New State Reward forA₁andA₂

<o1,o2> <a1,a2> <o1,o2> +0.1,+0.1

<a₁,a⁰₂> <o₁,o⁰₂> +0.1,−0.1

<a⁰1,a2> <o⁰1,o2> −0.1,+0.1

<a⁰1,a⁰2> <o⁰1,o⁰2> −0.1,−0.1

<o⁰1,o2> <a1,a2> <o1,o2> +0.1,+0.1

<a₁,a⁰₂> <o₁,o⁰₂> +0.1,−0.1

<a⁰₁,a₂> <o₁,o₂> +0.1,+0.2

<a⁰1,a⁰2> <o⁰1,o⁰2> −0.1,−0.1

<o₁,o⁰₂> <a₁,a₂> <o₁,o₂> +0.1,+0.1

<a1,a⁰2> <o1,o⁰2> +0.1,−0.1

<a⁰₁,a₂> <o⁰₁,o₂> −0.1,+0.1

<a⁰₁,a⁰₂> <o⁰₁,o⁰₂> −0.1,−0.1

<o⁰₁,o⁰₂> <a₁,a₂> <o₁,o₂> +0.1,+0.1

<a₁,a⁰₂> <o₁,o⁰₂> +0.1,−0.1

<a⁰1,a2> <o⁰1,o2> −0.1,+0.1

<a⁰₁,a⁰₂> <o⁰₁,o⁰₂> −0.1,−0.1

When the agent needs to perform a joint action, it tries to negotiate with the agents involved on the composed action. If the negotiation completely fails, it acts using its best action. The negotiation step is very important, and here we can use any kind of coalition or team formation approach described on Chapter 3. We can choose the more adequate kind of group formation according the problem constraints of communication

and response time.

The pseudo-code of the proposed approach is presented on Algorithm 5.1.

As an example, consider a scenario where Table 5.1 represents the transitions table for a distributed MDP with two agents (A1and A2) where all the transition probabilities are 1 and they receive different rewards. Both agents are able to receive two different observations (oando⁰) and have two possible actions (aanda⁰). The index of each obser-vation and each action indicates the agent responsible performing the action or receiving the observation, so "o₂" represents the observation "o" for agentA₂ and "a⁰₁" represents the action "a⁰" for agentA₁.

Table 5.2: Independent Transition Description for AgentA₁ State Action New State Reward forA₁

<o₁> <a₁> <o₁> +0.1

<o1> <a⁰1> <o⁰1> +0.1

<o⁰1> <a1> <o1> +0.1

<o⁰₁> <a⁰₁> <o⁰₁>or<o₁> +0.1 or−0.1

Table 5.3: Independent Transition Description for AgentA₂ State Action New State Reward forA₂

<o2> <a2> <o2> +0.1

<o₂> <a⁰₂> <o⁰₂> +0.1

<o⁰₂> <a₂> <o₂> +0.1

<o⁰₂> <a⁰₂> <o⁰₂> +0.2 or−0.1

The states, in the complete representation, are composed by the agents’ joint obser-vations and the actions are composed by the agents’ joint actions. Using an independent single agent RL approach on both agents, they would consider this environment as non stationary since they cannot observe the complete state description nor the effect of the other agent’s actions on the system. AgentA₁would consider unpredictable the transition from<o⁰1>to<o⁰1>or to<o⁰1>when performing actiona⁰1. The independent tran-sition table for agentA₁would be represented by Table 5.2 and for agentA₂by Table 5.3.

Q-table at t=0 Q-table at t=n Q-table at t=m Q(<o₁>, <a₁>) Q(<o₁>, <a₁>) Q(<o₁>, <a₁>) Q(<o1>, <a⁰1>) Q(<o1>, <a⁰1>) Q(<o1>, <a⁰1>) Q(<o⁰₁>, <a₁>) Q(<o⁰₁>, <a₁>) Q(<o⁰₁>, <a₁>) Q(<o⁰₁>, <a⁰₁>) Q(<o⁰₁>, <a⁰₁>) Q(<o⁰₁>, <a⁰₁>)

Q(<o⁰1,o2>, <a⁰1>) Q(<o⁰1,o2>, <a⁰1>) Q(<o⁰₁,o₂>, <a⁰₁,a₂>) Figure 5.1: Q-table changing over time for AgentA₁

To illustrate this example, we have the representation of the Q-table for Agent 1 on Figure 5.1. This figure shows the Q-table changes over time using when the agent is using the proposed algorithm. At the beginning the agent has a Q-table with its possible observations and actions (at t=0). When the agent perceives that the state<o⁰₁>has an error, it adds the observation from agentA2, creating a new state<o⁰1,o2>(at t=n) and

including this new state on its Q-table. As the agentA1continues, it perceives that even this new state continues with error, so it includes the action from the other agent at this state to have a better prediction, so at t=m, agentA₁includes the Q value with a composed state and a joint action. Using this new Q-table, the agent at state<o⁰1,o2>will have the possibility to act alone or to use a joint action. The joint action depends on a negotiation with the agentA₂, responsible for performing actiona₂. If the negotiation fails, the agent acts using its single action.

No documento Ph.D.ThesisProposalProf.Dr.AnaLúciaC.BazzanAdvisorPortoAlegre,November2007 TowardsJointLearninginMultiagentSystemsThroughOpportunisticCoordination UNIVERSIDADEFEDERALDORIOGRANDEDOSULINSTITUTODEINFORMÁTICAPROGRAMADEPÓS-GRADUAÇÃOEMCOMPUTAÇÃODENISEDEOLIVEIRA (páginas 53-56)