• Nenhum resultado encontrado

Algorithm 5.1: Proposed Approach

→s is the current state ;

1

repeatfor eachstep o f episode

2

Choose−→a from−→s using a policy derived fromQ;

3

if−→a >1then

4

Negotiate with the agents of−→a;

5

ifnegotiation failedthen

6 −→a my component of−→a ;

7

Perform action−→a, observe next states0and rewardr;

8

Update Q-table using Procedure 5.2;

9

ifE(max−→a0(−→s 0,−→a0))<Eminthen

10 −→sn ← −→s 0;

11

while|−→sn|==|−→s |or|−→sn|<|CommunicationArea|do

12

agent←NearestAgentCommunicationArea ;

13

ifstate−component from agent 6∈ −→sn then

14

Add to−→sn astate−componentfromagent;

15

ifQ(−→sn,−→a0)6∈Q-tablethen

16

AddQ(−→sn,−→a0)to Q-table;

17

else if|−→sn|==|CommunicationArea|then

18 −→an ← −→a0;

19

while|−→an|==|−→a0|do

20

agent←NearestAgentCommunicationArea;

21

ifaction from agent 6∈ −→anthen

22

Add to−→an theaction;

23

ifQ(−→s0,−→an)6∈Q-tablethen

24

AddQ(−→s 0,−→an)to Q-table;

25

→a0← −→an;

26

→s ← −→s 0;

27

untils is terminal;

28

At the beginning, the agent has only information about the environment states per-ceived and the set of possible actions in this environment. When the agent perceives that its information about the environment is generating error (the precisionE(s,a)is lower than a limitEmin), it tries to obtain information from other agent in order to create a new state on its knowledge base (Q-table). At this point a group is needed, since the agent must communicate in order to obtain more information about the environment. When the agents’ knowledge about the state of the environment is already expanded and it still

ProcedureupdateQTable(state vector: s, action vector: a, state vector: s’) foreachcombination i of state-components of sdo

1

foreachcombination j of action-components of ado

2

foreachcombination k of state-components of s0do

3

ifQ(i,j)∈Q−tablethen

4

Q(i,j)←Q(i,j) +α

³ r |j|

|−→a|maxa0Q(k,a0)−Q(i,j)

´

5 ;

UpdateE(i,j);

6

having significant errors, it starts to create joint actions with the nearest agents. For per-forming these new joint actions, the agent must negotiate with the agents that are needed to perform the joint action (line 5). This negotiation mechanism must be defined accord-ing to the group formation method used to create the communication and action groups.

There is need to study how to make this precision or error function E(s,a) in order to have a value that indicates the low precision only when the value of the pair(s,a)is changing too much for a stationary problem. This value is relative to the number of times that the pair(s,a)was visited. The limitEmin is a decreasing function over time, with a lower limit, that can be defined according to the needed precision.Eminmust always start with the highest possible value, so the agent do not increase its Q-table until it has already experienced acting with its own information for a sufficient time. The decreasing rate for actualizingEminmust be defined according to the problem characteristics.

Table 5.1: Complete Transition Description

State Action New State Reward forA1andA2

<o1,o2> <a1,a2> <o1,o2> +0.1,+0.1

<a1,a02> <o1,o02> +0.1,−0.1

<a01,a2> <o01,o2> −0.1,+0.1

<a01,a02> <o01,o02> −0.1,−0.1

<o01,o2> <a1,a2> <o1,o2> +0.1,+0.1

<a1,a02> <o1,o02> +0.1,−0.1

<a01,a2> <o1,o2> +0.1,+0.2

<a01,a02> <o01,o02> −0.1,−0.1

<o1,o02> <a1,a2> <o1,o2> +0.1,+0.1

<a1,a02> <o1,o02> +0.1,−0.1

<a01,a2> <o01,o2> −0.1,+0.1

<a01,a02> <o01,o02> −0.1,−0.1

<o01,o02> <a1,a2> <o1,o2> +0.1,+0.1

<a1,a02> <o1,o02> +0.1,−0.1

<a01,a2> <o01,o2> −0.1,+0.1

<a01,a02> <o01,o02> −0.1,−0.1

When the agent needs to perform a joint action, it tries to negotiate with the agents involved on the composed action. If the negotiation completely fails, it acts using its best action. The negotiation step is very important, and here we can use any kind of coalition or team formation approach described on Chapter 3. We can choose the more adequate kind of group formation according the problem constraints of communication

and response time.

The pseudo-code of the proposed approach is presented on Algorithm 5.1.

As an example, consider a scenario where Table 5.1 represents the transitions table for a distributed MDP with two agents (A1and A2) where all the transition probabilities are 1 and they receive different rewards. Both agents are able to receive two different observations (oando0) and have two possible actions (aanda0). The index of each obser-vation and each action indicates the agent responsible performing the action or receiving the observation, so "o2" represents the observation "o" for agentA2 and "a01" represents the action "a0" for agentA1.

Table 5.2: Independent Transition Description for AgentA1 State Action New State Reward forA1

<o1> <a1> <o1> +0.1

<o1> <a01> <o01> +0.1

<o01> <a1> <o1> +0.1

<o01> <a01> <o01>or<o1> +0.1 or−0.1

Table 5.3: Independent Transition Description for AgentA2 State Action New State Reward forA2

<o2> <a2> <o2> +0.1

<o2> <a02> <o02> +0.1

<o02> <a2> <o2> +0.1

<o02> <a02> <o02> +0.2 or−0.1

The states, in the complete representation, are composed by the agents’ joint obser-vations and the actions are composed by the agents’ joint actions. Using an independent single agent RL approach on both agents, they would consider this environment as non stationary since they cannot observe the complete state description nor the effect of the other agent’s actions on the system. AgentA1would consider unpredictable the transition from<o01>to<o01>or to<o01>when performing actiona01. The independent tran-sition table for agentA1would be represented by Table 5.2 and for agentA2by Table 5.3.

Q-table at t=0 Q-table at t=n Q-table at t=m Q(<o1>, <a1>) Q(<o1>, <a1>) Q(<o1>, <a1>) Q(<o1>, <a01>) Q(<o1>, <a01>) Q(<o1>, <a01>) Q(<o01>, <a1>) Q(<o01>, <a1>) Q(<o01>, <a1>) Q(<o01>, <a01>) Q(<o01>, <a01>) Q(<o01>, <a01>)

Q(<o01,o2>, <a01>) Q(<o01,o2>, <a01>) Q(<o01,o2>, <a01,a2>) Figure 5.1: Q-table changing over time for AgentA1

To illustrate this example, we have the representation of the Q-table for Agent 1 on Figure 5.1. This figure shows the Q-table changes over time using when the agent is using the proposed algorithm. At the beginning the agent has a Q-table with its possible observations and actions (at t=0). When the agent perceives that the state<o01>has an error, it adds the observation from agentA2, creating a new state<o01,o2>(at t=n) and

including this new state on its Q-table. As the agentA1continues, it perceives that even this new state continues with error, so it includes the action from the other agent at this state to have a better prediction, so at t=m, agentA1includes the Q value with a composed state and a joint action. Using this new Q-table, the agent at state<o01,o2>will have the possibility to act alone or to use a joint action. The joint action depends on a negotiation with the agentA2, responsible for performing actiona2. If the negotiation fails, the agent acts using its single action.

Documentos relacionados