- In SARSA, this is done by choosing another action following the same current policy above and using as target.
SARSA is called on-policy learning because new actionis chosen using the same –greedy policy as the action , the one that generated .
policy derived from Q == chính sách bắt nguồn từ Q
- In Q-learning, this is done by choosing the greedy action , i.e the action that maximize the Q-value function at the new state Q(s’, a):
Q-learning is called off-policy learning because the new action is taken as greedy, not using the current policy as SARSA
After this update, we replay the process from the new stateuntil we reach the end state.
- The new action are not actually taken, but only chosen, to compute the targets in the updating equations, i.e for SARSA and for Q-learning. and
- The action determines the next state we will consider next for updating. In both cases is chosen using the current -greedy policy, which guarantees the exploration.
- SARSA will learn the optimal , i.e, the Q-value function will converge to a optimal Q-value function but in the space of -greedy policy -greedy policy only (as long as each state action pair will be visited infinitely). We expect that in the limit of decaying to , SARSA will converge to the overall optimal policy. I quote here a paragraph from ‘Reinforcement Learning: An Introduction’ book by Sutton & Barto, section 6.4:
The convergence properties of the Sarsa algorithm depend on the nature of the policy’s dependence on Q. For example, one could use ε-greedy or ε-soft policies. According to Satinder Singh (personal communication), Sarsa converges with prob- ability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with ε-greedy policies by setting ε = 1/t), but this result has not yet been published in the literature.
- Q-learning on the other hand will converge to the optimal policy
To illustrate the difference of the 2 methods, we consider a grid-world example of cliff walking (đi bộ trên vách đá), which is mentioned in the Sutton & Barto book as well, but here I try to explain it in more details.
The grid-word is of size 4×12, where the cliff is on the last row of size 10 between the start and end state. Each step that takes the agent to fall inside cliff will receive -100 reward, otherwise -1.
We will takein our experiments. At each step, the agent has of not following the greedy action (or the action with the highest Q value).
- As noted before, SARSA learns a -greedy optimal policy, and this policy is describe by the safe path in the figure, which shows the greedy direction of this optimal policy. Why the agent prefers to go further away from the cliff? Because suppose that we are on the 3rd row, LEFT and RIGHT action will make us still staying on this row and there is 10%/4 = 2.5% chance that we will go into the Cliff afterward. Going DOWN is disaster immediately. Remember that in SARSA we choose the next action based on the same -greedy policy. With respect to any -greedy policy with not so small, the cells just above the cliff are dangerous and should have small value function.
- Q-Learning, on the other hand learns directly the optimal policy which is described by the optimal path. In the learning process, since the agent always explore, in Q-learning we have more chances that the agent will go into the cliff because the agent likes to stay more in the third row (the greedy direction). However this will only affect what are the states we will visit during our learning process, Q-learning still learns the optimal policy or optimal value function. Remember that the target in the updating equation contains .
This can be still confusing, and I am not sure I fully understood, but I think the concepts and differences are profound even though the updating equations look like very similar.
So let’s look into detail at the value of the resulted state-action value function.
[Q(s, UP), Q(s, RIGHT), Q(s, DOWN), Q(s, LEFT)] at different states or cell in the grid-world. Note and the state-action value obtained from SARSA and Q-Learning respectively (after 10000 simulations, ).
- At cell just above the goal: and Here we have perfect convergence at this state for the optimal state-value function. In SARSA, the value functions are more negatives since the underlying policy is -greedy.
- At cell , one cell to the left of the cell above: and