Q Learning 和SARSA演算法

q更新公式：

∈-greedy策略

在q learning的更新過程中，每一步都要根據當前的state以及q函式確定乙個合適的行動action。這裡有乙個如何平衡「經驗」和「探索」的問題。如果完全按照經驗行動，即每次都在q(state, :)中選擇對應值最大的action，那麼很有可能一直侷限在已有經驗中，難以發現更具價值的新的行為。但如果智慧型體只專注於探索新的行為，即完全隨機地行動，又可能因為大多數行動都沒有價值，導致學習q函式的速度很慢。

一種比較簡單的平衡「經驗」和「探索」的方法是採用∈-greedy策略選擇合適的行動。事先設定乙個較小的∈值（如∈=0.1），智慧型體有1-∈的概率根據學習到的q函式（已有經驗）行動，剩下∈的概率智慧型體會隨機行動，用於探索新的經驗。例如，∈=0.1時，在90%的情況下，智慧型體直接選擇使得q(state, action)最大的action，剩下10%的情況，隨機選擇乙個action。

off-policy和on-policy

強化學習中的方法可以分為off-policy和on-policy兩類。q learning演算法是乙個經典的off-policy方法，而sarsa演算法則是on-policy方法。那麼，如何理解這裡的off-policy和on-policy呢？

在q learning中，q函式的更新和q[new_state, :].max()有關。在q[new_state, :]中選出使得q函式最大的動作，以此來更新q函式。設這個動作為max_action。注意，智慧型體實際有可能並不會執行max_action。因為在下乙個過程中是根據epsilon-greedy方法來選擇策略的，有可能選擇max_action，也有可能並不會選到max_action。而sarsa演算法則不同，它用q[new_state, new_action]結合獎勵等資訊更新q函式。之後，在下一次迴圈時，智慧型體必然會執行new_action。

說q learning是一種off-policy演算法，是指它在更新q函式時使用的動作（max_action）可能並不會被智慧型體用到。又稱sarsa是一種on-policy方法，是指它在更新q函式時使用的動作(new_action)一定會被智慧型體所採用。這也是on-policy方法和off-policy方法的主要區別。

相比q learning演算法，sarsa演算法更「膽小」。q learning演算法會使用q[new_state, :].max()來更新q值，換句話說，它考慮的是新狀態下可以獲得的最大獎勵，而不去考慮新狀態會帶來的風險。因此，q learning演算法會更加的激進。相比之下，sarsa演算法只是使用q[new_state, new_action]來更新q值。在此處的迷宮問題中，sarsa演算法會考慮到接近陷阱可能帶來的負收益，因此更傾向於待在原地不動，從而更加難以找到「寶藏」.

from __future__ import print_function
import numpy as np
import time
from env import env
epsilon = 0.1
alpha = 0.1
gamma = 0.9
max_step = 30
np.random.seed(0)
def epsilon_greedy(q, state):
if (np.random.uniform() > 1 - epsilon) or ((q[state, :] == 0).all()):
action = np.random.randint(0, 4)  # 0~3
else:
action = q[state, :].argmax()
return action
e = env()
q = np.zeros((e.state_num, 4))
for i in range(200):
e = env()
while (e.is_end is false) and (e.step < max_step):
action = epsilon_greedy(q, e.present_state)
state = e.present_state
reward = e.interact(action)
new_state = e.present_state
q[state, action] = (1 - alpha) * q[state, action] + \
alpha * (reward + gamma * q[new_state, :].max())
e.print_map()
time.sleep(0.1)
print('episode:', i, 'total step:', e.step, 'total reward:', e.total_reward)
time.sleep(2)

from __future__ import print_function
import numpy as np
import time
from env import env
epsilon = 0.1
alpha = 0.1
gamma = 0.9
max_step = 50
np.random.seed(1)
def epsilon_greedy(q, state):
if (np.random.uniform() > 1 - epsilon) or ((q[state, :] == 0).all()):
action = np.random.randint(0, 4)  # 0~3
else:
action = q[state, :].argmax()
return action
e = env()
q = np.zeros((e.state_num, 4))
for i in range(200):
e = env()
action = epsilon_greedy(q, e.present_state)
while (e.is_end is false) and (e.step < max_step):
state = e.present_state
reward = e.interact(action)
new_state = e.present_state
new_action = epsilon_greedy(q, e.present_state)
q[state, action] = (1 - alpha) * q[state, action] + \
alpha * (reward + gamma * q[new_state, new_action])
action = new_action
e.print_map()
time.sleep(0.1)
print('episode:', i, 'total step:', e.step, 'total reward:', e.total_reward)
time.sleep(2)

Q Learning 和SARSA演算法

Q learning和Sarsa的區別

SARSA與Q learning的區別

強化學習Sarsa

Q Learning 和SARSA演算法

Q learning和Sarsa的區別

SARSA與Q learning的區別

強化學習Sarsa

相關推薦