重庆网站建设选卓光,广州品牌设计工作室,邯山手机网站建设,网站模板 餐饮理论基础#xff1a;注意#xff1a;1. 超参数samples的设置#xff1a;size of q_table grid_size*grid_size*action_size#xff0c;每个 Q(s,a) 至少要访问 t 20#xff5e;50 次#xff0c;才能开始收敛#xff0c;那么需要的总更新次数至少是(q_table)*t#xff…理论基础注意1. 超参数samples的设置size of q_table grid_size*grid_size*action_size每个 Q(s,a) 至少要访问 t 2050 次才能开始收敛那么需要的总更新次数至少是(q_table)*t如果每个episode平均走step步那么sample大小至少为(q_table)*t / step。大概来说episode 数至少是 Q‑table 大小的 50200 倍。2. alpha不能太小否则学不动GridWorld 这种小环境alpha取0.05~0.2差不多。但也不能太大比如0.5就太大了Q 值会剧烈震荡策略不稳定。代码可运行 区别1. sarsa是从一个特定的开始状态出发到达目标状态只有这条episode是最优的其他状态则不一定 2. sarsa是迭代式算法每更新一次action value就要更新一次policy import random import numpy as np from prometheus_client import samples from env import GridWorldEnv from utils import drow_policy class Sarsa(object): def __init__(self, env:GridWorldEnv, gamma0.9, alpha0.001, epsilon0.1, samples1, start_state(0,0)): :param env: 定义了网格的基础配置 :param gamma: discount rate :param alpha: learning rate :param epsilon: epsilon greedy更新policy :param samples: 从起点到终点采样的路径数 :param start_state: 起点 self.env env self.action_space_size self.env.num_actions # 上下左右原地 self.state_space_size self.env.num_states self.reward_list self.env.reward_list self.gamma gamma self.samples samples self.alpha alpha self.epsilonepsilon self.start_state self.env.state_id(start_state[0],start_state[1]) self.policy np.ones((self.state_space_size, self.action_space_size)) / self.action_space_size self.qvalues np.zeros((self.state_space_size, self.action_space_size)) def solve(self): for i in range(self.samples): s self.start_state a np.random.choice(self.action_space_size, pself.policy[s]) while s not in self.env.terminal: next_s, next_r, _ self.env.step(s,a) next_a np.random.choice(self.action_space_size, pself.policy[next_s]) # 根据Πt(s_t1)生成a_t1 # updata q-value for (s_t,a_t) # qt1(st, at) qt(st, at) − αt(st, at) [ qt(st, at) − (rt1 γqt(st1, at1))] td_targetnext_rself.gamma*self.qvalues[next_s][next_a] td_errortd_target-self.qvalues[s][a] # 负号提出去 self.qvalues[s][a]self.alpha*td_error # update policy for s_t best_anp.argmax(self.qvalues[s]) self.policy[s] self.epsilon / self.action_space_size self.policy[s, best_a] 1 - self.epsilon s, a next_s, next_a if __name__ __main__: env GridWorldEnv( size5, forbidden[(1, 2), (3, 3)], terminal[(4, 4)], r_boundary-1, r_other-0.04, r_terminal1, r_forbidden-1, r_stay-0.1 ) # 注意samples要大一点否则每个state被访问到的概率很小 viSarsa(envenv, gamma0.9, alpha0.01, epsilon0.1, samples5000, start_state(0,0)) vi.solve() print(\n state value: ) print(vi.qvalues) drow_policy(vi.policy, env)运行结果