王雪松, 张政, 程玉虎, 张依阳. 基于测地高斯基函数的递归最小二乘策略迭代[J]. 信息与控制, 2009, 38(4): 406-411.
引用本文: 王雪松, 张政, 程玉虎, 张依阳. 基于测地高斯基函数的递归最小二乘策略迭代[J]. 信息与控制, 2009, 38(4): 406-411.
WANG Xue-song, ZHANG Zheng, CHENG Yu-hu, ZHANG Yi-yang. Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function[J]. INFORMATION AND CONTROL, 2009, 38(4): 406-411.
Citation: WANG Xue-song, ZHANG Zheng, CHENG Yu-hu, ZHANG Yi-yang. Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function[J]. INFORMATION AND CONTROL, 2009, 38(4): 406-411.

基于测地高斯基函数的递归最小二乘策略迭代

Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function

  • 摘要: 在策略迭代结强化学习方法的值函数逼近过程中,基函数的合理选择直接影响方法的性能.为更好地描述环境的拓扑关系,采用测地线距离来替换普通高斯函数中的欧氏距离,提出一种基于测地高斯基函数的策略迭代强化学习方法.首先,基于马尔可夫决策过程抽样得到的样本数据建立环境的图论描述.其次,在图上定义测地高斯基函数,并用基于最短路径快速算法得到的最短路径来逼近测地线距离.然后,假定强化学习系统的状态—动作值函数是给定测地高斯基函数的加权组合,采用递归最小二乘方法对权值进行在线增量式更新.最后,基于估计的值函数进行策略改进.10×10和20×20迷宫问题的仿真结果验证了所提策略迭代方法的有效性.

     

    Abstract: An appropriate selection of basis function directly in?uences the learning performance of a policy iteration method during the value function approximation.In order to describe the topology relationship of an environment better,a geodesic distance is substituted for a Euclidean distance used in an ordinary Gaussian function and a policy iteration reinforcement learning method based on geodesic Gaussian basis function is proposed.At first,a graph about the environment can be built based on the sample data generated from a Markov decision process(MDP).Secondly,geodesic Gaussian basis functions are defined on the graph.A shortest path obtained by a shortest path faster algorithm is used to approximate a geodesic distance.Then a state-action value function in learning system is assumed as the linearly weighted sum of the given geodesic Gaussian basis functions,and a recursive least squares method is used to update the weights in an on-line and incremental manner.At last,policy improvement is carried out based on the estimated state-action value.Simulation results of 10×10 and 20×20 mazes illustrate the validity of the proposed policy iteration method.

     

/

返回文章
返回