基于测地高斯基函数的递归最小二乘策略迭代

王雪松; 张政; 程玉虎; 张依阳

基于测地高斯基函数的递归最小二乘策略迭代

Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function

摘要

摘要: 在策略迭代结强化学习方法的值函数逼近过程中，基函数的合理选择直接影响方法的性能．为更好地描述环境的拓扑关系，采用测地线距离来替换普通高斯函数中的欧氏距离，提出一种基于测地高斯基函数的策略迭代强化学习方法．首先，基于马尔可夫决策过程抽样得到的样本数据建立环境的图论描述．其次，在图上定义测地高斯基函数，并用基于最短路径快速算法得到的最短路径来逼近测地线距离．然后，假定强化学习系统的状态—动作值函数是给定测地高斯基函数的加权组合，采用递归最小二乘方法对权值进行在线增量式更新．最后，基于估计的值函数进行策略改进．10×10和20×20迷宫问题的仿真结果验证了所提策略迭代方法的有效性．

Abstract: An appropriate selection of basis function directly in?uences the learning performance of a policy iteration method during the value function approximation.In order to describe the topology relationship of an environment better,a geodesic distance is substituted for a Euclidean distance used in an ordinary Gaussian function and a policy iteration reinforcement learning method based on geodesic Gaussian basis function is proposed.At first,a graph about the environment can be built based on the sample data generated from a Markov decision process(MDP).Secondly,geodesic Gaussian basis functions are defined on the graph.A shortest path obtained by a shortest path faster algorithm is used to approximate a geodesic distance.Then a state-action value function in learning system is assumed as the linearly weighted sum of the given geodesic Gaussian basis functions,and a recursive least squares method is used to update the weights in an on-line and incremental manner.At last,policy improvement is carried out based on the estimated state-action value.Simulation results of 10×10 and 20×20 mazes illustrate the validity of the proposed policy iteration method.

HTML全文

参考文献(16)

施引文献

资源附件(0)