基于测地高斯基函数的递归最小二乘策略迭代

王雪松, 张政, 程玉虎, 张依阳

王雪松, 张政, 程玉虎, 张依阳. 基于测地高斯基函数的递归最小二乘策略迭代[J]. 信息与控制, 2009, 38(4): 406-411.
引用本文: 王雪松, 张政, 程玉虎, 张依阳. 基于测地高斯基函数的递归最小二乘策略迭代[J]. 信息与控制, 2009, 38(4): 406-411.
WANG Xue-song, ZHANG Zheng, CHENG Yu-hu, ZHANG Yi-yang. Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function[J]. INFORMATION AND CONTROL, 2009, 38(4): 406-411.
Citation: WANG Xue-song, ZHANG Zheng, CHENG Yu-hu, ZHANG Yi-yang. Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function[J]. INFORMATION AND CONTROL, 2009, 38(4): 406-411.

基于测地高斯基函数的递归最小二乘策略迭代

基金项目: 教育部新世纪优秀人才支持计划(NCET-08-0836);国家自然科学基金资助项目(60804022);江苏省自然科学基金资助项目(BK2008126);高等学校博士学科点专项科研基金资助项目(20070290537,200802901506);中国科学院自动化研究所复杂系统与智能科学重点实验室开放课题(20070102)
详细信息
    作者简介:

    王雪松(1974- ),女,博士,副教授.研究领域为机器学习,复杂系统优化与控制.
    张政(1983- ),女,硕士研究生.研究领域为强化学习.
    程玉虎(1973- ),男,博士,副教授.研究领域为机器学习,智能机器人.

  • 中图分类号: TP18

Recursive Least Squares Policy Iteration Based on Geodesic Gaussian Basis Function

  • 摘要: 在策略迭代结强化学习方法的值函数逼近过程中,基函数的合理选择直接影响方法的性能.为更好地描述环境的拓扑关系,采用测地线距离来替换普通高斯函数中的欧氏距离,提出一种基于测地高斯基函数的策略迭代强化学习方法.首先,基于马尔可夫决策过程抽样得到的样本数据建立环境的图论描述.其次,在图上定义测地高斯基函数,并用基于最短路径快速算法得到的最短路径来逼近测地线距离.然后,假定强化学习系统的状态—动作值函数是给定测地高斯基函数的加权组合,采用递归最小二乘方法对权值进行在线增量式更新.最后,基于估计的值函数进行策略改进.10×10和20×20迷宫问题的仿真结果验证了所提策略迭代方法的有效性.
    Abstract: An appropriate selection of basis function directly in?uences the learning performance of a policy iteration method during the value function approximation.In order to describe the topology relationship of an environment better,a geodesic distance is substituted for a Euclidean distance used in an ordinary Gaussian function and a policy iteration reinforcement learning method based on geodesic Gaussian basis function is proposed.At first,a graph about the environment can be built based on the sample data generated from a Markov decision process(MDP).Secondly,geodesic Gaussian basis functions are defined on the graph.A shortest path obtained by a shortest path faster algorithm is used to approximate a geodesic distance.Then a state-action value function in learning system is assumed as the linearly weighted sum of the given geodesic Gaussian basis functions,and a recursive least squares method is used to update the weights in an on-line and incremental manner.At last,policy improvement is carried out based on the estimated state-action value.Simulation results of 10×10 and 20×20 mazes illustrate the validity of the proposed policy iteration method.
  • [1] Bnsoniu L,Babuska R,De Schutter B.A comprehensive survey of multiagent reinforcement learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C-Applications and Reviews,2008,38(2):156~172
    [2] 刘建昌,林琳.基于CMAC再励学习控制的电梯群控调度方法[J].信息与控制,2005,34(4):495~499
    [3] Jodogne S,Briquet C,Piater J H.Approximate policy iteration for closed-loop learning of visual tasks[A].Lecture Notes in Artificial Intelligence (vol.4212)[M].Berlin,Germany:Springer-Verlag,2006.210~221
    [4] Lagoudakis M G,Parr R.Least-squares policy iteration[J].Journal of Machine Learning Research,2004,4(6):1107~1149
    [5] Wang X S,Cheng Y H,Yi J Q.A fuzzy actor-critic reinforcement learning network[J].Information Sciences,2007,177(18):3764~3781
    [6] 段勇,徐心和.基于模糊神经网络的强化学习及其在机器人导航中的应用[J].控制与决策,2007,22(5):525~529,534
    [7] 高阳,陈世富,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86~100
    [8] 程玉虎,王雪松,易建强,等.基于自组织模糊RBF网络的连续空间Q学习[J].信息与控制,2008,37(1):1~8
    [9] Mahadevan S.Proto-value functions:developmental reinforcement learning[A].Proceedings of the International Conference on Machine Learning[C].New York,USA:ACM,2005.553~560
    [10] Sugiyama M,Hachiya H,Towell C,et al.Value function approximation on non-linear manifolds for robot motor control[A].Proceedings of the IEEE International Conference on Robotics and Automation[C].Piscataway,NJ,USA:IEEE,2007.1733~1740
    [11] Glaubius R,Namihira M,Smart W D.Speeding up reinforcemerit learning using manifold representations:Preliminary results[A/OL].IJCAI Workshop Reasoning with Uncertainty in Roboties[C].http://www.cs.wustl.edu/~rlg1/Papers/glaubius2005rense.pdf,2008-11-30
    [12] Tenenbaum J B,de Silva V,Langford J C.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5500):2319~2323
    [13] Dijkstra E W.A note on two problems in connexion with graphs[J].Numerische Mathematik,1959,1(1):269~271
    [14] 段凡丁.关于最短路径的SPFA快速算法[J].西南交通大学学报,1994,29(2):207~212
    [15] Xu X,He H,Hu D.Efficient reinforcement learning using recursive least-squares methods[J].Journal of Artificial Intelligence Research,2002,16:259~292
    [16] Ljung L,Torsten S.Theory and Practice of Recursive Identification[M].Cambridge,MA,USA:MIT Press,1983.
计量
  • 文章访问数:  1783
  • HTML全文浏览量:  0
  • PDF下载量:  194
  • 被引次数: 0
出版历程
  • 收稿日期:  2008-11-29
  • 发布日期:  2009-08-19

目录

    /

    返回文章
    返回
    x