Abstract:
An appropriate selection of basis function directly in?uences the learning performance of a policy iteration method during the value function approximation.In order to describe the topology relationship of an environment better,a geodesic distance is substituted for a Euclidean distance used in an ordinary Gaussian function and a policy iteration reinforcement learning method based on geodesic Gaussian basis function is proposed.At first,a graph about the environment can be built based on the sample data generated from a Markov decision process(MDP).Secondly,geodesic Gaussian basis functions are defined on the graph.A shortest path obtained by a shortest path faster algorithm is used to approximate a geodesic distance.Then a state-action value function in learning system is assumed as the linearly weighted sum of the given geodesic Gaussian basis functions,and a recursive least squares method is used to update the weights in an on-line and incremental manner.At last,policy improvement is carried out based on the estimated state-action value.Simulation results of 10×10 and 20×20 mazes illustrate the validity of the proposed policy iteration method.