Investigating the behaviour of Q()

Jeremy Wyatt

发表年份: 1996
引用次数: 2

摘要

There is a spectrum of methods for learning robot control. At one end there are model-free methods (eg. Q-learning, AHC), and at the other there are model-based methods, (eg. dynamic programming). The advantage of one technique is the weakness of the other. Model-based methods use experience effectively, but are computationally expensive; model-free methods are cheap computationally, but require an order of magnitude more experience. In the middle ground there are now methods like Q-DYNA and prioritised sweeping which use a learned model to speed temporal credit assignment. The optimal trade-off is dependent on the relative cost of experience and computation for particular tasks. Unfortunately we frequently do not know the cost balance for a particular task. Hence the goal of this work is to understand more about the sorts of methods that might work well on a wide variety of cost ratios, and in particular how model free methods might be extended. In this paper we examine the behaviour of one such model-free algorithm, Q(λ). This algorithm shows promise because it combines the best features of Sutton's TD(λ) algorithm (1988) with those of Watkins Q-learning (1989). Despite being model-free algorithm, it has been reported to outperform prioritised sweeping, the current best method for learning a policy and a model at the same time. Here we look at the effect on its performance of using replacing or accumulating traces, and at the problem of exploration sensitivity. (3 pages)

关键词

Computer scienceReinforcement learningTask (project management)RobotComputationVariety (cybernetics)Artificial intelligenceRelation (database)Machine learningDynamic programming

Investigating the behaviour of Q()

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Fractional Differential Equations

Applied Nonlinear Control