Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Author

Aberdeen, Douglas

Author

Aberdeen, Douglas

Date

Mon, 27/07/2009 - 11:05

Tue, 04/01/2011 - 13:36

Mon, 27/07/2009 - 11:05

Tue, 04/01/2011 - 13:36

Wed, 01/01/2003 - 11:00

Description

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember.

GUID

oai:openresearch-repository.anu.edu.au:1885/48180

Handle

http://hdl.handle.net/1885/48180

https://openresearch-repository.anu.edu.au/bitstream/1885/48180/1/02whole.pdf.j…

Identifier

oai:openresearch-repository.anu.edu.au:1885/48180

Identifiers

b21435406

http://hdl.handle.net/1885/48180

10.25911/5d7a2b73cbb88

https://openresearch-repository.anu.edu.au/bitstream/1885/48180/1/02whole.pdf.jpg