Abstract
While existing off-policy evaluation (OPE) methods typically estimate the value of a policy, in real-world applications, OPE is often used to compare and rank policies before deploying them in the real world. This is also known as the offline policy ranking problem. While one can rank the policies based on point estimates from OPE, it is beneficial to estimate the full distribution of outcomes for policy ranking and selection. This paper introduces Probabilistic Offline Policy Ranking that works with expert trajectories. It introduces rigorous statistical inference capabilities to offline evaluation, which facilitates probabilistic comparisons of candidate policies before they are deployed. We empirically demonstrate that POPR is effective for evaluating RL policies across various environments.
Degree
MS
College and Department
Physical and Mathematical Sciences; Computer Science
Rights
https://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Schwantes, Trevor F., "POPR: Probabilistic Offline Policy Ranking with Expert Data" (2023). Theses and Dissertations. 10350.
https://scholarsarchive.byu.edu/etd/10350
Date Submitted
2023-04-26
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd13188
Keywords
offline reinforcement learning, off-policy evaluation, off-policy ranking
Language
english