Science Cast

Generalization in offline RL: The structure is more important than the amount of pessimism

Max WeltevredeJuly 3, 2026 1:50am

Views (5)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Generalization in offline RL: The structure is more important than the amount of pessimism

arXivPDFJuly 2, 2026 12:00am

Authors

Max Weltevrede, Matthijs T. J. Spaan, Wendelin Böhmer

Abstract

While pessimism counteracts overestimation bias in offline reinforcement learning (RL), being overly conservative has been associated with hindering certain forms of generalization. However, in this paper we demonstrate that being overly pessimistic does not inherently prevent optimal generalization in contextual MDPs (CMDPs). Instead, we argue successful generalization depends not on the amount of pessimism, but whether the pessimistic structure respects the underlying symmetries of the optimal solution. We prove that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the dataset coverage. As such, enforcing a symmetric value function can be non-trivial, and might require techniques such as data augmentation (DA). Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of (regular) offline training on an augmented dataset. This is empirically validated using IQL and CQL on a rotationally symmetric reacher environment.

TwitterandLinkedIn

0 comments

Add comment

Generalization in offline RL: The structure is more important than the amount of pessimism

Generalization in offline RL: The structure is more important than the amount of pessimism

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments