A Simple Framework for Generalization in Visual RL under Dynamic Scene Perturbations

NeurIPS 2024

Wonil Song1, Hyesong Choi2, Kwanghoon Sohn1, Dongbo Min2

1Yonsei University, Seoul, Korea, 2Ewha Womans University, Seoul, Korea

Abstract

In the rapidly evolving domain of vision-based deep reinforcement learning (RL), a pivotal challenge is to achieve generalization capability to dynamic environmental changes reflected in visual observations. Our work delves into the intricacies of this problem, identifying two key issues that appear in previous approaches for visual RL generalization: (i) imbalanced saliency and (ii) observational overfitting. Imbalanced saliency is a phenomenon where an RL agent disproportionately identifies salient features across consecutive frames in a frame stack. Observational overfitting occurs when the agent focuses on certain background regions rather than task-relevant objects. To address these challenges, we present a simple yet effective framework for generalization in visual RL (SimGRL) under dynamic scene perturbations. First, to mitigate the imbalanced saliency problem, we introduce an architectural modification to the image encoder to stack frames at the feature level rather than the image level. Simultaneously, to alleviate the observational overfitting problem, we propose a novel technique called shifted random overlay augmentation, which is specifically designed to learn robust representations capable of effectively handling dynamic visual scenes. Extensive experiments demonstrate the superior generalization capability of SimGRL, achieving state-of-the-art performance in benchmarks including the DeepMind Control Suite.

Motivation

Using a gradient-based attribution mask, we investigate the causes of the degradation in generalization in challenging environments by examining salient regions across consecutive stacked frames used as an RL input. Based on our analysis, we empirically identified two phenomena, highlighting them as key causes of performance degradation: (i) what we refer to as imbalanced saliency and (ii) observational overfitting [1].

Method

1. Feature-Level Frame Stack

To alleviate the imbalanced saliency problem, we modify the encoder structure from an image-level frame stack to a feature-level frame stack.

2. Shifted Random Overlay Augmentation

To alleviate the observational overfitting problem and make the encoder robust to dynamic backgrounds, we propose a new data augmentation called shifted random overlay.

SimGRL

Based on SVEA [2] as a baseline algorithm, we propose SimGRL by adopting the two regularizations.

Results

DMControl-GB

DistractingCS

Robotic Manipulation

Demonstrations

DMControl-GB

                

                

                

DistractingCS

Each test shows the results for intensity levels $\in$ {0.05, 0.1, 0.15, 0.2, 0.3}.

           

           

           

           

           

Robotic Manipulation

         

         

Reference

[1] Song et al. “Observational Overfitting in Reinforcement Learning.” ICLR (2020).

[2] Hansen et al. “Stabilizing deep q-learning with convnets and vision transformers under data augmentation.” NeurIPS (2021).

Citation

@inproceedings{songsimple,
  title={A Simple Framework for Generalization in Visual RL under Dynamic Scene Perturbations},
  author={Song, Wonil and Choi, Hyesong and Sohn, Kwanghoon and Min, Dongbo},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
}