offline 2 online | 重要性采样，把 offline + online 数据化为 on-policy samples

论文标题：Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble
CoRL 2021，4 个 weak accept。
pdf：https://arxiv.org/pdf/2107.00591.pdf
html：https://ar5iv.labs.arxiv.org/html/2107.00591
open review：https://openreview.net/forum?id=AlJXhEI6J5W
GitHub：https://github.com/shlee94/Off2OnRL

0 abstract

Recent advance in deep offline reinforcement learning (RL) has made it possible to train strong robotic agents from offline datasets. However, depending on the quality of the trained agents and the application being considered, it is often desirable to fine-tune such agents via further online interactions. In this paper, we observe that state-action distribution shift may lead to severe bootstrap error during fine-tuning, which destroys the good initial policy obtained via offline RL. To address this issue, we first propose a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples from the offline dataset. Furthermore, we leverage multiple Q-functions trained pessimistically offline, thereby preventing overoptimism concerning unfamiliar actions at novel states during the initial training phase. We show that the proposed method improves sample-efficiency and final performance of the fine-tuned robotic agents on various locomotion and manipulation tasks.

background：
- 我们希望做 offline 2 online： offline RL 训练 robotic agent 已经很强大了。然而，1. agent 质量不好，2. task 希望定制，通常希望通过进一步的 online 交互，来 fine-tune 这些 agent。
- gap：我们发现，state-action distribution shift 可能导致 fine-tune 过程中的 bootstrapping error，从而破坏 offline RL 的良好初始策略。
method：
- ① 提出了一种平衡的（balanced）重放方案（replay scheme），优先考虑 online sample，同时也鼓励使用 offline dataset 中的 near-on-policy samples。
- ② 利用了离线悲观训练的（trained pessimistically offline）多个 Q function，从而防止在初始训练阶段，对新状态下的不熟悉 action 过度乐观。
- near on-policy experience replay（？） + pessimistic Q-ensemble。
results：
- proposed method 提高了 fine-tune robotic agent 在各种 locomotion 和 manipulation task 上的样本效率和最终性能。

主要思想

motivation：
- 我们希望选取更加 on-policy 的 transition，更新 sac 的 actor-critic。
- 这样可以避免 state-action distribution shift，即，使用当前 value function 的 unseen action 更新 value function，会导致 bootstrapping error，毁掉整个 Q function。
观察：
- 发现在 online fine-tune 刚开始时，使用 offline samples 学习缓慢，使用 online samples 有 bootstrapping error 的风险；
- 因此在 offline online 之间权衡，用 density function 的估计度量 on-policy-ness，选取 offline + online 里最 on-policy 的 samples，做 sac fine-tune。
- （story 是这样讲的，但感觉 importance sampling 的故事更好）。
offline online 权衡：
- 在 offline + online buffer 的采样概率，应当与 $d^{on}(s,a) / d^{off}(s,a)$ 成正比（importance sampling）。
（原来 near-on-policy 的最初的 motivation，是避免 distribution shift 嘛…… 有端联想 QPA）

1 intro

off-policy RL 不能直接做 offline 2 online：
- off-policy RL 的应用场景，貌似适合 offline 2 online，因为 off-policy RL 可以同时利用 offline 和 online 样本。
- 然而，由于 distribution shift，传统 off-policy RL 进行 online 微调的效果不好，因为 agent 可能会遇到 offline dataset 中 unseen 的 state-aciton。
- 对于这种 OOD 的 online sample，Q function 无法提供准确的值估计，使用这种 sample 进行 Q update 会导致严重的 bootstrapping error，最终导致 policy update 在任意方向（in an arbitrary direction），破坏 offline RL 的良好初始策略。
解决 state-action distribution shift，引入一种 balanced replay scheme：
- 除了 online 样本外，还为 agent 的 Q update 提供 offline dataset 中的 near-on-policy 样本。具体的，我们训了一个 NN，来测量 offline samples 的在线性（online-ness），然后根据该 measure，对样本进行优先级排序。
解决对 unseen action 的 over-estimation，使用 pessimistic Q-ensemble：
- 发现一类特定的 offline RL 算法，它们训练的悲观 Q function 是 offline 2 online 的绝佳起点。使用 Q 函数隐式约束策略，使其在初始微调阶段保持在行为策略附近（？）。

2 background

回顾 Conservative Q-learning（保守的 Q-learning）：
- （policy evaluation）CQL 的 critic 更新的 loss function： $\frac12E_{(s,a,s')\sim B}[Q-B^{\pi_\phi}Q]^2+\alpha_0E_{s\sim B}[\log\sum_a\exp Q(s,a)-E_{a\sim \pi_\beta}[Q(s,a)]]$ ，第二项拉低了 unseen action 的 Q value，拉高的 seen action 的 Q value。
- CQL 的 policy update 与 SAC 相同。

3 offline RL + fine-tuning 的关键挑战： state-action distribution shift

主要在讨论 state-action distribution shift。

policy evaluation 使用样本（offline / online）的 trade-off：
- online sample 的 fine-tune 效率高，但因为 unseen / OOD state-action 所以危险；offline sample 效率低但安全。
- 若全部使用 online sample 来 fine-tune，会导致性能一开始大幅度下降。若随机采样 offline + online，则因为使用的 online 样本不够多，导致 value 更新缓慢，学习缓慢。
发现 pessimistic Q function 更适合作为 offline 起点。为了更悲观一点，我们采用 pessimistic Q-ensemble。

4 method

整点知乎博客：

4.1 balanced experience replay

https://blog.csdn.net/sinat_37422398/article/details/127292692

直接看这一篇的 4.1 即可，对如何估计 density ratio $w(s,a)=d^{on}(s,a)/d^{off}(s,a)$ 讲得比较清楚。可以与原文交替看。感觉是神秘的 trick。

4.2 pessimistic Q-ensemble

https://blog.csdn.net/sinat_37422398/article/details/127292692

直接看这一篇的 4.2 即可。在 MuJoco task 里的 ensemble size N = 5。

4.3 算法概括

offline RL： CQL ensemble，得到 offline 的 policy / value 训练起点。
online fine-tune：
- 初始化 density ratio estimation network ψ；初始化 online buffer $B^{on}$ 为空集、prioritized buffer $B^{priority}$ 为空集、default priority value p0 ← 1.0。
- 对 offline buffer B^{off} 中的每个 transition τi，将 {(τi, p0)} 加入 $B^{priority}$。
- 赋值 p0 ← P0。
- for each iteration do：
  - 收集 online training samples：使用 $\pi_\phi$ 收集一个 transition τ，将其加入 $B^{on}$，将 {(τ, p0)} 加入 $B^{priority}$ 。
  - 更新 density ratio network：从 $B^{on},B^{off}$ 里采样 B 个 transition，计算 loss function $L^{DR}(\tau)$ ，更新 ψ。
  - 更新 policy 和 value network：计算 $L^{SAC}_{critic}(\theta), L^{SAC}_{actor}(\phi)$ ，更新 θ φ。
  - 更新 priority values： for j = 1, ..., B，将 online 采样得到的 transition τj 的 priority 更新为 $\frac{w(s_i,a_j)}{E_{\tau^{off}\sim B^{off}}[w(s,a)]}$ ，更新 p0 ← $\max(p_0,\frac{w(s_i,a_j)}{E_{\tau^{off}\sim B^{off}}[w(s,a)]})$ 。

大致思想：

offline RL 采用 CQL ensemble 作为训练起点，提供 actor 和 critic。
对于 online fine-tune，维护一个 priority buffer（offline + online），其中采样概率正比于 priority，priority 为 $d_{on}(s,a)/d_{off}(s,a)$ 。（简单的 importance sampling）
从 priority buffer 里生成 samples，直接继续训 sac。
“policy 生成 (s,a) 的概率”，使用神秘计算方式；在每一轮 online 交互后，还会更新 density ratio estimation network。

6 experiment

实验环境：
- D4RL 的 MuJoco locomotion tasks（halfcheetah, hopper, walker2d）。
- 三个 sparse-reward pixel-based manipulation tasks，貌似是在 simulator 上的，没有真机。这个实验篇幅较短，D4RL 实验的分析篇幅长。
baselines：
- AWAC： Advantage Weighted Actor Critic，offline 2 online RL 方法，用于训练策略以模仿具有高优势估计的行动。
- BCQ-ft： Batch-Constrained deep Q-learning（BCQ），offline RL 算法，通过使用一个 conditional VAE，对 behavior policy 进行建模，来更新策略。通过应用与 offline training 相同的更新规则，将 BCQ 扩展到 online fine-tune setting。
- SAC-ft： SAC 的 fine-tune 版本。
- SAC：从头开始训练 SAC，无法访问 offline dataset；相当于，训练的初始策略是 random policy。
results：
- proposed method 在所有任务上表现良好，而 AWAC 和 BCQ-ft 的性能高度依赖于 offline dataset 的质量，在 random dataset 的 performance 甚至不如 SAC。
- 这是因为，AWAC 和 BCQ-ft 对 offline 和 online setting 都采用相同的正则化、悲观更新规则，无论是显式（BCQ-ft）还是隐式（AWAC），这会导致微调速度缓慢。相反，我们的方法依赖于悲观的初始化，因此可以享受更快的微调，同时不牺牲初始训练的稳定性。