Offline (Batch) Reinforcement Learning

Key takeaways

Offline RL (a.k.a. batch RL) learns a policy from a fixed, pre-collected dataset — with zero new interaction with the environment.
The core enemy is distributional shift: querying the value function on actions the dataset never contains produces wildly over-optimistic, un-correctable estimates.
Every method is some flavor of staying close to the data: constrain the policy (BCQ, TD3+BC), be pessimistic about unseen actions (CQL, IQL), or skip RL entirely and model the data as a sequence (Decision Transformer).
It matters wherever online trial-and-error is unsafe or expensive — robotics, healthcare, autonomous driving, recommendation — and it underpins how RL is applied to LLMs.

What is offline reinforcement learning?

Offline reinforcement learning — also called batch RL or data-driven RL — learns a decision-making policy from a fixed dataset of past experience, without ever interacting with the environment during training. You are handed a log of transitions (state, action, reward, next state) collected by some other policy (or several, or humans), and you must extract the best possible behavior from it. No exploration, no fresh rollouts, no resets — just the data.

This is a sharp departure from standard (“online”) reinforcement learning, where the agent’s whole job is to act, observe the consequence, and improve in a tight loop. Offline RL severs that loop. The promise is enormous: the same way supervised learning turned static image and text corpora into models, offline RL aims to turn static logged decision data into good policies — letting RL benefit from the kind of large, reusable datasets that drove the rest of deep learning.

Online RL learns inside a feedback loop with the environment; offline RL learns once from a frozen dataset, then deploys. The missing arrow back to the environment is the entire difficulty.

▶ CS 285 Lecture 15: Offline Reinforcement Learning — Sergey Levine (Berkeley, the canonical lecture)

Why it’s hard: distributional shift

If offline RL were just “run Q-learning on a saved buffer,” it would be trivial — we already train deep Q-networks from replay buffers. The catch is that those buffers are constantly refreshed with new data from the current policy. Freeze the data and naive value-based methods fail catastrophically, almost always producing a policy far worse than the data it learned from.

The culprit is distributional shift. Q-learning’s update bootstraps from the value of the best next action:

Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')

That max happily picks actions a' that never appear in the dataset. The Q-function has no evidence about them, so its estimates there are arbitrary — and because deep networks extrapolate optimistically, they tend to be over-estimates. Worse, the policy is then trained to seek out exactly those over-valued out-of-distribution (OOD) actions, the bootstrapping target propagates the inflated values backward through the Bellman equation, and the errors compound. Online RL self-corrects this (try the action, get the real low reward, fix the estimate). Offline RL cannot — there is no environment to set it straight.

Off-support over-estimation: inside the data's action support the learned Q-value tracks the truth, but on actions the dataset never contains it extrapolates upward. The argmax policy chases that phantom peak.

Go deeper: why importance sampling alone doesn’t save you

A natural instinct is to correct the mismatch between the data-collecting behavior policy π_β and the new policy π with importance sampling — reweighting returns by the ratio of their probabilities. Over a full trajectory those ratios multiply across every timestep, so the variance of the estimator explodes exponentially with the horizon. For anything but very short episodes the estimates are uselessly noisy. This is why modern offline RL leans on value-based constraints and pessimism rather than pure importance-weighted policy evaluation. The Levine et al. survey treats this in depth.

The data: behavior policy and dataset quality

Offline performance is bounded by what’s in the dataset, but not simply capped at the data’s quality. The headline win of offline RL over plain behavior cloning (imitation) is trajectory stitching: if the data contains a good way to get from A to B and a separate good way from B to C, value learning can compose them into an A-to-C policy that no single logged trajectory ever demonstrated. Cloning can only copy whole trajectories; offline RL can recombine their parts.

Datasets come in flavors, and the right algorithm depends on which you have:

Expert / near-optimal

Logged by a good policy. Behavior cloning is a strong baseline here; offline RL’s edge is smaller. Tight policy constraints work well.

Mixed / suboptimal

A heterogeneous blend of good and bad trajectories — the realistic case. This is where stitching pays off most, and where pessimistic value methods (CQL, IQL) shine over cloning.

The D4RL benchmark (Fu et al., 2020) standardized exactly these regimes — medium, medium-replay, medium-expert, random splits across MuJoCo locomotion, AntMaze navigation and more — and it remains the default yardstick for the field.

How offline RL works: the recipe

Collect (or inherit) a fixed dataset

Gather logged transitions D = {(s, a, r, s')} from one or more behavior policies π_β — past production systems, human demonstrators, scripted controllers, or old RL agents. After this point the environment is off-limits.

Choose how to stay close to the data

Pick a strategy for handling distributional shift: policy constraint (force π to resemble π_β), value pessimism (penalize Q-values on unseen actions), in-sample learning (never query OOD actions at all), or sequence modeling (treat the data as a sequence to imitate-with-conditioning).

Train the value function and/or policy on D

Run the chosen objective entirely on the static dataset, sampling minibatches like supervised learning. The conservatism from Step 2 keeps the learned values honest where data is thin.

Evaluate, then deploy (or finetune online)

Estimating performance without the environment — off-policy evaluation (OPE) — is itself hard and unreliable. Many pipelines therefore allow a small, careful online finetuning phase after offline pretraining to validate and sharpen the policy.

The main algorithm families

Four broad strategies dominate. They all answer the same question — how do we avoid trusting the value function off-support? — in different ways.

Family	Core idea	Representative methods	Trade-off
Policy constraint	Keep `π` close to the behavior policy `π_β`	BCQ, BEAR, TD3+BC	Simple and stable; can’t exceed `π_β` by much if the constraint is tight
Value pessimism	Lower Q-values on OOD actions so the policy avoids them	CQL, MOPO (model-based)	Strong on mixed data; a conservatism coefficient to tune
In-sample / implicit	Learn values using only dataset actions — never query a `max` over unseen ones	IQL	Very stable, no OOD queries; expectile/temperature to tune
Sequence modeling	Skip RL: model trajectories autoregressively, condition on desired return	Decision Transformer, Trajectory Transformer	Elegant, scales; weaker stitching, sensitive to target return

Policy constraint: BCQ and TD3+BC

Batch-Constrained deep Q-learning (BCQ) — the method that first crisply diagnosed the OOD problem — restricts the policy to actions a generative model deems likely under the data, so the max in the Bellman backup only ranges over plausible actions. TD3+BC is the minimalist heir: take the standard TD3 actor-critic and add one behavior-cloning regularizer to the policy loss:

\pi = \arg\max_{\pi}\; \mathbb{E}_{(s,a)\sim D}\Big[\,\lambda\, Q\big(s,\pi(s)\big)\;-\;\big(\pi(s)-a\big)^2\,\Big]

The first term maximizes value; the second pulls the chosen action toward the logged action. One extra line, a single coefficient λ, and it’s competitive with far more complex methods — a celebrated example of “a simple baseline done right.”

Value pessimism: CQL

Conservative Q-Learning (CQL) adds a regularizer that pushes down Q-values for actions the current policy would pick and pushes up Q-values for actions actually in the data, on top of the usual Bellman error:

\min_{Q}\; \alpha\Big(\mathbb{E}_{s\sim D,\,a\sim\pi}[Q(s,a)] - \mathbb{E}_{(s,a)\sim D}[Q(s,a)]\Big) \;+\; \tfrac{1}{2}\,\mathbb{E}_{D}\big[(Q - \hat{\mathcal{B}}Q)^2\big]

The net effect: a lower bound on the true value, so the policy can’t be lured toward over-estimated OOD actions. CQL became a default baseline because it’s robust across dataset qualities — the price is tuning the conservatism weight α.

In-sample learning: IQL

Implicit Q-Learning (IQL) sidesteps OOD queries entirely. Instead of max_{a'} Q(s', a') over all actions, it estimates the value of the best in-dataset action using expectile regression — approximating an upper expectile of the Q-distribution over actions that actually appear. Because it never evaluates the Q-function on an action the data doesn’t contain, there is nothing off-support to over-estimate. The policy is then extracted by advantage-weighted regression (behavior cloning, weighted toward high-advantage actions). IQL is prized for stability and for finetuning gracefully into the online setting.

Go deeper: model-based offline RL (MOPO, COMBO)

A different route learns a dynamics model from the dataset and plans/rollouts inside it — but a model is also unreliable off-support. MOPO and MOReL handle this by penalizing reward in proportion to model uncertainty, building a pessimistic MDP the agent is safe to optimize. COMBO combines model-based rollouts with CQL-style value conservatism. The theme is identical to model-free offline RL — be pessimistic where you have no data — just applied to the learned model instead of the value function. See model-based RL.

Sequence modeling: Decision Transformer

The most conceptually radical approach throws out value learning altogether. Decision Transformer (Chen et al., 2021) casts offline RL as conditional sequence modeling: feed a Transformer the sequence of (return-to-go, state, action) tokens and train it, GPT-style, to predict the next action. At test time you condition on a high desired return and let the model autoregressively produce actions to hit it. No Bellman backups, no bootstrapping, no OOD max — and therefore no over-estimation, because it never estimates a value off-support. The cost: it’s closer to smart imitation and struggles to “stitch” as aggressively as value methods on some tasks.

Off-policy evaluation: the silent hard problem

Training a policy offline is only half the job — you also have to know whether it’s any good before deploying it, and you still can’t touch the environment. Off-policy evaluation (OPE) tries to estimate a policy’s return from logged data alone, using importance sampling, learned value functions (fitted Q-evaluation), or doubly-robust hybrids. In practice OPE estimates are noisy and easy to fool, which is why model selection and hyperparameter tuning remain among offline RL’s thorniest, most under-appreciated obstacles. When the stakes allow it, teams hedge by validating with a small, monitored online rollout before full deployment.

Where offline RL is used

Domain	Why offline	Example signal
Healthcare	Experimenting on patients is unethical; rich logged records exist (e.g. ICU data)	Sepsis treatment policies learned from MIMIC critical-care logs
Robotics	Real-world rollouts are slow, costly, and risk hardware	Learning manipulation skills from large offline interaction datasets
Autonomous driving	An untrained policy can’t experiment in live traffic	Policies from fleets of logged human-driving data
Recommendation & ads	Online exploration costs revenue and user trust	Reusing historical interaction logs to improve ranking policies
LLM post-training	Re-querying the model/raters is expensive	Offline preference methods learn from fixed comparison data

new environment interactions during training

algorithm families (constraint, pessimism, in-sample, sequence)

2020

D4RL benchmark standardized the field

Offline RL and LLMs

Offline RL is also a useful lens on the alignment of large language models. DPO and related direct-preference methods are, in effect, offline RL: they optimize a policy against a frozen dataset of human preference comparisons with no fresh sampling, and their stability versus on-policy PPO echoes the offline-vs-online tension exactly. Conversely, methods like GRPO and RLVR are predominantly on-policy — they generate fresh samples each step. The offline RL literature’s hard-won lessons about distributional shift and conservatism map directly onto why purely offline preference tuning can over-optimize against stale data. See RLHF for the broader picture.

A short history

2005

Fitted Q & batch RL

Ernst et al.’s fitted Q-iteration and the broader “batch RL” line establish learning value functions from a fixed set of transitions.

2019

BCQ names the problem

Fujimoto et al. show off-policy deep RL collapses on static data and trace it to OOD actions — coining the batch-constrained view.

2020

The tutorial + D4RL

Levine, Kumar, Tucker and Fujimoto publish the field-defining survey; D4RL gives everyone a shared benchmark.

2020

CQL

Kumar et al. introduce conservative Q-learning — pessimistic value bounds become a dominant recipe.

2021

IQL, TD3+BC & Decision Transformer

In-sample learning (IQL), a one-line BC baseline (TD3+BC), and sequence-modeling (Decision Transformer) arrive within months — three new paradigms.

2023–26

Offline → online & LLMs

Offline-to-online finetuning matures; DPO-style methods bring offline RL thinking to LLM alignment; new work (e.g. geometric pessimism) pushes real-world deployment.

Limitations and open problems

You can’t exceed the data’s reach — offline RL can stitch and recombine, but it cannot discover behaviors the dataset never hints at. Exploration is fundamentally off the table.
Off-policy evaluation is unreliable — choosing the best policy or hyperparameters without the environment remains largely unsolved.
Conservatism is a knob, not a solution — too little and you over-estimate; too much and you collapse to behavior cloning. Right-sizing it per dataset is fiddly.
Distribution shift at deployment — even a well-trained policy meets states its data never covered once it acts in the world, where its choices compound.

Researcher takes

Sergey Levine, who co-authored the field’s defining survey, stresses that imitation learning and offline RL look similar but are solving genuinely different problems — the distinction (and why offline RL can beat cloning) is the conceptual heart of the area.

View Sergey Levine's post on X →

Researcher takes

Levine poses the counterintuitive thesis at the heart of conservative offline RL: even when behavioral cloning is handed optimal demonstrations, value-based offline RL can still beat it. The argument turns on stitching and the structure of the environment (sparse rewards, long horizons) rather than mere data quality, reframing the offline-vs-imitation debate.

View Sergey Levine's post on X →

Frequently asked questions

How is offline RL different from behavior cloning?

Behavior cloning is supervised imitation — copy the logged actions. Offline RL uses rewards to do better than the data: by learning value functions it can stitch good fragments of different trajectories into a policy no single trajectory demonstrated, and it can down-weight the bad decisions in a mixed dataset. On near-optimal data the gap is small; on mixed data offline RL wins.

Is offline RL just off-policy RL on a saved replay buffer?

No — that’s the trap. Off-policy methods like DQN reuse old data but keep adding fresh data that corrects their mistakes. Freeze the buffer and the same algorithms blow up, because they over-estimate the value of out-of-distribution actions with no way to self-correct. Offline RL is the set of techniques built to survive that frozen setting.

Which algorithm should I start with?

For continuous control, IQL and TD3+BC are the usual first picks — stable, simple, strong on the D4RL benchmark. CQL is the standard pessimistic baseline. If you want a non-value approach or have large diverse data, try a Decision Transformer. Match the method to your data quality: tight constraints for expert data, pessimism for mixed data.

Can I fine-tune an offline policy online afterward?

Yes, and it’s a popular recipe: pretrain offline to get a safe, competent starting policy, then do a short, monitored online finetuning phase to validate and sharpen it. IQL in particular was designed to transition smoothly from offline to online without destabilizing.

Key papers

Offline RL: Tutorial, Review, and Perspectives on Open Problems — Levine, Kumar, Tucker, Fujimoto, 2020 — the definitive survey.
Off-Policy Deep RL Without Exploration (BCQ) — Fujimoto et al., 2019 — names the OOD-action problem.
Conservative Q-Learning (CQL) — Kumar et al., 2020 — pessimistic value bounds.
Offline RL with Implicit Q-Learning (IQL) — Kostrikov et al., 2021 — in-sample, no OOD queries.
A Minimalist Approach to Offline RL (TD3+BC) — Fujimoto & Gu, 2021 — one-line BC regularizer.
Decision Transformer — Chen et al., 2021 — RL as conditional sequence modeling.
D4RL: Datasets for Deep Data-Driven RL — Fu et al., 2020 — the standard benchmark.

What is reinforcement learning? · Value functions · Q-learning · Deep Q-networks · Model-based RL · DPO & preference optimization · RLHF

Offline Reinforcement Learning

What is offline reinforcement learning?

Why it’s hard: distributional shift

The data: behavior policy and dataset quality

How offline RL works: the recipe

The main algorithm families

Policy constraint: BCQ and TD3+BC

Value pessimism: CQL

In-sample learning: IQL

Sequence modeling: Decision Transformer

Off-policy evaluation: the silent hard problem

Where offline RL is used

Offline RL and LLMs

A short history

Limitations and open problems

Researcher takes

Researcher takes

Frequently asked questions

Key papers

Related