深度强化学习教程：Spinning Up项目中文版¶
顺便给自己的公主号：一时博客 打个广告，欢迎关注，以下为正文。
项目介绍¶
这个项目是什么¶
欢迎来到深度强化学习（deep Reinforement Learning）的Spinning Up项目！这是一份由OpenAI提供的教育资源，旨在让深度强化学习的学习变得更加简单。
强化学习 ，是一种通过教会智能体（agents）反复试错从而完成任务的机器学习方法。深度强化学习指的是强化学习和 深度学习 的结合。
这个模块包括一系列有用的资源，包括：
为什么创建这个项目¶
我们最常听到的问题是：
如果我想为AI安全做贡献，我应该如何开始？
在OpenAI，我们相信深度学习尤其是深度强化学习，会在未来的人工智能科技中扮演重要角色。为了确保人工智能的安全性，我们必须提出与此相契合的安全策略和算法。因此我们鼓励每一个提出这个问题的人研究这些领域。
深度学习现在有很多帮助人们快速入门的资源，相比之下，深度强化学习显得门槛高很多。首先，学生要有数学、编程和深度学习的背景。除此之外，他们需要对于这一领域有高水准的理解：都有哪些研究课题？这些课题为什么重要？哪些东西已经做出来了？他们也需要认真的指导，从而了解如何将算法理论和代码结合起来。
这个领域还很新，所以一时很难有高层次的了解。现在深度强化学习领域还没有合适的教材，所以很多知识都被藏在了论文和讲座中，难以理解吸收。深度强化学习算法实现的学习也很痛苦，因为：
 很多算法论文或有意或无意的省略了核心的细节部分
 一些知名度很高的算法实现代码很难读懂，难以把代码和算法联系一块
尽管很多很棒的 repos ，例如 rllab, Baselines, 和 rllib ，让那些已经在这个领域的研究者更加容易做出成果。但这些项目会考虑很多因素综合权衡之后，把算法整合到框架里面，代码也就变得难以看懂。所以说，不管是对于学者还是从业者、业余爱好者来说，深度强化学习领域的门槛都很高。
我们的项目就是为了填上中间缺的这一部分，服务于那些，希望了解深度强化学习或者希望做出自己的贡献，但是对于要学习什么内容以及如果把算法变成代码不清楚的同学。我们努力把这个项目变成一个助推器。
也就是说，从业人员不是这个项目唯一的受益者.人工智能安全问题的解决，不仅需要人们有大量实践经验和广阔的视野，还需要了解很多与工程、计算机无关的专业知识。每一个参与到这个计算的人都应该做出精明的决定，Spinning Up 项目的很多地方都提到了这一点。
这个项目如何服务我们的使命¶
OpenAI 的 使命 是确保通用人工智能(Artificial general intelligence, AGI)的安全发展以及让人工智能带来的收益分布更加均匀。Spinning Up 这样的教育工具能够在这两个方面都作出贡献。
只要我们能帮助更多人了解人工智能究竟是什么以及它是怎么工作的，我们就能更接近广泛的利益分配。这会促使人们批判地思考很多问题，因为人工智能在我们的生活中变得越来越重要。
同时，我们也需要人 加入我们 共同确保通用人工智能的安全。由于这一领域还很新，所以拥有这项技能的人才目前供不应求。如果您能通过这个项目成为专家，那你一定能在我们的人工智能安全上发挥作用。
代码设计的原则¶
Spinning Up 项目的算法实现的时候有下面两个要求：
 尽量简单，同时还要足够好
 不同算法实现之间高度一致，从而揭示他们之间的相似性
这些算法基本上都是相互独立的，没有相互依赖的代码（除了日志打印、保存、载入和 MPI 等工具模块），所以你可以独立的学习每一个算法而不用去管那些繁琐的依赖关系。这些实现尽量做到和伪代码一致，最小化理论和代码之间的差距。
他们的结构都很相似，所以如果你理解了一个算法，再看剩下的就很简单。
我们尽量减少算法实现时候的技巧(trick)和相似算法之间的区别。这里可以展示一些移除的技巧，我们把原始 SoftActor Critic 算法中的 正则 和 观察正则化(observation normalization) 都移除了。我们在 DDPG, TD3, 和 SAC 的实现中，都遵循了 `原始TD3代码`_ 的约定，所有的梯度更新都是在每一个回合的最后执行的（而不是整个回合都在执行）。
所有的算法都做到“足够好”是指性能大致达到最优，但不一定达到了最优效果。所以进行科研基准(benchmark)的对比时要小心。每种实现的性能表现可以在我们的 基准 找到。
支持计划¶
我们计划支持 Spinning Up 项目来确保他能够作为学习深度强化学习的实用资料。这个项目的长期支持（数年内）还没有确定，但是短期内我们可以承诺:
发布后的前三周会大力支持（2018年11月8日至2018年11月29日）
 我们会通过修复 bug ，回答问题和改善文档以修复歧义的方式快速迭代
 我们会努力提升用户体验，方便用户使用该项目自助学习。
发布 6 个月之后，我们会根据社区反馈对整个项目做评估，然后宣布未来的计划，包括长期的规划路线。
此外，正如博客文章讨论的，我们也会在即将到来的 Scholars 和 Fellows 课程中使用 Spinning Up。任何更改和更新都会实时同步公开。
安装¶
Table of Contents
Spinning Up requires Python3, OpenAI Gym, and OpenMPI.
Spinning Up is currently only supported on Linux and OSX. It may be possible to install on Windows, though this hasn’t been extensively tested. [1]
You Should Know
Many examples and benchmarks in Spinning Up refer to RL environments that use the MuJoCo physics engine. MuJoCo is a proprietary software that requires a license, which is free to trial and free for students, but otherwise is not free. As a result, installing it is optional, but because of its importance to the research community—it is the de facto standard for benchmarking deep RL algorithms in continuous control—it is preferred.
Don’t worry if you decide not to install MuJoCo, though. You can definitely get started in RL by running RL algorithms on the Classic Control and Box2d environments in Gym, which are totally free to use.
[1]  It looks like at least one person has figured out a workaround for running on Windows. If you try another way and succeed, please let us know how you did it! 
Installing Python¶
We recommend installing Python through Anaconda. Anaconda is a library that includes Python and many useful packages for Python, as well as an environment manager called conda that makes package management simple.
Follow the installation instructions for Anaconda here. Download and install Anaconda 3.x (at time of writing, 3.6). Then create a conda env for organizing packages used in Spinning Up:
conda create n spinningup python=3.6
To use Python from the environment you just created, activate the environment with:
source activate spinningup
You Should Know
If you’re new to python environments and package management, this stuff can quickly get confusing or overwhelming, and you’ll probably hit some snags along the way. (Especially, you should expect problems like, “I just installed this thing, but it says it’s not found when I try to use it!”) You may want to read through some clean explanations about what package management is, why it’s a good idea, and what commands you’ll typically have to execute to correctly use it.
FreeCodeCamp has a good explanation worth reading. There’s a shorter description on Towards Data Science which is also helpful and informative. Finally, if you’re an extremely patient person, you may want to read the (dry, but very informative) documentation page from Conda.
警告
As of November 2018, there appears to be a bug which prevents the Tensorflow pip package from working in Python 3.7. To track, see this Github issue for Tensorflow. As a result, in order to use Spinning Up (which requires Tensorflow), you should use Python 3.6.
Installing Spinning Up¶
git clone https://github.com/openai/spinningup.git
cd spinningup
pip install e .
You Should Know
Spinning Up defaults to installing everything in Gym except the MuJoCo environments. In case you run into any trouble with the Gym installation, check out the Gym github page for help. If you want the MuJoCo environments, see the optional installation arguments below.
Check Your Install¶
To see if you’ve successfully installed Spinning Up, try running PPO in the LunarLanderv2 environment with
python m spinup.run ppo hid "[32,32]" env LunarLanderv2 exp_name installtest gamma 0.999
This might run for around 10 minutes, and you can leave it going in the background while you continue reading through documentation. This won’t train the agent to completion, but will run it for long enough that you can see some learning progress when the results come in.
After it finishes training, watch a video of the trained policy with
python m spinup.run test_policy data/installtest/installtest_s0
And plot the results with
python m spinup.run plot data/installtest/installtest_s0
Installing MuJoCo (Optional)¶
First, go to the mujocopy github page. Follow the installation instructions in the README, which describe how to install the MuJoCo physics engine and the mujocopy package (which allows the use of MuJoCo from Python).
You Should Know
In order to use the MuJoCo simulator, you will need to get a MuJoCo license. Free 30day licenses are available to anyone, and free 1year licenses are available to fulltime students.
Once you have installed MuJoCo, install the corresponding Gym environments with
pip install gym[mujoco,robotics]
And then check that things are working by running PPO in the Walker2dv2 environment with
python m spinup.run ppo hid "[32,32]" env Walker2dv2 exp_name mujocotest
核心算法及其实现¶
包括哪些算法¶
下面的算法已经在 Spinning Up 中实现：
 Vanilla Policy Gradient (VPG)
 Trust Region Policy Optimization (TRPO)
 Proximal Policy Optimization (PPO)
 Deep Deterministic Policy Gradient (DDPG)
 Twin Delayed DDPG (TD3)
 Soft ActorCritic (SAC)
这些算法全部以 多层感知机 actorcritics 的方式实现，从而适用于全观察、不基于图像的强化学习环境，例如 Gym Mujoco 环境。
为什么介绍这些算法？¶
我们在这个项目中选取了能够呈现强化学习近些年发展历程的核心算法。目前，在 可靠性 (stability)和 采样效率 (sample efficiency)这两个因素上表现最优的策略学习算法是 PPO 和 SAC。从这些算法的设计和实际应用中，可以看出可靠性和采样效率两者的权衡。
同策略（OnPolicy）算法¶
Vanilla Policy Gradient(VPG) 是深度强化学习领域最基础也是入门级的算法，发表时间远早于深度强化学习。VPG 算法的核心思想可以追溯到上世纪 80 年代末、90年代初。在那之后，TRPO（2015）和 PPO(2017) 等更好的算法才相继诞生。
上述系列工作都是基于不使用历史数据的同策略，因此在采样效率上表现相对较差。但这也是有原因的：它们直接优化我们关心的目标 —— 策略表现。这个系列的算法都是用采样效率换取可靠性，之后提出的算法，从 VPG 到TRPO 再到 PPO，都是在不断弥补采样效率方面的不足。
异策略（OffPolicy）算法¶
DDPG 是一个和 VPG 同样重要的算法，尽管它的提出时间较晚。确定策略梯度（Deterministic Policy Gradients，DPG）理论是在 2014 年提出的，是 DDPG 算法的基础。DDPG 算法和 Qlearning 算法很相似，都是同时学习 Q 函数和策略并通过更新相互提高。
DDPG 和 QLearning 属于 异策略 算法，他们通过对贝尔曼方程（Bellman’s equations,也称动态规划方程）的优化，实现对历史数据的有效利用。
但问题是，满足贝尔曼方程并不能保证一定有很好的策略性能。从经验上讲，满足贝尔曼方程可以有不错的性能、很好的采样效率,但也由于没有这种必然性的保证，这类算法没有那么稳定。基于 DDPG的后续工作 TD3 和 SAC 提出了很多新的方案来缓解这些问题。
代码格式¶
Spinning Up 项目的算法都按照固定的模板来实现。每个算法由两个文件组成：
 算法文件，主要是算法的核心逻辑
 核心文件，包括各种运行算法所需的工具类。
算法文件¶
算法文件最开始是经验存储类(experience buffer)的定义，作用是存储智能体和环境的互动信息。
接下来有一个运行算法，以及以下算法：
 Logger 输出设定
 随机种子的设定
 环境实例化
 为计算图创建 placeholder
 通过 actorcritic 函数传递算法函数
 实例化经验缓存
 损失函数和一些其他的函数
 构建训练 ops
 构建 TF Session 并初始化参数
 通过 logger 保存模型
 定义运行算法主循环需要的函数（例如核心更新函数，获取行动函数，测试智能体函数等，取决于具体的算法）
 运行算法主循环
 让智能体在环境中开始运行
 根据算法的主要方程式，周期性更新参数
 打印核心性能数据并保存智能体
最后是从命令行读入设置的代码(ArgumentParser)。
运行试验¶
Table of Contents
One of the best ways to get a feel for deep RL is to run the algorithms and see how they perform on different tasks. The Spinning Up code library makes smallscale (local) experiments easy to do, and in this section, we’ll discuss two ways to run them: either from the command line, or through function calls in scripts.
Launching from the Command Line¶
Spinning Up ships with spinup/run.py
, a convenient tool that lets you easily launch any algorithm (with any choices of hyperparameters) from the command line. It also serves as a thin wrapper over the utilities for watching trained policies and plotting, although we will not discuss that functionality on this page (for those details, see the pages on experiment outputs and plotting).
The standard way to run a Spinning Up algorithm from the command line is
python m spinup.run [algo name] [experiment flags]
eg:
python m spinup.run ppo env Walker2dv2 exp_name walker
You Should Know
If you are using ZShell: ZShell interprets square brackets as special characters. Spinning Up uses square brackets in a few ways for command line arguments; make sure to escape them, or try the solution recommended here if you want to escape them by default.
Detailed Quickstart Guide
python m spinup.run ppo exp_name ppo_ant env Antv2 clip_ratio 0.1 0.2
hid[h] [32,32] [64,32] act tf.nn.tanh seed 0 10 20 dt
data_dir path/to/data
runs PPO in the Antv2
Gym environment, with various settings controlled by the flags.
clip_ratio
, hid
, and act
are flags to set some algorithm hyperparameters. You can provide multiple values for hyperparameters to run multiple experiments. Check the docs to see what hyperparameters you can set (click here for the PPO documentation).
hid
and act
are special shortcut flags for setting the hidden sizes and activation function for the neural networks trained by the algorithm.
The seed
flag sets the seed for the random number generator. RL algorithms have high variance, so try multiple seeds to get a feel for how performance varies.
The dt
flag ensures that the save directory names will have timestamps in them (otherwise they don’t, unless you set FORCE_DATESTAMP=True
in spinup/user_config.py
).
The data_dir
flag allows you to set the save folder for results. The default value is set by DEFAULT_DATA_DIR
in spinup/user_config.py
, which will be a subfolder data
in the spinningup
folder (unless you change it).
Save directory names are based on exp_name
and any flags which have multiple values. Instead of the full flag, a shorthand will appear in the directory name. Shorthands can be provided by the user in square brackets after the flag, like hid[h]
; otherwise, shorthands are substrings of the flag (clip_ratio
becomes cli
). To illustrate, the save directory for the run with clip_ratio=0.1
, hid=[32,32]
, and seed=10
will be:
path/to/data/YYMMDD_ppo_ant_cli01_h3232/YYMMDD_HHMMSSppo_ant_cli01_h3232_seed10
Setting Hyperparameters from the Command Line¶
Every hyperparameter in every algorithm can be controlled directly from the command line. If kwarg
is a valid keyword arg for the function call of an algorithm, you can set values for it with the flag kwarg
. To find out what keyword args are available, see either the docs page for an algorithm, or try
python m spinup.run [algo name] help
to see a readout of the docstring.
You Should Know
Values pass through eval()
before being used, so you can describe some functions and objects directly from the command line. For example:
python m spinup.run ppo env Walker2dv2 exp_name walker act tf.nn.elu
sets tf.nn.elu
as the activation function.
You Should Know
There’s some nice handling for kwargs that take dict values. Instead of having to provide
key dict(v1=value_1, v2=value_2)
you can give
key:v1 value_1 key:v2 value_2
to get the same result.
Launching Multiple Experiments at Once¶
You can launch multiple experiments, to be executed in series, by simply providing more than one value for a given argument. (An experiment for each possible combination of values will be launched.)
For example, to launch otherwiseequivalent runs with different random seeds (0, 10, and 20), do:
python m spinup.run ppo env Walker2dv2 exp_name walker seed 0 10 20
Experiments don’t launch in parallel because they soak up enough resources that executing several at the same time wouldn’t get a speedup.
Special Flags¶
A few flags receive special treatment.
Environment Flag¶

env
,
env_name
¶
string. The name of an environment in the OpenAI Gym. All Spinning Up algorithms are implemented as functions that accept
env_fn
as an argument, whereenv_fn
must be a callable function that builds a copy of the RL environment. Since the most common use case is Gym environments, though, all of which are built throughgym.make(env_name)
, we allow you to just specifyenv_name
(orenv
for short) at the command line, which gets converted to a lambdafunction that builds the correct gym environment.
Shortcut Flags¶
Some algorithm arguments are relatively long, and we enabled shortcuts for them:

hid
,
ac_kwargs
:hidden_sizes
¶ list of ints. Sets the sizes of the hidden layers in the neural networks (policies and value functions).

act
,
ac_kwargs
:activation
¶ tf op. The activation function for the neural networks in the actor and critic.
These flags are valid for all current Spinning Up algorithms.
Config Flags¶
These flags are not hyperparameters of any algorithm, but change the experimental configuration in some way.

cpu
,
num_cpu
¶
int. If this flag is set, the experiment is launched with this many processes, one per cpu, connected by MPI. Some algorithms are amenable to this sort of parallelization but not all. An error will be raised if you try setting
num_cpu
> 1 for an incompatible algorithm. You can also setnum_cpu auto
, which will automatically use as many CPUs as are available on the machine.

exp_name
¶
string. The experiment name. This is used in naming the save directory for each experiment. The default is “cmd” + [algo name].

data_dir
¶
path. Set the base save directory for this experiment or set of experiments. If none is given, the
DEFAULT_DATA_DIR
inspinup/user_config.py
will be used.

datestamp
¶
bool. Include date and time in the name for the save directory of the experiment.
Where Results are Saved¶
Results for a particular experiment (a single run of a configuration of hyperparameters) are stored in
data_dir/[outer_prefix]exp_name[suffix]/[inner_prefix]exp_name[suffix]_s[seed]
where
data_dir
is the value of thedata_dir
flag (defaults toDEFAULT_DATA_DIR
fromspinup/user_config.py
ifdata_dir
is not given), the
outer_prefix
is aYYMMDD_
timestamp if thedatestamp
flag is raised, otherwise nothing,  the
inner_prefix
is aYYMMDD_HHMMSS
timestamp if thedatestamp
flag is raised, otherwise nothing,  and
suffix
is a special string based on the experiment hyperparameters.
How is Suffix Determined?¶
Suffixes are only included if you run multiple experiments at once, and they only include references to hyperparameters that differ across experiments, except for random seed. The goal is to make sure that results for similar experiments (ones which share all params except seed) are grouped in the same folder.
Suffixes are constructed by combining shorthands for hyperparameters with their values, where a shorthand is either 1) constructed automatically from the hyperparameter name or 2) supplied by the user. The user can supply a shorthand by writing in square brackets after the kwarg flag.
For example, consider:
python m spinup.run ddpg env Hopperv2 hid[h] [300] [128,128] act tf.nn.tanh tf.nn.relu
Here, the hid
flag is given a usersupplied shorthand, h
. The act
flag is not given a shorthand by the user, so one will be constructed for it automatically.
The suffixes produced in this case are:
_h128128_acactrelu
_h128128_acacttanh
_h300_acactrelu
_h300_acacttanh
Note that the h
was given by the user. the acact
shorthand was constructed from ac_kwargs:activation
(the true name for the act
flag).
Extra¶
You Don’t Actually Need to Know This One
Each individual algorithm is located in a file spinup/algos/ALGO_NAME/ALGO_NAME.py
, and these files can be run directly from the command line with a limited set of arguments (some of which differ from what’s available to spinup/run.py
). The command line support in the individual algorithm files is essentially vestigial, however, and this is not a recommended way to perform experiments.
This documentation page will not describe those command line calls, and will only describe calls through spinup/run.py
.
Launching from Scripts¶
Each algorithm is implemented as a python function, which can be imported directly from the spinup
package, eg
>>> from spinup import ppo
See the documentation page for each algorithm for a complete account of possible arguments. These methods can be used to set up specialized custom experiments, for example:
from spinup import ppo
import tensorflow as tf
import gym
env_fn = lambda : gym.make('LunarLanderv2')
ac_kwargs = dict(hidden_sizes=[64,64], activation=tf.nn.relu)
logger_kwargs = dict(output_dir='path/to/output_dir', exp_name='experiment_name')
ppo(env_fn=env_fn, ac_kwargs=ac_kwargs, steps_per_epoch=5000, epochs=250, logger_kwargs=logger_kwargs)
Using ExperimentGrid¶
It’s often useful in machine learning research to run the same algorithm with many possible hyperparameters. Spinning Up ships with a simple tool for facilitating this, called ExperimentGrid.
Consider the example in spinup/examples/bench_ppo_cartpole.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo
import tensorflow as tf
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('cpu', type=int, default=4)
parser.add_argument('num_runs', type=int, default=3)
args = parser.parse_args()
eg = ExperimentGrid(name='ppobench')
eg.add('env_name', 'CartPolev0', '', True)
eg.add('seed', [10*i for i in range(args.num_runs)])
eg.add('epochs', 10)
eg.add('steps_per_epoch', 4000)
eg.add('ac_kwargs:hidden_sizes', [(32,), (64,64)], 'hid')
eg.add('ac_kwargs:activation', [tf.tanh, tf.nn.relu], '')
eg.run(ppo, num_cpu=args.cpu)

After making the ExperimentGrid object, parameters are added to it with
eg.add(param_name, values, shorthand, in_name)
where in_name
forces a parameter to appear in the experiment name, even if it has the same value across all experiments.
After all parameters have been added,
eg.run(thunk, **run_kwargs)
runs all experiments in the grid (one experiment per valid configuration), by providing the configurations as kwargs to the function thunk
. ExperimentGrid.run
uses a function named call_experiment to launch thunk
, and **run_kwargs
specify behaviors for call_experiment
. See the documentation page for details.
Except for the absence of shortcut kwargs (you can’t use hid
for ac_kwargs:hidden_sizes
in ExperimentGrid
), the basic behavior of ExperimentGrid
is the same as running things from the command line. (In fact, spinup.run
uses an ExperimentGrid
under the hood.)
试验输出¶
Table of Contents
In this section we’ll cover
 what outputs come from Spinning Up algorithm implementations,
 what formats they’re stored in and how they’re organized,
 where they are stored and how you can change that,
 and how to load and run trained policies.
You Should Know
Spinning Up implementations currently have no way to resume training for partiallytrained agents. If you consider this feature important, please let us know—or consider it a hacking project!
Algorithm Outputs¶
Each algorithm is set up to save a training run’s hyperparameter configuration, learning progress, trained agent and value functions, and a copy of the environment if possible (to make it easy to load up the agent and environment simultaneously). The output directory contains the following:
Output Directory Structure  
simple_save/ 
A directory containing everything needed to restore the
trained agent and value functions. (Details below.)

config.json 
A dict containing an ascompleteaspossible description
of the args and kwargs you used to launch the training
function. If you passed in something which can’t be
serialized to JSON, it should get handled gracefully by the
logger, and the config file will represent it with a string.
Note: this is meant for recordkeeping only. Launching an
experiment from a config file is not currently supported.

progress.txt 
A tabseparated value file containing records of the metrics
recorded by the logger throughout training. eg,
Epoch ,AverageEpRet , etc. 
vars.pkl 
A pickle file containing anything about the algorithm state
which should get stored. Currently, all algorithms only use
this to save a copy of the environment.

You Should Know
Sometimes environmentsaving fails because the environment can’t be pickled, and vars.pkl
is empty. This is known to be a problem for Gym Box2D environments in older versions of Gym, which can’t be saved in this manner.
The simple_save
directory contains:
Simple_Save Directory Structure  
variables/ 
A directory containing outputs from the Tensorflow Saver.
See documentation for Tensorflow SavedModel.

model_info.pkl 
A dict containing information (map from key to tensor name)
which helps us unpack the saved model after loading.

saved_model.pb 
A protocol buffer, needed for a Tensorflow SavedModel.

You Should Know
The only file in here that you should ever have to use “by hand” is the config.json
file. Our agent testing utility will load things from the simple_save/
directory and vars.pkl
file, and our plotter interprets the contents of progress.txt
, and those are the correct tools for interfacing with these outputs. But there is no tooling for config.json
—it’s just there so that if you forget what hyperparameters you ran an experiment with, you can doublecheck.
Save Directory Location¶
Experiment results will, by default, be saved in the same directory as the Spinning Up package, in a folder called data
:
spinningup/ data/ ... docs/ ... spinup/ ... LICENSE setup.py
You can change the default results directory by modifying DEFAULT_DATA_DIR
in spinup/user_config.py
.
Loading and Running Trained Policies¶
If Environment Saves Successfully¶
For cases where the environment is successfully saved alongside the agent, it’s a cinch to watch the trained agent act in the environment using:
python m spinup.run test_policy path/to/output_directory
There are a few flags for options:

l
L
,
len
=L
,
default
=0
¶ int. Maximum length of test episode / trajectory / rollout. The default of 0 means no maximum episode length—episodes only end when the agent has reached a terminal state in the environment. (Note: setting L=0 will not prevent Gym envs wrapped by TimeLimit wrappers from ending when they reach their preset maximum episode length.)

n
N
,
episodes
=N
,
default
=100
¶ int. Number of test episodes to run the agent for.

nr
,
norender
¶
Do not render the test episodes to the screen. In this case,
test_policy
will only print the episode returns and lengths. (Use case: the renderer slows down the testing process, and you just want to get a fast sense of how the agent is performing, so you don’t particularly care to watch it.)

i
I
,
itr
=I
,
default
=1
¶ int. This is an option for a special case which is not supported by algorithms in this package asshipped, but which they are easily modified to do. Use case: Sometimes it’s nice to watch trained agents from many different points in training (eg watch at iteration 50, 100, 150, etc.). The logger can do this—save snapshots of the agent from those different points, so they can be run and watched later. In this case, you use this flag to specify which iteration to run. But again: spinup algorithms by default only save snapshots of the most recent agent, overwriting the old snapshots.
The default value of this flag means “use the latest snapshot.”
To modify an algo so it does produce multiple snapshots, find the following lines (which are present in all of the algorithms):
if (epoch % save_freq == 0) or (epoch == epochs1): logger.save_state({'env': env}, None)
and tweak them to
if (epoch % save_freq == 0) or (epoch == epochs1): logger.save_state({'env': env}, epoch)
Make sure to then also set
save_freq
to something reasonable (because if it defaults to 1, for instance, you’ll flood your output directory with onesimple_save
folder for each snapshot—which adds up fast).

d
,
deterministic
¶
Another special case, which is only used for SAC. The Spinning Up SAC implementation trains a stochastic policy, but is evaluated using the deterministic mean of the action distribution.
test_policy
will default to using the stochastic policy trained by SAC, but you should set the deterministic flag to watch the deterministic mean policy (the correct evaluation policy for SAC). This flag is not used for any other algorithms.
Environment Not Found Error¶
If the environment wasn’t saved successfully, you can expect test_policy.py
to crash with
Traceback (most recent call last):
File "spinup/utils/test_policy.py", line 88, in <module>
run_policy(env, get_action, args.len, args.episodes, not(args.norender))
File "spinup/utils/test_policy.py", line 50, in run_policy
"page on Experiment Outputs for how to handle this situation."
AssertionError: Environment not found!
It looks like the environment wasn't saved, and we can't run the agent in it. :(
Check out the readthedocs page on Experiment Outputs for how to handle this situation.
In this case, watching your agent perform is slightly more of a pain but not impossible, as long as you can recreate your environment easily. Try the following in IPython:
>>> from spinup.utils.test_policy import load_policy, run_policy
>>> import your_env
>>> _, get_action = load_policy('/path/to/output_directory')
>>> env = your_env.make()
>>> run_policy(env, get_action)
Logging data to /tmp/experiments/1536150702/progress.txt
Episode 0 EpRet 163.830 EpLen 93
Episode 1 EpRet 346.164 EpLen 99
...
Using Trained Value Functions¶
The test_policy.py
tool doesn’t help you look at trained value functions, and if you want to use those, you will have to do some digging by hand. Check the documentation for the restore_tf_graph function for details on how.
绘制结果¶
Spinning Up ships with a simple plotting utility for interpreting results. Run it with:
python m spinup.run plot [path/to/output_directory ...] [legend [LEGEND ...]]
[xaxis XAXIS] [value [VALUE ...]] [count] [smooth S]
[select [SEL ...]] [exclude [EXC ...]]
Positional Arguments:

logdir
¶
strings. As many log directories (or prefixes to log directories, which the plotter will autocomplete internally) as you’d like to plot from. Logdirs will be searched recursively for experiment outputs.
You Should Know
The internal autocompleting is really handy! Suppose you have run several experiments, with the aim of comparing performance between different algorithms, resulting in a log directory structure of:
data/ bench_algo1/ bench_algo1seed0/ bench_algo1seed10/ bench_algo2/ bench_algo2seed0/ bench_algo2seed10/
You can easily produce a graph comparing algo1 and algo2 with:
python spinup/utils/plot.py data/bench_algo
relying on the autocomplete to find both
data/bench_algo1
anddata/bench_algo2
.
Optional Arguments:

l
,
legend
=[LEGEND ...]
¶ strings. Optional way to specify legend for the plot. The plotter legend will automatically use the
exp_name
from theconfig.json
file, unless you tell it otherwise through this flag. This only works if you provide a name for each directory that will get plotted. (Note: this may not be the same as the number of logdir args you provide! Recall that the plotter looks for autocompletes of the logdir args: there may be more than one match for a given logdir prefix, and you will need to provide a legend string for each one of those matches—unless you have removed some of them as candidates via selection or exclusion rules (below).)

x
,
xaxis
=XAXIS
,
default
='TotalEnvInteracts'
¶ string. Pick what column from data is used for the xaxis.

y
,
value
=[VALUE ...]
,
default
='Performance'
¶ strings. Pick what columns from data to graph on the yaxis. Submitting multiple values will produce multiple graphs. Defaults to
Performance
, which is not an actual output of any algorithm. Instead,Performance
refers to eitherAverageEpRet
, the correct performance measure for the onpolicy algorithms, orAverageTestEpRet
, the correct performance measure for the offpolicy algorithms. The plotter will automatically figure out which ofAverageEpRet
orAverageTestEpRet
to report for each separate logdir.

count
¶
Optional flag. By default, the plotter shows yvalues which are averaged across all results that share an
exp_name
, which is typically a set of identical experiments that only vary in random seed. But if you’d like to see all of those curves separately, use thecount
flag.

s
,
smooth
=S
,
default
=1
¶ int. Smooth data by averaging it over a fixed window. This parameter says how wide the averaging window will be.

select
=[SEL ...]
¶ strings. Optional selection rule: the plotter will only show curves from logdirs that contain all of these substrings.

exclude
=[EXC ...]
¶ strings. Optional exclusion rule: plotter will only show curves from logdirs that do not contain these substrings.
第一部分：强化学习中的核心概念¶
欢迎来到强化学习的介绍部分！我们希望你能了解以下内容：
 常见的符号表示
 高层次的理解：关于强化学习算法做什么（我们会尽量避免 如何做 这个话题）
 算法背后的核心数学知识
总的来说，强化学习是关于智能体以及它们如何通过试错来学习的研究。它确定了通过奖励或惩罚智能体的动作从而使它未来更容易重复或者放弃某一动作的思想。
强化学习能做什么¶
基于强化学习的方法已经在很多地方取得了成功。例如，它被用来教电脑在仿真环境下控制机器人（YouTube视频）：
以及在现实世界中的机器人（YouTube视频）：
强化学习因为被用在复杂策略游戏创造出突破性的 AI 中而名声大噪，最著名的要数 围棋 、 Dota 、教电脑 玩Atari游戏 以及训练模拟机器人 听从人类的指令 。
核心概念和术语¶
强化学习的主要角色是 智能体 和 环境,环境是智能体存在和互动的世界。智能体在每一步的交互中，都会获得对于所处环境状态的观察（有可能只是一部分），然后决定下一步要执行的动作。环境会因为智能体对它的动作而改变，也可能自己改变。
智能体也会从环境中感知到 奖励 信号，一个表明当前状态好坏的数字。智能体的目标是最大化累计奖励，也就是 回报。强化学习就是智能体通过学习来完成目标的方法。
为了便于后面的学习，我们介绍一些术语：
 状态和观察(states and observations)
 动作空间(action spaces)
 策略(policies)
 行动轨迹(trajectories)
 不同的回报公式(formulations of return)
 强化学习优化问题(the RL optimization problem)
 值函数(value functions)
状态和观察¶
一个 状态 是一个关于这个世界状态的完整描述。这个世界除了状态以外没有别的信息。观察 是对于一个状态的部分描述，可能会漏掉一些信息。
在深度强化学习中，我们一般用 实数向量、矩阵或者更高阶的张量（tensor） 表示状态和观察。比如说，视觉上的 观察 可以用RGB矩阵的方式表示其像素值；机器人的 状态 可以通过关节角度和速度来表示。
如果智能体观察到环境的全部状态，我们通常说环境是被 全面观察 的。如果智能体只能观察到一部分，我们称之为 部分观察。
你应该知道
强化学习有时候用这个符号 代表状态 , 有些地方也会写作观察符号 . 尤其是，当智能体在决定采取什么动作的时候，符号上的表示按理动作是基于状态的，但实际上，动作是基于观察的，因为智能体并不能知道状态（只能通过观察了解状态）。
在我们的教程中，我们会按照标准的方式使用这些符号，不过你一般能从上下文中看出来具体表示什么。如果你觉得有些内容不够清楚，请提出issue！我们的目的是教会大家，不是让大家混淆。
动作空间¶
不同的环境有不同的动作。所有有效动作的集合称之为 动作空间。有些环境，比如说 Atari 游戏和围棋，属于 离散动作空间，这种情况下智能体只能采取有限的动作。其他的一些环境，比如智能体在物理世界中控制机器人，属于 连续动作空间。在连续动作空间中，动作是实数向量。
这种区别对于深度强化学习来说，影响深远。有些种类的算法只能直接用在某些案例上，如果需要用在别的地方，可能就需要大量重写代码。
策略¶
策略 是智能体用于决定下一步执行什么行动的规则。可以是确定性的，一般表示为：:
也可以是随机的，一般表示为 :
因为策略本质上就是智能体的大脑，所以很多时候“策略”和“智能体”这两个名词经常互换，例如我们会说：“策略的目的是最大化奖励”。
在深度强化学习中，我们处理的是参数化的策略，这些策略的输出，依赖于一系列计算函数，而这些函数又依赖于参数（例如神经网络的权重和误差），所以我们可以通过一些优化算法改变智能体的的行为。
我们经常把这些策略的参数写作 或者 ，然后把它写在策略的下标上来强调两者的联系。
确定性策略¶
例子：确定性策略： 这是一个基于 TensorFlow 在连续动作空间上确定性策略的简单例子：
obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
actions = tf.layers.dense(net, units=act_dim, activation=None)
其中，mlp 是把多个给定大小和激活函数的 密集层 （dense layer）相互堆积在一起的函数。
随机性策略¶
深度强化学习中最常见的两种随机策略是 绝对策略 (Categorical Policies) 和 对角高斯策略 (Diagonal Gaussian Policies)。
确定 策略适用于离散行动空间，而 高斯 策略一般用在连续行动空间
使用和训练随机策略的时候有两个重要的计算：
 从策略中采样行动
 计算特定行为的似然(likelihoods) .
下面我们介绍一下这两种策略
绝对策略
确定策略就像是一个离散空间的分类器(classifier)。对于分类器和确定策略来说，建立神经网络的方式一模一样：输入是观察，接着是一些卷积、全连接层之类的，至于具体是哪些取决于输入的类型，最后一个线性层给出每个行动的 log 数值(logits)，后面跟一个 softmax 层把 log 数值转换为可能性。
采样 给定每个行动的可能性，TensorFlow之类的框架有内置采样工。具体可查阅 tf.distributions.Categorical 或 tf.multinomial 的文档。
对数似然 ：表示最后一层的可能性 。它是一个有很多值的向量，我们可以把行动当做向量的索引。所以向量的对数似然值 可以通过这样得到：
对角高斯策略
多元高斯分布（或者多元正态分布），可以用一个向量 和协方差 来描述。对角高斯分布就是协方差矩阵只有对角线上有值的特殊情况，所以我们可以用一个向量来表示它。
对角高斯策略总会有一个神经网络，表示观察到行动的映射。其中有两种协方差矩阵的经典表示方式：
第一种 ： 有一个单独的关于对数标准差的向量： ，它不是关于状态的函数， 而是单独的参数（我们这个项目里，VPG, TRPO 和 PPO 都是用这种方式实现的）。
第二种 ：有一个神经网络，从状态映射到对数标准差 。这种方式可能会均值网络共享某些层的参数。
要注意这两种情况下我们都没有直接计算标准差而是对数标准差。这是因为对数标准差能够接受 的任何值，而标准差必须要求参数非负。要知道，限制条件越少，训练就越简单。而标准差可以通过取幂快速从对数标准差中计算得到，所以这种表示方法也不会丢失信息。
采样 ：给定平均行动 和 标准差 ，以及一个服从球形高斯分布的噪声向量 ，行为的样本可以这样计算：
这里 表示两个向量按元素乘。标准框架都有内置噪声向量实现，例如 tf.random_normal 。你也可以直接用 tf.distributions.Normal 以均值和标准差的方式采样。
对数似然 一个 k 维行动 基于均值为 ，标准差为 的对角高斯的对数似然：
运动轨迹¶
运动轨迹 指的是状态和行动的序列。
第一个状态 ，是从 开始状态分布 中随机采样的，有时候表示为 :
转态转换（从某一状态时间 , 到另一状态时间 , 会发生什么），是由环境的自然法则确定的，并且只依赖于最近的行动 。它们可以是确定性的：
而可以是随机的：
智能体的行为由策略确定。
你应该知道
行动轨迹常常也被称作 回合(episodes) 或者 rollouts。
奖励和回报¶
强化学习中，奖励函数 非常重要。它由当前状态、已经执行的行动和下一步的状态共同决定。
有时候这个公式会被改成只依赖当前的状态 ，或者状态行动对 。
智能体的目标是最大化行动轨迹的累计奖励，这意味着很多事情。我们会把所有的情况表示为 ，至于具体表示什么，要么可以很清楚的从上下文看出来，要么并不重要。（因为相同的方程式适用于所有情况。）
步累计奖赏，指的是在一个固定窗口步数 内获得的累计奖励：
另一种叫做 折扣奖励，指的是智能体获得的全部奖励之和，但是奖励会因为获得的时间不同而衰减。这个公式包含衰减率 :
这里为什么要加上一个衰减率呢？为什么不直接把所有的奖励加在一起？可以从两个角度来解释： 直观上讲，现在的奖励比外来的奖励要好，所以未来的奖励会衰减；数学角度上，无限多个奖励的和很可能 不收敛 ，有了衰减率和适当的约束条件，数值才会收敛。
你应该知道
这两个公式看起来差距很大，事实上我们经常会混用。比如说，我们经常使用 折扣奖励，但是用衰减率估算 值函数。
强化学习问题¶
无论选择哪种方式衡量收益（ 步累计奖赏或者 折扣奖励），无论选择哪种策略，强化学习的目标都是选择一种策略从而最大化 预期收益。
讨论预期收益之前，我们先讨论下行动轨迹的可能性分布。
我们假设环境转换和策略都是随机的。这种情况下， 步 行动轨迹是：
预期收益是
强化学习中的核心优化问题可以表示为：
是 最优策略
值函数¶
知道一个状态的 值 或者状态行动对(stateaction pair)很有用。这里的值指的是，如果你从某一个状态或者状态行动对开始，一直按照某个策略运行下去最终获得的期望回报。几乎是所有的强化学习方法，都在用不同的形式使用着值函数。
这里介绍四种主要函数：
 同策略值函数 ： ，从某一个状态 开始，之后每一步行动都按照策略 执行
同策略行动值函数 ： ,从某一个状态 开始，先随便执行一个行动 （有可能不是按照策略走的），之后每一步都按照固定的策略执行
最优值函数： ，从某一个状态 开始，之后每一步都按照 最优策略 执行
最优行动值函数 ： ，从某一个状态 开始，先随便执行一个行动 （有可能不是按照策略走的），之后每一步都按照 最优策略 执行
你应该知道
当我们讨论值函数的时候，如果我们没有提到时间依赖问题，那就意味着 折扣累计奖赏。 无衰减收益需要传入时间作为参数，你知道为什么吗？ 提示：时间到了会发生什么？
你应该知道
值函数和行动值函数两者之间经常出现的联系：
以及：
这些关系直接来自刚刚给出的定义，你能尝试证明吗？
最优 Q 函数和最优行动¶
最优行动值函数 和被最优策略选中的行动有重要的联系。从定义上讲， 指的是从一个状态 开始，任意执行一个行动 ，然后一直按照最优策略执行下去所获得的回报。
最优策略 会选择从状态 开始选择能够最大化期望回报的行动。所以如果我们有了 ，就可以通过下面的公式直接获得最优行动： ：
注意：可能会有多个行为能够最大化 ，这种情况下，它们都是最优行为，最优策略可能会从中随机选择一个。但是总会存在一个最优策略每一步选择行为的时候是确定的。
贝尔曼方程¶
全部四个值函数都遵守自一致性的方程叫做 贝尔曼方程，贝尔曼方程的基本思想是：
起始点的值等于当前点预期值和下一个点的值之和。
同策略值函数的贝尔曼方程：
是 的简写, 表明下一个状态 是按照转换规则从环境中抽样得到的; 是 的简写; and 是 的简写.
最优值函数的贝尔曼方程是：
同策略值函数和最优值函数的贝尔曼方程最大的区别是是否在行动中去 。这表明智能体在选择下一步行动时，为了做出最优行动，他必须选择能获得最大值的行动。
你应该知道
贝尔曼算子（Bellman backup）会在强化学习中经常出现。对于一个状态或一个状态行动对，贝尔曼算子是贝尔曼方程的右边： 奖励加上一个值。
优势函数（Advantage Functions）¶
强化学习中，有些时候我们不需要描述一个行动的绝对好坏，而只需要知道它相对于平均水平的优势。也就是说，我们只想知道一个行动的相对 优势 。这就是优势函数的概念。
一个服从策略 的优势函数，描述的是它在状态 下采取行为 比随机选择一个行为好多少（假设之后一直服从策略 ）。数学角度上，优势函数的定义为：
你应该知道
我们之后会继续谈论优势函数，它对于策略梯度方法非常重要。
第二部分：强化学习算法概述¶
我们已经介绍了深度学习的基础术语和符号，现在可以讨论一些更丰富的内容：现代强化学习的整体发展和算法设计时候要考虑的各种因素之间的权衡。
强化学习算法的分类¶
要先声明的是：很难准确全面的把所有现代强化学习算法都列举出来，因为这些内容本身不适合用树形结构展示。同时，把这么多内容放在一篇文章里，还要便于理解，必须省略掉一些更加前沿的内容，例如探索（exploration），迁移学习（transfer learning），元学习（meta learning）等。
这篇文章的目标是：
 只强调深度强化学习中关于学习什么和如何学习的最基本的设计选择
 揭示这些选择中的权衡利弊
 把其中部分优秀的现代算法介绍给大家
免模型学习（ModelFree） vs 有模型学习（ModelBased）¶
不同强化学习算法最重要的区分点之一就是 智能体是否能完整了解或学习到所在环境的模型。 环境的模型是指一个预测状态转换和奖励的函数。
有模型学习最大的优势在于智能体能够 提前考虑来进行规划，走到每一步的时候，都提前尝试未来可能的选择，然后明确地从这些候选项中进行选择。智能体可以把预先规划的结果提取为学习策略。这其中最著名的例子就是 AlphaZero。这个方法起作用的时候，可以大幅度提升采样效率 —— 相对于那些没有模型的方法。
有模型学习最大的缺点就是智能体往往不能获得环境的真实模型。如果智能体想在一个场景下使用模型，那它必须完全从经验中学习，这会带来很多挑战。最大的挑战就是，智能体探索出来的模型和真实模型之间存在误差，而这种误差会导致智能体在学习到的模型中表现很好，但在真实的环境中表现得不好（甚至很差）。基于模型的学习从根本上讲是非常困难的，即使你愿意花费大量的时间和计算力，最终的结果也可能达不到预期的效果。
使用模型的算法叫做有模型学习，不基于模型的叫做免模型学习。虽然免模型学习放弃了有模型学习在样本效率方面的潜在收益，但是他们往往更加易于实现和调整。截止到目前（2018年9月），相对于有模型学习，免模型学习方法更受欢迎，得到更加广泛的开发和测试。
要学习什么¶
强化学习算法另一个重要的区分点是 要学习什么。常提到的主题包括：
 策略，不管是随机的还是确定性的
 行动奖励函数（Q 函数）
 值函数
 环境模型
免模型学习中要学习什么¶
有两种用来表示和训练免模型学习强化学习算法的方式：
策略优化（Policy Optimization） ：这个系列的方法将策略显示表示为： 。 它们直接对性能目标 进行梯度下降进行优化，或者间接地，对性能目标的局部近似函数进行优化。优化基本都是基于 同策略 的，也就是说每一步更新只会用最新的策略执行时采集到的数据。策略优化通常还包括学习出 ，作为 的近似，该函数用于确定如何更新策略。
基于策略优化的方法举例：
QLearning ：这个系列的算法学习最优行动值函数 的近似函数： 。它们通常使用基于 贝尔曼方程 的目标函数。优化过程属于 异策略 系列，这意味着每次更新可以使用任意时间点的训练数据，不管获取数据时智能体选择如何探索环境。对应的策略是通过 and 之间的联系得到的。智能体的行动由下面的式子给出：
基于 QLearning 的方法
策略优化和 QLearning 的权衡 ：策略优化的主要优势在于这类方法是原则性的，某种意义上讲，你是直接在优化你想要的东西。与此相反，Qlearning 方法通过训练 以满足自洽方程，间接地优化智能体的表现。这种方法有很多失败的情况，所以相对来说稳定性较差。[1] 但是，Qlearning 有效的时候能获得更好的采样效率，因为它们能够比策略优化更加有效地重用历史数据。
策略优化和 Qlearning 的融合方法 ：意外的是，策略优化和 Qlearning 并不是不能兼容的（在某些场景下，它们两者是 等价的 ），并且存在很多介于两种极端之间的算法。这个范围的算法能够很好的平衡好两者之间的优点和缺点，比如说：
[1]  关于更多 Qlearning 可能会表现不好的情况，参见： 1) 经典论文 Tsitsiklis and van Roy, 2) 最近的文章 review by Szepesvari (在 4.3.2章节) 3) Sutton and Barto 的第11章节，尤其是 11.3 (on “the deadly triad” of function approximation, bootstrapping, and offpolicy data, together causing instability in valuelearning algorithms). 
有模型强化学习要学习什么¶
不同于免模型学习，有模型学习方法不是很好分类：很多方法之间都会有交叉。我们会给出一些例子，当然肯定不够详尽，覆盖不到全部。在这些例子里面， 模型 要么已知，要么是可学习的。
背景：纯规划 ：这种最基础的方法，从来不显示的表示策略，而是纯使用规划技术来选择行动，例如 模型预测控制 (modelpredictive control, MPC)。在模型预测控制中，智能体每次观察环境的时候，都会计算得到一个对于当前模型最优的规划，这里的规划指的是未来一个固定时间段内，智能体会采取的所有行动（通过学习值函数，规划算法可能会考虑到超出范围的未来奖励）。智能体先执行规划的第一个行动，然后立即舍弃规划的剩余部分。每次准备和环境进行互动时，它会计算出一个新的规划，从而避免执行小于规划范围的规划给出的行动。
 MBMF 在一些深度强化学习的标准基准任务上，基于学习到的环境模型进行模型预测控制
Expert Iteration ：纯规划的后来之作，使用、学习策略的显示表示形式： 。智能体在模型中应用了一种规划算法，类似蒙特卡洛树搜索(Monte Carlo Tree Search)，通过对当前策略进行采样生成规划的候选行为。这种算法得到的行动比策略本身生成的要好，所以相对于策略来说，它是“专家”。随后更新策略，以产生更类似于规划算法输出的行动。
免模型方法的数据增强 ：使用免模型算法来训练策略或者 Q 函数，要么 1）更新智能体的时候，用构造出的假数据来增加真实经验 2）更新的时候 仅 使用构造的假数据
 MBVE 用假数据增加真实经验
 World Models 全部用假数据来训练智能体，所以被称为：“在梦里训练”
Embedding Planning Loops into Policies. ：另一种方法直接把规划程序作为策略的子程序，这样在基于任何免模型算法训练策略输出的时候，整个规划就变成了策略的附属信息。这个框架最核心的概念就是，策略可以学习到如何以及何时使用规划。这使得模型偏差不再是问题，因为如果模型在某些状态下不利于规划，那么策略可以简单地学会忽略它。
 更多例子，参见 I2A
分类中提到的算法链接¶
[2]  A2C / A3C (Asynchronous Advantage ActorCritic): Mnih et al, 2016 
[3]  PPO (Proximal Policy Optimization): Schulman et al, 2017 
[4]  TRPO (Trust Region Policy Optimization): Schulman et al, 2015 
[5]  DDPG (Deep Deterministic Policy Gradient): Lillicrap et al, 2015 
[6]  TD3 (Twin Delayed DDPG): Fujimoto et al, 2018 
[7]  SAC (Soft ActorCritic): Haarnoja et al, 2018 
[8]  DQN (Deep QNetworks): Mnih et al, 2013 
[9]  C51 (Categorical 51Atom DQN): Bellemare et al, 2017 
[10]  QRDQN (Quantile Regression DQN): Dabney et al, 2017 
[11]  HER (Hindsight Experience Replay): Andrychowicz et al, 2017 
[12]  World Models: Ha and Schmidhuber, 2018 
[13]  I2A (ImaginationAugmented Agents): Weber et al, 2017 
[14]  MBMF (ModelBased RL with ModelFree FineTuning): Nagabandi et al, 2017 
[15]  MBVE (ModelBased Value Expansion): Feinberg et al, 2018 
[16]  AlphaZero: Silver et al, 2017 
第三部分：策略优化介绍¶
Table of Contents
在这个部分，我们会讨论策略优化算法的数学基础，同时提供样例代码。我们会包括策略优化的以下三个部分
 最简单的等式 描述了策略表现对于策略参数的梯度
 一个让我们可以 舍弃无用项 的公式
 一个让我们可以 添加有用参数 的公式
最后，我们会把结果放在一起，然后描述策略梯度更有优势的版本： 我们在 Vanilla Policy Gradient 中使用的版本。
最简单的策略梯度求导¶
我们考虑一种基于随机参数的策略： 。我们的目的是最小化期望回报 。为了计算导数，我们假定 属于 无衰减回报，但是对于衰减回报来说基本上是一样的。
We would like to optimize the policy by gradient ascent, eg 我们想要通过梯度下降来优化策略，例如
策略性能的梯度 ，通常被称为 策略梯度 ，优化策略的算法通常被称为 策略算法 。（比如说 Vanilla Policy Gradient 和 TRPO。PPO 也被称为策略梯度算法，尽管这样不是很准确。）
To actually use this algorithm, we need an expression for the policy gradient which we can numerically compute. This involves two steps: 1) deriving the analytical gradient of policy performance, which turns out to have the form of an expected value, and then 2) forming a sample estimate of that expected value, which can be computed with data from a finite number of agentenvironment interaction steps.
In this subsection, we’ll find the simplest form of that expression. In later subsections, we’ll show how to improve on the simplest form to get the version we actually use in standard policy gradient implementations.
We’ll begin by laying out a few facts which are useful for deriving the analytical gradient.
1. Probability of a Trajectory. The probability of a trajectory given that actions come from is
2. The LogDerivative Trick. The logderivative trick is based on a simple rule from calculus: the derivative of with respect to is . When rearranged and combined with chain rule, we get:
3. LogProbability of a Trajectory. The logprob of a trajectory is just
4. Gradients of Environment Functions. The environment has no dependence on , so gradients of , , and are zero.
5. GradLogProb of a Trajectory. The gradient of the logprob of a trajectory is thus
Putting it all together, we derive the following:
Derivation for Basic Policy Gradient
This is an expectation, which means that we can estimate it with a sample mean. If we collect a set of trajectories where each trajectory is obtained by letting the agent act in the environment using the policy , the policy gradient can be estimated with
where is the number of trajectories in (here, ).
This last expression is the simplest version of the computable expression we desired. Assuming that we have represented our policy in a way which allows us to calculate , and if we are able to run the policy in the environment to collect the trajectory dataset, we can compute the policy gradient and take an update step.
Implementing the Simplest Policy Gradient¶
We give a short Tensorflow implementation of this simple version of the policy gradient algorithm in spinup/examples/pg_math/1_simple_pg.py
. (It can also be viewed on github.) It is only 122 lines long, so we highly recommend reading through it in depth. While we won’t go through the entirety of the code here, we’ll highlight and explain a few important pieces.
1. Making the Policy Network.
25 26 27 28 29 30  # make core of policy network
obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
logits = mlp(obs_ph, sizes=hidden_sizes+[n_acts])
# make action selection op (outputs int actions, sampled from policy)
actions = tf.squeeze(tf.multinomial(logits=logits,num_samples=1), axis=1)

This block builds a feedforward neural network categorical policy. (See the Stochastic Policies section in Part 1 for a refresher.) The logits
tensor can be used to construct logprobabilities and probabilities for actions, and the actions
tensor samples actions based on the probabilities implied by logits
.
2. Making the Loss Function.
32 33 34 35 36 37  # make loss function whose gradient, for the right data, is policy gradient
weights_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)
action_masks = tf.one_hot(act_ph, n_acts)
log_probs = tf.reduce_sum(action_masks * tf.nn.log_softmax(logits), axis=1)
loss = tf.reduce_mean(weights_ph * log_probs)

In this block, we build a “loss” function for the policy gradient algorithm. When the right data is plugged in, the gradient of this loss is equal to the policy gradient. The right data means a set of (state, action, weight) tuples collected while acting according to the current policy, where the weight for a stateaction pair is the return from the episode to which it belongs. (Although as we will show in later subsections, there are other values you can plug in for the weight which also work correctly.)
You Should Know
Even though we describe this as a loss function, it is not a loss function in the typical sense from supervised learning. There are two main differences from standard loss functions.
1. The data distribution depends on the parameters. A loss function is usually defined on a fixed data distribution which is independent of the parameters we aim to optimize. Not so here, where the data must be sampled on the most recent policy.
2. It doesn’t measure performance. A loss function usually evaluates the performance metric that we care about. Here, we care about expected return, , but our “loss” function does not approximate this at all, even in expectation. This “loss” function is only useful to us because, when evaluated at the current parameters, with data generated by the current parameters, it has the negative gradient of performance.
But after that first step of gradient descent, there is no more connection to performance. This means that minimizing this “loss” function, for a given batch of data, has no guarantee whatsoever of improving expected return. You can send this loss to and policy performance could crater; in fact, it usually will. Sometimes a deep RL researcher might describe this outcome as the policy “overfitting” to a batch of data. This is descriptive, but should not be taken literally because it does not refer to generalization error.
We raise this point because it is common for ML practitioners to interpret a loss function as a useful signal during training—”if the loss goes down, all is well.” In policy gradients, this intuition is wrong, and you should only care about average return. The loss function means nothing.
You Should Know
The approach used here to make the log_probs
tensor—creating an action mask, and using it to select out particular log probabilities—only works for categorical policies. It does not work in general.
3. Running One Epoch of Training.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106  # for training policy
def train_one_epoch():
# make some empty lists for logging.
batch_obs = [] # for observations
batch_acts = [] # for actions
batch_weights = [] # for R(tau) weighting in policy gradient
batch_rets = [] # for measuring episode returns
batch_lens = [] # for measuring episode lengths
# reset episodespecific variables
obs = env.reset() # first obs comes from starting distribution
done = False # signal from environment that episode is over
ep_rews = [] # list for rewards accrued throughout ep
# render first episode of each epoch
finished_rendering_this_epoch = False
# collect experience by acting in the environment with current policy
while True:
# rendering
if not(finished_rendering_this_epoch):
env.render()
# save obs
batch_obs.append(obs.copy())
# act in the environment
act = sess.run(actions, {obs_ph: obs.reshape(1,1)})[0]
obs, rew, done, _ = env.step(act)
# save action, reward
batch_acts.append(act)
ep_rews.append(rew)
if done:
# if episode is over, record info about episode
ep_ret, ep_len = sum(ep_rews), len(ep_rews)
batch_rets.append(ep_ret)
batch_lens.append(ep_len)
# the weight for each logprob(as) is R(tau)
batch_weights += [ep_ret] * ep_len
# reset episodespecific variables
obs, done, ep_rews = env.reset(), False, []
# won't render again this epoch
finished_rendering_this_epoch = True
# end experience loop if we have enough of it
if len(batch_obs) > batch_size:
break
# take a single policy gradient update step
batch_loss, _ = sess.run([loss, train_op],
feed_dict={
obs_ph: np.array(batch_obs),
act_ph: np.array(batch_acts),
weights_ph: np.array(batch_weights)
})
return batch_loss, batch_rets, batch_lens

The train_one_epoch()
function runs one “epoch” of policy gradient, which we define to be
 the experience collection step (L6297), where the agent acts for some number of episodes in the environment using the most recent policy, followed by
 a single policy gradient update step (L99105).
The main loop of the algorithm just repeatedly calls train_one_epoch()
.
Expected GradLogProb Lemma¶
In this subsection, we will derive an intermediate result which is extensively used throughout the theory of policy gradients. We will call it the Expected GradLogProb (EGLP) lemma. [1]
EGLP Lemma. Suppose that is a parameterized probability distribution over a random variable, . Then:
Proof
Recall that all probability distributions are normalized:
Take the gradient of both sides of the normalization condition:
Use the log derivative trick to get:
[1]  The author of this article is not aware of this lemma being given a standard name anywhere in the literature. But given how often it comes up, it seems pretty worthwhile to give it some kind of name for ease of reference. 
Don’t Let the Past Distract You¶
Examine our most recent expression for the policy gradient:
Taking a step with this gradient pushes up the logprobabilities of each action in proportion to , the sum of all rewards ever obtained. But this doesn’t make much sense.
Agents should really only reinforce actions on the basis of their consequences. Rewards obtained before taking an action have no bearing on how good that action was: only rewards that come after.
It turns out that this intuition shows up in the math, and we can show that the policy gradient can also be expressed by
In this form, actions are only reinforced based on rewards obtained after they are taken.
We’ll call this form the “rewardtogo policy gradient,” because the sum of rewards after a point in a trajectory,
is called the rewardtogo from that point, and this policy gradient expression depends on the rewardtogo from stateaction pairs.
You Should Know
But how is this better? A key problem with policy gradients is how many sample trajectories are needed to get a lowvariance sample estimate for them. The formula we started with included terms for reinforcing actions proportional to past rewards, all of which had zero mean, but nonzero variance: as a result, they would just add noise to sample estimates of the policy gradient. By removing them, we reduce the number of sample trajectories needed.
An (optional) proof of this claim can be found here, and it ultimately depends on the EGLP lemma.
Implementing RewardtoGo Policy Gradient¶
We give a short Tensorflow implementation of the rewardtogo policy gradient in spinup/examples/pg_math/2_rtg_pg.py
. (It can also be viewed on github.)
The only thing that has changed from 1_simple_pg.py
is that we now use different weights in the loss function. The code modification is very slight: we add a new function, and change two other lines. The new function is:
12 13 14 15 16 17  def reward_to_go(rews):
n = len(rews)
rtgs = np.zeros_like(rews)
for i in reversed(range(n)):
rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0)
return rtgs

And then we tweak the old L8687 from:
86 87  # the weight for each logprob(as) is R(tau)
batch_weights += [ep_ret] * ep_len

to:
93 94  # the weight for each logprob(a_ts_t) is rewardtogo from t
batch_weights += list(reward_to_go(ep_rews))

Baselines in Policy Gradients¶
An immediate consequence of the EGLP lemma is that for any function which only depends on state,
This allows us to add or subtract any number of terms like this from our expression for the policy gradient, without changing it in expectation:
Any function used in this way is called a baseline.
The most common choice of baseline is the onpolicy value function . Recall that this is the average return an agent gets if it starts in state and then acts according to policy for the rest of its life.
Empirically, the choice has the desirable effect of reducing variance in the sample estimate for the policy gradient. This results in faster and more stable policy learning. It is also appealing from a conceptual angle: it encodes the intuition that if an agent gets what it expected, it should “feel” neutral about it.
You Should Know
In practice, cannot be computed exactly, so it has to be approximated. This is usually done with a neural network, , which is updated concurrently with the policy (so that the value network always approximates the value function of the most recent policy).
The simplest method for learning , used in most implementations of policy optimization algorithms (including VPG, TRPO, PPO, and A2C), is to minimize a meansquarederror objective:
where is the policy at epoch . This is done with one or more steps of gradient descent, starting from the previous value parameters .
Other Forms of the Policy Gradient¶
What we have seen so far is that the policy gradient has the general form
where could be any of
or
or
All of these choices lead to the same expected value for the policy gradient, despite having different variances. It turns out that there are two more valid choices of weights which are important to know.
1. OnPolicy ActionValue Function. The choice
is also valid. See this page for an (optional) proof of this claim.
2. The Advantage Function. Recall that the advantage of an action, defined by , describes how much better or worse it is than other actions on average (relative to the current policy). This choice,
is also valid. The proof is that it’s equivalent to using and then using a value function baseline, which we are always free to do.
You Should Know
The formulation of policy gradients with advantage functions is extremely common, and there are many different ways of estimating the advantage function used by different algorithms.
You Should Know
For a more detailed treatment of this topic, you should read the paper on Generalized Advantage Estimation (GAE), which goes into depth about different choices of in the background sections.
That paper then goes on to describe GAE, a method for approximating the advantage function in policy optimization algorithms which enjoys widespread use. For instance, Spinning Up’s implementations of VPG, TRPO, and PPO make use of it. As a result, we strongly advise you to study it.
Recap¶
In this chapter, we described the basic theory of policy gradient methods and connected some of the early results to code examples. The interested student should continue from here by studying how the later results (value function baselines and the advantage formulation of policy gradients) translate into Spinning Up’s implementation of Vanilla Policy Gradient.
深度强化学习研究者的资料¶
By Joshua Achiam, October 13th, 2018
Table of Contents
如果你是一位深度强化学习的研究者，你现在可能已经对深度强化学习有了很多的了解。你知道它 `很难`_ 而且 `不总是有效`_ 。即便是严格按照步骤来， `可重现性`_ 依然是一大挑战。如果你准备从头开始， `学习的曲线非常陡峭`_ 。虽然已经有很多很棒的学习资源（比如 ），但是因为很多资料都很新，以至于还没有一条清晰明确的捷径。这个项目的目的就是帮助你克服这些一开始的障碍，并且让你清楚的知道，如何成为一名深度强化学习研究院。在这个项目里，我们会介绍一些有用的课程，作为基础知识，同时把一些可能适合研究的方向结合进来。
正确的背景¶
建立良好的数学背景 从概率和统计学的角度，要对于随机变量、贝叶斯定理、链式法则、期望、标准差和重要性抽样等要有很好的理解。从多重积分的角度，要了解梯度和泰勒展开（可选，但是会很有用）。
对于深度学习要有基础的了解 你不用知道每一个技巧和结构，但是了解基础的知识很有帮助。要了解多层感知机额、LSTM、GRU、卷积、层、resnets、注意力机制、mechanisms，常见的正则手段(weight decay, dropout)，归一化方式(batch norm, layer norm, weight norm)和优化方式(SGD, momentum SGD, Adam, `以及其它`_)。要了解什么是 reparameterization trick 。
至少熟悉一种深度学习框架 Tensorflow or PyTorch 非常适合练手。你不用知道所有东西，但是你要能非常自信的实现一种监督学习算法。
对于强化学习中的主要概念和术语很了解 知道什么是状态、行动、轨迹、策略、奖励、值函数和行动值函数。如果你对这些不了解，去读一读项目里面 `介绍`_ 部分的材料。OpenAI Hackthon 的 `强化学习介绍`_ 也很值得看，或者是 Lilian Weng 的 综述。如果你对于数学理论很感兴趣，可以学习 monotonic improvement theory （策略梯度算法的的基础）或者 classical RL algorithms （尽管被深度强化学习所替代，但还是有很多能推动新的研究的 insights）。
在动手中学习¶
自己实现算法 你应该尽可能地从头开始编写尽可能多的深度学习的核心算法，同时要保证自己的实现尽量简单、正确。这是了解这些算法如何工作、培养性能特征直觉的最佳方法。
简单是最重要的 你要对自己的工作有合理的规划，从最简单的算法开始，然后慢慢引入复杂性。如果你一开始就构建很多复杂的部分，有可能会耗费你接下来几周的时间来尝试调试。对于刚刚接触强化学习的人来说，这是很常见的问题。如果你发现自己被困在其中，不要气馁，尝试回到最开始然后换一种更简单的算法。
哪些算法？ 你可以按照 vanilla policy gradient(也被称为 REINFORCE )、DQN, A2C ( A3C 的同步版本), PPO (具有 clipped objective 特性的变体), DDPG 的顺序来学习。 这些算法的最简版本可以用几百行代码编写（大约250300行），有些更少，比如 a nofrills version of VPG 只需要 80 行的代码。再写并行版本代码之前，先尝试写单线程版本的。（至少实现一种并行的算法）
注重理解 编写有效的强化学习代码需要对于算法有明确的理解，同时注重细节。因为错误的代码总是悄无声息：看起来运行的很正常，但实际上智能体什么也没有学到。这种情况通常是因为有些公式写错了，或者分布不对，又或者数据传输到了错误的地方。有时候找到这些错误的唯一办法，就是批判性地阅读代码，明确知道它应该做什么，找到它偏离正确行为的地方。这就需要你一方面了解学术文献，另一方面参考已有的实现，所以你要花很多时间在这些上面。
看论文的时候要注意什么 当基于一篇论文实现算法的时候，要彻读论文
But don’t overfit to paper details. Sometimes, the paper prescribes the use of more tricks than are strictly necessary, so be a bit wary of this, and try out simplifications where possible. For example, the original DDPG paper suggests a complex neural network architecture and initialization scheme, as well as batch normalization. These aren’t strictly necessary, and some of the bestreported results for DDPG use simpler networks. As another example, the original A3C paper uses asynchronous updates from the various actorlearners, but it turns out that synchronous updates work about as well.
Don’t overfit to existing implementations either. Study existing implementations for inspiration, but be careful not to overfit to the engineering details of those implementations. RL libraries frequently make choices for abstraction that are good for code reuse between algorithms, but which are unnecessary if you’re only writing a single algorithm or supporting a single use case.
Iterate fast in simple environments. To debug your implementations, try them with simple environments where learning should happen quickly, like CartPolev0, InvertedPendulumv0, FrozenLakev0, and HalfCheetahv2 (with a short time horizon—only 100 or 250 steps instead of the full 1000) from the OpenAI Gym. Don’t try to run an algorithm in Atari or a complex Humanoid environment if you haven’t first verified that it works on the simplest possible toy task. Your ideal experiment turnaroundtime at the debug stage is <5 minutes (on your local machine) or slightly longer but not much. These smallscale experiments don’t require any special hardware, and can be run without too much trouble on CPUs.
If it doesn’t work, assume there’s a bug. Spend a lot of effort searching for bugs before you resort to tweaking hyperparameters: usually it’s a bug. Bad hyperparameters can significantly degrade RL performance, but if you’re using hyperparameters similar to the ones in papers and standard implementations, those will probably not be the issue. Also worth keeping in mind: sometimes things will work in one environment even when you have a breaking bug, so make sure to test in more than one environment once your results look promising.
Measure everything. Do a lot of instrumenting to see what’s going on underthehood. The more stats about the learning process you read out at each iteration, the easier it is to debug—after all, you can’t tell it’s broken if you can’t see that it’s breaking. I personally like to look at the mean/std/min/max for cumulative rewards, episode lengths, and value function estimates, along with the losses for the objectives, and the details of any exploration parameters (like mean entropy for stochastic policy optimization, or current epsilon for epsilongreedy as in DQN). Also, watch videos of your agent’s performance every now and then; this will give you some insights you wouldn’t get otherwise.
Scale experiments when things work. After you have an implementation of an RL algorithm that seems to work correctly in the simplest environments, test it out on harder environments. Experiments at this stage will take longer—on the order of somewhere between a few hours and a couple of days, depending. Specialized hardware—like a beefy GPU or a 32core machine—might be useful at this point, and you should consider looking into cloud computing resources like AWS or GCE.
Keep these habits! These habits are worth keeping beyond the stage where you’re just learning about deep RL—they will accelerate your research!
开展一个研究项目¶
Once you feel reasonably comfortable with the basics in deep RL, you should start pushing on the boundaries and doing research. To get there, you’ll need an idea for a project.
Start by exploring the literature to become aware of topics in the field. There are a wide range of topics you might find interesting: sample efficiency, exploration, transfer learning, hierarchy, memory, modelbased RL, meta learning, and multiagent, to name a few. If you’re looking for inspiration, or just want to get a rough sense of what’s out there, check out Spinning Up’s key papers list. Find a paper that you enjoy on one of these subjects—something that inspires you—and read it thoroughly. Use the related work section and citations to find closelyrelated papers and do a deep dive in the literature. You’ll start to figure out where the unsolved problems are and where you can make an impact.
Approaches to ideageneration: There are a many different ways to start thinking about ideas for projects, and the frame you choose influences how the project might evolve and what risks it will face. Here are a few examples:
Frame 1: Improving on an Existing Approach. This is the incrementalist angle, where you try to get performance gains in an established problem setting by tweaking an existing algorithm. Reimplementing prior work is super helpful here, because it exposes you to the ways that existing algorithms are brittle and could be improved. A novice will find this the most accessible frame, but it can also be worthwhile for researchers at any level of experience. While some researchers find incrementalism less exciting, some of the most impressive achievements in machine learning have come from work of this nature.
Because projects like these are tied to existing methods, they are by nature narrowly scoped and can wrap up quickly (a few months), which may be desirable (especially when starting out as a researcher). But this also sets up the risks: it’s possible that the tweaks you have in mind for an algorithm may fail to improve it, in which case, unless you come up with more tweaks, the project is just over and you have no clear signal on what to do next.
Frame 2: Focusing on Unsolved Benchmarks. Instead of thinking about how to improve an existing method, you aim to succeed on a task that no one has solved before. For example: achieving perfect generalization from training levels to test levels in the Sonic domain or Gym Retro. When you hammer away at an unsolved task, you might try a wide variety of methods, including prior approaches and new ones that you invent for the project. It is possible for a novice to approch this kind of problem, but there will be a steeper learning curve.
Projects in this frame have a broad scope and can go on for a while (several months to a yearplus). The main risk is that the benchmark is unsolvable without a substantial breakthrough, meaning that it would be easy to spend a lot of time without making any progress on it. But even if a project like this fails, it often leads the researcher to many new insights that become fertile soil for the next project.
Frame 3: Create a New Problem Setting. Instead of thinking about existing methods or current grand challenges, think of an entirely different conceptual problem that hasn’t been studied yet. Then, figure out how to make progress on it. For projects along these lines, a standard benchmark probably doesn’t exist yet, and you will have to design one. This can be a huge challenge, but it’s worth embracing—great benchmarks move the whole field forward.
Problems in this frame come up when they come up—it’s hard to go looking for them.
Avoid reinventing the wheel. When you come up with a good idea that you want to start testing, that’s great! But while you’re still in the early stages with it, do the most thorough check you can to make sure it hasn’t already been done. It can be pretty disheartening to get halfway through a project, and only then discover that there’s already a paper about your idea. It’s especially frustrating when the work is concurrent, which happens from time to time! But don’t let that deter you—and definitely don’t let it motivate you to plant flags with notquitefinished research and overclaim the merits of the partial work. Do good research and finish out your projects with complete and thorough investigations, because that’s what counts, and by far what matters most in the long run.
做严谨的强化学习研究¶
Now you’ve come up with an idea, and you’re fairly certain it hasn’t been done. You use the skills you’ve developed to implement it and you start testing it out on standard domains. It looks like it works! But what does that mean, and how well does it have to work to be important? This is one of the hardest parts of research in deep RL. In order to validate that your proposal is a meaningful contribution, you have to rigorously prove that it actually gets a performance benefit over the strongest possible baseline algorithm—whatever currently achieves SOTA (state of the art) on your test domains. If you’ve invented a new test domain, so there’s no previous SOTA, you still need to try out whatever the most reliable algorithm in the literature is that could plausibly do well in the new test domain, and then you have to beat that.
Set up fair comparisons. If you implement your baseline from scratch—as opposed to comparing against another paper’s numbers directly—it’s important to spend as much time tuning your baseline as you spend tuning your own algorithm. This will make sure that comparisons are fair. Also, do your best to hold “all else equal” even if there are substantial differences between your algorithm and the baseline. For example, if you’re investigating architecture variants, keep the number of model parameters approximately equal between your model and the baseline. Under no circumstances handicap the baseline! It turns out that the baselines in RL are pretty strong, and getting big, consistent wins over them can be tricky or require some good insight in algorithm design.
Remove stochasticity as a confounder. Beware of random seeds making things look stronger or weaker than they really are, so run everything for many random seeds (at least 3, but if you want to be thorough, do 10 or more). This is really important and deserves a lot of emphasis: deep RL seems fairly brittle with respect to random seed in a lot of common use cases. There’s potentially enough variance that two different groups of random seeds can yield learning curves with differences so significant that they look like they don’t come from the same distribution at all (see figure 10 here).
Run highintegrity experiments. Don’t just take the results from the best or most interesting runs to use in your paper. Instead, launch new, final experiments—for all of the methods that you intend to compare (if you are comparing against your own baseline implementations)—and precommit to report on whatever comes out of that. This is to enforce a weak form of preregistration: you use the tuning stage to come up with your hypotheses, and you use the final runs to come up with your conclusions.
Check each claim separately. Another critical aspect of doing research is to run an ablation analysis. Any method you propose is likely to have several key design decisions—like architecture choices or regularization techniques, for instance—each of which could separately impact performance. The claim you’ll make in your work is that those design decisions collectively help, but this is really a bundle of several claims in disguise: one for each such design element. By systematically evaluating what would happen if you were to swap them out with alternate design choices, or remove them entirely, you can figure out how to correctly attribute credit for the benefits your method confers. This lets you make each separate claim with a measure of confidence, and increases the overall strength of your work.
别想太多¶
Deep RL is an exciting, fastmoving field, and we need as many people as possible to go through the open problems and make progress on them. Hopefully, you feel a bit more prepared to be a part of it after reading this! And whenever you’re ready, let us know.
后记：其他资源¶
Consider reading through these other informative articles about growing as a researcher or engineer in this field:
Advice for Shortterm Machine Learning Research Projects, by Tim Rocktäschel, Jakob Foerster and Greg Farquhar.
ML Engineering for AI Safety & Robustness: a Google Brain Engineer’s Guide to Entering the Field, by Catherine Olsson and 80,000 Hours.
参考¶
[1]  不总是奏效, Alex Irpan, 2018 
[2]  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control, Islam et al, 2017 
[3]  Deep Reinforcement Learning that Matters, Henderson et al, 2017 
[4]  Lessons Learned Reproducing a Deep Reinforcement Learning Paper, Matthew Rahtz, 2018 
[5]  UCL Course on RL 
[6]  Berkeley Deep RL Course 
[7]  Deep RL Bootcamp 
[8]  Nuts and Bolts of Deep RL, John Schulman 
[9]  Stanford Deep Learning Tutorial: MultiLayer Neural Network 
[10]  The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy, 2015 
[11]  LSTM: A Search Space Odyssey, Greff et al, 2015 
[12]  Understanding LSTM Networks, Chris Olah, 2015 
[13]  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Chung et al, 2014 (GRU paper) 
[14]  Conv Nets: A Modular Perspective, Chris Olah, 2014 
[15]  Stanford CS231n, Convolutional Neural Networks for Visual Recognition 
[16]  Deep Residual Learning for Image Recognition, He et al, 2015 (ResNets) 
[17]  Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al, 2014 (Attention mechanisms) 
[18]  Attention Is All You Need, Vaswani et al, 2017 
[19]  A Simple Weight Decay Can Improve Generalization, Krogh and Hertz, 1992 
[20]  Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al, 2014 
[21]  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe and Szegedy, 2015 
[22]  Layer Normalization, Ba et al, 2016 
[23]  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, Salimans and Kingma, 2016 
[24]  Stanford Deep Learning Tutorial: Stochastic Gradient Descent 
[25]  Adam: A Method for Stochastic Optimization, Kingma and Ba, 2014 
[26]  An overview of gradient descent optimization algorithms, Sebastian Ruder, 2016 
[27]  AutoEncoding Variational Bayes, Kingma and Welling, 2013 (Reparameterization trick) 
[28]  Tensorflow 
[29]  PyTorch 
[30]  Spinning Up in Deep RL: Introduction to RL, Part 1 
[31]  RLIntro Slides from OpenAI Hackathon, Josh Achiam, 2018 
[32]  A (Long) Peek into Reinforcement Learning, Lilian Weng, 2018 
[33]  Optimizing Expectations, John Schulman, 2016 (Monotonic improvement theory) 
[34]  Algorithms for Reinforcement Learning, Csaba Szepesvari, 2009 (Classic RL Algorithms) 
[35]  Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al, 2016 
[36]  Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013 (DQN) 
[37]  OpenAI Baselines: ACKTR & A2C 
[38]  Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016 (A3C) 
[39]  Proximal Policy Optimization Algorithms, Schulman et al, 2017 (PPO) 
[40]  Continuous Control with Deep Reinforcement Learning, Lillicrap et al, 2015 (DDPG) 
[41]  RLIntro Policy Gradient Sample Code, Josh Achiam, 2018 
[42]  OpenAI Baselines 
[43]  rllab 
[44]  OpenAI Gym 
[45]  OpenAI Retro Contest 
[46]  OpenAI Gym Retro 
[47]  Center for Open Science, explaining what preregistration means in the context of scientific experiments. 
深度强化学习的核心论文¶
What follows is a list of papers in deep RL that are worth reading. This is far from comprehensive, but should provide a useful starting point for someone looking to do research in the field.
Table of Contents
1. 免模型强化学习¶
a. 深度 Qlearning¶
[1]  Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN. 
[2]  Deep Recurrent QLearning for Partially Observable MDPs, Hausknecht and Stone, 2015. Algorithm: Deep Recurrent QLearning. 
[3]  Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. Algorithm: Dueling DQN. 
[4]  Deep Reinforcement Learning with Double Qlearning, Hasselt et al 2015. Algorithm: Double DQN. 
[5]  Prioritized Experience Replay, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER). 
[6]  Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al, 2017. Algorithm: Rainbow DQN. 
b. 策略梯度¶
[7]  Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016. Algorithm: A3C. 
[8]  Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO. 
[9]  HighDimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE. 
[10]  Proximal Policy Optimization Algorithms, Schulman et al, 2017. Algorithm: PPOClip, PPOPenalty. 
[11]  Emergence of Locomotion Behaviours in Rich Environments, Heess et al, 2017. Algorithm: PPOPenalty. 
[12]  Scalable trustregion method for deep reinforcement learning using Kroneckerfactored approximation, Wu et al, 2017. Algorithm: ACKTR. 
[13]  Sample Efficient ActorCritic with Experience Replay, Wang et al, 2016. Algorithm: ACER. 
[14]  Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018. Algorithm: SAC. 
c. 确定策略梯度¶
[15]  Deterministic Policy Gradient Algorithms, Silver et al, 2014. Algorithm: DPG. 
[16]  Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015. Algorithm: DDPG. 
[17]  Addressing Function Approximation Error in ActorCritic Methods, Fujimoto et al, 2018. Algorithm: TD3. 
d. 分布式强化学习¶
[18]  A Distributional Perspective on Reinforcement Learning, Bellemare et al, 2017. Algorithm: C51. 
[19]  Distributional Reinforcement Learning with Quantile Regression, Dabney et al, 2017. Algorithm: QRDQN. 
[20]  Implicit Quantile Networks for Distributional Reinforcement Learning, Dabney et al, 2018. Algorithm: IQN. 
[21]  Dopamine: A Research Framework for Deep Reinforcement Learning, Anonymous, 2018. Contribution: Introduces Dopamine, a code repository containing implementations of DQN, C51, IQN, and Rainbow. Code link. 
e. 动作依赖性策略梯度¶
[22]  QProp: SampleEfficient Policy Gradient with An OffPolicy Critic, Gu et al, 2016. Algorithm: QProp. 
[23]  Actiondepedent Control Variates for Policy Optimization via Stein’s Identity, Liu et al, 2017. Algorithm: Stein Control Variates. 
[24]  The Mirage of ActionDependent Baselines in Reinforcement Learning, Tucker et al, 2018. Contribution: interestingly, critiques and reevaluates claims from earlier papers (including QProp and stein control variates) and finds important methodological errors in them. 
f. 路径一致性学习(PathConsistency Learning)¶
[25]  Bridging the Gap Between Value and Policy Based Reinforcement Learning, Nachum et al, 2017. Algorithm: PCL. 
[26]  TrustPCL: An OffPolicy Trust Region Method for Continuous Control, Nachum et al, 2017. Algorithm: TrustPCL. 
g. Other Directions for Combining PolicyLearning and QLearning¶
[27]  Combining Policy Gradient and Qlearning, O’Donoghue et al, 2016. Algorithm: PGQL. 
[28]  The Reactor: A Fast and SampleEfficient ActorCritic Agent for Reinforcement Learning, Gruslys et al, 2017. Algorithm: Reactor. 
[29]  Interpolated Policy Gradient: Merging OnPolicy and OffPolicy Gradient Estimation for Deep Reinforcement Learning, Gu et al, 2017. Algorithm: IPG. 
[30]  Equivalence Between Policy Gradients and Soft QLearning, Schulman et al, 2017. Contribution: Reveals a theoretical link between these two families of RL algorithms. 
h. 进化(Evolutionary)算法¶
[31]  Evolution Strategies as a Scalable Alternative to Reinforcement Learning, Salimans et al, 2017. Algorithm: ES. 
2. 探索(Exploration)¶
a. 内在激励(Intrinsic Motivation)¶
[32]  VIME: Variational Information Maximizing Exploration, Houthooft et al, 2016. Algorithm: VIME. 
[33]  Unifying CountBased Exploration and Intrinsic Motivation, Bellemare et al, 2016. Algorithm: CTSbased Pseudocounts. 
[34]  CountBased Exploration with Neural Density Models, Ostrovski et al, 2017. Algorithm: PixelCNNbased Pseudocounts. 
[35]  #Exploration: A Study of CountBased Exploration for Deep Reinforcement Learning, Tang et al, 2016. Algorithm: Hashbased Counts. 
[36]  EX2: Exploration with Exemplar Models for Deep Reinforcement Learning, Fu et al, 2017. Algorithm: EX2. 
[37]  Curiositydriven Exploration by Selfsupervised Prediction, Pathak et al, 2017. Algorithm: Intrinsic Curiosity Module (ICM). 
[38]  LargeScale Study of CuriosityDriven Learning, Burda et al, 2018. Contribution: Systematic analysis of how surprisalbased intrinsic motivation performs in a wide variety of environments. 
[39]  Exploration by Random Network Distillation, Burda et al, 2018. Algorithm: RND. 
b. 非监督强化学习¶
[40]  Variational Intrinsic Control, Gregor et al, 2016. Algorithm: VIC. 
[41]  Diversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al, 2018. Algorithm: DIAYN. 
[42]  Variational Option Discovery Algorithms, Achiam et al, 2018. Algorithm: VALOR. 
3. 迁移和多任务强化学习¶
[43]  Progressive Neural Networks, Rusu et al, 2016. Algorithm: Progressive Networks. 
[44]  Universal Value Function Approximators, Schaul et al, 2015. Algorithm: UVFA. 
[45]  Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, 2016. Algorithm: UNREAL. 
[46]  The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously, Cabi et al, 2017. Algorithm: IU Agent. 
[47]  PathNet: Evolution Channels Gradient Descent in Super Neural Networks, Fernando et al, 2017. Algorithm: PathNet. 
[48]  Mutual Alignment Transfer Learning, Wulfmeier et al, 2017. Algorithm: MATL. 
[49]  Learning an Embedding Space for Transferable Robot Skills, Hausman et al, 2018. 
[50]  Hindsight Experience Replay, Andrychowicz et al, 2017. Algorithm: Hindsight Experience Replay (HER). 
4. 层次(Hierarchy)¶
[51]  Strategic Attentive Writer for Learning MacroActions, Vezhnevets et al, 2016. Algorithm: STRAW. 
[52]  FeUdal Networks for Hierarchical Reinforcement Learning, Vezhnevets et al, 2017. Algorithm: Feudal Networks 
[53]  DataEfficient Hierarchical Reinforcement Learning, Nachum et al, 2018. Algorithm: HIRO. 
5. 记忆(Memory)¶
[54]  ModelFree Episodic Control, Blundell et al, 2016. Algorithm: MFEC. 
[55]  Neural Episodic Control, Pritzel et al, 2017. Algorithm: NEC. 
[56]  Neural Map: Structured Memory for Deep Reinforcement Learning, Parisotto and Salakhutdinov, 2017. Algorithm: Neural Map. 
[57]  Unsupervised Predictive Memory in a GoalDirected Agent, Wayne et al, 2018. Algorithm: MERLIN. 
[58]  Relational Recurrent Neural Networks, Santoro et al, 2018. Algorithm: RMC. 
6. 有模型强化学习¶
a. 模型可被学习¶
[59]  ImaginationAugmented Agents for Deep Reinforcement Learning, Weber et al, 2017. Algorithm: I2A. 
[60]  Neural Network Dynamics for ModelBased Deep Reinforcement Learning with ModelFree FineTuning, Nagabandi et al, 2017. Algorithm: MBMF. 
[61]  ModelBased Value Expansion for Efficient ModelFree Reinforcement Learning, Feinberg et al, 2018. Algorithm: MVE. 
[62]  SampleEfficient Reinforcement Learning with Stochastic Ensemble Value Expansion, Buckman et al, 2018. Algorithm: STEVE. 
[63]  ModelEnsemble TrustRegion Policy Optimization, Kurutach et al, 2018. Algorithm: METRPO. 
[64]  ModelBased Reinforcement Learning via MetaPolicy Optimization, Clavera et al, 2018. Algorithm: MBMPO. 
[65]  Recurrent World Models Facilitate Policy Evolution, Ha and Schmidhuber, 2018. 
b. 模型已知¶
[66]  Mastering Chess and Shogi by SelfPlay with a General Reinforcement Learning Algorithm, Silver et al, 2017. Algorithm: AlphaZero. 
[67]  Thinking Fast and Slow with Deep Learning and Tree Search, Anthony et al, 2017. Algorithm: ExIt. 
7. 元学习(MetaRL)¶
[68]  RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning, Duan et al, 2016. Algorithm: RL^2. 
[69]  Learning to Reinforcement Learn, Wang et al, 2016. 
[70]  ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks, Finn et al, 2017. Algorithm: MAML. 
[71]  A Simple Neural Attentive MetaLearner, Mishra et al, 2018. Algorithm: SNAIL. 
8. Scaling RL¶
[72]  Accelerated Methods for Deep Reinforcement Learning, Stooke and Abbeel, 2018. Contribution: Systematic analysis of parallelization in deep RL across algorithms. 
[73]  IMPALA: Scalable Distributed DeepRL with Importance Weighted ActorLearner Architectures, Espeholt et al, 2018. Algorithm: IMPALA. 
[74]  Distributed Prioritized Experience Replay, Horgan et al, 2018. Algorithm: ApeX. 
[75]  Recurrent Experience Replay in Distributed Reinforcement Learning, Anonymous, 2018. Algorithm: R2D2. 
[76]  RLlib: Abstractions for Distributed Reinforcement Learning, Liang et al, 2017. Contribution: A scalable library of RL algorithm implementations. Documentation link. 
9. 现实世界的强化学习¶
[77]  Benchmarking Reinforcement Learning Algorithms on RealWorld Robots, Mahmood et al, 2018. 
[78]  Learning Dexterous InHand Manipulation, OpenAI, 2018. 
[79]  QTOpt: Scalable Deep Reinforcement Learning for VisionBased Robotic Manipulation, Kalashnikov et al, 2018. Algorithm: QTOpt. 
[80]  Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform, Gauci et al, 2018. 
10. 安全¶
[81]  Concrete Problems in AI Safety, Amodei et al, 2016. Contribution: establishes a taxonomy of safety problems, serving as an important jumpingoff point for future research. We need to solve these! 
[82]  Deep Reinforcement Learning From Human Preferences, Christiano et al, 2017. Algorithm: LFP. 
[83]  Constrained Policy Optimization, Achiam et al, 2017. Algorithm: CPO. 
[84]  Safe Exploration in Continuous Action Spaces, Dalal et al, 2018. Algorithm: DDPG+Safety Layer. 
[85]  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention, Saunders et al, 2017. Algorithm: HIRL. 
[86]  Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning, Eysenbach et al, 2017. Algorithm: Leave No Trace. 
11. 模仿学习和逆强化学习¶
[87]  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy, Ziebart 2010. Contributions: Crisp formulation of maximum entropy IRL. 
[88]  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Finn et al, 2016. Algorithm: GCL. 
[89]  Generative Adversarial Imitation Learning, Ho and Ermon, 2016. Algorithm: GAIL. 
[90]  DeepMimic: ExampleGuided Deep Reinforcement Learning of PhysicsBased Character Skills, Peng et al, 2018. Algorithm: DeepMimic. 
[91]  Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow, Peng et al, 2018. Algorithm: VAIL. 
[92]  OneShot HighFidelity Imitation: Training LargeScale Deep Nets with RL, Le Paine et al, 2018. Algorithm: MetaMimic. 
12. 可复现、分析和评价¶
[93]  Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al, 2016. Contribution: rllab. 
[94]  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control, Islam et al, 2017. 
[95]  Deep Reinforcement Learning that Matters, Henderson et al, 2017. 
[96]  Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods, Henderson et al, 2018. 
[97]  Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?, Ilyas et al, 2018. 
[98]  Simple Random Search Provides a Competitive Approach to Reinforcement Learning, Mania et al, 2018. 
13. 额外奖励：强化学习理论的经典论文¶
[99]  Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes. 
[100]  An Analysis of TemporalDifference Learning with Function Approximation, Tsitsiklis and Van Roy, 1997. Contributions: Variety of convergence results and counterexamples for valuelearning methods in RL. 
[101]  Reinforcement Learning of Motor Skills with Policy Gradients, Peters and Schaal, 2008. Contributions: Thorough review of policy gradient methods at the time, many of which are still serviceable descriptions of deep RL methods. 
[102]  Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford, 2002. Contributions: Early roots for monotonic improvement theory, later leading to theoretical justification for TRPO and other algorithms. 
[103]  A Natural Policy Gradient, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL. 
[104]  Algorithms for Reinforcement Learning, Szepesvari, 2009. Contributions: Unbeatable reference on RL before deep RL, containing foundations and theoretical background. 
练习¶
Table of Contents
Problem Set 1: Basics of Implementation¶
Exercise 1.1: Gaussian LogLikelihood
Path to Exercise. spinup/exercises/problem_set_1/exercise1_1.py
Path to Solution. spinup/exercises/problem_set_1_solutions/exercise1_1_soln.py
Instructions. Write a function which takes in Tensorflow symbols for the means and log stds of a batch of diagonal Gaussian distributions, along with a Tensorflow placeholder for (previouslygenerated) samples from those distributions, and returns a Tensorflow symbol for computing the log likelihoods of those samples.
You may find it useful to review the formula given in this section of the RL introduction.
Implement your solution in exercise1_1.py
, and run that file to automatically check your work.
Evaluation Criteria. Your solution will be checked by comparing outputs against a knowngood implementation, using a batch of random inputs.
Exercise 1.2: Policy for PPO
Path to Exercise. spinup/exercises/problem_set_1/exercise1_2.py
Path to Solution. spinup/exercises/problem_set_1_solutions/exercise1_2_soln.py
Instructions. Implement an MLP diagonal Gaussian policy for PPO.
Implement your solution in exercise1_2.py
, and run that file to automatically check your work.
Evaluation Criteria. Your solution will be evaluated by running for 20 epochs in the InvertedPendulumv2 Gym environment, and this should take in the ballpark of 35 minutes (depending on your machine, and other processes you are running in the background). The bar for success is reaching an average score of over 500 in the last 5 epochs, or getting to a score of 1000 (the maximum possible score) in the last 5 epochs.
Exercise 1.3: Computation Graph for TD3
Path to Exercise. spinup/exercises/problem_set_1/exercise1_3.py
Path to Solution. spinup/algos/td3/td3.py
Instructions. Implement the core computation graph for the TD3 algorithm.
As starter code, you are given the entirety of the TD3 algorithm except for the computation graph. Find “YOUR CODE HERE” to begin.
You may find it useful to review the pseudocode in our page on TD3.
Implement your solution in exercise1_3.py
, and run that file to see the results of your work. There is no automatic checking for this exercise.
Evaluation Criteria. Evaluate your code by running exercise1_3.py
with HalfCheetahv2, InvertedPendulumv2, and one other Gym MuJoCo environment of your choosing (set via the env
flag). It is set up to use smaller neural networks (hidden sizes [128,128]) than typical for TD3, with a maximum episode length of 150, and to run for only 10 epochs. The goal is to see significant learning progress relatively quickly (in terms of wall clock time). Experiments will likely take on the order of ~10 minutes.
Use the use_soln
flag to run Spinning Up’s TD3 instead of your implementation. Anecdotally, within 10 epochs, the score in HalfCheetah should go over 300, and the score in InvertedPendulum should max out at 150.
Problem Set 2: Algorithm Failure Modes¶
Exercise 2.1: Value Function Fitting in TRPO
Path to Exercise. (Not applicable, there is no code for this one.)
Path to Solution. Solution available here.
Many factors can impact the performance of policy gradient algorithms, but few more drastically than the quality of the learned value function used for advantage estimation.
In this exercise, you will compare results between runs of TRPO where you put lots of effort into fitting the value function (train_v_iters=80
), versus where you put very little effort into fitting the value function (train_v_iters=0
).
Instructions. Run the following command:
python m spinup.run trpo env Hopperv2 train_v_iters[v] 0 80 exp_name ex21 epochs 250 steps_per_epoch 4000 seed 0 10 20 dt
and plot the results. (These experiments might take ~10 minutes each, and this command runs six of them.) What do you find?
Exercise 2.2: Silent Bug in DDPG
Path to Exercise. spinup/exercises/problem_set_2/exercise2_2.py
Path to Solution. Solution available here.
The hardest part of writing RL code is dealing with bugs, because failures are frequently silent. The code will appear to run correctly, but the agent’s performance will degrade relative to a bugfree implementation—sometimes to the extent that it never learns anything.
In this exercise, you will observe a bug in vivo and compare results against correct code.
Instructions. Run exercise2_2.py
, which will launch DDPG experiments with and without a bug. The nonbugged version runs the default Spinning Up implementation of DDPG, using a default method for creating the actor and critic networks. The bugged version runs the same DDPG code, except uses a bugged method for creating the networks.
There will be six experiments in all (three random seeds for each case), and each should take in the ballpark of 10 minutes. When they’re finished, plot the results. What is the difference in performance with and without the bug?
Without referencing the correct actorcritic code (which is to say—don’t look in DDPG’s core.py
file), try to figure out what the bug is and explain how it breaks things.
Hint. To figure out what’s going wrong, think about how the DDPG code implements the DDPG computation graph. Specifically, look at this excerpt:
# Bellman backup for Q function
backup = tf.stop_gradient(r_ph + gamma*(1d_ph)*q_pi_targ)
# DDPG losses
pi_loss = tf.reduce_mean(q_pi)
q_loss = tf.reduce_mean((qbackup)**2)
How could a bug in the actorcritic code have an impact here?
Bonus. Are there any choices of hyperparameters which would have hidden the effects of the bug?
Challenges¶
Write Code from Scratch
As we suggest in the essay, try reimplementing various deep RL algorithms from scratch.
Requests for Research
If you feel comfortable with writing deep learning and deep RL code, consider trying to make progress on any of OpenAI’s standing requests for research:
Spinning Up 算法实现的基准¶
Table of Contents
We benchmarked the Spinning Up algorithm implementations in five environments from the MuJoCo Gym task suite: HalfCheetah, Hopper, Walker2d, Swimmer, and Ant.
Experiment Details¶
Random seeds. The onpolicy algorithms (VPG, TPRO, PPO) were run for 3 random seeds each, and the offpolicy algorithms (DDPG, TD3, SAC) were run for 10 random seeds each. Graphs show the average (solid line) and std dev (shaded) of performance over random seed over the course of training.
Performance metric. Performance for the onpolicy algorithms is measured as the average trajectory return across the batch collected at each epoch. Performance for the offpolicy algorithms is measured once every 10,000 steps by running the deterministic policy (or, in the case of SAC, the mean policy) without action noise for ten trajectories, and reporting the average return over those test trajectories.
Network architectures. The onpolicy algorithms use networks of size (64, 32) with tanh units for both the policy and the value function. The offpolicy algorithms use networks of size (400, 300) with relu units.
Batch size. The onpolicy algorithms collected 4000 steps of agentenvironment interaction per batch update. The offpolicy algorithms used minibatches of size 100 at each gradient descent step.
All other hyperparameters are left at default settings for the Spinning Up implementations. See algorithm pages for details.
Vanilla Policy Gradient¶
Table of Contents
Background¶
(Previously: Introduction to RL, Part 3)
The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.
Quick Facts¶
 VPG is an onpolicy algorithm.
 VPG can be used for environments with either discrete or continuous action spaces.
 The Spinning Up implementation of VPG supports parallelization with MPI.
Key Equations¶
Let denote a policy with parameters , and denote the expected finitehorizon undiscounted return of the policy. The gradient of is
where is a trajectory and is the advantage function for the current policy.
The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance:
Policy gradient implementations typically compute advantage function estimates based on the infinitehorizon discounted return, despite otherwise using the finitehorizon undiscounted policy gradient formula.
Exploration vs. Exploitation¶
VPG trains a stochastic policy in an onpolicy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.
Documentation¶

spinup.
vpg
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, pi_lr=0.0003, vf_lr=0.001, train_v_iters=80, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10)[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Samples actions from policy givenstates.logp
(batch,) Gives log probability, according tothe policy, of taking actionsa_ph
in statesx_ph
.logp_pi
(batch,) Gives log probability, according tothe policy, of the action sampled bypi
.v
(batch,) Gives the value estimate for statesinx_ph
. (Critical: make sureto flatten this!)  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to VPG.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 pi_lr (float) – Learning rate for policy optimizer.
 vf_lr (float) – Learning rate for value function optimizer.
 train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
 lam (float) – Lambda for GAELambda. (Always between 0 and 1, close to 1.)
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
pi 
Samples an action from the agent, conditioned on states in x . 
v 
Gives value estimate for states in x . 
This saved model can be accessed either by
 running the trained policy with the test_policy.py tool,
 or loading the whole saved graph into a program with restore_tf_graph.
References¶
Relevant Papers¶
 Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al. 2000
 Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs, Schulman 2016(a)
 Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al. 2016
 High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016(b)
Why These Papers?¶
Sutton 2000 is included because it is a timeless classic of reinforcement learning theory, and contains references to the earlier work which led to modern policy gradients. Schulman 2016(a) is included because Chapter 2 contains a lucid introduction to the theory of policy gradient algorithms, including pseudocode. Duan 2016 is a clear, recent benchmark paper that shows how vanilla policy gradient in the deep RL setting (eg with neural network policies and Adam as the optimizer) compares with other deep RL algorithms. Schulman 2016(b) is included because our implementation of VPG makes use of Generalized Advantage Estimation for computing the policy gradient.
Trust Region Policy Optimization¶
Table of Contents
Background¶
(Previously: Background for VPG)
TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. The constraint is expressed in terms of KLDivergence, a measure of (something like, but not exactly) distance between probability distributions.
This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.
Quick Facts¶
 TRPO is an onpolicy algorithm.
 TRPO can be used for environments with either discrete or continuous action spaces.
 The Spinning Up implementation of TRPO supports parallelization with MPI.
Key Equations¶
Let denote a policy with parameters . The theoretical TRPO update is:
where is the surrogate advantage, a measure of how policy performs relative to the old policy using data from the old policy:
and is an average KLdivergence between policies across states visited by the old policy:
You Should Know
The objective and constraint are both zero when . Furthermore, the gradient of the constraint with respect to is zero when . Proving these facts requires some subtle command of the relevant math—it’s an exercise worth doing, whenever you feel ready!
The theoretical TRPO update isn’t the easiest to work with, so TRPO makes some approximations to get an answer quickly. We Taylor expand the objective and constraint to leading order around :
resulting in an approximate optimization problem,
You Should Know
By happy coincidence, the gradient of the surrogate advantage function with respect to , evaluated at , is exactly equal to the policy gradient, ! Try proving this, if you feel comfortable diving into the math.
This approximate problem can be analytically solved by the methods of Lagrangian duality [1], yielding the solution:
If we were to stop here, and just use this final result, the algorithm would be exactly calculating the Natural Policy Gradient. A problem is that, due to the approximation errors introduced by the Taylor expansion, this may not satisfy the KL constraint, or actually improve the surrogate advantage. TRPO adds a modification to this update rule: a backtracking line search,
where is the backtracking coefficient, and is the smallest nonnegative integer such that satisfies the KL constraint and produces a positive surrogate advantage.
Lastly: computing and storing the matrix inverse, , is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the conjugate gradient algorithm to solve for , requiring only a function which can compute the matrixvector product instead of computing and storing the whole matrix directly. This is not too hard to do: we set up a symbolic operation to calculate
which gives us the correct output without computing the whole matrix.
[1]  See Convex Optimization by Boyd and Vandenberghe, especially chapters 2 through 5. 
Exploration vs. Exploitation¶
TRPO trains a stochastic policy in an onpolicy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.
Documentation¶

spinup.
trpo
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=0.001, train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10, backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10, algo='trpo')[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Samples actions from policy givenstates.logp
(batch,) Gives log probability, according tothe policy, of taking actionsa_ph
in statesx_ph
.logp_pi
(batch,) Gives log probability, according tothe policy, of the action sampled bypi
.info
N/A A dict of any intermediate quantities(from calculating the policy or logprobabilities) which are needed foranalytically computing KL divergence.(eg sufficient statistics of thedistributions)info_phs
N/A A dict of placeholders for old valuesof the entries ininfo
.d_kl
() A symbol for computing the mean KLdivergence between the current policy(pi
) and the old policy (asspecified by the inputs toinfo_phs
) over the batch ofstates given inx_ph
.v
(batch,) Gives the value estimate for statesinx_ph
. (Critical: make sureto flatten this!)  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TRPO.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 delta (float) – KLdivergence limit for TRPO / NPG update. (Should be small for stability. Values like 0.01, 0.05.)
 vf_lr (float) – Learning rate for value function optimizer.
 train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
 damping_coeff (float) –
Artifact for numerical stability, should be smallish. Adjusts Hessianvector product calculation:
where is the damping coefficient. Probably don’t play with this hyperparameter.
 cg_iters (int) –
Number of iterations of conjugate gradient to perform. Increasing this will lead to a more accurate approximation to , and possibly slightlyimproved performance, but at the cost of slowing things down.
Also probably don’t play with this hyperparameter.
 backtrack_iters (int) – Maximum number of steps allowed in the backtracking line search. Since the line search usually doesn’t backtrack, and usually only steps back once when it does, this hyperparameter doesn’t often matter.
 backtrack_coeff (float) – How far back to step during backtracking line search. (Always between 0 and 1, usually above 0.5.)
 lam (float) – Lambda for GAELambda. (Always between 0 and 1, close to 1.)
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
 algo – Either ‘trpo’ or ‘npg’: this code supports both, since they are almost the same.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
pi 
Samples an action from the agent, conditioned on states in x . 
v 
Gives value estimate for states in x . 
This saved model can be accessed either by
 running the trained policy with the test_policy.py tool,
 or loading the whole saved graph into a program with restore_tf_graph.
References¶
Relevant Papers¶
 Trust Region Policy Optimization, Schulman et al. 2015
 High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016
 Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002
Why These Papers?¶
Schulman 2015 is included because it is the original paper describing TRPO. Schulman 2016 is included because our implementation of TRPO makes use of Generalized Advantage Estimation for computing the policy gradient. Kakade and Langford 2002 is included because it contains theoretical results which motivate and deeply connect to the theoretical foundations of TRPO.
Proximal Policy Optimization¶
Table of Contents
Background¶
(Previously: Background for TRPO)
PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex secondorder method, PPO is a family of firstorder methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
There are two primary variants of PPO: PPOPenalty and PPOClip.
PPOPenalty approximately solves a KLconstrained update like TRPO, but penalizes the KLdivergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately.
PPOClip doesn’t have a KLdivergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.
Here, we’ll focus only on PPOClip (the primary variant used at OpenAI).
Quick Facts¶
 PPO is an onpolicy algorithm.
 PPO can be used for environments with either discrete or continuous action spaces.
 The Spinning Up implementation of PPO supports parallelization with MPI.
Key Equations¶
PPOclip updates policies via
typically taking multiple steps of (usually minibatch) SGD to maximize the objective. Here is given by
in which is a (small) hyperparameter which roughly says how far away the new policy is allowed to go from the old.
This is a pretty complex expression, and it’s hard to tell at first glance what it’s doing, or how it helps keep the new policy close to the old policy. As it turns out, there’s a considerably simplified version [1] of this objective which is a bit easier to grapple with (and is also the version we implement in our code):
where
To figure out what intuition to take away from this, let’s look at a single stateaction pair , and think of cases.
Advantage is positive: Suppose the advantage for that stateaction pair is positive, in which case its contribution to the objective reduces to
Because the advantage is positive, the objective will increase if the action becomes more likely—that is, if increases. But the min in this term puts a limit to how much the objective can increase. Once , the min kicks in and this term hits a ceiling of . Thus: the new policy does not benefit by going far away from the old policy.
Advantage is negative: Suppose the advantage for that stateaction pair is negative, in which case its contribution to the objective reduces to
Because the advantage is negative, the objective will increase if the action becomes less likely—that is, if decreases. But the max in this term puts a limit to how much the objective can increase. Once , the max kicks in and this term hits a ceiling of . Thus, again: the new policy does not benefit by going far away from the old policy.
What we have seen so far is that clipping serves as a regularizer by removing incentives for the policy to change dramatically, and the hyperparameter corresponds to how far away the new policy can go from the old while still profiting the objective.
You Should Know
While this kind of clipping goes a long way towards ensuring reasonable policy updates, it is still possible to end up with a new policy which is too far from the old policy, and there are a bunch of tricks used by different PPO implementations to stave this off. In our implementation here, we use a particularly simple method: early stopping. If the mean KLdivergence of the new policy from the old grows beyond a threshold, we stop taking gradient steps.
When you feel comfortable with the basic math and implementation details, it’s worth checking out other implementations to see how they handle this issue!
[1]  See this note for a derivation of the simplified form of the PPOClip objective. 
Exploration vs. Exploitation¶
PPO trains a stochastic policy in an onpolicy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.
Documentation¶

spinup.
ppo
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, clip_ratio=0.2, pi_lr=0.0003, vf_lr=0.001, train_pi_iters=80, train_v_iters=80, lam=0.97, max_ep_len=1000, target_kl=0.01, logger_kwargs={}, save_freq=10)[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Samples actions from policy givenstates.logp
(batch,) Gives log probability, according tothe policy, of taking actionsa_ph
in statesx_ph
.logp_pi
(batch,) Gives log probability, according tothe policy, of the action sampled bypi
.v
(batch,) Gives the value estimate for statesinx_ph
. (Critical: make sureto flatten this!)  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to PPO.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 clip_ratio (float) – Hyperparameter for clipping in the policy objective. Roughly: how far can the new policy go from the old policy while still profiting (improving the objective function)? The new policy can still go farther than the clip_ratio says, but it doesn’t help on the objective anymore. (Usually small, 0.1 to 0.3.)
 pi_lr (float) – Learning rate for policy optimizer.
 vf_lr (float) – Learning rate for value function optimizer.
 train_pi_iters (int) – Maximum number of gradient descent steps to take on policy loss per epoch. (Early stopping may cause optimizer to take fewer than this.)
 train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
 lam (float) – Lambda for GAELambda. (Always between 0 and 1, close to 1.)
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 target_kl (float) – Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
pi 
Samples an action from the agent, conditioned on states in x . 
v 
Gives value estimate for states in x . 
This saved model can be accessed either by
 running the trained policy with the test_policy.py tool,
 or loading the whole saved graph into a program with restore_tf_graph.
References¶
Relevant Papers¶
 Proximal Policy Optimization Algorithms, Schulman et al. 2017
 High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016
 Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017
Why These Papers?¶
Schulman 2017 is included because it is the original paper describing PPO. Schulman 2016 is included because our implementation of PPO makes use of Generalized Advantage Estimation for computing the policy gradient. Heess 2017 is included because it presents a largescale empirical analysis of behaviors learned by PPO agents in complex environments (although it uses PPOpenalty instead of PPOclip).
Other Public Implementations¶
 Baselines
 ModularRL (Caution: this implements PPOpenalty instead of PPOclip.)
 rllab (Caution: this implements PPOpenalty instead of PPOclip.)
 rllib (Ray)
Deep Deterministic Policy Gradient¶
Table of Contents
Background¶
(Previously: Introduction to RL Part 1: The Optimal QFunction and the Optimal Action)
Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Qfunction and a policy. It uses offpolicy data and the Bellman equation to learn the Qfunction, and uses the Qfunction to learn the policy.
This approach is closely connected to Qlearning, and is motivated the same way: if you know the optimal actionvalue function , then in any given state, the optimal action can be found by solving
DDPG interleaves learning an approximator to with learning an approximator to , and it does so in a way which is specifically adapted for environments with continuous action spaces. But what does it mean that DDPG is adapted specifically for environments with continuous action spaces? It relates to how we compute the max over actions in .
When there are a finite number of discrete actions, the max poses no problem, because we can just compute the Qvalues for each action separately and directly compare them. (This also immediately gives us the action which maximizes the Qvalue.) But when the action space is continuous, we can’t exhaustively evaluate the space, and solving the optimization problem is highly nontrivial. Using a normal optimization algorithm would make calculating a painfully expensive subroutine. And since it would need to be run every time the agent wants to take an action in the environment, this is unacceptable.
Because the action space is continuous, the function is presumed to be differentiable with respect to the action argument. This allows us to set up an efficient, gradientbased learning rule for a policy which exploits that fact. Then, instead of running an expensive optimization subroutine each time we wish to compute , we can approximate it with . See the Key Equations section details.
Quick Facts¶
 DDPG is an offpolicy algorithm.
 DDPG can only be used for environments with continuous action spaces.
 DDPG can be thought of as being deep Qlearning for continuous action spaces.
 The Spinning Up implementation of DDPG does not support parallelization.
Key Equations¶
Here, we’ll explain the math behind the two parts of DDPG: learning a Q function, and learning a policy.
The QLearning Side of DDPG¶
First, let’s recap the Bellman equation describing the optimal actionvalue function, . It’s given by
where is shorthand for saying that the next state, , is sampled by the environment from a distribution .
This Bellman equation is the starting point for learning an approximator to . Suppose the approximator is a neural network , with parameters , and that we have collected a set of transitions (where indicates whether state is terminal). We can set up a meansquared Bellman error (MSBE) function, which tells us roughly how closely comes to satisfying the Bellman equation:
Here, in evaluating , we’ve used a Python convention of evaluating True
to 1 and False
to zero. Thus, when d==True
—which is to say, when is a terminal state—the Qfunction should show that the agent gets no additional rewards after the current state. (This choice of notation corresponds to what we later implement in code.)
Qlearning algorithms for function approximators, such as DQN (and all its variants) and DDPG, are largely based on minimizing this MSBE loss function. There are two main tricks employed by all of them which are worth describing, and then a specific detail for DDPG.
Trick One: Replay Buffers. All standard algorithms for training a deep neural network to approximate make use of an experience replay buffer. This is the set of previous experiences. In order for the algorithm to have stable behavior, the replay buffer should be large enough to contain a wide range of experiences, but it may not always be good to keep everything. If you only use the verymost recent data, you will overfit to that and things will break; if you use too much experience, you may slow down your learning. This may take some tuning to get right.
You Should Know
We’ve mentioned that DDPG is an offpolicy algorithm: this is as good a point as any to highlight why and how. Observe that the replay buffer should contain old experiences, even though they might have been obtained using an outdated policy. Why are we able to use these at all? The reason is that the Bellman equation doesn’t care which transition tuples are used, or how the actions were selected, or what happens after a given transition, because the optimal Qfunction should satisfy the Bellman equation for all possible transitions. So any transitions that we’ve ever experienced are fair game when trying to fit a Qfunction approximator via MSBE minimization.
Trick Two: Target Networks. Qlearning algorithms make use of target networks. The term
is called the target, because when we minimize the MSBE loss, we are trying to make the Qfunction be more like this target. Problematically, the target depends on the same parameters we are trying to train: . This makes MSBE minimization unstable. The solution is to use a set of parameters which comes close to , but with a time delay—that is to say, a second network, called the target network, which lags the first. The parameters of the target network are denoted .
In DQNbased algorithms, the target network is just copied over from the main network every somefixednumber of steps. In DDPGstyle algorithms, the target network is updated once per main network update by polyak averaging:
where is a hyperparameter between 0 and 1 (usually close to 1). (This hyperparameter is called polyak
in our code).
DDPG Detail: Calculating the Max Over Actions in the Target. As mentioned earlier: computing the maximum over actions in the target is a challenge in continuous action spaces. DDPG deals with this by using a target policy network to compute an action which approximately maximizes . The target policy network is found the same way as the target Qfunction: by polyak averaging the policy parameters over the course of training.
Putting it all together, Qlearning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent:
where is the target policy.
The Policy Learning Side of DDPG¶
Policy learning in DDPG is fairly simple. We want to learn a deterministic policy which gives the action that maximizes . Because the action space is continuous, and we assume the Qfunction is differentiable with respect to action, we can just perform gradient ascent (with respect to policy parameters only) to solve
Note that the Qfunction parameters are treated as constants here.
Exploration vs. Exploitation¶
DDPG trains a deterministic policy in an offpolicy way. Because the policy is deterministic, if the agent were to explore onpolicy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make DDPG policies explore better, we add noise to their actions at training time. The authors of the original DDPG paper recommended timecorrelated OU noise, but more recent results suggest that uncorrelated, meanzero Gaussian noise works perfectly well. Since the latter is simpler, it is preferred. To facilitate getting higherquality training data, you may reduce the scale of the noise over the course of training. (We do not do this in our implementation, and keep noise scale fixed throughout.)
At test time, to see how well the policy exploits what it has learned, we do not add noise to the actions.
You Should Know
Our DDPG implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the start_steps
keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal DDPG exploration.
Documentation¶

spinup.
ddpg
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, max_ep_len=1000, logger_kwargs={}, save_freq=1)[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Deterministically computes actionsfrom policy given states.q
(batch,) Gives the current estimate of Q* forstates inx_ph
and actions ina_ph
.q_pi
(batch,) Gives the composition ofq
andpi
for states inx_ph
:q(x, pi(x)).  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to DDPG.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs to run and train agent.
 replay_size (int) – Maximum length of replay buffer.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 polyak (float) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak. (Always between 0 and 1, usually close to 1.)
 pi_lr (float) – Learning rate for policy.
 q_lr (float) – Learning rate for Qnetworks.
 batch_size (int) – Minibatch size for SGD.
 start_steps (int) – Number of steps for uniformrandom action selection, before running real policy. Helps exploration.
 act_noise (float) – Stddev for Gaussian exploration noise added to policy at training time. (At test time, no noise is added.)
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
a 
Tensorflow placeholder for action input. 
pi 
Deterministically computes an action from the agent, conditioned
on states in
x . 
q 
Gives actionvalue estimate for states in x and actions in a . 
This saved model can be accessed either by
 running the trained policy with the test_policy.py tool,
 or loading the whole saved graph into a program with restore_tf_graph.
References¶
Relevant Papers¶
 Deterministic Policy Gradient Algorithms, Silver et al. 2014
 Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016
Why These Papers?¶
Silver 2014 is included because it establishes the theory underlying deterministic policy gradients (DPG). Lillicrap 2016 is included because it adapts the theoreticallygrounded DPG algorithm to the deep RL setting, giving DDPG.
Twin Delayed DDPG¶
Table of Contents
Background¶
(Previously: Background for DDPG)
While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Qfunction begins to dramatically overestimate Qvalues, which then leads to the policy breaking, because it exploits the errors in the Qfunction. Twin Delayed DDPG (TD3) is an algorithm which addresses this issue by introducing three critical tricks:
Trick One: Clipped DoubleQ Learning. TD3 learns two Qfunctions instead of one (hence “twin”), and uses the smaller of the two Qvalues to form the targets in the Bellman error loss functions.
Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Qfunction. The paper recommends one policy update for every two Qfunction updates.
Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Qfunction errors by smoothing out Q along changes in action.
Together, these three tricks result in substantially improved performance over baseline DDPG.
Quick Facts¶
 TD3 is an offpolicy algorithm.
 TD3 can only be used for environments with continuous action spaces.
 The Spinning Up implementation of TD3 does not support parallelization.
Key Equations¶
TD3 concurrently learns two Qfunctions, and , by mean square Bellman error minimization, in almost the same way that DDPG learns its single Qfunction. To show exactly how TD3 does this and how it differs from normal DDPG, we’ll work from the innermost part of the loss function outwards.
First: target policy smoothing. Actions used to form the Qlearning target are based on the target policy, , but with clipped noise added on each dimension of the action. After adding the clipped noise, the target action is then clipped to lie in the valid action range (all valid actions, , satisfy ). The target actions are thus:
Target policy smoothing essentially serves as a regularizer for the algorithm. It addresses a particular failure mode that can happen in DDPG: if the Qfunction approximator develops an incorrect sharp peak for some actions, the policy will quickly exploit that peak and then have brittle or incorrect behavior. This can be averted by smoothing out the Qfunction over similar actions, which target policy smoothing is designed to do.
Next: clipped doubleQ learning. Both Qfunctions use a single target, calculated using whichever of the two Qfunctions gives a smaller target value:
and then both are learned by regressing to this target:
Using the smaller Qvalue for the target, and regressing towards that, helps fend off overestimation in the Qfunction.
Lastly: the policy is learned just by maximizing :
which is pretty much unchanged from DDPG. However, in TD3, the policy is updated less frequently than the Qfunctions are. This helps damp the volatility that normally arises in DDPG because of how a policy update changes the target.
Exploration vs. Exploitation¶
TD3 trains a deterministic policy in an offpolicy way. Because the policy is deterministic, if the agent were to explore onpolicy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make TD3 policies explore better, we add noise to their actions at training time, typically uncorrelated meanzero Gaussian noise. To facilitate getting higherquality training data, you may reduce the scale of the noise over the course of training. (We do not do this in our implementation, and keep noise scale fixed throughout.)
At test time, to see how well the policy exploits what it has learned, we do not add noise to the actions.
You Should Know
Our TD3 implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the start_steps
keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal TD3 exploration.
Documentation¶

spinup.
td3
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, target_noise=0.2, noise_clip=0.5, policy_delay=2, max_ep_len=1000, logger_kwargs={}, save_freq=1)[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Deterministically computes actionsfrom policy given states.q1
(batch,) Gives one estimate of Q* forstates inx_ph
and actions ina_ph
.q2
(batch,) Gives another estimate of Q* forstates inx_ph
and actions ina_ph
.q1_pi
(batch,) Gives the composition ofq1
andpi
for states inx_ph
:q1(x, pi(x)).  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TD3.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs to run and train agent.
 replay_size (int) – Maximum length of replay buffer.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 polyak (float) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak. (Always between 0 and 1, usually close to 1.)
 pi_lr (float) – Learning rate for policy.
 q_lr (float) – Learning rate for Qnetworks.
 batch_size (int) – Minibatch size for SGD.
 start_steps (int) – Number of steps for uniformrandom action selection, before running real policy. Helps exploration.
 act_noise (float) – Stddev for Gaussian exploration noise added to policy at training time. (At test time, no noise is added.)
 target_noise (float) – Stddev for smoothing noise added to target policy.
 noise_clip (float) – Limit for absolute value of target policy smoothing noise.
 policy_delay (int) – Policy will only be updated once every policy_delay times for each update of the Qnetworks.
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
a 
Tensorflow placeholder for action input. 
pi 
Deterministically computes an action from the agent, conditioned
on states in
x . 
q1 
Gives one actionvalue estimate for states in x and actions in a . 
q2 
Gives the other actionvalue estimate for states in x and actions in a . 
This saved model can be accessed either by
 running the trained policy with the test_policy.py tool,
 or loading the whole saved graph into a program with restore_tf_graph.
Soft ActorCritic¶
Table of Contents
Background¶
(Previously: Background for TD3)
Soft Actor Critic (SAC) is an algorithm which optimizes a stochastic policy in an offpolicy way, forming a bridge between stochastic policy optimization and DDPGstyle approaches. It isn’t a direct successor to TD3 (having been published roughly concurrently), but it incorporates the clipped doubleQ trick, and due to the inherent stochasticity of the policy in SAC, it also winds up benefiting from something like target policy smoothing.
A central feature of SAC is entropy regularization. The policy is trained to maximize a tradeoff between expected return and entropy, a measure of randomness in the policy. This has a close connection to the explorationexploitation tradeoff: increasing entropy results in more exploration, which can accelerate learning later on. It can also prevent the policy from prematurely converging to a bad local optimum.
Quick Facts¶
 SAC is an offpolicy algorithm.
 The version of SAC implemented here can only be used for environments with continuous action spaces.
 An alternate version of SAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces.
 The Spinning Up implementation of SAC does not support parallelization.
Key Equations¶
To explain Soft Actor Critic, we first have to introduce the entropyregularized reinforcement learning setting. In entropyregularized RL, there are slightlydifferent equations for value functions.
EntropyRegularized Reinforcement Learning¶
Entropy is a quantity which, roughly speaking, says how random a random variable is. If a coin is weighted so that it almost always comes up heads, it has low entropy; if it’s evenly weighted and has a half chance of either outcome, it has high entropy.
Let be a random variable with probability mass or density function . The entropy of is computed from its distribution according to
In entropyregularized reinforcement learning, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that timestep. This changes the RL problem to:
where is the tradeoff coefficient. (Note: we’re assuming an infinitehorizon discounted setting here, and we’ll do the same for the rest of this page.) We can now define the slightlydifferent value functions in this setting. is changed to include the entropy bonuses from every timestep:
is changed to include the entropy bonuses from every timestep except the first:
With these definitions, and are connected by:
and the Bellman equation for is
You Should Know
The way we’ve set up the value functions in the entropyregularized setting is a little bit arbitrary, and actually we could have done it differently (eg make include the entropy bonus at the first timestep). The choice of definition may vary slightly across papers on the subject.
Soft ActorCritic¶
SAC concurrently learns a policy , two Qfunctions , and a value function .
Learning Q. The Qfunctions are learned by MSBE minimization, using a target value network to form the Bellman backups. They both use the same target, like in TD3, and have loss functions:
The target value network, like the target networks in DDPG and TD3, is obtained by polyak averaging the value network parameters over the course of training.
Learning V. The value function is learned by exploiting (a samplebased approximation of) the connection between and . Before we go into the learning rule, let’s first rewrite the connection equation by using the definition of entropy to obtain:
The RHS is an expectation over actions, so we can approximate it by sampling from the policy:
SAC sets up a meansquarederror loss for based on this approximation. But what Qvalue do we use? SAC uses clipped doubleQ like TD3 for learning the value function, and takes the minimum Qvalue between the two approximators. So the SAC loss for value function parameters is:
Importantly, we do not use actions from the replay buffer here: these actions are sampled fresh from the current version of the policy.
Learning the Policy. The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize , which we expand out (as before) into
The way we optimize the policy makes use of the reparameterization trick, in which a sample from is drawn by computing a deterministic function of state, policy parameters, and independent noise. To illustrate: following the authors of the SAC paper, we use a squashed Gaussian policy, which means that samples are obtained according to
You Should Know
This policy has two key differences from the policies we use in the other policy optimization algorithms:
1. The squashing function. The in the SAC policy ensures that actions are bounded to a finite range. This is absent in the VPG, TRPO, and PPO policies. It also changes the distribution: before the the SAC policy is a factored Gaussian like the other algorithms’ policies, but after the it is not. (You can still compute the logprobabilities of actions in closed form, though: see the paper appendix for details.)
2. The way standard deviations are parameterized. In VPG, TRPO, and PPO, we represent the log std devs with stateindependent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with stateindependent log std devs, in our experience, did not work. (Can you think of why? Or better yet: run an experiment to verify?)
The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):
To get the policy loss, the final step is that we need to substitute with one of our function approximators. The same as in TD3, we use . The policy is thus optimized according to
which is almost the same as the DDPG and TD3 policy optimization, except for the stochasticity and entropy term.
Exploration vs. Exploitation¶
SAC trains a stochastic policy with entropy regularization, and explores in an onpolicy way. The entropy regularization coefficient explicitly controls the exploreexploit tradeoff, with higher corresponding to more exploration, and lower corresponding to more exploitation. The right coefficient (the one which leads to the stablest / highestreward learning) may vary from environment to environment, and could require careful tuning.
At test time, to see how well the policy exploits what it has learned, we remove stochasticity and use the mean action instead of a sample from the distribution. This tends to improve performance over the original stochastic policy.
You Should Know
Our SAC implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the start_steps
keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal SAC exploration.
Documentation¶

spinup.
sac
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, lr=0.001, alpha=0.2, batch_size=100, start_steps=10000, max_ep_len=1000, logger_kwargs={}, save_freq=1)[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description mu
(batch, act_dim) Computes mean actions from policygiven states.pi
(batch, act_dim) Samples actions from policy givenstates.logp_pi
(batch,) Gives log probability, according tothe policy, of the action sampled bypi
. Critical: must be differentiablewith respect to policy parameters allthe way through action sampling.q1
(batch,) Gives one estimate of Q* forstates inx_ph
and actions ina_ph
.q2
(batch,) Gives another estimate of Q* forstates inx_ph
and actions ina_ph
.q1_pi
(batch,) Gives the composition ofq1
andpi
for states inx_ph
:q1(x, pi(x)).q2_pi
(batch,) Gives the composition ofq2
andpi
for states inx_ph
:q2(x, pi(x)).v
(batch,) Gives the value estimate for statesinx_ph
.  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to SAC.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs to run and train agent.
 replay_size (int) – Maximum length of replay buffer.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 polyak (float) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak. (Always between 0 and 1, usually close to 1.)
 lr (float) – Learning rate (used for both policy and value learning).
 alpha (float) – Entropy regularization coefficient. (Equivalent to inverse of reward scale in the original SAC paper.)
 batch_size (int) – Minibatch size for SGD.
 start_steps (int) – Number of steps for uniformrandom action selection, before running real policy. Helps exploration.
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
a 
Tensorflow placeholder for action input. 
mu 
Deterministically computes mean action from the agent, given states in x . 
pi 
Samples an action from the agent, conditioned on states in x . 
q1 
Gives one actionvalue estimate for states in x and actions in a . 
q2 
Gives the other actionvalue estimate for states in x and actions in a . 
v 
Gives the value estimate for states in x . 
This saved model can be accessed either by
 running the trained policy with the test_policy.py tool,
 or loading the whole saved graph into a program with restore_tf_graph.
Note: for SAC, the correct evaluation policy is given by mu
and not by pi
. The policy pi
may be thought of as the exploration policy, while mu
is the exploitation policy.
日志打印¶
Table of Contents
Using a Logger¶
Spinning Up ships with basic logging tools, implemented in the classes Logger and EpochLogger. The Logger class contains most of the basic functionality for saving diagnostics, hyperparameter configurations, the state of a training run, and the trained model. The EpochLogger class adds a thin layer on top of that to make it easy to track the average, standard deviation, min, and max value of a diagnostic over each epoch and across MPI workers.
You Should Know
All Spinning Up algorithm implementations use an EpochLogger.
Examples¶
First, let’s look at a simple example of how an EpochLogger keeps track of a diagnostic value:
>>> from spinup.utils.logx import EpochLogger
>>> epoch_logger = EpochLogger()
>>> for i in range(10):
epoch_logger.store(Test=i)
>>> epoch_logger.log_tabular('Test', with_min_and_max=True)
>>> epoch_logger.dump_tabular()

 AverageTest  4.5 
 StdTest  2.87 
 MaxTest  9 
 MinTest  0 

The store
method is used to save all values of Test
to the epoch_logger
‘s internal state. Then, when log_tabular
is called, it computes the average, standard deviation, min, and max of Test
over all of the values in the internal state. The internal state is wiped clean after the call to log_tabular
(to prevent leakage into the statistics at the next epoch). Finally, dump_tabular
is called to write the diagnostics to file and to stdout.
Next, let’s look at a full training procedure with the logger embedded, to highlight configuration and model saving as well as diagnostic logging:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69  import numpy as np
import tensorflow as tf
import time
from spinup.utils.logx import EpochLogger
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:1]:
x = tf.layers.dense(x, units=h, activation=activation)
return tf.layers.dense(x, units=hidden_sizes[1], activation=output_activation)
# Simple script for training an MLP on MNIST.
def train_mnist(steps_per_epoch=100, epochs=5,
lr=1e3, layers=2, hidden_size=64,
logger_kwargs=dict(), save_freq=1):
logger = EpochLogger(**logger_kwargs)
logger.save_config(locals())
# Load and preprocess MNIST data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(1, 28*28) / 255.0
# Define inputs & main outputs from computation graph
x_ph = tf.placeholder(tf.float32, shape=(None, 28*28))
y_ph = tf.placeholder(tf.int32, shape=(None,))
logits = mlp(x_ph, hidden_sizes=[hidden_size]*layers + [10], activation=tf.nn.relu)
predict = tf.argmax(logits, axis=1, output_type=tf.int32)
# Define loss function, accuracy, and training op
y = tf.one_hot(y_ph, 10)
loss = tf.losses.softmax_cross_entropy(y, logits)
acc = tf.reduce_mean(tf.cast(tf.equal(y_ph, predict), tf.float32))
train_op = tf.train.AdamOptimizer().minimize(loss)
# Prepare session
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Setup model saving
logger.setup_tf_saver(sess, inputs={'x': x_ph},
outputs={'logits': logits, 'predict': predict})
start_time = time.time()
# Run main training loop
for epoch in range(epochs):
for t in range(steps_per_epoch):
idxs = np.random.randint(0, len(x_train), 32)
feed_dict = {x_ph: x_train[idxs],
y_ph: y_train[idxs]}
outs = sess.run([loss, acc, train_op], feed_dict=feed_dict)
logger.store(Loss=outs[0], Acc=outs[1])
# Save model
if (epoch % save_freq == 0) or (epoch == epochs1):
logger.save_state(state_dict=dict(), itr=None)
# Log info about epoch
logger.log_tabular('Epoch', epoch)
logger.log_tabular('Acc', with_min_and_max=True)
logger.log_tabular('Loss', average_only=True)
logger.log_tabular('TotalGradientSteps', (epoch+1)*steps_per_epoch)
logger.log_tabular('Time', time.time()start_time)
logger.dump_tabular()
if __name__ == '__main__':
train_mnist()

In this example, observe that
 On line 19, logger.save_config is used to save the hyperparameter configuration to a JSON file.
 On lines 42 and 43, logger.setup_tf_saver is used to prepare the logger to save the key elements of the computation graph.
 On line 54, diagnostics are saved to the logger’s internal state via logger.store.
 On line 58, the computation graph is saved once per epoch via logger.save_state.
 On lines 6166, logger.log_tabular and logger.dump_tabular are used to write the epoch diagnostics to file. Note that the keys passed into logger.log_tabular are the same as the keys passed into logger.store.
Logging and MPI¶
You Should Know
Several algorithms in RL are easily parallelized by using MPI to average gradients and/or other key quantities. The Spinning Up loggers are designed to be wellbehaved when using MPI: things will only get written to stdout and to file from the process with rank 0. But information from other processes isn’t lost if you use the EpochLogger: everything which is passed into EpochLogger via store
, regardless of which process it’s stored in, gets used to compute average/std/min/max values for a diagnostic.
Logger Classes¶

class
spinup.utils.logx.
Logger
(output_dir=None, output_fname='progress.txt', exp_name=None)[源代码]¶ A generalpurpose logger.
Makes it easy to save diagnostics, hyperparameter configurations, the state of a training run, and the trained model.

__init__
(output_dir=None, output_fname='progress.txt', exp_name=None)[源代码]¶ Initialize a Logger.
参数:  output_dir (string) – A directory for saving results to. If
None
, defaults to a temp directory of the form/tmp/experiments/somerandomnumber
.  output_fname (string) – Name for the tabseparatedvalue file
containing metrics logged throughout a training run.
Defaults to
progress.txt
.  exp_name (string) – Experiment name. If you run multiple training
runs and give them all the same
exp_name
, the plotter will know to group them. (Use case: if you run the same hyperparameter configuration with multiple random seeds, you should give them all the sameexp_name
.)
 output_dir (string) – A directory for saving results to. If

dump_tabular
()[源代码]¶ Write all of the diagnostics from the current iteration.
Writes both to stdout, and to the output file.

log_tabular
(key, val)[源代码]¶ Log a value of some diagnostic.
Call this only once for each diagnostic quantity, each iteration. After using
log_tabular
to store values for each diagnostic, make sure to calldump_tabular
to write them out to file and stdout (otherwise they will not get saved anywhere).

save_config
(config)[源代码]¶ Log an experiment configuration.
Call this once at the top of your experiment, passing in all important config vars as a dict. This will serialize the config to JSON, while handling anything which can’t be serialized in a graceful way (writing as informative a string as possible).
Example use:
logger = EpochLogger(**logger_kwargs) logger.save_config(locals())

save_state
(state_dict, itr=None)[源代码]¶ Saves the state of an experiment.
To be clear: this is about saving state, not logging diagnostics. All diagnostic logging is separate from this function. This function will save whatever is in
state_dict
—usually just a copy of the environment—and the most recent parameters for the model you previously set up saving for withsetup_tf_saver
.Call with any frequency you prefer. If you only want to maintain a single state and overwrite it at each call with the most recent version, leave
itr=None
. If you want to keep all of the states you save, provide unique (increasing) values for ‘itr’.参数:  state_dict (dict) – Dictionary containing essential elements to describe the current state of training.
 itr – An int, or None. Current iteration of training.

setup_tf_saver
(sess, inputs, outputs)[源代码]¶ Set up easy model saving for tensorflow.
Call once, after defining your computation graph but before training.
参数:  sess – The Tensorflow session in which you train your computation graph.
 inputs (dict) – A dictionary that maps from keys of your choice to the tensorflow placeholders that serve as inputs to the computation graph. Make sure that all of the placeholders needed for your outputs are included!
 outputs (dict) – A dictionary that maps from keys of your choice to the outputs from your computation graph.


class
spinup.utils.logx.
EpochLogger
(*args, **kwargs)[源代码]¶ Bases:
spinup.utils.logx.Logger
A variant of Logger tailored for tracking average values over epochs.
Typical use case: there is some quantity which is calculated many times throughout an epoch, and at the end of the epoch, you would like to report the average / std / min / max value of that quantity.
With an EpochLogger, each time the quantity is calculated, you would use
epoch_logger.store(NameOfQuantity=quantity_value)
to load it into the EpochLogger’s state. Then at the end of the epoch, you would use
epoch_logger.log_tabular(NameOfQuantity, **options)
to record the desired values.

log_tabular
(key, val=None, with_min_and_max=False, average_only=False)[源代码]¶ Log a value or possibly the mean/std/min/max values of a diagnostic.
参数:  key (string) – The name of the diagnostic. If you are logging a
diagnostic whose state has previously been saved with
store
, the key here has to match the key you used there.  val – A value for the diagnostic. If you have previously saved
values for this key via
store
, do not provide aval
here.  with_min_and_max (bool) – If true, log min and max values of the diagnostic over the epoch.
 average_only (bool) – If true, do not log the standard deviation of the diagnostic over the epoch.
 key (string) – The name of the diagnostic. If you are logging a
diagnostic whose state has previously been saved with

Loading Saved Graphs¶

spinup.utils.logx.
restore_tf_graph
(sess, fpath)[源代码]¶ Loads graphs saved by Logger.
Will output a dictionary whose keys and values are from the ‘inputs’ and ‘outputs’ dict you specified with logger.setup_tf_saver().
参数:  sess – A Tensorflow session.
 fpath – Filepath to save directory.
返回: A dictionary mapping from keys to tensors in the computation graph loaded from
fpath
.
When you use this method to restore a graph saved by a Spinning Up implementation, you can minimally expect it to include the following:
Key  Value 

x 
Tensorflow placeholder for state input. 
pi 
Samples an action from the agent, conditioned
on states in
x . 
The relevant value functions for an algorithm are also typically stored. For details of what else gets saved by a given algorithm, see its documentation page.
绘图¶
See the page on plotting results for documentation of the plotter.
MPI 工具¶
Table of Contents
Core MPI Utilities¶

spinup.utils.mpi_tools.
mpi_fork
(n, bind_to_core=False)[源代码]¶ Relaunches the current script with workers linked by MPI.
Also, terminates the original process that launched it.
Taken almost without modification from the Baselines function of the same name.
参数:  n (int) – Number of process to split into.
 bind_to_core (bool) – Bind each MPI process to a core.

spinup.utils.mpi_tools.
mpi_statistics_scalar
(x, with_min_and_max=False)[源代码]¶ Get mean/std and optional min/max of scalar x across MPI processes.
参数:  x – An array containing samples of the scalar to produce statistics for.
 with_min_and_max (bool) – If true, return min and max of x in addition to mean and std.
MPI + Tensorflow Utilities¶
The spinup.utils.mpi_tf
contains a a few tools to make it easy to use the AdamOptimizer across many MPI processes. This is a bit hacky—if you’re looking for something more sophisticated and generalpurpose, consider horovod.

class
spinup.utils.mpi_tf.
MpiAdamOptimizer
(**kwargs)[源代码]¶ Adam optimizer that averages gradients across MPI processes.
The compute_gradients method is taken from Baselines MpiAdamOptimizer. For documentation on method arguments, see the Tensorflow docs page for the base AdamOptimizer.
运行工具¶
Table of Contents
ExperimentGrid¶
Spinning Up ships with a tool called ExperimentGrid for making hyperparameter ablations easier. This is based on (but simpler than) the rllab tool called VariantGenerator.

class
spinup.utils.run_utils.
ExperimentGrid
(name='')[源代码]¶ Tool for running many experiments given hyperparameter ranges.

add
(key, vals, shorthand=None, in_name=False)[源代码]¶ Add a parameter (key) to the grid config, with potential values (vals).
By default, if a shorthand isn’t given, one is automatically generated from the key using the first three letters of each colonseparated term. To disable this behavior, change
DEFAULT_SHORTHAND
in thespinup/user_config.py
file toFalse
.参数:  key (string) – Name of parameter.
 vals (value or list of values) – Allowed values of parameter.
 shorthand (string) – Optional, shortened name of parameter. For
example, maybe the parameter
steps_per_epoch
is shortened tosteps
.  in_name (bool) – When constructing variant names, force the inclusion of this parameter into the name.

run
(thunk, num_cpu=1, data_dir=None, datestamp=False)[源代码]¶ Run each variant in the grid with function ‘thunk’.
Note: ‘thunk’ must be either a callable function, or a string. If it is a string, it must be the name of a parameter whose values are all callable functions.
Uses
call_experiment
to actually launch each experiment, and gives each variant a name usingself.variant_name()
.Maintenance note: the args for ExperimentGrid.run should track closely to the args for call_experiment. However,
seed
is omitted because we presume the user may add it as a parameter in the grid.

variant_name
(variant)[源代码]¶ Given a variant (dict of valid param/value pairs), make an exp_name.
A variant’s name is constructed as the grid name (if you’ve given it one), plus param names (or shorthands if available) and values separated by underscores.
Note: if
seed
is a parameter, it is not included in the name.

variants
()[源代码]¶ Makes a list of dicts, where each dict is a valid config in the grid.
There is special handling for variant parameters whose names take the form
'full:param:name'
.The colons are taken to indicate that these parameters should have a nested dict structure. eg, if there are two params,
Key Val 'base:param:one'
1 'base:param:two'
2 the variant dict will have the structure
variant = { base: { param : { a : 1, b : 2 } } }

Calling Experiments¶

spinup.utils.run_utils.
call_experiment
(exp_name, thunk, seed=0, num_cpu=1, data_dir=None, datestamp=False, **kwargs)[源代码]¶ Run a function (thunk) with hyperparameters (kwargs), plus configuration.
This wraps a few pieces of functionality which are useful when you want to run many experiments in sequence, including logger configuration and splitting into multiple processes for MPI.
There’s also a SpinningUpspecific convenience added into executing the thunk: if
env_name
is one of the kwargs passed to call_experiment, it’s assumed that the thunk accepts an argument calledenv_fn
, and that theenv_fn
should make a gym environment with the givenenv_name
.The way the experiment is actually executed is slightly complicated: the function is serialized to a string, and then
run_entrypoint.py
is executed in a subprocess call with the serialized string as an argument.run_entrypoint.py
unserializes the function call and executes it. We choose to do it this way—instead of just calling the function directly here—to avoid leaking state between successive experiments.参数:  exp_name (string) – Name for experiment.
 thunk (callable) – A python function.
 seed (int) – Seed for random number generators.
 num_cpu (int) – Number of MPI processes to split into. Also accepts ‘auto’, which will set up as many procs as there are cpus on the machine.
 data_dir (string) – Used in configuring the logger, to decide where
to store experiment results. Note: if left as None, data_dir will
default to
DEFAULT_DATA_DIR
fromspinup/user_config.py
.  **kwargs – All kwargs to pass to thunk.

spinup.utils.run_utils.
setup_logger_kwargs
(exp_name, seed=None, data_dir=None, datestamp=False)[源代码]¶ Sets up the output_dir for a logger and returns a dict for logger kwargs.
If no seed is given and datestamp is false,
output_dir = data_dir/exp_name
If a seed is given and datestamp is false,
output_dir = data_dir/exp_name/exp_name_s[seed]
If datestamp is true, amend to
output_dir = data_dir/YYMMDD_exp_name/YYMMDD_HHMMSS_exp_name_s[seed]
You can force datestamp=True by setting
FORCE_DATESTAMP=True
inspinup/user_config.py
.参数:  exp_name (string) – Name for experiment.
 seed (int) – Seed for random number generators used by experiment.
 data_dir (string) – Path to folder where results should be saved.
Default is the
DEFAULT_DATA_DIR
inspinup/user_config.py
.  datestamp (bool) – Whether to include a date and timestamp in the name of the save directory.
返回: logger_kwargs, a dict containing output_dir and exp_name.
致谢¶
We gratefully acknowledge the contributions of the many people who helped get this project off of the ground, including people who beta tested the software, gave feedback on the material, improved dependencies of Spinning Up code in service of this release, or otherwise supported the project. Given the number of people who were involved at various points, this list of names may not be exhaustive. (If you think you should have been listed here, please do not hesitate to reach out.)
In no particular order, thank you Alex Ray, Amanda Askell, Ben Garfinkel, Christy Dennison, Coline Devin, Daniel Zeigler, Dylan HadfieldMenell, Ge Yang, Greg Khan, Jack Clark, Jonas Rothfuss, Larissa Schiavo, Leandro Castelao, Lilian Weng, Maddie Hall, Matthias Plappert, Miles Brundage, Peter Zokhov, and Pieter Abbeel.
We are also grateful to Pieter Abbeel’s group at Berkeley, and the Center for HumanCompatible AI, for giving feedback on presentations about Spinning Up.
作者¶
Spinning Up in Deep RL was primarily developed by Josh Achiam, a research scientist on the OpenAI Safety Team and PhD student at UC Berkeley advised by Pieter Abbeel. Josh studies topics related to safety in deep reinforcement learning, and has previously published work on safe exploration.
关于译者¶
我在学习强化学习的时候，看到这个由OpenAI团队提供的项目：Spinning UP，觉得非常不错，于是决定翻译部分技术文章，以飨读者。
截止目前，项目还在翻译当中，我会尽快完成全部翻译。
译者水平有限，译文中肯定会有很多错误，欢迎各位不吝指教，提 issue 或 pr，我会及时更正。
本项目Github地址为: Spinning UP
个人主页： 一时博客: https://hellogod.cn/
最后夹点私货，欢迎关注公众号：一时博客