Welcome to q2!¶
q2 is a reinforcement learning framework and command line tool. It consists of a set of interfaces that help organize research and development and a set of command line tools that unify your workflow. It also comes “batteries-included” with a set of helpful components and utilities that are frequently used in reinforcement learning work.
q2 aims to improve your reinforcement learning work in the following ways:
- Reduce boilerplate, no more copy-pasted code, avoid reinventing the wheel
- Easier collaboration if your colleagues are familiar with q2
- Faster iteration time: try out a new reinforcement learning idea in only three steps
q2 is completely open-source and under active development, so if you find something missing, please reach out on Github!
Get started¶
Installation¶
You will need to be using Python 3.5 or greater (this is because q2 makes extensive use of Python’s new type annotations).
Then simply run:
pip install q2
First time usage¶
To support your work, q2 needs a certain project structure. For a new project, you can generate this structure by running the following command from within the project directory:
q2 init
This will create four folders and one file:
agents/
environments/
objectives/
regimens/
objects.yaml
You can then run:
q2 generate --help
to see how to go about generating new agents, environments, objectives, regimens and more.
To see q2 in action right away, start a training session with q2’s built-in
random
agent in OpenAI Gym’s Cartpole-v1
environment by running:
q2 train random --env gym.CartPole-v1 --episodes 10 --render
You should see a window open up rendering the CartPole-v1
environment
with random
agent playing. The basic syntax of the train
command is:
q2 train <agent> --env <environment> --episodes <num_episodes> --render
Run q2 train --help
to see all the options.
Contents¶
Tutorial¶
In this tutorial you will use q2 to implement a Deep-Q Network that can solve the cart-pole problem. The goal is to familiarize you with q2, show you how it can speed up and simplify development and maybe even learn a bit about reinforcement learning to boot.
In the cart-pole problem, the agent must balance an unstable vertical pole on top of a cart that can roll horizontally. It’s a classic problem in dynamics and fortunately, OpenAI gym offers an implementation of the problem in the CartPole-v1 environment, which you can access through q2. You can think of the cart-pole problem as a video game with two buttons, left and right. Your goal is to keep the pole from falling over for as long as possible using only those two buttons.
At each time step, the environment yields an observation that corresponds to the position and velocity of the cart, and the angular position and angular velocity of the pole. The agent then selects either left or right, which adds a constant increment to the cart’s velocity in the left and right direction respectively. Every time step that it survives, the agent gets +1 reward.
Now you understand the problem: let’s build an agent that solves it!
Setup¶
First up, there’s some basic setup to do. You’ll need Python 3.5 or greater. You can check what version you have at the command line with:
python3 --version
Next, make a new folder called q2_tutorial
for this project:
mkdir q2_tutorial
cd q2_tutorial
Then make a virtual environment to keep this project’s dependencies isolated from the rest of your system:
python3 -m venv env
. ./env/bin/activate
And finally, install q2:
pip install q2
q2 offers a command line interface that automates mundane tasks so you can spend more time on the problem. First, you can have q2 automatically setup your project structure:
q2 init
Then, test that q2 is working properly by starting a training session on the cart-pole problem with the built-in random agent:
q2 train random --env gym.CartPole-v1 --episodes 10 --render
You should see the cart-pole environment rendered on the screen, with the
random agent controlling the cart. As you might guess from the name, the
random agent just chooses a random action every time. In this case, that
would be a random string of lefts (0
) and rights (1
). This turns
out not to be a very good strategy. It’s time to build our own agent, to see
if we can do better.

The random
agent playing gym.CartPole-v1
.
Generate¶
q2 can generate a new agent from a template for you. This helps you avoid
rewriting boilerplate and gets you started with a working agent based on the
random agent. We’re going to build a Deep-Q Network, or DQN for short, so use q2 to
generate an agent called dqn
:
q2 generate agent dqn
Now you should have a file at q2_tutorial/agents/dqn.py
. Open it up in
an editor. You should see a Python class definition that begins:
class Dqn(Agent):
...
There are six methods defined: __init__
, load
, save
, act
,
step
, and learn
. Here’s a run-down of what each method is for:
__init__
- This is where you will define the neural network and setup other objects that the agent needs.
act(state, training_flag) -> action
- This method is called whenever an action is needed. It takes an environment state (e.g. cart-pole position and velocity) and a boolean flag that tells the agent whether this is training or testing. It should return a valid action.
step(state, action, reward, next_state, episode_end)
- This method is the agent’s chance to do some work between steps of the environment. It is called once every time step and can be used to do things like run the learning step or append to internal buffers.
learn(states, actions, rewards, next_states, episode_ends) -> loss
- Runs the learning step using the training data provided. This is where you will run a gradient descent step.
load
,save
- These methods are responsible for saving and loading the Tensorflow checkpoint file, so that your agent remembers what it has learned between training sessions. You will not need to edit them during this tutorial.
Deep-Q Networks¶
Before we start implementing, here’s a primer on Deep-Q Networks. A Deep-Q
Network is fundamentally just a neural network that learns to predict how
much reward it will get if it takes action A
when the environment is in
state S
. Given a good approximation to this reward function (which is
usually denoted Q(S, A)
, hence “Q” in DQN), it’s easy to implement a good
agent: just choose the action that is predicted to produce the most reward!
So at each time step the agent does something and gets a certain amount of reward. The DQN looks at the state that the environment was in, the action that was taken and makes a prediction for what the reward will be. It then compares its prediction with the actual reward that was given and updates itself based on its error. When it comes time to choose the next action, our agent simply runs the DQN and chooses the action that is predicted to produce the most reward. That’s it.
Well, there is one small complication: the reward that we care about maximizing is actually the total future reward, not just the reward on the next time step. For example, we want our agent to learn how to wait in order to take 10 marshmallows tomorrow rather than 1 today. To make that happen, the DQN is actually going to be learning to predict the total future reward of an action.
After each time step, the agent will use the DQN twice: once to make a prediction about the state that it just acted on, and once to predict the total future reward based on the next environment state that was just entered. Then, we add the actual reward from the current time step to the total future reward predicted for the upcoming state to get our target “ground-truth” total future reward. This is then fed to the DQN to compare with its prediction and update. [1]
Build¶
You’re now ready to define a model. First you’ll create a two-layer neural
network in __init__
using Tensorflow:
def __init__(self, action_space, observation_space):
if not isinstance(action_space, Discrete):
raise TypeError("Invalid environment, this agent only works" +
"with Discrete action spaces.")
self.action_space = action_space
self.name = type(self).__name__ + "Agent"
self.checkpoint_name = 'checkpoints/' + self.name + '.ckpt'
# The training regimen pulls messages from the agent to be displayed
# during training.
self.message = ""
# It's a good idea to keep track of training loss
self.losses = list()
# Model parameters
hidden_nodes = 128
learning_rate = 1e-4
# Model definition
with tf.variable_scope(self.name):
# Input placeholders
self.state = tf.placeholder(tf.float32,
[None, *observation_space.shape], name='state')
self.target = tf.placeholder(tf.float32, [None], name='target')
self.action = tf.placeholder(tf.int32,
[None, *action_space.shape], name='action')
# Transformed inputs
self.action_vector = tf.one_hot(self.action, action_space.n)
self.state_flat = tf.layers.flatten(self.state)
# Hidden layers
self.hidden0 = tf.contrib.layers.fully_connected(self.state_flat,
hidden_nodes)
self.hidden1 = tf.contrib.layers.fully_connected(self.hidden0,
hidden_nodes)
# Outputs
self.value = tf.contrib.layers.fully_connected(self.hidden1,
action_space.n, activation_fn=None)
self.predicted_reward = tf.reduce_sum(tf.multiply(self.value,
self.action_vector), axis=1)
# Learning
self.loss = tf.reduce_mean(tf.square(self.target -
self.predicted_reward))
self.opt = tf.train.AdamOptimizer(
learning_rate=learning_rate).minimize(self.loss)
So far this is a fairly standard model definition in Tensorflow. You’ve
defined a computational graph that will be run later during act
and
learn
to produce a prediction of the total future reward to be had for
each possible action. Next, implementing act
is straightforward. You just
compute the value for each action and then choose the best one:
def act(self,
sess:tf.Session,
state:np.array,
train:bool,
) -> np.array:
# self.value holds the predicted rewards for each action
value = sess.run(self.value, feed_dict={
self.state: state.reshape((1, *state.shape))
})
best_action = np.argmax(value)
return best_action
learn
is where you compute the total future reward based on the new state
of the environment, and then feed that to the DQN as the target towards which
to optimise:
def learn(self,
sess:tf.Session,
states:np.array,
actions:np.array,
rewards:np.array,
next_states:np.array,
episode_ends:np.array
) -> float:
# Discount factor
gamma = 0.99
# Compute ground-truth total future expected value based on actual
# rewards using the Bellman equation
future_values = sess.run(self.value, feed_dict={
self.state: next_states,
})
# Expected future value is 0 if episode has ended
future_values[episode_ends] = np.zeros(future_values.shape[1:])
# The Bellman equation
targets = rewards + gamma * np.max(future_values, axis=1)
loss, _ = sess.run([self.loss, self.opt], feed_dict={
self.state: states,
self.target: targets,
self.action: actions,
})
return loss
Finally, step
is where you run the learning step. For now there is nothing
else that needs to be done here:
def step(self,
sess:tf.Session,
state:np.array,
action:np.array,
reward:float,
next_state:np.array,
done:bool
):
# For this simple agent, all we need to do here is run the
# learning step.
loss = self.learn(sess, [state], [action], [reward], [next_state],
[done])
self.losses.append(loss)
self.message = "Loss: {:.2f}".format(loss)
You now have a fully functioning DQN agent! Try it out against the cart-pole environment in a training session:
q2 train dqn --env gym.CartPole-v1 --episodes 10 --render
Once again you should see the cart-pole environment rendered on the screen,
only this time your Dqn
agent is playing.
Extend¶
With the basic implementation from above, you probably observed that the agent always goes to one-side as quickly as it can. This is a very common failure mode for RL agents. In our case, the initial weights of the DQN came out slightly favouring either left or right. Consequently, the agent chose that action, then receiving a reward of +1 for surviving that time step. This causes the DQN to increase its confidence in that action, leading to a runaway self-reinforcing process in which it will only ever output the same action.
Exploration¶
One way to remedy this is to break the loop by injecting some randomness into the agent’s actions. q2 comes with some useful tools for this out of the box. At the top of the file, import a decaying noise generator like so:
from q2.agents.noise import DecayProcess
DecayProcess
generates a stream of 1
s and 0
s, with 1
s showing
up less and less frequently as the process goes on. We can use this to add
some randomness to our agents behaviour that starts out big and slowly
disappears, letting the agent have more control. Go back down to __init__
and add a line to instantiate the DecayProcess
def __init__(...):
...
# Agents need to trade off between exploring and exploiting. This decay
# process starts the agent off with a high initial exploration tendency
# and gradually reduces it over time.
self.noise = DecayProcess(
explore_start=1.0, explore_stop=0.1, final_frame=1e4)
...
We’ll make use of this when choosing the next action. Add these lines to the
start of the definition of act
:
def act(...):
# Decide whether to "explore" i.e. take a completely random action
if self.noise.sample() == 1 and train:
return self.action_space.sample()
...
Finally, in order for the process to decay it needs to be stepped every time
that the agent is stepped. Modify the end of step
like so:
def step(...):
...
self.noise.step()
self.message = "Loss: {:.2f}\tExplore: {:.2f}".format(
loss, self.noise.epsilon)
Now run a training session with your agent again! You should observe it mixing up its actions much more often.
Replay buffer¶
At this point, if you just left the agent running for a few thousand episodes it would solve this environment. However, at the moment the agent is learning very inefficiently. At each time step it looks at what just happened and tries to learn from it. This means that the variance in the gradient will be high, and the network will take a winding, inefficient path down the objective landscape. Additionally, the fact that the network is learning from events in the order that they happened means that it is vulnerable to loops in the learning process that might prevent it from converging.
We can fix this by adding one last component to the agent: a replay buffer. The agent will record each step of the environment to a buffer, and at each step it will sample from this buffer to get training data for the learning step. This breaks potential feedback loops because learning can happen out of order. It also reduces variance in the gradient step by averaging over multiple data points. Once again, q2 comes with a helper to make implementing this easy. At the top of the file, add:
from q2.agents.history import History
Then in __init__
, add this line:
def __init__(...):
...
# In the learning step, we will sample from a history of
# the last 1000 training steps seen.
self.history = History(1000)
...
And add these lines to the start of learn
:
def learn(...):
# Add the current step to the history buffer
self.history.step(state, action, reward, next_state, done)
# Sample history for learning
batch_size = 10
states, actions, rewards, next_states, dones = self.history.sample(
batch_size)
Finally, modify the learning step to use the batch of data:
loss = self.learn(sess, states, actions, rewards, next_states, dones)
That’s all! Run the agent again and observe how much faster the loss drops. Finally, try running the training for 500 episodes like so:
q2 train dqn --env gym.CartPole-v1 --epochs 5 --episodes 100
Once it’s done, you can run a test session in which the agent doesn’t explore at all:
q2 train dqn --env gym.CartPole-v1 --episodes --test --render
If all went well, the agent should be noticeably better at cart-pole than when it started. Try running the random agent again to compare.
That’s the end of this tutorial. Hopefully you see how q2 makes developing RL agents easier and faster. For some next steps, try modifying this agent to learn other environments. Or try messing with the model parameters and architecture to see if you can get it to solve cart-pole faster: OpenAI Gym defines solving cart-pole as consistently achieving an episode score above 195.
Source code¶
The complete source code for the agent you developed is available below for reference:
import tensorflow as tf
import numpy as np
from gym import Space
from gym.spaces import Discrete, Box, MultiBinary
from q2.agents import Agent
from q2.agents.noise import DecayProcess
from q2.agents.history import History
class Dqn(Agent):
def __init__(self, observation_space:Space, action_space:Space):
if not isinstance(action_space, Discrete):
raise TypeError("Invalid environment, this agent only works with Discrete action spaces.")
self.action_space = action_space
self.name = type(self).__name__ + "Agent"
self.checkpoint_name = 'checkpoints/' + self.name + '.ckpt'
# The training regimen pulls messages from the agent to be displayed during training
self.message = ""
# It's a good idea to keep track of training loss
self.losses = list()
# Model parameters
hidden_nodes = 128
learning_rate = 1e-4
# In the learning step, we will sample from a history of training steps seen.
self.history = History(1000)
# Agents need to trade off between exploring and exploiting. This decay process starts
# the agent off with a high initial exploration tendency and gradually reduces it over
# time.
self.noise = DecayProcess(explore_start=1.0, explore_stop=0.1, final_frame=1e4)
# Model definition
with tf.variable_scope(self.name):
# Input placeholders
self.state = tf.placeholder(tf.float32, [None, *observation_space.shape], name='state')
self.target = tf.placeholder(tf.float32, [None], name='target')
self.action = tf.placeholder(tf.int32, [None, *action_space.shape], name='action')
# Transformed inputs
self.action_vector = tf.one_hot(self.action, action_space.n)
self.state_flat = tf.layers.flatten(self.state)
# Hidden layers
self.hidden0 = tf.contrib.layers.fully_connected(self.state_flat, hidden_nodes)
self.hidden1 = tf.contrib.layers.fully_connected(self.hidden0, hidden_nodes)
# Outputs
self.value = tf.contrib.layers.fully_connected(self.hidden1, action_space.n,
activation_fn=None)
self.predicted_reward = tf.reduce_sum(tf.multiply(self.value, self.action_vector), axis=1)
# Learning
self.loss = tf.reduce_mean(tf.square(self.target - self.predicted_reward))
self.opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(self.loss)
def load(self, sess:tf.Session):
train_vars = tf.trainable_variables(scope=self.name)
saver = tf.train.Saver(train_vars)
try:
saver.restore(sess, self.checkpoint_name)
print("Checkpoint loaded")
except (tf.errors.InvalidArgumentError, tf.errors.NotFoundError):
print("Checkpoint file not found, skipping load")
def save(self, sess:tf.Session):
train_vars = tf.trainable_variables(scope=self.name)
saver = tf.train.Saver(train_vars)
saver.save(sess, self.checkpoint_name)
def act(self,
sess:tf.Session,
state:np.array,
train:bool,
) -> np.array:
# Decide whether to "explore" i.e. take a completely random action
if self.noise.sample() == 1 and train:
return self.action_space.sample()
value = sess.run(self.value, feed_dict={
self.state: state.reshape((1, *state.shape))
})
best_action = np.argmax(value)
return best_action
def step(self,
sess:tf.Session,
state:np.array,
action:np.array,
reward:float,
next_state:np.array,
done:bool
):
# Add the current step to the history buffer
self.history.step(state, action, reward, next_state, done)
# Sample history for learning
batch_size = 10
states, actions, rewards, next_states, dones = self.history.sample(batch_size)
# For this simple agent, all we need to do here is run the learning step
# loss = self.learn(sess, [state], [action], [reward], [next_state], [done])
loss = self.learn(sess, states, actions, rewards, next_states, dones)
self.losses.append(loss)
self.noise.step()
self.message = "Loss: {:.2f}\tExplore: {:.2f}".format(loss, self.noise.epsilon)
def learn(self,
sess:tf.Session,
states:np.array,
actions:np.array,
rewards:np.array,
next_states:np.array,
episode_ends:np.array
) -> float:
# Discount factor
gamma = 0.99
# Compute ground-truth total future expected value based on actual rewards using the Bellman equation
future_values = sess.run(self.value, feed_dict={
self.state: next_states,
})
# Expected future value is 0 if episode has ended
future_values[episode_ends] = np.zeros(future_values.shape[1:])
# The Bellman equation
targets = rewards + gamma * np.max(future_values, axis=1)
loss, _ = sess.run([self.loss, self.opt], feed_dict={
self.state: states,
self.target: targets,
self.action: actions,
})
return loss
Footnotes
[1] | In reinforcement learning this idea is known as the Bellman equation. |
Command line tool¶
q2 offers three basic commands:
init
generate
train
You can run:
q2 --help
to see usage on the command line.
Commands¶
init
Initialize a new project
A one-time command that you should run at the start of a new project. It will set up the necessary directory structure for you (if some or all of this structure already exists, it won’t overwrite it).
generate
Create new objects easily
The basic syntax is:
q2 generate <type> <name>
where
<type>
can beagent
,environment
,objective
orregimen
. When you run the command, an appropriate template is pulled up and rendered with the name you specified, then written to the appropriate folder within your q2 project. You can then edit the newly generated object by opening up the generated file.train
Run a training session
Begins a training session in which your agent will interact with an environment and learn interesting new behaviours. The basic syntax is:
q2 train <agent> --env <environment> --regimen <regimen> \ --episodes <num_episodes> [--render]
First some setup happens, then the specified regimen is instantiated and control is handed over to it. All regimens perform at least the basic process of successively stepping the agent and the environment and logging basic information, but a lot more can also be happening. See the
Regimen
section for more details (TODO link). When the training is done, certain outputs may have been generated including training loss data and Tensorflow checkpoint files.
Project structure¶
To support your work, q2 expects a certain project structure, consisting of four folders and one file:
agents/
environments/
objectives/
regimens/
objects.yaml
q2 will setup this structure for you when you run the command:
q2 init
from within your project directory.
Each of the directories contains your agents, environments, objectives and
regimens respectively. q2 uses objects.yaml
to keep track of your stuff
so that the command line tool knows where to look for it. It is a YAML file that contains a reference to each user-defined
object with some supporting information and metadata. You shouldn’t need to
modify it directly, but it does need to be checked into source control.
Agents¶
Agents generate actions in response to states of the environment. Usually they also learn certain behaviours in response to a reward signal. Agents are the primary object of interest to reinforcement learning researchers, and you’ll probably spend most of your time working on them. In q2, an agent is any Python class that inherits the q2.Agent abstract base class [1], which simply means that it has to implement a certain set of methods:
-
class
q2.agents.
Agent
¶ An abstract base class (interface) that specifies what an agent must implement. You will fill in your own definitions for each of these methods.
-
__init__
(observation_space, action_space)¶ This is where you define the model and setup any other objects that the agent needs.
-
act
(state, training_flag) → action¶ This method is called whenever an action is needed. It takes an environment state (e.g. cart-pole position and velocity) and a boolean flag that tells the agent whether this is training or testing. It should return a valid action.
Note: this method should be as close to a pure function as possible, always returning the agents best guess for what to do next given the state. Side-effects like updating internal state should be handled by
step
.
-
step
(state, action, reward, next_state, episode_end)¶ This method is the agent’s chance to do some work between steps of the environment. It is called once every time step and should be used to do things like run the learning step, or update an internal buffer of environment observations.
-
learn
(states, actions, rewards, next_states, episode_ends) → loss¶ This should run a learning step to update the agent’s model using the batch of training data passed as argumnets. It will typically be called from within
step
, but q2 requires you to implement it separately so that training regimens can control when the agent learns.
-
load
(), save()¶ These two methods are responsible for loading and saving anything the agent needs to keep around between training sessions e.g. TensorFlow checkpoint files.
-
q2 will call these methods for you during training sessions. For example, in
the default training regimen (online
), act
is called when the
environment is ready for the action to supply it’s next action, step
is
called after the environment has updated to its next state, and load
and
save
are called at the beginning and end of the training session
respectively [2].
When you generate a new agent with q2 generate agent <agent_name>
, these
methods are stubbed out for you with an implementation drawn from the
random
agent (which simply chooses a random action at each time step). This
is so that you can see a basic implementation and hack on it instead of
starting from scratch.
Implementation checklist¶
Here is the basic implementation checklist for creating an agent that uses deep learning:
- Run
q2 generate agent my_agent
to generate a new agent from template. - In
__init__
, define your network. From here on I will assume you named the tensor that represents the agents output actions asself.action
, and the tensor that represents the agents learning step asself.optimize
. - In
act
, callaction = sess.run(self.action, feed_dict={...})
with the appropriate inputs and return the action at the end. - In
learn
, callloss = sess.run([self.loss, self.optimize], feed_dict={...})
with the appropriate inputs and return the loss at the end. - In
step
, handle the new environment state and callself.learn(...)
somewhere. - The default implementations of
save
andload
are good enough for most purposes, I recommend that you leave them as they are until there is a reason to change them.
Footnotes
[1] | See q2/agents/agent.py for the full abstract base class. |
[2] | learn is only explicitly called by the regimen during
offline training. |
Environments¶
Environments evolve over time as the agent interacts with it, typically according to predetermined rules so that the environment presents consistent behaviour. They may also define end conditions that cause the episode to end.
q2 comes with a set of environments from OpenAI Gym and OpenAI Retro. On the command line, run:
q2 train random --env gym.CartPole-v1 --episodes 10 --render
to see OpenAI Gym’s cart-pole environment in action.
Any OpenAI gym environment can be referenced from the command line tool as
gym.<environment>
, and similarly any OpenAI Retro can be specified as
retro.<game>
. Note that some of the Retro environments require special
ROMs [1].
However, eventually you may want to design your own training environment. To
do so, you need to make a Python class that inherits from the abstract base
class q2.environments.Environment
, which means that you will implement
the following interface:
-
class
q2.environments.
Environment
¶ An abstract base class (interface) that specifies what the environment must implement. You will fill in your own definitions for each of these methods.
-
reset
() → state¶ Reset the environment to its initial state and return it.
-
step
(action) → next_state, reward, done, info¶ Given the agent’s choice of action, move the environment forward one time step and return a 4-tuple that includes the new state, the reward as a
float
, whether the environment has entered an end state as abool
, and an arbitrary infodict
which may be used to output debugging information.The agents action must match the spec in
action_space
and the new state returned must match the spec inobservation_space
.
-
render
()¶ Optional to implement, ignore it if you don’t care about seeing the environment on the screen. If you do want to implement it, a good practice is to create a new window the first time
render
is called, drawing to it on every subsequent call and then closing the window from withinclose()
.
-
close
()¶ Shutdown the environment, freeing any resources (e.g. rendering windows).
-
observation_space, action_space
These attributes must be set to instances of
gym.Space
. They provide a specification of the allowable actions and environment states, which agents may use to define their internal models.All states returned by the environment must match the spec in
observation_space
.
-
Footnotes
[1] | See the https://github.com/openai/retro#roms for details. |
Objectives¶
Objectives are pipes that take in the reward signal generated by the
environment and transform it into an objective for the agent to optimize
toward. This enables more complex training behaviour, but is not always
necessary. q2 comes with a Passthrough
objective that simply passes the
reward that the environment yielded directly to the agent. Passthrough
is
used by default unless you override it with your own.
Run:
q2 generate objective my_objective
to start working on your own objective. You need to implement the following interface:
-
class
q2.objectives.
Objective
¶ An abstract base class (interface) that specifies what the environment must implement. You will fill in your own definitions for each of these methods.
-
reset
()¶ This is called by the training regimen before each episode. It gives you the opportunity to wipe any accumulated state.
-
step
(state, action, reward) → float¶ Accepts a
state
andreward
from the environment, anaction
from the agent and returns the new objective for the agent to learn.
-
Regimens¶
Regimens are training algorithms. At its core, a training algorithm is
responsible for setting up the environment and the agent and successively
calling step
on each one, while passing actions and states back and forth
as appropriate. q2’s Regimen
class implements this basic funtionality and
provides hooks for you to customize any other behaviour as desired.
Run:
q2 generate regimen my_regimen
to generate a new regimen from template, then fill in the implementations of whichever event hooks you need.
-
class
q2.regimens.
Regimen
¶ -
before_training
()¶
-
after_training
()¶ Called once at the start and end of training respectively.
-
before_epoch
(epoch:int)¶
-
after_epoch
(epoch:int)¶ Called before and after each epoch.
-
before_episode
(episode:int)¶
-
after_episode
(episode:int)¶ Called before and after each episode.
-
before_step
(step)¶
-
after_step
(step)¶ Called before and after each step of the environment and agent.
-
on_error
(step, exception)¶ Called when an exception occurs. If this method returns
True
then propagation of the exception is stopped, which can be useful when certain exceptions are expected to occur.
-
plugins
() → List[Plugin]¶ A list of plugins to be used by your regimen.
-
log
(msg:str)¶ Add a message to the logging output for the current timestep. This method is implemented by q2 and provided as a convenience, you should not override it with your own implementation.
-
agent: Agent
-
env: Environment
-
sess: tf.Session
The tensorflow session.
-
objective: Objective
-
action_space
¶ The action_space of the environment.
-
observation_space
¶ The observation_space of the environment.
-
agent_constructor: Type[Agent]
Callable that constructs a new agent.
-
env_maker: Maker
Callable that creates a new environment.
-
Plugins¶
When implementing your own regimens, you might find that you want to re-use
the same morsels of useful behaviour in multiple different regimens. You can
achieve this by implementing a Plugin
. For example, q2 comes with a
DisplayFramerate plugin that lets any regimen display a nice framerate
message in the logs without polluting the core logic of the regimen.
The interface of a plugin is identical to that of a Regimen
except that
each method takes as first argument a Regimen
which it can inspect,
interact with or modify. Note that the plugin event hooks are always called
before the regimen’s hooks, so the regimen always has final say over any
state before the next step is run. You should write your plugins to account
for the possibility that the regimen itself might change some state before
the next event happens.
You can use a plugin by calling adding it to the list returned by
regimen.plugins()
.
Project goals¶
Reinforcement learning research is surprisingly hard to reproduce, especially if you want to use a new technique within a larger system. It is also difficult to iterate on, because every researcher has their own bespoke setup, which often means a handful of messy scripts containing a handful of tightly-coupled methods and a ton of unhandled corner cases. I dreamed of cloning another researcher’s project and immediately feeling confident enough to start hacking on their algorithms, so I created q2 to make that possible.
In the recent past, the field of web development faced a similar problem, with every developer having their own tools and techniques that they copy from project to project, making it hard for people to collaborate and learn each other’s ideas. Their problem has largely been solved thanks to tools and frameworks like Ruby on Rails, AngularJS and React/create-react-app, many of which drew inspiration from manifestos like the Twelve Factor App. These tools encouraged greater standardization and common patterns across projects, solidifying good practices and reducing cognitive load.
My hypothesis is that machine learning and especially reinforcement learning face a similar problem. Fortunately, there are common use cases and patterns that we can optimize. The basic workflow that q2 targets is as follows:
- Clone a project or spin up a new one with
q2 init
andq2 generate
. - Build, tinker, iterate!
- Run a training session with
q2 train ...
.
The goal of q2 is to make steps 1 and 3 completely seamless so that you can spend as much time as possible in step 2, confident that with q2 supporting you, all of your effort can be focused on the learning task at hand.
Planned and upcoming features¶
- Better integration with Keras, PyTorch and other Tensorflow API wrappers
- Web interface to your agent zoo
- Improved deployment story: q2 should support common deployment patterns
- Show better training metrics, explore integrating with Tensorboard
Bugs and feature requests¶
If you find something wrong, please create an issue on the Github page.
About the author¶
I’m a software engineer and I hack on RL projects in my spare time. My personal website and blog can be found at tdb-alcorn.github.io.