Welcome to DeeR’s documentation!

DeeR (Deep Reinforcement) is a python library to train an agent how to behave in a given environment so as to maximize a cumulative sum of rewards (see What is deep reinforcement learning?).

Here are key advantages of the library:

  • You have access within a single library to techniques such as Double Q-learning, prioritized Experience Replay, Deep deterministic policy gradient (DDPG), etc.
  • This package provides a general framework where observations are made up of any number of elements : scalars, vectors and frames. The belief state on which the agent is based to build the Q function or the policy is made up of any length history of each element provided in the observation.
  • You can easily add up a validation phase that allows to stop the training process before overfitting. This possibility is useful when the environment is dependent on scarce data (e.g. limited time series).

In addition, the framework is made in such a way that it is easy to

  • build any environment
  • modify any part of the learning process
  • use your favorite python-based framework to code your own neural network architecture. The provided neural network architectures are based on Keras (or pure Theano) but you may easily use others.

It is a work in progress and input is welcome. Please submit any contribution via pull request.

What is new

Version 0.3

  • Integration of different exploration/exploitation policies and possibility to easily built your own (see Policies)
  • Integration of DDPG for continuous action spaces (see Actor-critic learning)
  • Naming convention for this project and some interfaces have been updated. This may cause broken backward compatibility. In that case, make the changes to the new convention by looking at the API in this documentation or by looking at the current version of the examples.
  • Additional automated tests

Version 0.2

  • Standalone python package (you can simply do pip install deer)
  • Integration of new examples environments : The pendulum on a cart, PLE environment and Gym environment
  • Double Q-learning and prioritized Experience Replay
  • Augmented documentation
  • First automated tests

Future extensions:

  • Add planning (e.g. MCTS based when deterministic environment)
  • Several agents interacting in the same environment
  • ...

User Guide

Installation

Dependencies

This framework is tested to work under Python 2.7, and Python 3.5. It should also work with Python 3.3 and 3.4.

The required dependencies are NumPy >= 1.10, joblib >= 0.9. You also need theano >= 0.7 (lasagne is optional) or you can write your own neural network using your favorite framework.

For running some of the examples, Matplotlib >= 1.1.1 is required. You also sometimes need to install specific dependencies (e.g. for the atari games, you need to install ALE >= 0.4).

User install instructions

You can install the framework with pip:

pip install deer

For the bleeding edge version (recommanded), you can simply use

pip install git+git://github.com/VINF/deer.git@master

Developer install instructions

As a developer, you can set you up with the bleeding-edge version of DeeR with:

git clone -b master https://github.com/VinF/deer.git

Assuming you already have a python environment with pip, you can automatically install all the dependencies (except specific dependencies that you may need for some examples) with:

pip install -r requirements.txt

And you can install the framework as a package using the mode develop so that you can make modifications and test without having to re-install the package.

python setup.py develop

Tutorial

What is deep reinforcement learning?

Deep reinforcement learning is the combination of two fields:

  • Reinforcement learning (RL) is a theory that allows an agent to learn a startegy so as to maximize a sum of cumulated (delayed) rewards from any given environment. If you are not familiar with RL, you can get up to speed easily with this book by Sutton and Barto.
  • Deep learning is a branch of machine learning for regression and classification. It is particularly well suited to model high-level abstractions in data by using multiple processing layers composed of multiple non-linear transformations.

This combination allows to learn complex tasks such as playing ATARI games from high-dimensional sensory inputs. For more informations, you can refer to one of the main papers in the domain : "Human-level control through deep reinforcement learning".

How can I get started?

First, make sure you have installed the package properly by following the steps descibed in Installation.

The general idea of this framework is that you need to instantiate an agent (along with a q-network) and an environment. In order to perform an experiment, you also need to attach to the agent some controllers for controlling the training and the various parameters of your agent.

The environment should be built specifically for any specific task while q-networks, the DQN agent and many controllers are provided within this package.

The best to get started is to have a look at the Examples and in particular the two first environments that are simple to understand:

If you find something that is not yet implemented and if you wish to contribute, you can check the section Development.

Any Question?

You can raise questions about the DeeR project on github : https://github.com/VinF/deer/issues

Examples

You can find these examples at the root of the package. For each example at least two files are provided:

  • A launcher file (whose name usually starts by run_).
  • An environnement file (whose name usually ends by _env).

The launcher file performs different actions:

  • It instantiates the environment, the agent (along with a q-network).
  • It binds controllers to the agent
  • it finally runs the experiment

Examples are better than precepts and the best is to get started with the following examples (with the simplest examples listed first)

Toy environment with time series

Description of the environement

This environment simulates the possibility of buying or selling a good. The agent can either have one unit or zero unit of that good. At each transaction with the market, the agent obtains a reward equivalent to the price of the good when selling it and the opposite when buying. In addition, a penalty of 0.5 (negative reward) is added for each transaction.

The price pattern is made by repeating the following signal plus a random constant between 0 and 3:

Toy example price pattern

Let’s now see how this environement is built by looking into the file Toy_env.py in examples/toy_env/. It is important to note that any environment derives from the base class Environment and you can refer to it in order to understand the required methods and their usage.

How to run

A minimalist way of running this example can be found in the file run_toy_env_simple.py in examples/toy_env/.

  • First, we need to import the agent, the Q-network, the environement and some controllers
1
2
3
4
5
6
import numpy as np

from deer.agent import NeuralAgent
from deer.q_networks.q_net_theano import MyQNetwork
from Toy_env import MyEnv as Toy_env
import deer.experiment.base_controllers as bc
  • Then we instantiate the different elements as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
if __name__ == "__main__":

    rng = np.random.RandomState(123456)
    
    # --- Instantiate environment ---
    env = Toy_env(rng)

    # --- Instantiate qnetwork ---
    qnetwork = MyQNetwork(
        environment=env,
        random_state=rng)
    
    # --- Instantiate agent ---
    agent = NeuralAgent(
        env,
        qnetwork,
        random_state=rng)

    # --- Bind controllers to the agent ---
    # Before every training epoch, we want to print a summary of the agent's epsilon, discount and 
    # learning rate as well as the training epoch number.
    agent.attach(bc.VerboseController())

    # During training epochs, we want to train the agent after every action it takes.
    # Plus, we also want to display after each training episode (!= than after every training) the average bellman
    # residual and the average of the V values obtained during the last episode.
    agent.attach(bc.TrainerController())

    # All previous controllers control the agent during the epochs it goes through. However, we want to interleave a 
    # "test epoch" between each training epoch. We do not want these test epoch to interfere with the training of the 
    # agent. Therefore, we will disable these controllers for the whole duration of the test epochs interleaved this 
    # way, using the controllersToDisable argument of the InterleavedTestEpochController. The value of this argument 
    # is a list of the indexes of all controllers to disable, their index reflecting in which order they were added.
    agent.attach(bc.InterleavedTestEpochController(
        epoch_length=500, 
        controllers_to_disable=[0, 1]))
        
    # --- Run the experiment ---
    agent.run(n_epochs=100, epoch_length=1000)
Results

Navigate to the folder examples/toy_env/ in a terminal window. The example can then be run by using

python run_toy_env_simple.py

You can also choose the full version of the launcher that specifies the hyperparameters for better performance.

python run_toy_env.py

Every 10 epochs, a graph is saved in the ‘toy_env’ folder. You can then see that kind of behaviour for the test policy at the end of the training:

Toy example policy

In this graph, you can see that the agent has successfully learned to take advantage of the price pattern to buy when it is low and to sell when it is high. This example is of course easy due to the fact that the patterns are very systematic which allows the agent to successfuly learn it. It is important to note that the results shown are made on a validation set that is different from the training and we can see that learning generalizes well. For instance, the action of buying at time step 7 and 16 is the expected result because in average this will allow to make profit since the agent has no information on the future.

Using Convolutions VS LSTM’s

So far, the neural network was build by using a convolutional architecture as follows:

Convolutions architecture

The neural nework processes time series thanks to a set of convolutions layers. The output of the convolutions as well as the other inputs are followed by fully connected layers and the ouput layer.

When working with deep reinforcement learning, it is also possible to work with LSTM’s (see for instance this introduction to LSTM's)

If you want to use LSTM’s architecture, you can import the following libraries

from deer.q_networks.q_net_keras import MyQNetwork
from deer.q_networks.NN_keras_LSTM import NN as NN_keras

and then instanciate the qnetwork by specifying the ‘neural_network’ as follows:

qnetwork = MyQNetwork(
    env,
    neural_network=NN_keras)

The pendulum on a cart

Description

The environment simulates the behavior of an inverted pendulum. The theoretical system with its equations are as described in Barto et al. (1983):

  • A cart of mass \(M\) that can move horizontally;
  • A pole of mass \(m\) and length \(l\) attached to the cart, with \(\theta\) in \([0, -\pi]\) for the lefthand plane, and \([0, \pi]\) for the righthand side. We are supposing that the cart is moving on a rail and the pole can go under it.
Inverted Pendulum

The goal of the agent is to balance the pole above its supporting cart (\(\theta=0\)), by displacing the cart left or right - thus, 2 actions are possible. To do so, the environment communicates to the agent:

  • A vector (position, speed, angle, angular speed);
  • The reward associated to the action chosen by the agent.
Results

In a terminal windown go to the folder examples/pendulum. The example can then be run with

python run_pendulum.py

Here are the outputs of the agent after respectively 20 and 70 learning epochs, with 1000 steps in each. We clearly see the final success of the agent in controlling the inverted pendulum.

Note: a MP4 is generated every PERIOD_BTW_SUMMARY_PERFS epochs and you need the [FFmpeg](https://www.ffmpeg.org/) library to do so. If you do not want to install this library or to generate the videos, just set PERIOD_BTW_SUMMARY_PERFS = -1.

http://vincent.francois-l.be/img_GeneralDeepQRL/output2.gif http://vincent.francois-l.be/img_GeneralDeepQRL/output7.gif
Details on the implementation

The main focus in the environment is to implement act(self, action) which specifies how the cart-pole system behaves in response to an input action. So first, we transcript the physical laws that rule the motion of the pole and the cart. The simulation timestep of the agent is \(\Delta_t=0.02\) second. But we discretize this value even further in act(self, action), in order to obtain dynamics that are closer to the exact differential equations. Secondly, we chose the reward function as the sum of :

  • \(- |\theta|\) such that the agent receives 0 when the pole is standing up, and a negative reward proportional to the angle otherwise.
  • \(- \frac{|x|}{2}\) such that the agent receives a negative reward when it is far from \(x=0\).

Gym environment

Some examples are also provided with the Gym environment.

Here is the resulting policy for the mountain car example:

http://vincent.francois-l.be/img_GeneralDeepQRL/gym_mountain_car.gif

Here is the resulting policy for the pendulum example:

http://vincent.francois-l.be/img_GeneralDeepQRL/gym_pendulum.gif

Two storage devices environment

Description of the environement

This example simulates the operation of a realistic micro-grid (such as a smart home for instance) that is not connected to the main utility grid (off-grid) and that is provided with PV panels, batteries and hydrogen storage. The battery has the advantage that it is not limited in instaneous power that it can provide or store. The hydrogen storage has the advantage that is can store very large quantity of energy.

python run_MG_two_storage_devices

This example uses the environment defined in MG_two_storage_devices_env.py. The agent can either choose to store in the long term storage or take energy out of it while the short term storage handle at best the lack or surplus of energy by discharging itself or charging itself respectively. Whenever the short term storage is empty and cannot handle the net demand a penalty (negative reward) is obtained equal to the value of loss load set to 2euro/kWh.

The state of the agent is made up of an history of two to four punctual observations:

  • Charging state of the short term storage (0 is empty, 1 is full)
  • Production and consumption (0 is no production or consumption, 1 is maximal production or consumption)
  • (Distance to equinox)
  • (Predictions of future production : average of the production for the next 24 hours and 48 hours)

Two actions are possible for the agent:

  • Action 0 corresponds to discharging the long-term storage
  • Action 1 corresponds to charging the long-term storage
More information can be found in the paper to be published :
Deep Reinforcement Learning Solutions for Energy Microgrids Management, Vincent François-Lavet, David Taralla, Damien Ernst, Raphael Fonteneau
Annex to the paper
PV production and consumption profiles

Solar irradiance varies throughout the year depending on the seasons, and it also varies throughout the day depending on the weather and the position of the sun in the sky relative to the PV panels. The main distinction between these profiles is the difference between summer and winter PV production. In particular, production varies with a factor 1:5 between winter and summer as can be seen from the measurements of PV panels production for a residential customer located in Belgium in the figures below.

http://vincent.francois-l.be/img_GeneralDeepQRL/ProductionVSMonths_be.png

Total energy produced per month

http://vincent.francois-l.be/img_GeneralDeepQRL/ProductionVSTime_1janv_be.png

Typical production in winter

http://vincent.francois-l.be/img_GeneralDeepQRL/ProductionVSTime_1july_be.png

Typical production in summer

A simple residential consumption profile is considered with a daily average consumption of 18kWh (see figure below).

http://vincent.francois-l.be/img_GeneralDeepQRL/ConsumptionVSTime_random.png

Representative residential consumption profile

Main microgrid parameters
Data used for the PV panels
cost \(c^{PV}\) \(1 euro/W_p\)
Efficiency \(\eta^{PV}\) \(18 \%\)
Life time \(L^{PV}\) \(20 years\)
Data used for the \(LiFePO_4\) battery
cost \(c^B\) \(500 euro/kWh\)
discharge efficiency \(\eta_0^B\) \(90\%\)
charge efficiency \(\zeta_0^B\) \(90\%\)
Maximum instantaneous power \(P^B\) \(> 10kW\)
Life time \(L^{B}\) \(20 years\)
Data used for the Hydrogen storage device
cost \(c^{H_2}\) \(14 euro/W_p\)
discharge efficiency \(\eta_0^{H_2}\) \(65\%\)
charge efficiency \(\zeta_0^{H_2}\) \(65\%\)
Life time \(L^{H_2}\) \(20 years\)
Data used for reward function
cost endured per kWh not supplied within the microgrid \(k\) \(2 euro/kWh\)
revenue/cost per kWh of hydrogen produced/used \(k^{H_2}\) \(0.1 euro/kWh\)

ALE environment

This environment is an interface with the ALE environment that simulates any ATARI game. The hyper-parameters used in the example provided aim to simulate as closely as possible the following paper : Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529-533.

Some changes are still necessary to obtain the same performances.

PLE environment

This environment is an interface with the PLE environment.

Development

DeeR is a work in progress and contributions are welcome via pull request.

For more information, you can check out this link : Contributing to an open source Project on github.

You should also make sure that you install the repository approriately for development (see Developer install instructions).

Guidelines for this project

Here are a few guidelines for this project.

  • Simplicity: Be easy to use but also easy to understand when one digs into the code. Any additional code should be justified by the usefulness of the feature.
  • Modularity: The user should be able to easily use its own code with any part of the deer framework (probably at the exception of the core of agent.py that is coded in a very general way).

These guidelines come of course in addition to all good practices for open source development.

Naming convention for this project

  • All classes and methods have word boundaries using medial capitalization. Classes are written with UpperCamelCase and methods are written with lowerCamelCase respectively. Example: “two words” is rendered as “TwoWords” for the UpperCamelCase (classes) and “twoWords” for the lowerCamelCase (methods).
  • All attributes and variables have words separated by underscores. Example: “two words” is rendered as “two_words”
  • If a variable is intended to be ‘private’, it is prefixed by an underscore.

API reference

If you are looking for information on a specific function, class or method, this API is for you.

Agent

This module contains classes used to define the standard behavior of the agent. It relies on the controllers, the chosen training/test policy and the learning algorithm to specify its behavior in the environment.

NeuralAgent(environment, q_network[, ...]) The NeuralAgent class wraps a deep Q-network for training and testing in a given environment.
DataSet(env[, random_state, max_size, ...]) A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

Detailed description

class deer.agent.NeuralAgent(environment, q_network, replay_memory_size=1000000, replay_start_size=None, batch_size=32, random_state=<mtrand.RandomState object>, exp_priority=0, train_policy=None, test_policy=None, only_full_history=True)

The NeuralAgent class wraps a deep Q-network for training and testing in a given environment.

Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,...).

Parameters:

environment : object from class Environment

The environment in which the agent interacts

q_network : object from class QNetwork

The q_network associated to the agent

replay_memory_size : int

Size of the replay memory. Default : 1000000

replay_start_size : int

Number of observations (=number of time steps taken) in the replay memory before starting learning. Default: minimum possible according to environment.inputDimensions().

batch_size : int

Number of tuples taken into account for each iteration of gradient descent. Default : 32

random_state : numpy random number generator

Default : random seed.

exp_priority : float

The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.

train_policy : object from class Policy

Policy followed when in training mode (mode -1)

test_policy : object from class Policy

Policy followed when in other modes than training (validation and test modes)

only_full_history : boolean

Whether we wish to train the neural network only on full histories or we wish to fill with zeroes the observations before the beginning of the episode

Methods

attach(controller)
avgBellmanResidual() Returns the average training loss on the epoch
avgEpisodeVValue() Returns the average V value on the episode (on time steps where a non-random action has been taken)
bestAction() Returns the best Action
detach(controllerIdx)
discountFactor() Get the discount factor
dumpNetwork(fname[, nEpoch]) Dump the network
learningRate() Get the learning rate
mode()
overrideNextAction(action) Possibility to override the chosen action.
resumeTrainingMode()
run(n_epochs, epoch_length) This function encapsulates the whole process of the learning.
setControllersActive(toDisable, active) Activate controller
setDiscountFactor(df) Set the discount factor
setLearningRate(lr) Set the learning rate for the gradient descent
setNetwork(fname[, nEpoch]) Set values into the network
startMode(mode, epochLength)
summarizeTestPerformance()
totalRewardOverLastTest() Returns the average sum of rewards per episode and the number of episode
train() This function selects a random batch of data (with self._dataset.randomBatch) and performs a Q-learning iteration (with self._network.train).
avgBellmanResidual()

Returns the average training loss on the epoch

avgEpisodeVValue()

Returns the average V value on the episode (on time steps where a non-random action has been taken)

bestAction()

Returns the best Action

discountFactor()

Get the discount factor

dumpNetwork(fname, nEpoch=-1)

Dump the network

Parameters:

fname : string

Name of the file where the network will be dumped

nEpoch : int

Epoch number (Optional)

learningRate()

Get the learning rate

overrideNextAction(action)

Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.

run(n_epochs, epoch_length)

This function encapsulates the whole process of the learning. It starts by calling the controllers method “onStart”, Then it runs a given number of epochs where an epoch is made up of one or many episodes (called with agent._runEpisode) and where an epoch ends up after the number of steps reaches the argument “epoch_length”. It ends up by calling the controllers method “end”.

Parameters:

n_epochs : number of epochs

int

epoch_length : maximum number of steps for a given epoch

int

setControllersActive(toDisable, active)

Activate controller

setDiscountFactor(df)

Set the discount factor

setLearningRate(lr)

Set the learning rate for the gradient descent

setNetwork(fname, nEpoch=-1)

Set values into the network

Parameters:

fname : string

Name of the file where the values are

nEpoch : int

Epoch number (Optional)

totalRewardOverLastTest()

Returns the average sum of rewards per episode and the number of episode

train()

This function selects a random batch of data (with self._dataset.randomBatch) and performs a Q-learning iteration (with self._network.train).

class deer.agent.DataSet(env, random_state=None, max_size=1000, use_priority=False, only_full_history=True)

A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

Methods

actions() Get all actions currently in the replay memory, ordered by time where they were taken.
addSample(obs, action, reward, is_terminal, ...) Store a (observation[for all subjects], action, reward, is_terminal) in the dataset.
observations() Get all observations currently in the replay memory, ordered by time where they were observed.
randomBatch(size, use_priority) Return corresponding states, actions, rewards, terminal status, and next_states for size randomly chosen transitions.
rewards() Get all rewards currently in the replay memory, ordered by time where they were received.
terminals() Get all terminals currently in the replay memory, ordered by time where they were observed.
updatePriorities(priorities, rndValidIndices)
actions()

Get all actions currently in the replay memory, ordered by time where they were taken.

addSample(obs, action, reward, is_terminal, priority)

Store a (observation[for all subjects], action, reward, is_terminal) in the dataset. Parameters ———– obs : ndarray

An ndarray(dtype=’object’) where obs[s] corresponds to the observation made on subject s before the agent took action [action].
action : int
The action taken after having observed [obs].
reward : float
The reward associated to taking this [action].
is_terminal : bool
Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).
priority : float
The priority to be associated with the sample
observations()

Get all observations currently in the replay memory, ordered by time where they were observed.

observations[s][i] corresponds to the observation made on subject s before the agent took actions()[i].

randomBatch(size, use_priority)

Return corresponding states, actions, rewards, terminal status, and next_states for size randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each subject s.

Parameters:

size : int

Number of transitions to return.

Returns

——-

states : ndarray

An ndarray(size=number_of_subjects, dtype=’object), where states[s] is a 2+D matrix of dimensions size x s.memorySize x “shape of a given observation for this subject”. States were taken randomly in the data with the only constraint that they are complete regarding the histories for each observed subject.

actions : ndarray

An ndarray(size=number_of_subjects, dtype=’int32’) where actions[i] is the action taken after having observed states[:][i].

rewards : ndarray

An ndarray(size=number_of_subjects, dtype=’float32’) where rewards[i] is the reward obtained for taking actions[i-1].

next_states : ndarray

Same structure than states, but next_states[s][i] is guaranteed to be the information concerning the state following the one described by states[s][i] for each subject s.

terminals : ndarray

An ndarray(size=number_of_subjects, dtype=’bool’) where terminals[i] is True if actions[i] lead to terminal states and False otherwise

Throws

——-

SliceError

If a batch of this size could not be built based on current data set (not enough data or all trajectories are too short).

rewards()

Get all rewards currently in the replay memory, ordered by time where they were received.

terminals()

Get all terminals currently in the replay memory, ordered by time where they were observed.

terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.

updatePriorities(priorities, rndValidIndices)

Controller

This file defines the base Controller class and some presets controllers that you can use for controlling the training and the various parameters of your agents.

Controllers can be attached to an agent using the agent’s attach(Controller) method. The order in which controllers are attached matters. Indeed, if controllers C1, C2 and C3 were attached in this order and C1 and C3 both listen to the onEpisodeEnd signal, the onEpisodeEnd() method of C1 will be called before the onEpisodeEnd() method of C3, whenever an episode ends.

Controller() A base controller that does nothing when receiving the various signals emitted by an agent.
LearningRateController([...]) A controller that modifies the learning rate periodically upon epochs end.
EpsilonController([initial_e, e_decays, ...]) A controller that modifies the probability “epsilon” of taking a random action periodically.
DiscountFactorController([...]) A controller that modifies the q-network discount periodically.
TrainerController([evaluate_on, ...]) A controller that makes the agent train on its current database periodically.
InterleavedTestEpochController([id, ...]) A controller that interleaves a test epoch between training epochs of the agent.
FindBestController([validationID, testID, ...]) A controller that finds the neural net performing at best in validation mode (i.e.

Detailed description

class deer.experiment.base_controllers.Controller

A base controller that does nothing when receiving the various signals emitted by an agent. This class should be the base class of any controller you would want to define.

Methods

onActionChosen(agent, action) Called whenever the agent has chosen an action.
onActionTaken(agent) Called whenever the agent has taken an action on its environment.
onEnd(agent) Called when the agent has finished processing all its epochs, just before returning from its run() method.
onEpisodeEnd(agent, terminal_reached, reward) Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.
onEpochEnd(agent) Called whenever the agent ends an epoch, just after the last episode of this epoch was ended and after any onEpisodeEnd() signal was processed.
onStart(agent) Called when the agent is going to start working (before anything else).
setActive(active) Activate or deactivate this controller.
onActionChosen(agent, action)

Called whenever the agent has chosen an action.

This occurs after the agent state was updated with the new observation it made, but before it applied this action on the environment and before the total reward is updated.

onActionTaken(agent)

Called whenever the agent has taken an action on its environment.

This occurs after the agent applied this action on the environment and before terminality is evaluated. This is called only once, even in the case where the agent skip frames by taking the same action multiple times. In other words, this occurs just before the next observation of the environment.

onEnd(agent)

Called when the agent has finished processing all its epochs, just before returning from its run() method.

onEpisodeEnd(agent, terminal_reached, reward)

Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.

Parameters:

agent : NeuralAgent

The agent firing the event

terminal_reached : bool

Whether the episode ended because a terminal transition occured. This could be False if the episode was stopped because its step budget was exhausted.

reward : float

The reward obtained on the last transition performed in this episode.

onEpochEnd(agent)

Called whenever the agent ends an epoch, just after the last episode of this epoch was ended and after any onEpisodeEnd() signal was processed.

Parameters:

agent : NeuralAgent

The agent firing the event

onStart(agent)

Called when the agent is going to start working (before anything else).

This corresponds to the moment where the agent’s run() method is called.

Parameters:

agent : NeuralAgent

The agent firing the event

setActive(active)

Activate or deactivate this controller.

A controller should not react to any signal it receives as long as it is deactivated. For instance, if a controller maintains a counter on how many episodes it has seen, this counter should not be updated when this controller is disabled.

class deer.experiment.base_controllers.LearningRateController(initial_learning_rate=0.005, learning_rate_decay=1.0, periodicity=1)

Bases: deer.experiment.base_controllers.Controller

A controller that modifies the learning rate periodically upon epochs end.

Parameters:

initial_learning_rate : float

The learning rate upon agent start

learning_rate_decay : float

The factor by which the previous learning rate is multiplied every [periodicity] epochs.

periodicity : int

How many epochs are necessary before an update of the learning rate occurs

Methods

onActionChosen(agent, action) Called whenever the agent has chosen an action.
onActionTaken(agent) Called whenever the agent has taken an action on its environment.
onEnd(agent) Called when the agent has finished processing all its epochs, just before returning from its run() method.
onEpisodeEnd(agent, terminal_reached, reward) Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.
onEpochEnd(agent)
onStart(agent)
setActive(active) Activate or deactivate this controller.
class deer.experiment.base_controllers.EpsilonController(initial_e=1.0, e_decays=10000, e_min=0.1, evaluate_on='action', periodicity=1, reset_every='none')

Bases: deer.experiment.base_controllers.Controller

A controller that modifies the probability “epsilon” of taking a random action periodically.

Parameters:

initial_e : float

Start epsilon

e_decays : int

How many updates are necessary for epsilon to reach eMin

e_min : float

End epsilon

evaluate_on : str

After what type of event epsilon shoud be updated periodically. Possible values: ‘action’, ‘episode’, ‘epoch’.

periodicity : int

How many [evaluateOn] are necessary before an update of epsilon occurs

reset_every : str

After what type of event epsilon should be reset to its initial value. Possible values: ‘none’, ‘episode’, ‘epoch’.

Methods

onActionChosen(agent, action)
onActionTaken(agent) Called whenever the agent has taken an action on its environment.
onEnd(agent) Called when the agent has finished processing all its epochs, just before returning from its run() method.
onEpisodeEnd(agent, terminal_reached, reward)
onEpochEnd(agent)
onStart(agent)
setActive(active) Activate or deactivate this controller.
class deer.experiment.base_controllers.DiscountFactorController(initial_discount_factor=0.9, discount_factor_growth=1.0, discount_factor_max=0.99, periodicity=1)

Bases: deer.experiment.base_controllers.Controller

A controller that modifies the q-network discount periodically. More informations in : Francois-Lavet Vincent et al. (2015) - How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies (http://arxiv.org/abs/1512.02011).

Parameters:

initial_discount_factor : float

Start discount

discount_factor_growth : float

The factor by which the previous discount is multiplied every [periodicity] epochs.

discount_factor_max : float

Maximum reachable discount

periodicity : int

How many training epochs are necessary before an update of the discount occurs

Methods

onActionChosen(agent, action) Called whenever the agent has chosen an action.
onActionTaken(agent) Called whenever the agent has taken an action on its environment.
onEnd(agent) Called when the agent has finished processing all its epochs, just before returning from its run() method.
onEpisodeEnd(agent, terminal_reached, reward) Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.
onEpochEnd(agent)
onStart(agent)
setActive(active) Activate or deactivate this controller.
class deer.experiment.base_controllers.TrainerController(evaluate_on='action', periodicity=1, show_episode_avg_V_value=True, show_avg_Bellman_residual=True)

Bases: deer.experiment.base_controllers.Controller

A controller that makes the agent train on its current database periodically.

Parameters:

evaluate_on : str

After what type of event the agent shoud be trained periodically. Possible values: ‘action’, ‘episode’, ‘epoch’. The first training will occur after the first occurence of [evaluateOn].

periodicity : int

How many [evaluateOn] are necessary before a training occurs _show_avg_Bellman_residual [bool] - Whether to show an informative message after each episode end (and after a training if [evaluateOn] is ‘episode’) about the average bellman residual of this episode

show_episode_avg_V_value : bool

Whether to show an informative message after each episode end (and after a training if [evaluateOn] is ‘episode’) about the average V value of this episode

Methods

onActionChosen(agent, action) Called whenever the agent has chosen an action.
onActionTaken(agent)
onEnd(agent) Called when the agent has finished processing all its epochs, just before returning from its run() method.
onEpisodeEnd(agent, terminal_reached, reward)
onEpochEnd(agent)
onStart(agent)
setActive(active) Activate or deactivate this controller.
class deer.experiment.base_controllers.InterleavedTestEpochController(id=0, epoch_length=500, controllers_to_disable=[], periodicity=2, show_score=True, summarize_every=10)

Bases: deer.experiment.base_controllers.Controller

A controller that interleaves a test epoch between training epochs of the agent.

Parameters:

id : int

The identifier (>= 0) of the mode each test epoch triggered by this controller will belong to. Can be used to discriminate between datasets in your Environment subclass (this is the argument that will be given to your environment’s reset() method when starting the test epoch).

epoch_length : float

The total number of transitions that will occur during a test epoch. This means that this epoch could feature several episodes if a terminal transition is reached before this budget is exhausted.

controllers_to_disable : list of int

A list of controllers to disable when this controller wants to start a test epoch. These same controllers will be reactivated after this controller has finished dealing with its test epoch.

periodicity : int

How many epochs are necessary before a test epoch is ran (these controller’s epochs included: “1 test epoch on [periodicity] epochs”). Minimum value: 2.

show_score : bool

Whether to print an informative message on stdout at the end of each test epoch, about the total reward obtained in the course of the test epoch.

summarize_every : int

How many of this controller’s test epochs are necessary before the attached agent’s summarizeTestPerformance() method is called. Give a value <= 0 for “never”. If > 0, the first call will occur just after the first test epoch.

Methods

onActionChosen(agent, action) Called whenever the agent has chosen an action.
onActionTaken(agent) Called whenever the agent has taken an action on its environment.
onEnd(agent) Called when the agent has finished processing all its epochs, just before returning from its run() method.
onEpisodeEnd(agent, terminal_reached, reward) Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.
onEpochEnd(agent)
onStart(agent)
setActive(active) Activate or deactivate this controller.
class deer.experiment.base_controllers.FindBestController(validationID=0, testID=None, unique_fname='nnet')

Bases: deer.experiment.base_controllers.Controller

A controller that finds the neural net performing at best in validation mode (i.e. for mode = [validationID]) and computes the associated generalization score in test mode (i.e. for mode = [testID], and this only if [testID] is different from None). This controller should never be disabled by InterleavedTestControllers as it is meant to work in conjunction with them.

At each epoch end where this controller is active, it will look at the current mode the agent is in.

If the mode matches [validationID], it will take the total reward of the agent on this epoch and compare it to its current best score. If it is better, it will ask the agent to dump its current nnet on disk and update its current best score. In all cases, it saves the validation score obtained in a vector.

If the mode matches [testID], it saves the test (= generalization) score in another vector. Note that if [testID] is None, no test mode score are ever recorded.

At the end of the experiment (onEnd), if active, this controller will print information about the epoch at which the best neural net was found together with its generalization score, this last information shown only if [testID] is different from None. Finally it will dump a dictionnary containing the data of the plots ({n: number of epochs elapsed, ts: test scores, vs: validation scores}). Note that if [testID] is None, the value dumped for the ‘ts’ key is [].

Parameters:

validationID : int

See synopsis

testID : int

See synopsis

unique_fname : str

A unique filename (basename for score and network dumps).

Methods

onActionChosen(agent, action) Called whenever the agent has chosen an action.
onActionTaken(agent) Called whenever the agent has taken an action on its environment.
onEnd(agent)
onEpisodeEnd(agent, terminal_reached, reward) Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.
onEpochEnd(agent)
onStart(agent) Called when the agent is going to start working (before anything else).
setActive(active) Activate or deactivate this controller.

Environment

Detailed description

class deer.base_classes.Environment

All your Environment classes should inherit this interface.

The environment defines the dynamics and the reward signal that the agent observes when interacting with it.

An agent sees at any time-step from the environment a collection of observable elements. Observing the environment at time t thus corresponds to obtaining a punctual observation for each of these elements. According to the control problem to solve, it might be useful for the agent to not only take action based on the current punctual observations but rather on a collection of the last punctual observations. In this framework, it’s the environment that defines the number of each punctual observation to be considered.

Different “modes” are used in this framework to allow the environment to have different dynamics and/or reward signal. For instance, in training mode, only a part of the dynamics may be available so that it is possible to see how well the agent generalizes to a slightly different one.

Methods

act(action) Applies the agent action [action] on the environment.
end() Optional hook called at the end of all epochs
inTerminalState() Tells whether the environment reached a terminal state after the last transition (i.e.
inputDimensions() Gets the shape of the input space for this environment.
nActions() Gets the number of different actions that can be taken on this environment.
observationType(subject) Gets the most inner type (np.uint8, np.float32, ...) of [subject].
observe() Gets a list of punctual observations on all subjects composing this environment.
reset(mode) Resets the environment and put it in mode [mode].
summarizePerformance(test_data_set) Optional hook that can be used to show a summary of the performance of the agent on the environment in the current mode.
act(action)

Applies the agent action [action] on the environment.

Parameters:

action : int

The action selected by the agent to operate on the environment. Should be an identifier included between 0 included and nActions() excluded.

end()

Optional hook called at the end of all epochs

inTerminalState()

Tells whether the environment reached a terminal state after the last transition (i.e. the last transition that occured was terminal).

As the majority of control tasks considered have no end (a continuous control should be operated), by default this returns always False. But in the context of a video game for instance, terminal states can occurs and these cases this method should be overriden.

Returns:isTerminal : bool
inputDimensions()

Gets the shape of the input space for this environment.

This returns a list whose length is the number of subjects observed on the environment. Each element of the list is a tuple whose content and size depends on the type of data observed: the first integer is always the history size (or batch size) for observing this subject and the rest describes the shape of a single observation on this subject: - () or (1,) means each observation on this subject is a single number and the history size is 1 (= no history) - (N,) means each observation on this subject is a single number and the history size is N - (N, M) means each observation on this subject is a vector of length M and the history size is N - (N, M1, M2) means each observation on this subject is a matrix with M1 rows and M2 columns and the history size is N

nActions()

Gets the number of different actions that can be taken on this environment. It can be either an integer in the case of a finite discrete number of actions or it can be a list of couples [min_action_value,max_action_value] for a continuous action space

observationType(subject)

Gets the most inner type (np.uint8, np.float32, ...) of [subject].

Parameters:

subject : int

The subject

observe()

Gets a list of punctual observations on all subjects composing this environment.

This returns a list where element i is a punctual observation on subject i. You will notice that the history of observations on this subject is not returned; only the very last observation. Each element is thus either a number, vector or matrix and not a succession of numbers, vectors and matrices.

See the documentation of batchDimensions() for more information about the shape of the observations according to their mathematical representation (number, vector or matrix).

reset(mode)

Resets the environment and put it in mode [mode]. This function is called when beginning every new episode.

The [mode] can be used to discriminate for instance between an agent which is training or trying to get a validation or generalization score. The mode the environment is in should always be redefined by resetting the environment using this method, meaning that the mode should be preserved until the next call to reset().

Parameters:

mode : int

The mode to put the environment into. Mode “-1” is reserved and always means “training”.

summarizePerformance(test_data_set)

Optional hook that can be used to show a summary of the performance of the agent on the environment in the current mode.

Parameters:

test_data_set : agent.DataSet

The dataset maintained by the agent in the current mode, which contains observations, actions taken and rewards obtained, as well as wether each transition was terminal or not. Refer to the documentation of agent.DataSet for more information.

Learning algorithms

Q-learning

deer.base_classes.QNetwork(environment, ...) All the Q-networks and actor-critic networks should inherit this interface.
deer.q_networks.q_net_theano.MyQNetwork(...) Deep Q-learning network using Theano
deer.q_networks.q_net_keras.MyQNetwork(...) Deep Q-learning network using Keras (with any backend)

Actor-critic learning

deer.q_networks.AC_net_keras.MyACNetwork(...) Deep Q-learning network using Keras with Tensorflow backend

Detailed description

class deer.base_classes.QNetwork(environment, batch_size)

All the Q-networks and actor-critic networks should inherit this interface.

Parameters:

environment : object from class Environment

The environment linked to the Q-network

batch_size : int

Number of tuples taken into account for each iteration of gradient descent

Methods

chooseBestAction
discountFactor
learningRate
qValues
setDiscountFactor
setLearningRate
train
chooseBestAction(state)

Get the best action for a belief state

discountFactor()

Getting the discount factor

learningRate()

Getting the learning rate

qValues(state)

Get the q value for one belief state

setDiscountFactor(df)

Setting the discount factor

Parameters:

df : float

The discount factor that has to bet set

setLearningRate(lr)

Setting the learning rate

Parameters:

lr : float

The learning rate that has to bet set

train(states, actions, rewards, nextStates, terminals)

This method performs the Bellman iteration for one batch of tuples.

class deer.q_networks.q_net_theano.MyQNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_delta=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.q_networks.NN_theano.NN'>)

Bases: deer.base_classes.QNetwork.QNetwork

Deep Q-learning network using Theano

Parameters:

environment : object from class Environment

rho : float

Parameter for rmsprop. Default : 0.9

rms_epsilon : float

Parameter for rmsprop. Default : 0.0001

momentum : float

Not implemented.

clip_delta : float

If > 0, the squared loss is linear past the clip point which keeps the gradient constant. Default : 0

freeze_interval : int

Period during which the target network is freezed and after which the target network is updated. Default : 1000

batch_size : int

Number of tuples taken into account for each iteration of gradient descent. Default : 32

update_rule: str

{sgd,rmsprop}. Default : rmsprop

random_state : numpy random number generator

Default : random seed.

double_Q : bool

Activate or not the DoubleQ learning : not implemented yet. Default : False More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.

neural_network : object

default is deer.qnetworks.NN_theano

Methods

chooseBestAction
discountFactor
getAllParams
learningRate
qValues
setAllParams
setDiscountFactor
setLearningRate
train
chooseBestAction(state)

Get the best action for a belief state

Returns:The best action : int
qValues(state_val)

Get the q values for one belief state

Returns:The q values for the provided belief state
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train one batch.

  1. Set shared variable in states_shared, next_states_shared, actions_shared, rewards_shared, terminals_shared
  2. perform batch training
Parameters:

states_val : list of batch_size * [list of max_num_elements* [list of k * [element 2D,1D or scalar]])

actions_val : b x 1 numpy array of integers

rewards_val : b x 1 numpy array

next_states_val : list of batch_size * [list of max_num_elements* [list of k * [element 2D,1D or scalar]])

terminals_val : b x 1 numpy boolean array

Returns:

Average loss of the batch training (RMSE)

Individual (square) losses for each tuple

class deer.q_networks.AC_net_keras.MyACNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_delta=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network_critic=<class 'deer.q_networks.NN_keras.NN'>, neural_network_actor=<class 'deer.q_networks.NN_keras.NN'>)

Bases: deer.base_classes.QNetwork.QNetwork

Deep Q-learning network using Keras with Tensorflow backend

Parameters:

environment : object from class Environment

rho : float

Parameter for rmsprop. Default : 0.9

rms_epsilon : float

Parameter for rmsprop. Default : 0.0001

momentum : float

Default : 0

clip_delta : float

Not implemented.

freeze_interval : int

Period during which the target network is freezed and after which the target network is updated. Default : 1000

batch_size : int

Number of tuples taken into account for each iteration of gradient descent. Default : 32

update_rule: str

{sgd,rmsprop}. Default : rmsprop

batch_accumulator : str

{sum,mean}. Default : sum

random_state : numpy random number generator

double_Q : bool, optional

Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.

neural_network : object, optional

default is deer.qnetworks.NN_keras

Methods

chooseBestAction
clip_action
discountFactor
getAllParams
gradients
learningRate
qValues
setAllParams
setDiscountFactor
setLearningRate
train
chooseBestAction(state)

Get the best action for a belief state

Returns:

best_action : float

estim_value : float

train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train one batch.

  1. Set shared variable in states_shared, next_states_shared, actions_shared, rewards_shared, terminals_shared
  2. perform batch training
Parameters:

states_val : list of batch_size * [list of max_num_elements* [list of k * [element 2D,1D or scalar]])

actions_val : b x 1 numpy array of objects (lists of floats)

rewards_val : b x 1 numpy array

next_states_val : list of batch_size * [list of max_num_elements* [list of k * [element 2D,1D or scalar]])

terminals_val : b x 1 numpy boolean array (currently ignored)

Returns:

Average loss of the batch training

Individual losses for each tuple

class deer.q_networks.q_net_keras.MyQNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_delta=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.q_networks.NN_keras.NN'>)

Bases: deer.base_classes.QNetwork.QNetwork

Deep Q-learning network using Keras (with any backend)

Parameters:

environment : object from class Environment

rho : float

Parameter for rmsprop. Default : 0.9

rms_epsilon : float

Parameter for rmsprop. Default : 0.0001

momentum : float

Default : 0

clip_delta : float

Not implemented.

freeze_interval : int

Period during which the target network is freezed and after which the target network is updated. Default : 1000

batch_size : int

Number of tuples taken into account for each iteration of gradient descent. Default : 32

update_rule: str

{sgd,rmsprop}. Default : rmsprop

random_state : numpy random number generator

double_Q : bool, optional

Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.

neural_network : object, optional

default is deer.qnetworks.NN_keras

Methods

chooseBestAction
discountFactor
getAllParams
learningRate
qValues
setAllParams
setDiscountFactor
setLearningRate
train
chooseBestAction(state)

Get the best action for a belief state

Returns:The best action : int
qValues(state_val)

Get the q values for one belief state

Returns:The q values for the provided belief state
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train one batch.

  1. Set shared variable in states_shared, next_states_shared, actions_shared, rewards_shared, terminals_shared
  2. perform batch training
Parameters:

states_val : list of batch_size * [list of max_num_elements* [list of k * [element 2D,1D or scalar]])

actions_val : b x 1 numpy array of integers

rewards_val : b x 1 numpy array

next_states_val : list of batch_size * [list of max_num_elements* [list of k * [element 2D,1D or scalar]])

terminals_val : b x 1 numpy boolean array

Returns:

Average loss of the batch training (RMSE)

Individual (square) losses for each tuple

Policies

deer.base_classes.Policy(q_network, ...) Abstract class for all policies.
deer.policies.EpsilonGreedyPolicy(q_network, ...) The policy acts greedily with probability \(1-\epsilon\) and acts randomly otherwise.
deer.policies.LongerExplorationPolicy(...[, ...]) Simple alternative to \(\epsilon\)-greedy that can explore more efficiently for a broad class of realistic problems.

Detailed description

class deer.base_classes.Policy(q_network, n_actions, random_state)

Abstract class for all policies. A policy takes observations as input, and outputs an action.

Parameters:

q_network : object from class QNetwork

n_actions : int or list

Definition of the action space provided by Environment.nActions()

random_state : numpy random number generator

Methods

action
bestAction
randomAction
action(state)

Main method of the Policy class. It can be called by agent.py, given a state, and should return a valid action w.r.t. the environment given to the constructor.

bestAction(state)

Returns the best Action for the given state. This is an additional encapsulation for q-network.

randomAction()

Returns a random action

class deer.policies.EpsilonGreedyPolicy(q_network, n_actions, random_state, epsilon)

Bases: deer.base_classes.Policy.Policy

The policy acts greedily with probability \(1-\epsilon\) and acts randomly otherwise. It is now used as a default policy for the neural agent.

Parameters:

epsilon : float

Proportion of random steps

Methods

action
bestAction
epsilon
randomAction
setEpsilon
epsilon()

Get the epsilon for \(\epsilon\)-greedy exploration

setEpsilon(e)

Set the epsilon used for \(\epsilon\)-greedy exploration

class deer.policies.LongerExplorationPolicy(q_network, n_actions, random_state, epsilon, length=10)

Bases: deer.base_classes.Policy.Policy

Simple alternative to \(\epsilon\)-greedy that can explore more efficiently for a broad class of realistic problems.

Parameters:

epsilon : float

Proportion of random steps

length : int

Length of the exploration sequences that will be considered

Methods

action
bestAction
epsilon
randomAction
sampleUniformActionSequence
setEpsilon
epsilon()

Get the epsilon

setEpsilon(e)

Set the epsilon

Indices and tables