keras-gym¶
Plug-n-play Reinforcement Learning in Python
Create simple, reproducible RL solutions with OpenAI gym environments and Keras function approximators.
Documentation¶
Example Notebooks¶
Here we list a selection of Jupyter notebooks that help you to get started by learning by example.
Cartpole¶
In these notebooks we solve the CartPole environment.
Cartpole with SARSA¶
In this notebook we solve the CartPole-v0 environment using the SARSA algorithm. We’ll use a linear function approximator for our Q-function.
To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Atari 2600: Pong¶
These notebooks solve the Pong environment.
Atari 2600: Pong with DQN¶
In this notebook we solve the PongDeterministic-v4 environment using deep Q-learning
(DQN). We’ll use a convolutional neural
net (without pooling) as our function approximator for the Q-function, see AtariQ
.
This notebook periodically generates GIFs, so that we can inspect how the training is progressing.
After a few hundred episodes, this is what you can expect:

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Atari 2600: Pong with PPO¶
In this notebook we solve the PongDeterministic-v4 environment using a TD actor-critic algorithm with PPO policy updates.
We use convolutional neural nets (without pooling) as our function
approximators for the state value function \(v(s)\) and
policy \(\pi(a|s)\), see
AtariFunctionApproximator
.
This notebook periodically generates GIFs, so that we can inspect how the training is progressing.
After a few hundred episodes, this is what you can expect:

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Non-Slippery Frozen Lake¶
In these notebooks we solve a non-slippery version of the FrozenLake-v0 environment.
Non-Slippery Frozen Lake with REINFORCE¶
In this notebook we solve a non-slippery version of the FrozenLake-v0 environment using the REINFORCE algorithm (Monte Carlo policy gradient). We’ll use a linear function approximator for our policy.
To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Non-Slippery Frozen Lake with Actor-Critic¶
In this notebook we solve a non-slippery version of the FrozenLake-v0 environment using the TD actor critic algorithm with PPO policy updates. We’ll use a linear function approximator for our policy and our state value function.
To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Non-Slippery Frozen Lake with Soft Actor-Critic (SAC)¶
In this notebook we solve a non-slippery version of the FrozenLake-v0 environment using the Soft Actor-Critic algorithm (SAC). We’ll use a linear function approximator for our policy and our value functions.
To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Pendulum¶
These notebooks solve the Pendulum environment.
Pendulum with PPO¶
In this notebook we solve the Pendulum-v0 environment using a TD actor-critic algorithm with PPO policy updates.
We use a simple multi-layer percentron as our function
approximators for the state value function \(v(s)\) and
policy \(\pi(a|s)\) implemented by GaussianPolicy
.
This algorithm is slow to converge (if it does at all). You should start to see improvement in the average return after about 150k timesteps. Below you’ll see a particularly succesful episode:

To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.
Function Approximators¶
The central object object in this package is the
keras_gym.FunctionApproximator
, which provides an interface between a
gym-type environment and function approximators like value functions and updateable policies.
FunctionApproximator class¶
The way we would define a function approximator is by specifying a body. For instance, the example below specifies a simple multi-layer perceptron:
import gym
import keras_gym as km
from tensorflow import keras
class MLP(km.FunctionApproximator):
""" multi-layer perceptron with one hidden layer """
def body(self, S):
X = keras.layers.Flatten()(S)
X = keras.layers.Dense(units=4)(X)
return X
# environment
env = gym.make(...)
# value function and its derived policy
function_approximator = MLP(env, lr=0.01)
This function_approximator
can now be used to construct a value function or
updateable policy, which we cover in the remainder of this page.
Predefined Function Approximators¶
Although it’s pretty easy to create a custom function approximator, keras-gym also provides some predefined function approximators. They are listed here.
Value Functions¶
Value functions estimate the expected (discounted) sum of future rewards. For instance, state value functions are defined as:
Here, the \(R\) are the individual rewards we receive from the Markov Decision Process (MDP) at each time step.
In keras-gym we define a state value functions as follows:
v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)
The function_approximator is discussed above. The other arguments set the discount factor \(\gamma\in[0,1]\) and the number of steps over which to bootstrap.
Similar to state value functions, we can also define state-action value functions:
keras-gym provides two distinct ways to define such a Q-function, which are refered to as type-I and type-II Q-functions. The difference between the two is in how the function approximator models the Q-function. A type-I Q-function models the Q-function as \((s, a)\mapsto q(s, a)\in\mathbb{R}\), whereas a type-II Q-function models it as \(s\mapsto q(s,.)\in\mathbb{R}^n\). Here, \(n\) is the number of actions, which means that this is only well-defined for discrete action spaces.
In keras-gym we define a type-I Q-function as follows:
q = km.QTypeI(function_approximator, update_strategy='sarsa')
and similarly for type-II:
q = km.QTypeII(function_approximator, update_strategy='sarsa')
The update_strategy
argument specifies our bootstrapping target. Available
choices are 'sarsa'
, 'q_learning'
and 'double_q_learning'
.
The main reason for using a Q-function is for value-based control. In other words, we typically want to derive a policy from the Q-function. This is pretty straightforward too:
pi = km.EpsilonGreedy(q, epsilon=0.1)
# the epsilon parameter may be updated dynamically
pi.set_epsilon(0.25)
Updateable Policies¶
Besides value-based control in which we derive a policy from a Q-function, we can also do policy-based control. In policy-based methods we learn a policy directly as a probability distribution over the space of actions \(\pi(a|s)\).
The updateable policies for discrete action spaces are known as softmax policies:
where the logits are defined over the real line \(z(s,a)\in\mathbb{R}\).
In keras-gym we define a softmax policy as follows:
pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')
Similar to Q-functions, we can pick different update strategies. Available
options for policies are 'vanilla'
, 'ppo'
and 'cross_entropy'
.
These specify the objective function used in our policy updates.
Actor-Critics¶
It’s often useful to combine a policy with a value function into what is called
an actor-critic. The value function (critic) can be used to aid the
update procedure for the policy (actor). The keras-gym package provides
simple way of constructing an actor-critic using the ActorCritic
class:
# separate policy and value function
pi = km.SoftmaxPolicy(function_approximator, update_strategy='vanilla')
v = km.V(function_approximator, gamma=0.9, bootstrap_n=1)
# combine them into a single actor-critic
actor_critic = km.ActorCritic(pi, v)
Objects¶
FunctionApproximator class¶
keras_gym.FunctionApproximator |
A generic function approximator. |
-
class
keras_gym.
FunctionApproximator
(env, optimizer=None, **optimizer_kwargs)[source]¶ A generic function approximator.
This is the central object object that provides an interface between a gym-type environment and function approximators like value functions and updateable policies.
In order to create a valid function approximator, you need to implement the body method. For example, to implement a simple multi-layer perceptron function approximator you would do something like:
import gym import keras_gym as km from tensorflow.keras.layers import Flatten, Dense class MLP(km.FunctionApproximator): """ multi-layer perceptron with one hidden layer """ def body(self, S): X = Flatten()(S) X = Dense(units=4)(X) return X # environment env = gym.make(...) # generic function approximator mlp = MLP(env, lr=0.001) # policy and value function pi, v = km.SoftmaxPolicy(mlp), km.V(mlp)
The default heads are simple (multi) linear regression layers, which can be overridden by your own implementation.
Parameters: - env : environment
A gym-style environment.
- optimizer : keras.optimizers.Optimizer, optional
If left unspecified (
optimizer=None
), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.- **optimizer_kwargs : keyword arguments
Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.
-
DEFAULT_OPTIMIZER
¶ alias of
tensorflow.python.keras.optimizer_v2.adam.Adam
-
body
(self, S)[source]¶ This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the
body()
as an elaborate automatic feature extractor.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
body_q1
(self, S, A)[source]¶ This is similar to
body()
, except that it takes a state-action pair as input instead of only state observations.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
- A : nd Tensor: shape: [batch_size, …]
The input actions.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
head_pi
(self, X)[source]¶ This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - *params : Tensor or tuple of Tensors, shape: [batch_size, …]
These constitute the raw policy distribution parameters.
-
head_q1
(self, X)[source]¶ This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_sa : 2d Tensor, shape: [batch_size, 1]
The output type-I Q-values \(q(s,a)\in\mathbb{R}\).
-
head_q2
(self, X)[source]¶ This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_s : 2d Tensor, shape: [batch_size, num_actions]
The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).
-
head_v
(self, X)[source]¶ This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - V : 2d Tensor, shape: [batch_size, 1]
The output state values \(v(s)\in\mathbb{R}\).
Predefined Function Approximators¶
keras_gym.predefined.LinearFunctionApproximator |
A linear function approximator. |
keras_gym.predefined.AtariFunctionApproximator |
A function approximator specifically designed for Atari 2600 environments. |
keras_gym.predefined.ConnectFourFunctionApproximator |
A function approximator specifically designed for the ConnectFour environment. |
-
class
keras_gym.predefined.
LinearFunctionApproximator
(env, interaction=None, optimizer=None, **optimizer_kwargs)¶ A linear function approximator.
Parameters: - env : environment
A gym-style environment.
- interaction : str or keras.layers.Layer, optional
The desired feature interactions that are fed to the linear regression model. Available predefined preprocessors can be chosen by passing a string, one of the following:
- ‘full_quadratic’
This option generates full-quadratic interactions, which include all linear, bilinear and quadratic terms. It does not include an intercept. Let \(b\) and \(n\) be the batch size and number of features. Then, the input shape is \((b, n)\) and the output shape is \((b, (n + 1) (n + 2) / 2 - 1))\).
Note: This option requires the tensorflow backend.
- ‘elementwise_quadratic’
This option generates element-wise quadratic interactions, which only include linear and quadratic terms. It does not include bilinear terms or an intercept. Let \(b\) and \(n\) be the batch size and number of features. Then, the input shape is \((b, n)\) and the output shape is \((b, 2n)\).
Otherwise, a custom interaction layer can be passed as well. If left unspecified (interaction=None), the interaction layer is omitted altogether.
- optimizer : keras.optimizers.Optimizer, optional
If left unspecified (
optimizer=None
), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.- **optimizer_kwargs : keyword arguments
Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.
-
DEFAULT_OPTIMIZER
¶ alias of
tensorflow.python.keras.optimizer_v2.gradient_descent.SGD
-
body
(self, S)¶ This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the
body()
as an elaborate automatic feature extractor.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
body_q1
(self, S, A)¶ This is similar to
body()
, except that it takes a state-action pair as input instead of only state observations.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
- A : nd Tensor: shape: [batch_size, …]
The input actions.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
head_pi
(self, X)¶ This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - *params : Tensor or tuple of Tensors, shape: [batch_size, …]
These constitute the raw policy distribution parameters.
-
head_q1
(self, X)¶ This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_sa : 2d Tensor, shape: [batch_size, 1]
The output type-I Q-values \(q(s,a)\in\mathbb{R}\).
-
head_q2
(self, X)¶ This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_s : 2d Tensor, shape: [batch_size, num_actions]
The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).
-
head_v
(self, X)¶ This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - V : 2d Tensor, shape: [batch_size, 1]
The output state values \(v(s)\in\mathbb{R}\).
-
class
keras_gym.predefined.
AtariFunctionApproximator
(env, optimizer=None, **optimizer_kwargs)¶ A function approximator specifically designed for Atari 2600 environments.
Parameters: - env : environment
An Atari 2600 gym environment.
- optimizer : keras.optimizers.Optimizer, optional
If left unspecified (
optimizer=None
), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.- **optimizer_kwargs : keyword arguments
Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.
-
DEFAULT_OPTIMIZER
¶ alias of
tensorflow.python.keras.optimizer_v2.adam.Adam
-
body
(self, S)¶ This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the
body()
as an elaborate automatic feature extractor.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
body_q1
(self, S, A)¶ This is similar to
body()
, except that it takes a state-action pair as input instead of only state observations.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
- A : nd Tensor: shape: [batch_size, …]
The input actions.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
head_pi
(self, X)¶ This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - *params : Tensor or tuple of Tensors, shape: [batch_size, …]
These constitute the raw policy distribution parameters.
-
head_q1
(self, X)¶ This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_sa : 2d Tensor, shape: [batch_size, 1]
The output type-I Q-values \(q(s,a)\in\mathbb{R}\).
-
head_q2
(self, X)¶ This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_s : 2d Tensor, shape: [batch_size, num_actions]
The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).
-
head_v
(self, X)¶ This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - V : 2d Tensor, shape: [batch_size, 1]
The output state values \(v(s)\in\mathbb{R}\).
-
class
keras_gym.predefined.
ConnectFourFunctionApproximator
(env, optimizer=None, **optimizer_kwargs)¶ A function approximator specifically designed for the
ConnectFour
environment.Parameters: - env : environment
An Atari 2600 gym environment.
- optimizer : keras.optimizers.Optimizer, optional
If left unspecified (
optimizer=None
), the function approximator’s DEFAULT_OPTIMIZER is used. See keras documentation for more details.- **optimizer_kwargs : keyword arguments
Keyword arguments for the optimizer. This is useful when you want to use the default optimizer with a different setting, e.g. changing the learning rate.
-
DEFAULT_OPTIMIZER
¶ alias of
tensorflow.python.keras.optimizer_v2.adam.Adam
-
body
(self, S)¶ This is the part of the computation graph that may be shared between e.g. policy (actor) and value function (critic). It is typically the part of a neural net that does most of the heavy lifting. One may think of the
body()
as an elaborate automatic feature extractor.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
body_q1
(self, S, A)¶ This is similar to
body()
, except that it takes a state-action pair as input instead of only state observations.Parameters: - S : nd Tensor: shape: [batch_size, …]
The input state observation.
- A : nd Tensor: shape: [batch_size, …]
The input actions.
Returns: - X : nd Tensor, shape: [batch_size, …]
The intermediate keras tensor.
-
head_pi
(self, X)¶ This is the policy head. It returns logits, i.e. not probabilities. Use a softmax to turn the output into probabilities.
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - *params : Tensor or tuple of Tensors, shape: [batch_size, …]
These constitute the raw policy distribution parameters.
-
head_q1
(self, X)¶ This is the type-I Q-value head. It returns a scalar Q-value \(q(s,a)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_sa : 2d Tensor, shape: [batch_size, 1]
The output type-I Q-values \(q(s,a)\in\mathbb{R}\).
-
head_q2
(self, X)¶ This is the type-II Q-value head. It returns a vector of Q-values \(q(s,.)\in\mathbb{R}^n\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - Q_s : 2d Tensor, shape: [batch_size, num_actions]
The output type-II Q-values \(q(s,.)\in\mathbb{R}^n\).
-
head_v
(self, X)¶ This is the state value head. It returns a scalar V-value \(v(s)\in\mathbb{R}\).
Parameters: - X : nd Tensor, shape: [batch_size, …]
X
is an intermediate tensor in the full forward-pass of the computation graph; it’s the output of the last layer of thebody()
method.
Returns: - V : 2d Tensor, shape: [batch_size, 1]
The output state values \(v(s)\in\mathbb{R}\).
Value Functions¶
keras_gym.V |
A state value function \(s\mapsto v(s)\). |
keras_gym.QTypeI |
A type-I state-action value function \((s,a)\mapsto q(s,a)\). |
keras_gym.QTypeII |
A type-II state-action value function \(s\mapsto q(s,.)\). |
-
class
keras_gym.
V
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False)[source]¶ A state value function \(s\mapsto v(s)\).
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
-
__call__
(self, s, use_target_model=False)[source]¶ Evaluate the Q-function.
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - V : float or array of floats
The estimated value of the state \(v(s)\).
-
batch_eval
(self, S, use_target_model=False)[source]¶ Evaluate the state value function on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - V : 1d array, dtype: float, shape: [batch_size]
The predicted state values.
-
batch_update
(self, S, Rn, In, S_next)[source]¶ Update the value function on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
class
keras_gym.
QTypeI
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]¶ A type-I state-action value function \((s,a)\mapsto q(s,a)\).
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
- update_strategy : str, optional
The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]Options are:
- ‘sarsa’
Sample the next action, i.e. use the action that was actually taken.
- ‘q_learning’
Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.
- ‘double_q_learning’
Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.
\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]- ‘expected_sarsa’
Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
-
__call__
(self, s, a=None, use_target_model=False)¶ Evaluate the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action, optional
A single action.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : float or array of floats
If action
a
is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand,a
is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is[num_actions]
, which is only well-defined for discrete action spaces.
-
batch_eval
(self, S, A=None, use_target_model=False)[source]¶ Evaluate the Q-function on a batch of state (or state-action) observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : 1d array, dtype: int, shape: [batch_size], optional
A batch of actions that were taken.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : 1d or 2d array of floats
If action
A
is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand,A
is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is[batch_size, num_actions]
, which is only well-defined for discrete action spaces.
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)¶ Update the value function on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
bootstrap_target
(self, Rn, In, S_next, A_next=None)¶ Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).
Parameters: - Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - Gn : 1d array, dtype: int, shape: [batch_size]
A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given
update_strategy
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
-
class
keras_gym.
QTypeII
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, update_strategy='sarsa')[source]¶ A type-II state-action value function \(s\mapsto q(s,.)\).
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
- update_strategy : str, optional
The update strategy that we use to select the (would-be) next-action \(A_{t+n}\) in the bootstrapped target:
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n Q(S_{t+n}, A_{t+n})\]Options are:
- ‘sarsa’
Sample the next action, i.e. use the action that was actually taken.
- ‘q_learning’
Take the action with highest Q-value under the current estimate, i.e. \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\). This is an off-policy method.
- ‘double_q_learning’
Same as ‘q_learning’, \(A_{t+n} = \arg\max_aQ(S_{t+n}, a)\), except that the value itself is computed using the target_model rather than the primary model, i.e.
\[\begin{split}A_{t+n}\ &=\ \arg\max_aQ_\text{primary}(S_{t+n}, a)\\ G^{(n)}_t\ &=\ R^{(n)}_t + \gamma^n Q_\text{target}(S_{t+n}, A_{t+n})\end{split}\]- ‘expected_sarsa’
Similar to SARSA in that it’s on-policy, except that we take the expectated Q-value rather than a sample of it, i.e.
\[G^{(n)}_t\ =\ R^{(n)}_t + \gamma^n\sum_a\pi(a|s)\,Q(S_{t+n}, a)\]
-
__call__
(self, s, a=None, use_target_model=False)¶ Evaluate the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action, optional
A single action.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : float or array of floats
If action
a
is provided, a single float representing \(q(s,a)\) is returned. If, on the other hand,a
is left unspecified, a vector representing \(q(s,.)\) is returned instead. The shape of the latter return value is[num_actions]
, which is only well-defined for discrete action spaces.
-
batch_eval
(self, S, A=None, use_target_model=False)[source]¶ Evaluate the Q-function on a batch of state (or state-action) observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : 1d array, dtype: int, shape: [batch_size], optional
A batch of actions that were taken.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - Q : 1d or 2d array of floats
If action
A
is provided, a 1d array representing a batch of \(q(s,a)\) is returned. If, on the other hand,A
is left unspecified, a vector representing a batch of \(q(s,.)\) is returned instead. The shape of the latter return value is[batch_size, num_actions]
, which is only well-defined for discrete action spaces.
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)¶ Update the value function on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
bootstrap_target
(self, Rn, In, S_next, A_next=None)¶ Get the bootstrapped target \(G^{(n)}_t=R^{(n)}_t+\gamma^nQ(S_{t+n}, A_{t+n})\).
Parameters: - Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as:
\[G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\]- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, dtype: int, shape: [batch_size, num_actions]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - Gn : 1d array, dtype: int, shape: [batch_size]
A batch of bootstrap-estimated returns \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n},A_{t+n})\) computed according to given
update_strategy
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update the Q-function.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
Updateable Policies¶
keras_gym.GaussianPolicy |
An updateable policy for environments with a continuous action space, i.e. |
keras_gym.SoftmaxPolicy |
Updateable policy for discrete action spaces. |
-
class
keras_gym.
GaussianPolicy
(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]¶ An updateable policy for environments with a continuous action space, i.e. a
Box
. It models the policy \(\pi_\theta(a|s)\) as a normal distribution with conditional parameters \((\mu_\theta(s), \sigma_\theta(s))\).Important
This environment requires that the
env
is with:env = km.wrappers.BoxToReals(env)
This wrapper decompactifies the Box action space.
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- update_strategy : str, optional
The strategy for updating our policy. This typically determines the loss function that we use for our policy function approximator.
Options are:
- ‘vanilla’
Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:
\[J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ -\mathcal{A}_t\,\log\pi_\theta(A_t|S_t) \right\}\]where \(\mathcal{A}_t=\mathcal{A}(S_t,A_t)\) is the advantage at time step \(t\).
- ‘ppo’
Proximal policy optimization uses a clipped proximal loss:
\[J(\theta)\ =\ \hat{\mathbb{E}}_t \left\{ \min\Big( \rho_t(\theta)\,\mathcal{A}_t\,,\ \tilde{\rho}_t(\theta)\,\mathcal{A}_t \Big) \right\}\]where \(\rho_t(\theta)\) is the probability ratio:
\[\rho_t(\theta)\ =\ \frac {\pi_\theta(A_t|S_t)} {\pi_{\theta_\text{old}}(A_t|S_t)}\]and \(\tilde{\rho}_t(\theta)\) is its clipped version:
\[\tilde{\rho}_t(\theta)\ =\ \text{clip}\big( \rho_t(\theta), 1-\epsilon, 1+\epsilon\big)\]- ‘cross_entropy’
Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):
\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
- ppo_clip_eps : float, optional
The clipping parameter \(\epsilon\) in the PPO clipped surrogate loss. This option is only applicable if
update_strategy='ppo'
.- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
-
__call__
(self, s, use_target_model=False)¶ Draw an action from the current policy \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the policy on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - A : nd array, shape: [batch_size, …]
A batch of sampled actions.
-
batch_update
(self, S, A, Adv)¶ Update the policy on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd array, shape: [batch_size, …]
A batch of actions taken by the behavior policy.
- Adv : 1d array, dtype: float, shape: [batch_size]
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s, use_target_model=False)¶ Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - *params : tuple of arrays
The raw distribution parameters.
-
greedy
(self, s, use_target_model=False)¶ Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
policy_loss_with_metrics
(self, Adv, A=None)¶ This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).
This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.
Parameters: - Adv : 1d Tensor, shape: [batch_size]
A batch of advantages.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken under the behavior policy. For some choices of policy loss, e.g.
update_strategy='sac'
this input is ignored.
Returns: - loss, metrics : (Tensor, dict of Tensors)
The policy loss along with some metrics, which is a dict of type
{name <str>: metric <Tensor>}
. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors withndim=0
.The
loss
is passed to a keras Model usingtrain_model.add_loss(loss)
. Similarly, each metric in the metric dict is passed to the model usingtrain_model.add_metric(metric, name=name, aggregation='mean')
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, advantage)¶ Update the policy.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- advantage : float
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
-
class
keras_gym.
SoftmaxPolicy
(function_approximator, update_strategy='vanilla', ppo_clip_eps=0.2, entropy_beta=0.01, random_seed=None)[source]¶ Updateable policy for discrete action spaces.
Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- update_strategy : str, callable, optional
The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the
policy_loss_with_metrics()
method.Provided options are:
- ‘vanilla’
Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:
\[J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)\]- ‘ppo’
Proximal policy optimization uses a clipped proximal loss:
\[J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)\]where \(r(\theta)\) is the probability ratio:
\[r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}\]- ‘cross_entropy’
Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):
\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
- ppo_clip_eps : float, optional
The clipping parameter \(\epsilon\) in the PPO clipped surrogate loss. This option is only applicable if
update_strategy='ppo'
.- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
- random_seed : int, optional
Sets the random state to get reproducible results.
-
__call__
(self, s, use_target_model=False)[source]¶ Draw an action from the current policy \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the policy on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - A : nd array, shape: [batch_size, …]
A batch of sampled actions.
-
batch_update
(self, S, A, Adv)¶ Update the policy on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd array, shape: [batch_size, …]
A batch of actions taken by the behavior policy.
- Adv : 1d array, dtype: float, shape: [batch_size]
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s, use_target_model=False)¶ Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - *params : tuple of arrays
The raw distribution parameters.
-
greedy
(self, s, use_target_model=False)¶ Draw the greedy action, i.e. \(\arg\max_a\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns: - a : action
A single action proposed under the current policy.
-
policy_loss_with_metrics
(self, Adv, A=None)¶ This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).
This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.
Parameters: - Adv : 1d Tensor, shape: [batch_size]
A batch of advantages.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken under the behavior policy. For some choices of policy loss, e.g.
update_strategy='sac'
this input is ignored.
Returns: - loss, metrics : (Tensor, dict of Tensors)
The policy loss along with some metrics, which is a dict of type
{name <str>: metric <Tensor>}
. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors withndim=0
.The
loss
is passed to a keras Model usingtrain_model.add_loss(loss)
. Similarly, each metric in the metric dict is passed to the model usingtrain_model.add_metric(metric, name=name, aggregation='mean')
.
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, advantage)¶ Update the policy.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- advantage : float
A value for the advantage \(\mathcal{A}(s,a) = q(s,a) - v(s)\). This might be sampled and/or estimated version of the true advantage.
Actor-Critics¶
keras_gym.ActorCritic |
A generic actor-critic, combining an updateable policy with a value function. |
keras_gym.SoftActorCritic |
Implementation of a soft actor-critic (SAC), which uses entropy regularization in the value function as well as in its policy updates. |
-
class
keras_gym.
ActorCritic
(policy, v_func, value_loss_weight=1.0)[source]¶ A generic actor-critic, combining an updateable policy with a value function.
The added value of using an
ActorCritic
to combine a policy with a value function is that it avoids having to feed in S (potentially very large) three times at training time. Instead, it only feeds it in once.Parameters: - policy : Policy object
- v_func : value-function object
A state value function \(v(s)\).
- value_loss_weight : float, optional
Relative weight to give to the value-function loss:
loss = policy_loss + value_loss_weight * value_loss
-
__call__
(self, s)¶ Draw an action from the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the actor-critic on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)¶ Update both actor and critic on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\).
- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s)¶ Get the distribution parameters under the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - dist_params, v : tuple (1d array of floats, float)
Returns a pair representing the distribution parameters of \(\pi(a|s)\) and the estimated state value \(v(s)\).
-
classmethod
from_func
(function_approximator, gamma=0.9, bootstrap_n=1, bootstrap_with_target_model=False, entropy_beta=0.01, update_strategy='vanilla', random_seed=None)[source]¶ Create instance directly from a
FunctionApproximator
object.Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- bootstrap_with_target_model : bool, optional
Whether to use the target_model when constructing a bootstrapped target. If False (default), the primary predict_model is used.
- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
- update_strategy : str, callable, optional
The strategy for updating our policy. This determines the loss function that we use for our policy function approximator. If you wish to use a custom policy loss, you can override the
policy_loss_with_metrics()
method.Provided options are:
- ‘vanilla’
Plain vanilla policy gradient. The corresponding (surrogate) loss function that we use is:
\[J(\theta)\ =\ -\mathcal{A}(s,a)\,\ln\pi(a|s,\theta)\]- ‘ppo’
Proximal policy optimization uses a clipped proximal loss:
\[J(\theta)\ =\ \min\Big( r(\theta)\,\mathcal{A}(s,a)\,,\ \text{clip}\big( r(\theta), 1-\epsilon, 1+\epsilon\big) \,\mathcal{A}(s,a)\Big)\]where \(r(\theta)\) is the probability ratio:
\[r(\theta)\ =\ \frac {\pi(a|s,\theta)} {\pi(a|s,\theta_\text{old})}\]- ‘cross_entropy’
Straightforward categorical cross-entropy (from logits). This loss function does not make use of the advantages Adv. Instead, it minimizes the cross entropy between the behavior policy \(\pi_b(a|s)\) and the learned policy \(\pi_\theta(a|s)\):
\[J(\theta)\ =\ \hat{\mathbb{E}}_t\left\{ -\sum_a \pi_b(a|S_t)\, \log \pi_\theta(a|S_t) \right\}\]
- random_seed : int, optional
Sets the random state to get reproducible results.
-
greedy
(self, s)¶ Draw a greedy action \(a=\arg\max_{a'}\pi(a'|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update both actor and critic.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
-
class
keras_gym.
SoftActorCritic
(policy, v_func, q_func1, q_func2, value_loss_weight=1.0)[source]¶ Implementation of a soft actor-critic (SAC), which uses entropy regularization in the value function as well as in its policy updates.
Parameters: - policy : a policy object
An updateable policy object \(\pi(a|s)\).
- v_func : v-function object
A state-action value function. This is used as the entropy-regularized value function (critic).
- q_func1 : q-function object
A type-I state-action value function. This is used as the target for both the policy (actor) and the state value function (critic).
- q_func2 : q-function object
Same as
q_func1
. SAC uses two q-functions to avoid overfitting due to overly optimistic value estimates.- value_loss_weight : float, optional
Relative weight to give to the value-function loss:
loss = policy_loss + value_loss_weight * value_loss
-
__call__
(self, s)¶ Draw an action from the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
batch_eval
(self, S, use_target_model=False)¶ Evaluate the actor-critic on a batch of state observations.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- use_target_model : bool, optional
Whether to use the target_model internally. If False (default), the predict_model is used.
Returns:
-
batch_update
(self, S, A, Rn, In, S_next, A_next=None)[source]¶ Update both actor and critic on a batch of transitions.
Parameters: - S : nd array, shape: [batch_size, …]
A batch of state observations.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken.
- Rn : 1d array, dtype: float, shape: [batch_size]
A batch of partial returns. For example, in n-step bootstrapping this is given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the non-bootstrapped part of the n-step return.
- In : 1d array, dtype: float, shape: [batch_size]
A batch bootstrapping factor. For instance, in n-step bootstrapping this is given by \(I^{(n)}_t=\gamma^n\) if the episode is ongoing and \(I^{(n)}_t=0\) otherwise. This allows us to write the bootstrapped target as \(G^{(n)}_t=R^{(n)}_t+I^{(n)}_tQ(S_{t+n}, A_{t+n})\).
- S_next : nd array, shape: [batch_size, …]
A batch of next-state observations.
- A_next : 2d Tensor, shape: [batch_size, …]
A batch of (potential) next actions A_next. This argument is only used if
update_strategy='sarsa'
.
Returns: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
dist_params
(self, s)¶ Get the distribution parameters under the current policy \(\pi(a|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - dist_params, v : tuple (1d array of floats, float)
Returns a pair representing the distribution parameters of \(\pi(a|s)\) and the estimated state value \(v(s)\).
-
classmethod
from_func
(function_approximator, gamma=0.9, bootstrap_n=1, q_type=None, entropy_beta=0.01, random_seed=None)[source]¶ Create instance directly from a
FunctionApproximator
object.Parameters: - function_approximator : FunctionApproximator object
The main function approximator.
- gamma : float, optional
The discount factor for discounting future rewards.
- bootstrap_n : positive int, optional
The number of steps in n-step bootstrapping. It specifies the number of steps over which we’re willing to delay bootstrapping. Large \(n\) corresponds to Monte Carlo updates and \(n=1\) corresponds to TD(0).
- q_type : 1 or 2, optional
Whether to model the q-function as type-I or type-II. This defaults to type-II for discrete action spaces and type-I otherwise.
- entropy_beta : float, optional
The coefficient of the entropy bonus term in the policy objective.
- random_seed : int, optional
Sets the random state to get reproducible results.
-
greedy
(self, s)¶ Draw a greedy action \(a=\arg\max_{a'}\pi(a'|s)\) and get the expected value \(v(s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a, v : tuple (1d array of floats, float)
Returns a pair representing \((a, v(s))\).
-
sync_target_model
(self, tau=1.0)¶ Synchronize the target model with the primary model.
Parameters: - tau : float between 0 and 1, optional
The amount of exponential smoothing to apply in the target update:
\[w_\text{target}\ \leftarrow\ (1 - \tau)\,w_\text{target} + \tau\,w_\text{primary}\]
-
update
(self, s, a, r, done)¶ Update both actor and critic.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
Policies¶
In reinforcement learning (RL), a policy can either be derived from a state-action value function or it be learned directly as an updateable policy. These two approaches are called value-based and policy-based RL, respectively. The way we update our policies differs quite a bit between the two approaches.
For value-based RL, we have algorithms like TD(0), Monte Carlo and everything in between. The optimization problem that we use to update our function approximator is typically ordinary least-squares regression (or Huber loss).
In policy-based RL, on the other hand, we update our function approximators using direct policy gradient techniques. This makes the optimization problem quite different from ordinary supervised learning.
Below we list all policy objects provided by keras-gym.
Updateable Policies and Actor-Critics¶
For updateable policies have a look at the relevant function approximator section:
Value-Based Policies¶
These policies are derived from a Q-function object. See example below:
import gym
import keras_gym as km
# the cart-pole MDP
env = gym.make(...)
# use linear function approximator for q(s,a)
func = km.predefined.LinearFunctionApproximator(env, lr=0.01)
q = km.Q(func, update_strategy='q_learning')
pi = EpsilonGreedy(q, epsilon=0.1)
# get some dummy state observation
s = env.reset()
# draw an action, given state s
a = pi(s)
Special Policies¶
We’ve also got some special policies, which are policies that don’t depend on
any learned function approximator. The two main examples that are available
right now are RandomPolicy
and
UserInputPolicy
. The latter
allows you to pick the actions yourself as the episode runs.
Objects¶
Value-Based Policies¶
keras_gym.policies.EpsilonGreedy |
Value-based policy to select actions using epsilon-greedy strategy. |
-
class
keras_gym.policies.
EpsilonGreedy
(q_function, epsilon=0.1, random_seed=None)[source]¶ Value-based policy to select actions using epsilon-greedy strategy.
Parameters: - q_function : callable
A state-action value function object.
- epsilon : float between 0 and 1
The probability of selecting an action uniformly at random.
- random_seed : int, optional
Sets the random state to get reproducible results.
-
__call__
(self, s)[source]¶ Draw an action from the current policy \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a : action
A single action proposed under the current policy.
-
dist_params
(self, s)[source]¶ Get the parameters of the (conditional) probability distribution \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
Returns: - params : nd array
An array containing the distribution parameters.
Special Policies¶
keras_gym.policies.RandomPolicy |
Value-based policy to select actions using epsilon-greedy strategy. |
keras_gym.policies.UserInputPolicy |
A policy that prompts the user to take an action. |
-
class
keras_gym.policies.
RandomPolicy
(env, random_seed=None)[source]¶ Value-based policy to select actions using epsilon-greedy strategy.
Parameters: - env : gym environment
The gym environment is used to sample from the action space.
- random_seed : int, optional
Sets the random state to get reproducible results.
-
__call__
(self, s)[source]¶ Draw an action from the current policy \(\pi(a|s)\).
Parameters: - s : state observation
A single state observation.
Returns: - a : action
A single action proposed under the current policy.
-
class
keras_gym.policies.
UserInputPolicy
(env, render_before_prompt=False)[source]¶ A policy that prompts the user to take an action.
Parameters: - env : gym environment
The gym environment is used to sample from the action space.
- render_before_prompt : bool, optional
Whether to render the env before prompting the user to pick an action.
Probability Distributions¶
This is a collection of probability distributions that can be used as part of a computation graph.
All methods are differentiable, including the sample()
method via the
reparametrization trick or variations thereof. This means that they may be used
in constructing loss functions that require quantities like (cross)entropy or
KL-divergence.
Objects¶
Differentiable Probability Distributions¶
keras_gym.proba_dists.CategoricalDist |
Differential implementation of a categorical distribution. |
keras_gym.proba_dists.NormalDist |
Implementation of a normal distribution. |
-
class
keras_gym.proba_dists.
CategoricalDist
(logits, boltzmann_tau=0.2, name='categorical_dist', random_seed=None)[source]¶ Differential implementation of a categorical distribution.
Parameters: - logits : 2d Tensor, dtype: float, shape: [batch_size, num_categories]
A batch of logits \(z\in \mathbb{R}^n\) with \(n=\)
num_categories
.- boltzmann_tau : float, optional
The Boltzmann temperature that is used in generating near one-hot propensities in
sample()
. A smaller number means closer to deterministic, one-hot encoded samples. A larger number means better numerical stability. A good value for \(\tau\) is one that offers a good trade-off between these two desired properties.- name : str, optional
Name scope of the distribution.
- random_seed : int, optional
To get reproducible results.
-
cross_entropy
(self, other)[source]¶ Compute the cross-entropy of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:
\[\text{CE}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\,\log p_\text{other}\]Parameters: - other : probability dist
The
other
probability dist must be of the same type asself
.
Returns: - cross_entropy : 1d Tensor, shape: [batch_size]
The cross-entropy.
-
entropy
(self)[source]¶ Compute the entropy of the probability distribution.
Parameters: - x : nd Tensor, shape: [batch_size, …]
A batch of specific variates.
Returns: - entropy : 1d Tensor, shape: [batch_size]
The entropy of the probability distribution.
-
kl_divergence
(self, other)[source]¶ Compute the Kullback-Leibler divergence of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:
\[\text{KL}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\, \log\frac{p_\text{other}}{p_\text{self}}\]Parameters: - other : probability dist
The
other
probability dist must be of the same type asself
.
Returns: - kl_divergence : 1d Tensor, shape: [batch_size]
The KL-divergence.
-
log_proba
(self, x)[source]¶ Compute the log-probability associated with specific variates.
Parameters: - x : nd Tensor, shape: [batch_size, …]
A batch of specific variates.
Returns: - log_proba : 1d Tensor, shape: [batch_size]
The log-probabilities.
-
sample
(self)[source]¶ Sample from the probability distribution. In order to return a differentiable sample, this method uses the approach outlined in [ArXiv:1611.01144].
Returns: - sample : 2d array, shape: [batch_size, num_categories]
The sampled variates. The returned arrays are near one-hot encoded versions of deterministic variates.
-
class
keras_gym.proba_dists.
NormalDist
(mu, logvar, name='normal_dist', random_seed=None)[source]¶ Implementation of a normal distribution.
Parameters: - mu : 1d Tensor, dtype: float, shape: [batch_size, n]
A batch of vectors of means \(\mu\in\mathbb{R}^n\).
- logvar : 1d Tensor, dtype: float, shape: [batch_size, n]
A batch of vectors of log-variances \(\log(\sigma^2)\in\mathbb{R}^n\)
- name : str, optional
Name scope of the distribution.
- random_seed : int, optional
To get reproducible results.
-
cross_entropy
(self, other)[source]¶ Compute the cross-entropy of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:
\[\text{CE}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\,\log p_\text{other}\]Parameters: - other : probability dist
The
other
probability dist must be of the same type asself
.
Returns: - cross_entropy : 1d Tensor, shape: [batch_size]
The cross-entropy.
-
entropy
(self)[source]¶ Compute the entropy of the probability distribution.
Parameters: - x : nd Tensor, shape: [batch_size, …]
A batch of specific variates.
Returns: - entropy : 1d Tensor, shape: [batch_size]
The entropy of the probability distribution.
-
kl_divergence
(self, other)[source]¶ Compute the Kullback-Leibler divergence of a probability distribution \(p_\text{other}\) relative to the current probablity distribution \(p_\text{self}\), symbolically:
\[\text{KL}[p_\text{self}, p_\text{other}]\ =\ -\sum p_\text{self}\, \log\frac{p_\text{other}}{p_\text{self}}\]Parameters: - other : probability dist
The
other
probability dist must be of the same type asself
.
Returns: - kl_divergence : 1d Tensor, shape: [batch_size]
The KL-divergence.
Caching¶
In RL we often make use of data caching. This might be short-term caching, over the course of an episode, or it might be long-term caching as is done in experience replay.
Short-term Caching¶
Our short-term caching objects allow us to cache experience within an episode.
For instance MonteCarloCache
caches all transitions collected over an entire episode and then gives us back
the the \(\gamma\)-discounted returns when the episode
finishes.
Another short-term caching object is NStepCache
, which keeps an \(n\)-sized sliding window
of transitions that allows us to do \(n\)-step bootstrapping.
Experience Replay Buffer¶
At the moment, we only have one long-term caching object, which is the
ExperienceReplayBuffer
.
This object can hold an arbitrary number of transitions; the only constraint is
the amount of available memory on your machine.
The way we use learn from the experience stored in the
ExperienceReplayBuffer
is
by sampling from it and then feeding the batch of transitions to our
function approximator.
Objects¶
Short-term Caching¶
keras_gym.caching.MonteCarloCache |
|
keras_gym.caching.NStepCache |
A convenient helper class for n-step bootstrapping. |
-
class
keras_gym.caching.
MonteCarloCache
(env, gamma)[source]¶ -
add
(self, s, a, r, done)[source]¶ Add a transition to the experience cache.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
-
flush
(self)[source]¶ Flush all transitions from the cache.
Returns: - S, A, G : tuple of arrays
The returned tuple represents a batch of preprocessed transitions:
-
-
class
keras_gym.caching.
NStepCache
(env, n, gamma)[source]¶ A convenient helper class for n-step bootstrapping.
Parameters: - env : gym environment
The main gym environment. This is needed to determine
num_actions
.- n : positive int
The number of steps over which to bootstrap.
- gamma : float between 0 and 1
The amount by which to discount future rewards.
-
add
(self, s, a, r, done)[source]¶ Add a transition to the experience cache.
Parameters: - s : state observation
A single state observation.
- a : action
A single action.
- r : float
A single observed reward.
- done : bool
Whether the episode has finished.
-
flush
(self)[source]¶ Flush all transitions from the cache.
Returns: - S, A, Rn, In, S_next, A_next : tuple of arrays
The returned tuple represents a batch of preprocessed transitions:
These are typically used for bootstrapped updates, e.g. minimizing the bootstrapped MSE:
\[\left( R^{(n)}_t + I^{(n)}_t\,Q(S_{t+n},A_{t+n}) - Q(S_t,A_t) \right)^2\]
-
pop
(self)[source]¶ Pop a single transition from the cache.
Returns: - S, A, Rn, In, S_next, A_next : tuple of arrays, batch_size=1
The returned tuple represents a batch of preprocessed transitions:
These are typically used for bootstrapped updates, e.g. minimizing the bootstrapped MSE:
\[\left( R^{(n)}_t + I^{(n)}_t\,Q(S_{t+n},A_{t+n}) - Q(S_t,A_t) \right)^2\]
Experience Replay¶
keras_gym.caching.ExperienceReplayBuffer |
A simple numpy implementation of an experience replay buffer. |
-
class
keras_gym.caching.
ExperienceReplayBuffer
(env, capacity, batch_size=32, bootstrap_n=1, gamma=0.99, random_seed=None)[source]¶ A simple numpy implementation of an experience replay buffer. This is written primarily with computer game environments (Atari) in mind.
It implements a generic experience replay buffer for environments in which individual observations (frames) are stacked to represent the state.
Parameters: - env : gym environment
The main gym environment. This is needed to infer the number of stacked frames
num_frames
as well as the number of actionsnum_actions
.- capacity : positive int
The capacity of the experience replay buffer. DQN typically uses
capacity=1000000
.- batch_size : positive int, optional
The desired batch size of the sample.
- bootstrap_n : positive int
The number of steps over which to delay bootstrapping, i.e. n-step bootstrapping.
- gamma : float between 0 and 1
Reward discount factor.
- random_seed : int or None
To get reproducible results.
-
add
(self, s, a, r, done, episode_id)[source]¶ Add a transition to the experience replay buffer.
Parameters: - s : state
A single state observation.
- a : action
A single action.
- r : float
The observed rewards associated with this transition.
- done : bool
Whether the episode has finished.
- episode_id : int
The episode in which the transition took place. This is needed for generating consistent samples.
-
classmethod
from_value_function
(value_function, capacity, batch_size=32)[source]¶ Create a new instance by extracting some settings from a Q-function.
The settings that are extracted from the value function are:
gamma
,bootstrap_n
andnum_frames
. The latter is taken from the value function’senv
attribute.Parameters: - value_function : value-function object
A state value function or a state-action value function.
- capacity : positive int
The capacity of the experience replay buffer. DQN typically uses
capacity=1000000
.- batch_size : positive int, optional
The desired batch size of the sample.
Returns: - experience_replay_buffer
A new instance.
-
sample
(self)[source]¶ Get a batch of transitions to be used for bootstrapped updates.
Returns: - S, A, Rn, In, S_next, A_next : tuple of arrays
The returned tuple represents a batch of preprocessed transitions:
These are typically used for bootstrapped updates, e.g. minimizing the bootstrapped MSE:
\[\left( R^{(n)}_t + I^{(n)}_t\,\sum_aP(a|S_{t+n})\,Q(S_{t+n},a) - \sum_aP(a|S_t)\,Q(S_t,a) \right)^2\]
Planning¶
keras-gym provides planning methods. The only planning method that is currently implemented is the variant of Monte Carlo tree search (MCTS) that is used in AlphaZero. The goal is to implement more planning methods in the near future.
Objects¶
Monte Carlo Tree Search¶
keras_gym.planning.MCTSNode |
Implementation of Monte Carlo tree search used in AlphaZero. |
-
class
keras_gym.planning.
MCTSNode
(actor_critic, state_id=None, tau=1.0, v_resign=0.999, c_puct=1.0, random_seed=None)[source]¶ Implementation of Monte Carlo tree search used in AlphaZero.
Parameters: - state_id : str
The state id of the env, which allows us to set the env to the correct state.
- actor_critic : ActorCritic object
The actor-critic that is used to evaluate leaf nodes.
- tau : float, optional
The temperature parameter used in the ‘behavior’ policy:
\[\pi(a|s)\ =\ \frac{N(s,a)^{1/\tau}}{\sum_{a'}N(s,a')^{1/\tau}}\]- v_resign : float, optional
The value we use to determine whether a player should resign before a game ends. Namely, the player will resign if the predicted value drops below \(v(s) < v_\text{resign}\).
- c_puct : float, optional
A hyperparameter that determines how to balance exploration and exploitation. It appears in the selection criterion during the search phase:
\[a\ =\ \arg\max_{a'}\left( Q(s,a) + U(s,a) \right)\]where
\[\begin{split}Q(s,a)\ &=\ \frac1{N(s)} \sum_{s'\in\text{desc}(s,a)} v(s') \\ U(s,a)\ &=\ \color{red}{c_\text{puct}}\,P(s, a)\, \frac{\sqrt{N(s)}}{1+N(s,a)}\end{split}\]Here \(\text{desc}(s,a)\) denotes the set of all the previously evaluated descendant states of the state \(s\) that can be reached by taking action \(a\). The value and prior probabilities \(v(s)\) and \(P(s,a)\) are generated by the actor-critic. Also, we use the short-hand notation for the combined state-action visit counts:
\[N(s)\ =\ \sum_{a'} N(s,a')\]Note that this is not exactly the state visit count, which would be \(N(s) + 1\) due to the initial selection and expansion of the root node itself.
- random_seed : int, optional
Sets the random state to get reproducible results.
Attributes: - env : gym-style environment
The main environment of the game.
- state : state observation
The current state of the environment.
- num_actions : int
The number of actions of the environment, i.e. regardless of whether these actions are actually available in the current state.
- is_root : bool
Whether the current node is a root node, i.e. whether it has a parent node.
- is_leaf : bool
Whether the current node is a leaf node. A leaf node is typically a node that was previous unexplored, but it may also be a terminal state node.
- is_terminal : bool
Whether the current state is a terminal state.
- parent_node : MCTSNode object
The parent node. This is used to traverse back up the tree.
- parent_action : int
Which action led to the current state from the parent state. This is used to inform the parent which child is responsible for the update in the backup phase of the search procedure.
- children : dict
A dictionary that contains all the child states accessible from the current state, format:
{action <int>: child <MCTSNode>}
.- N : 1d array, dtype: int, shape: [num_actions]
The state-action visit count \(N(s,a)\).
- P : 1d array, dtype: int, shape: [num_actions]
The prior probabilities over the space of actions \(P(s,a)\), which are generated by the actor-critic function approximator.
- U : 1d array, dtype: float, shape: [num_actions]
The UCT exploration term, which is a vector over the space of actions:
\[U(s,a)\ =\ c_\text{puct}\,P(s,a)\, \frac{\sqrt{N(s)}}{1+N(s,a)}\]- Q : 1d array, dtype: float, shape: [num_actions]
The UCT exploitation term, which is a vector over the space of actions:
\[Q(s,a)\ =\ \frac{W(s,a)}{N(s, a)}\]- W : 1d array, dtype: float, shape: [num_actions]
This is the accumulator for the numerator of the UCT exploitation term \(Q(s,a)\). It is a sum of all of the values generated by starting from \(s\), taking action \(a\):
\[W(s,a)\ =\ v(s) + \sum_{s'\in\text{desc}(s,a)} v(s')\]Here \(\text{desc}(s,a)\) denotes the set of all the previously evaluated descendant states of the state \(s\) that can be reached by taking action \(a\). The prior values \(v(s)\) and \(v(s')\) is generated by the actor-critic function approximator.
- D : 1d array, dtype: bool, shape: [num_actions]
This contains the
done
flags for each child state, i.e. whether each child state is a terminal state.
-
backup
(self, v)[source]¶ Back-up the newly found leaf node value up the tree.
Parameters: - v : float
The value of the newly expanded leaf node.
-
expand
(self)[source]¶ Expand tree, i.e. promote leaf node to a non-leaf node.
Returns: - v : float
The value of the leaf node as predicted by the actor-critic.
-
play
(self, tau=None)[source]¶ Play one move/action.
Parameters: - tau : float, optional
The temperature parameter used in the ‘behavior’ policy:
\[\pi(a|s)\ =\ \frac{N(s,a)^{1/\tau}}{\sum_{a'}N(s,a')^{1/\tau}}\]If left unspecified,
tau
defaults to the instance setting.
Returns: - s, a, pi, r, done : tuple
The return values represent the following quanities:
- s : state observation
The state \(s\) from which the action was taken.
- a : action
The specific action \(a\) taken from that state \(s\).
- pi : 1d array, dtype: float, shape: [num_actions]
The action probabilities \(\pi(.|s)\) that were used.
- r : float
The reward received in the transition \((s,a)\to s_\text{next}\)
- done : bool
A flag that indicates that either the game has finished or the actor-critic predicted a value that is below the cutoff value \(v(s) < v_\text{resign}\).
-
search
(self, n=512)[source]¶ Perform \(n\) searches.
Each search consists of three consecutive steps:
select()
,expand()
andbackup()
.Parameters: - n : int, optional
The number of searches to perform.
Wrappers¶
OpenAI gym provides a nice modular interface to extend existing environments using environment wrappers. Here we list some wrappers that are used throughout the keras-gym package.
Preprocessors¶
The default preprocessor tries to create a feature vector from any environment state observation on a best-effort basis. For instance, if the observation space is discrete \(s\in\{0, 1, \dots, n-1\}\), it will create a one-hot encoded vector such that the wrapped environment yields state observations \(s\in\mathbb{R}^n\).
import gym
import keras_gym as km
env = gym.make('FrozenLake-v0')
env = km.wrappers.DefaultPreprocessor(env)
s = env.unwrapped.reset() # s == 0
s = env.reset() # s == [1, 0, 0, ..., 0]
Other preprocessors that are particularly useful when dealing with video input
are ImagePreprocessor
and
FrameStacker
. For instance, for
Atari 2600 environments we usually apply preprocessing as follows:
env = gym.make('PongDeterministic-v4')
env = km.wrappers.ImagePreprocessor(env, height=105, width=80, grayscale=True)
env = km.wrappers.FrameStacker(env, num_frames=4)
s = env.unwrapped.reset() # s.shape == (210, 160, 3)
s = env.reset() # s.shape == (105, 80, 4)
The first wrapper does down-scaling and grayscaling on each input frame. The second wrapper then stacks consecutive frames together, which allows for the function approximator to learn velocities/accelerations as well as positions for each input pixel.
Monitors¶
Another type of environment wrapper is a monitor, which is used to keep track
of the progress of the training process. At the moment, keras-gym only
provides a generic train monitor called TrainMonitor
Objects¶
Preprocessors¶
keras_gym.wrappers.BoxActionsToReals |
This wrapper decompactifies a Box action space to the reals. |
keras_gym.wrappers.ImagePreprocessor |
Preprocessor for images. |
keras_gym.wrappers.FrameStacker |
Stack multiple frames into one state observation. |
-
class
keras_gym.wrappers.
BoxActionsToReals
(env)[source]¶ This wrapper decompactifies a
Box
action space to the reals. This is required in order to be able to use aGaussianPolicy
.In practice, the wrapped environment expects the input action \(a_\text{real}\in\mathbb{R}^n\) and then it compactifies it back to a Box of the right size:
\[a_\text{box}\ =\ \text{low} + (\text{high}-\text{low}) \times\text{sigmoid}(a_\text{real})\]Technically, the transformed space is still a Box, but that’s only because we assume that the values lie between large but finite bounds, \(a_\text{real}\in[10^{-15}, 10^{15}]^n\).
-
close
(self)¶ Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
-
render
(self, mode='human', **kwargs)¶ Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
- human: render to the current display or terminal and return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
- Note:
- Make sure that your class’s metadata ‘render.modes’ key includes
- the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Args:
- mode (str): the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
- return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
- … # pop up a window and render
- else:
- super(MyEnv, self).render(mode=mode) # just raise an exception
-
reset
(self, **kwargs)¶ Resets the state of the environment and returns an initial observation.
- Returns:
- observation (object): the initial observation.
-
seed
(self, seed=None)¶ Sets the seed for this env’s random number generator(s).
- Note:
- Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns:
- list<bigint>: Returns the list of seeds used in this env’s random
- number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
-
step
(self, a)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Args:
- action (object): an action provided by the agent
- Returns:
- observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
-
unwrapped
¶ Completely unwrap this env.
- Returns:
- gym.Env: The base non-wrapped gym.Env instance
-
-
class
keras_gym.wrappers.
ImagePreprocessor
(env, height, width, grayscale=True, assert_input_shape=None)[source]¶ Preprocessor for images.
This preprocessing is adapted from this blog post:
Parameters: - env : gym environment
A gym environment.
- height : positive int
Output height (number of pixels).
- width : positive int
Output width (number of pixels).
- grayscale : bool, optional
Whether to convert RGB image to grayscale.
- assert_input_shape : shape tuple, optional
If provided, the preprocessor will assert the given input shape.
-
close
(self)¶ Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
-
render
(self, mode='human', **kwargs)¶ Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
- human: render to the current display or terminal and return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
- Note:
- Make sure that your class’s metadata ‘render.modes’ key includes
- the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Args:
- mode (str): the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
- return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
- … # pop up a window and render
- else:
- super(MyEnv, self).render(mode=mode) # just raise an exception
-
reset
(self)[source]¶ Resets the state of the environment and returns an initial observation.
- Returns:
- observation (object): the initial observation.
-
seed
(self, seed=None)¶ Sets the seed for this env’s random number generator(s).
- Note:
- Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns:
- list<bigint>: Returns the list of seeds used in this env’s random
- number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
-
step
(self, a)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Args:
- action (object): an action provided by the agent
- Returns:
- observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
-
unwrapped
¶ Completely unwrap this env.
- Returns:
- gym.Env: The base non-wrapped gym.Env instance
-
class
keras_gym.wrappers.
FrameStacker
(env, num_frames=4)[source]¶ Stack multiple frames into one state observation.
Parameters: - env : gym environment
A gym environment.
- num_frames : positive int, optional
Number of frames to stack in order to build a state feature vector.
-
close
(self)¶ Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
-
render
(self, mode='human', **kwargs)¶ Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
- human: render to the current display or terminal and return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
- Note:
- Make sure that your class’s metadata ‘render.modes’ key includes
- the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Args:
- mode (str): the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
- return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
- … # pop up a window and render
- else:
- super(MyEnv, self).render(mode=mode) # just raise an exception
-
reset
(self)[source]¶ Resets the state of the environment and returns an initial observation.
- Returns:
- observation (object): the initial observation.
-
seed
(self, seed=None)¶ Sets the seed for this env’s random number generator(s).
- Note:
- Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns:
- list<bigint>: Returns the list of seeds used in this env’s random
- number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
-
step
(self, a)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Args:
- action (object): an action provided by the agent
- Returns:
- observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
-
unwrapped
¶ Completely unwrap this env.
- Returns:
- gym.Env: The base non-wrapped gym.Env instance
Monitors¶
keras_gym.wrappers.TrainMonitor |
Environment wrapper for monitoring the training process. |
-
class
keras_gym.wrappers.
TrainMonitor
(env, tensorboard_dir=None)[source]¶ Environment wrapper for monitoring the training process.
This wrapper logs some diagnostics at the end of each episode and it also gives us some handy attributes (listed below).
Parameters: - env : gym environment
A gym environment.
- tensorboard_dir : str, optional
If provided, TrainMonitor will log all diagnostics to be viewed in tensorboard. To view these, point tensorboard to the same dir:
$ tensorboard --logdir {tensorboard_dir}
Attributes: - T : positive int
Global step counter. This is not reset by
env.reset()
, useenv.reset_global()
instead.- ep : positive int
Global episode counter. This is not reset by
env.reset()
, useenv.reset_global()
instead.- t : positive int
Step counter within an episode.
- G : float
The return, i.e. amount of reward accumulated from the start of the current episode.
- avg_G : float
The average return G, averaged over the past 100 episodes.
- dt_ms : float
The average wall time of a single step, in milliseconds.
-
close
(self)¶ Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
-
record_losses
(self, losses)[source]¶ Record losses during the training process.
These are used to print more diagnostics.
Parameters: - losses : dict
A dict of losses/metrics, of type
{name <str>: value <float>}
.
-
render
(self, mode='human', **kwargs)¶ Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
- human: render to the current display or terminal and return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
- Note:
- Make sure that your class’s metadata ‘render.modes’ key includes
- the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Args:
- mode (str): the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
- return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
- … # pop up a window and render
- else:
- super(MyEnv, self).render(mode=mode) # just raise an exception
-
reset
(self)[source]¶ Resets the state of the environment and returns an initial observation.
- Returns:
- observation (object): the initial observation.
-
seed
(self, seed=None)¶ Sets the seed for this env’s random number generator(s).
- Note:
- Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns:
- list<bigint>: Returns the list of seeds used in this env’s random
- number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
-
step
(self, a)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Args:
- action (object): an action provided by the agent
- Returns:
- observation (object): agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
-
unwrapped
¶ Completely unwrap this env.
- Returns:
- gym.Env: The base non-wrapped gym.Env instance
Environments¶
This is a collection of environments currently not included in OpenAI Gym.
Self-Play Environments¶
These environments are typically games. They are implemented in such a way that
can be played from a single-player perspective. The environment switches the
current player and opponent between turns. The way to picture this is that
the environment swaps color of all pieces between turns, so that the agent
always gets the perspective of the player whose turn it is. The first such
environment we include is the ConnectFourEnv
.
Objects¶
Self-Play Environments¶
keras_gym.envs.ConnectFourEnv |
An adversarial environment for playing the Connect-Four game. |
-
class
keras_gym.envs.
ConnectFourEnv
[source]¶ An adversarial environment for playing the Connect-Four game.
Attributes: - action_space : gym.spaces.Discrete(7)
The action space.
- observation_space : MultiDiscrete(nvec)
The state observation space, representing the position of the current player’s tokens (
s[1:,:,0]
) and the other player’s tokens (s[1:,:,1]
) as well as a mask over the space of actions, indicating which actions are available to the current player (s[0,:,0]
) or the other player (s[0,:,1]
).Note: The “current” player is relative to whose turn it is, which means that the entries
s[:,:,0]
ands[:,:,1]
swap between turns.- max_time_steps : int
Maximum number of timesteps within each episode.
- available_actions : array of int
Array of available actions. This list shrinks when columns saturate.
- win_reward : 1.0
The reward associated with a win.
- loss_reward : -1.0
The reward associated with a loss.
- draw_reward : 0.0
The reward associated with a draw.
-
close
(self)¶ Override close in your subclass to perform any necessary cleanup.
Environments will automatically close() themselves when garbage collected or when the program exits.
-
reset
(self)[source]¶ Reset the environment to the starting position.
Returns: - s : 3d-array, shape: [num_rows + 1, num_cols, num_players]
A state observation, representing the position of the current player’s tokens (
s[1:,:,0]
) and the other player’s tokens (s[1:,:,1]
) as well as a mask over the space of actions, indicating which actions are available to the current player (s[0,:,0]
) or the other player (s[0,:,1]
).Note: The “current” player is relative to whose turn it is, which means that the entries
s[:,:,0]
ands[:,:,1]
swap between turns.
-
seed
(self, seed=None)¶ Sets the seed for this env’s random number generator(s).
- Note:
- Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.
- Returns:
- list<bigint>: Returns the list of seeds used in this env’s random
- number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.
-
step
(self, a)[source]¶ Take one step in the MDP, following the single-player convention from gym.
Parameters: - a : int, options: {0, 1, 2, 3, 4, 5, 6}
The action to be taken. The action is the zero-based count of the possible insertion slots, starting from the left of the board.
Returns: - s_next : array, shape [6, 7, 2]
A next-state observation, representing the position of the current player’s tokens (
s[1:,:,0]
) and the other player’s tokens (s[1:,:,1]
) as well as a mask over the space of actions, indicating which actions are available to the current player (s[0,:,0]
) or the other player (s[0,:,1]
).Note: The “current” player is relative to whose turn it is, which means that the entries
s[:,:,0]
ands[:,:,1]
swap between turns.- r : float
Reward associated with the transition \((s, a)\to s_\text{next}\).
Note: Since “current” player is relative to whose turn it is, you need to be careful about aligning the rewards with the correct state or state-action pair. In particular, this reward \(r\) is the one associated with the \(s\) and \(a\), i.e. not aligned with \(s_\text{next}\).
- done : bool
Whether the episode is done.
- info : dict or None
A dict with some extra information (or None).
-
unwrapped
¶ Completely unwrap this env.
- Returns:
- gym.Env: The base non-wrapped gym.Env instance
Loss Functions¶
This is a collection of custom keras-compatible loss functions that are used throughout this package.
Note
These functions generally require the Tensorflow backend.
Value Losses¶
These loss functions can be applied to learning a value function. Most of the losses are actually already provided by keras. The value-function losses included here are minor adaptations of the available keras losses.
Policy Losses¶
The way policy losses are implemented is slightly different from value losses
due to their non-standard structure. A policy loss is implemented in a method
on updateable policy objects (see below). If you need to implement a
custom policy loss, you can override this policy_loss_with_metrics()
method.
-
BaseUpdateablePolicy.
policy_loss_with_metrics
(self, Adv, A=None)[source]¶ This method constructs the policy loss as a scalar-valued Tensor, together with a dictionary of metrics (also scalars).
This method may be overridden to construct a custom policy loss and/or to change the accompanying metrics.
Parameters: - Adv : 1d Tensor, shape: [batch_size]
A batch of advantages.
- A : nd Tensor, shape: [batch_size, …]
A batch of actions taken under the behavior policy. For some choices of policy loss, e.g.
update_strategy='sac'
this input is ignored.
Returns: - loss, metrics : (Tensor, dict of Tensors)
The policy loss along with some metrics, which is a dict of type
{name <str>: metric <Tensor>}
. The loss and each of the metrics (dict values) are scalar Tensors, i.e. Tensors withndim=0
.The
loss
is passed to a keras Model usingtrain_model.add_loss(loss)
. Similarly, each metric in the metric dict is passed to the model usingtrain_model.add_metric(metric, name=name, aggregation='mean')
.
Objects¶
Value Losses¶
keras_gym.losses.ProjectedSemiGradientLoss |
Loss function for type-II Q-function. |
keras_gym.losses.RootMeanSquaredError |
Root-mean-squared error (RMSE) loss. |
keras_gym.losses.LoglossSign |
Logloss implemented for predicted logits \(z\in\mathbb{R}\) and ground truth \(y\pm1\). |
-
class
keras_gym.losses.
ProjectedSemiGradientLoss
(G, base_loss=<tensorflow.python.keras.losses.Huber object>)[source]¶ Loss function for type-II Q-function.
This loss function projects the predictions \(q(s, .)\) onto the actions for which we actually received a feedback signal.
Parameters: - G : 1d Tensor, dtype: float, shape: [batch_size]
The returns that we wish to fit our value function on.
- base_loss : keras loss function, optional
Keras loss function. Default:
huber_loss
.
-
__call__
(self, A, Q_pred, sample_weight=None)[source]¶ Compute the projected MSE.
Parameters: - A : 2d Tensor, dtype: int, shape: [batch_size, num_actions]
A batch of (one-hot encoded) discrete actions A.
- Q_pred : 2d Tensor, shape: [batch_size, num_actions]
The predicted values \(q(s,.)\), a.k.a.
y_pred
.- sample_weight : Tensor, dtype: float, optional
Tensor whose rank is either 0 or is broadcastable to
y_true
.sample_weight
acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. Ifsample_weight
is a tensor of size[batch_size]
, then the total loss for each sample of the batch is rescaled by the corresponding element in thesample_weight
vector.
Returns: - loss : 0d Tensor (scalar)
The batch loss.
-
call
(self, y_true, y_pred)¶ Invokes the Loss instance.
- Args:
- y_true: Ground truth values, with the same shape as ‘y_pred’. y_pred: The predicted values.
-
class
keras_gym.losses.
RootMeanSquaredError
(delta=1.0, name='root_mean_squared_error')[source]¶ Root-mean-squared error (RMSE) loss.
Parameters: - name : str, optional
Optional name for the op.
-
__call__
(self, y_true, y_pred, sample_weight=None)[source]¶ Compute the RMSE loss.
Parameters: - y_true : Tensor, shape: [batch_size, …]
Ground truth values.
- y_pred : Tensor, shape: [batch_size, …]
The predicted values.
- sample_weight : Tensor, dtype: float, optional
Tensor whose rank is either 0, or the same rank as
y_true
, or is broadcastable toy_true
.sample_weight
acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. Ifsample_weight
is a tensor of size[batch_size]
, then the total loss for each sample of the batch is rescaled by the corresponding element in thesample_weight
vector. If the shape of sample_weight matches the shape ofy_pred
, then the loss of each measurable element ofy_pred
is scaled by the corresponding value ofsample_weight
.
Returns: - loss : 0d Tensor (scalar)
The batch loss.
-
call
(self, y_true, y_pred)¶ Invokes the Loss instance.
- Args:
- y_true: Ground truth values, with the same shape as ‘y_pred’. y_pred: The predicted values.
-
class
keras_gym.losses.
LoglossSign
[source]¶ Logloss implemented for predicted logits \(z\in\mathbb{R}\) and ground truth \(y\pm1\).
\[L\ =\ \log\left( 1 + \exp(-y\,z) \right)\]-
__call__
(self, y_true, z_pred, sample_weight=None)[source]¶ Parameters: - y_true : Tensor, shape: [batch_size, …]
Ground truth values \(y\pm1\).
- z_pred : Tensor, shape: [batch_size, …]
The predicted logits \(z\in\mathbb{R}\).
- sample_weight : Tensor, dtype: float, optional
Not yet implemented.
#TODO: implement this
-
call
(self, y_true, y_pred)¶ Invokes the Loss instance.
- Args:
- y_true: Ground truth values, with the same shape as ‘y_pred’. y_pred: The predicted values.
-
Utilities¶
The helper functions are organized by what objects they act on. The three categories are tensor helpers, numpy-array helpers and miscellaneous.
Objects¶
Miscellaneous Utilities¶
keras_gym.utils.enable_logging |
Enable logging output. |
keras_gym.utils.generate_gif |
Store a gif from the episode frames. |
keras_gym.utils.get_env_attr |
Get the given attribute from a potentially wrapped environment. |
keras_gym.utils.get_transition |
Generate a transition from the environment. |
keras_gym.utils.has_env_attr |
Check if a potentially wrapped environment has a given attribute. |
keras_gym.utils.is_policy |
Check whether an object is an (updateable) policy. |
keras_gym.utils.is_qfunction |
Check whether an object is a state-action value function, or Q-function. |
keras_gym.utils.is_vfunction |
Check whether an object is a state value function, or V-function. |
keras_gym.utils.render_episode |
Run a single episode with env.render() calls with each time step. |
keras_gym.utils.set_tf_loglevel |
Set the logging level for Tensorflow logger. |
-
keras_gym.utils.
enable_logging
(level=20, level_tf=40)[source]¶ Enable logging output.
This executes the following two lines of code:
import logging logging.basicConfig(level=logging.INFO) set_tf_loglevel(logging.ERROR)
Note that
set_tf_loglevel()
is another keras-gym utility function.Parameters: - level : int, optional
Log level for native python logging. For instance, if you’d like to see more verbose logging messages you might set
level=logging.DEBUG
.- level_tf : int, optional
Log level for tensorflow-specific logging (logs coming from the C++ layer).
-
keras_gym.utils.
generate_gif
(env, policy, filepath, resize_to=None, duration=50)[source]¶ Store a gif from the episode frames.
Parameters: - env : gym environment
The environment to record from.
- policy : keras-gym policy object
The policy that is used to take actions.
- filepath : str
Location of the output gif file.
- resize_to : tuple of ints, optional
The size of the output frames,
(width, height)
. Notice the ordering: first width, then height. This is the convention PIL uses.- duration : float, optional
Time between frames in the animated gif, in milliseconds.
-
keras_gym.utils.
get_env_attr
(env, attr, default='__ERROR__', max_depth=100)[source]¶ Get the given attribute from a potentially wrapped environment.
Note that the wrapped envs are traversed from the outside in. Once the attribute is found, the search stops. This means that an inner wrapped env may carry the same (possibly conflicting) attribute. This situation is not resolved by this function.
Parameters: - env : gym environment
A potentially wrapped environment.
- attr : str
The attribute name.
- max_depth : positive int, optional
The maximum depth of wrappers to traverse.
-
keras_gym.utils.
get_transition
(env)[source]¶ Generate a transition from the environment.
This basically does a single step on the environment and then closes it.
Parameters: - env : gym environment
A gym environment.
Returns: - s, a, r, s_next, a_next, done, info : tuple
A single transition. Note that the order and the number of items returned is different from what
env.reset()
return.
-
keras_gym.utils.
has_env_attr
(env, attr, max_depth=100)[source]¶ Check if a potentially wrapped environment has a given attribute.
Parameters: - env : gym environment
A potentially wrapped environment.
- attr : str
The attribute name.
- max_depth : positive int, optional
The maximum depth of wrappers to traverse.
-
keras_gym.utils.
is_policy
(obj, check_updateable=False)[source]¶ Check whether an object is an (updateable) policy.
Parameters: - obj
Object to check.
- check_updateable : bool, optional
If the obj is a policy, also check whether or not the policy is updateable.
Returns: - bool
Whether
obj
is a (updateable) policy.
-
keras_gym.utils.
is_qfunction
(obj, qtype=None)[source]¶ Check whether an object is a state-action value function, or Q-function.
Parameters: Returns: - bool
Whether
obj
is a (type-I/II) Q-function.
-
keras_gym.utils.
is_vfunction
(obj)[source]¶ Check whether an object is a state value function, or V-function.
Parameters: - obj
Object to check.
Returns: - bool
Whether
obj
is a V-function.
-
keras_gym.utils.
render_episode
(env, policy, step_delay_ms=0)[source]¶ Run a single episode with env.render() calls with each time step.
Parameters: - env : gym environment
A gym environment.
- policy : callable
A policy objects that is used to pick actions:
a = policy(s)
.- step_delay_ms : non-negative float
The number of milliseconds to wait between consecutive timesteps. This can be used to slow down the rendering.
Numpy-Array Utilities¶
keras_gym.utils.argmax |
This is a little hack to ensure that argmax breaks ties randomly, which is something that numpy.argmax() doesn’t do. |
keras_gym.utils.argmin |
This is a little hack to ensure that argmin breaks ties randomly, which is something that numpy.argmin() doesn’t do. |
keras_gym.utils.box_to_reals_np |
Transform array values from a Box space to the reals. |
keras_gym.utils.box_to_unit_interval_np |
Rescale array values from Box space to the unit interval. |
keras_gym.utils.check_numpy_array |
This helper function is mostly for internal use. |
keras_gym.utils.clipped_logit_np |
A safe implementation of the logit function \(x\mapsto\log(x/(1-x))\). |
keras_gym.utils.feature_vector |
Create a feature vector out of a state observation \(s\) or an action \(a\). |
keras_gym.utils.idx |
Given a numpy array, return its corresponding integer index array. |
keras_gym.utils.log_softmax |
Compute the log-softmax. |
keras_gym.utils.one_hot |
Create a dense one-hot encoded vector. |
keras_gym.utils.project_onto_actions_np |
Project tensor onto specific actions taken: numpy implementation. |
keras_gym.utils.reals_to_box_np |
Transform array values from the reals to a Box space. |
keras_gym.utils.softmax |
Compute the softmax (normalized point-wise exponential). |
keras_gym.utils.unit_interval_to_box_np |
Rescale array values from the unit interval to a Box space. |
-
keras_gym.utils.
argmax
(arr, axis=-1, random_state=None)[source]¶ This is a little hack to ensure that argmax breaks ties randomly, which is something that
numpy.argmax()
doesn’t do.Note: random tie breaking is only done for 1d arrays; for multidimensional inputs, we fall back to the numpy version.
Parameters: - a : array_like
Input array.
- axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
- random_state : int or RandomState
This can either be a random seed (int) or an instance of
numpy.random.RandomState
.
Returns: - index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
-
keras_gym.utils.
argmin
(arr, axis=None, random_state=None)[source]¶ This is a little hack to ensure that argmin breaks ties randomly, which is something that
numpy.argmin()
doesn’t do.Note: random tie breaking is only done for 1d arrays; for multidimensional inputs, we fall back to the numpy version.
Parameters: - a : array_like
Input array.
- axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
- random_state : int or RandomState
This can either be a random seed (int) or an instance of
numpy.random.RandomState
.
Returns: - index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
-
keras_gym.utils.
box_to_reals_np
(arr, space, epsilon=1e-15)[source]¶ Transform array values from a Box space to the reals. This is done by first mapping the Box values to the unit interval \(x\in[0, 1]\) and then feeding it to the
clipped_logit_np()
function.Parameters: - arr : nd array
A numpy array containing a single instance or a batch of elements of a Box space.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
- epsilon : float, optional
The cut-off value used by
clipped_logit_np()
.
Returns: - out : nd array, same shape as input
A numpy array with the transformed values. The output values are real-valued.
-
keras_gym.utils.
box_to_unit_interval_np
(arr, space)[source]¶ Rescale array values from Box space to the unit interval. This is essentially just min-max scaling:
\[x\ \mapsto\ \frac{x-x_\text{low}}{x_\text{high}-x_\text{low}}\]Parameters: - arr : nd array
A numpy array containing a single instance or a batch of elements of a Box space.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
Returns: - out : nd array, same shape as input
A numpy array with the transformed values. The output values lie on the unit interval \([0, 1]\).
-
keras_gym.utils.
check_numpy_array
(arr, ndim=None, ndim_min=None, dtype=None, shape=None, axis_size=None, axis=None)[source]¶ This helper function is mostly for internal use. It is used to check a few common properties of a numpy array.
Raises: - NumpyArrayCheckError
If one of the checks fails, it raises a
NumpyArrayCheckError
.
-
keras_gym.utils.
clipped_logit_np
(x, epsilon=1e-15)[source]¶ A safe implementation of the logit function \(x\mapsto\log(x/(1-x))\). It clips the arguments of the log function from below so as to avoid evaluating it at 0:
\[\text{logit}_\epsilon(x)\ =\ \log(\max(\epsilon, x)) - \log(\max(\epsilon, 1 - x))\]Parameters: - x : nd array
Input numpy array whose entries lie on the unit interval, \(x_i\in [0, 1]\).
- epsilon : float, optional
The small number with which to clip the arguments of the logarithm from below.
Returns: - z : nd array, dtype: float, shape: same as input
The output logits whose entries lie on the real line, \(z_i\in\mathbb{R}\).
-
keras_gym.utils.
feature_vector
(x, space)[source]¶ Create a feature vector out of a state observation \(s\) or an action \(a\). This is used in the
DefaultPreprocessor
.Parameters: - x : state or action
A state observation \(s\) or an action \(a\).
- space : gym space
A gym space, e.g.
gym.spaces.Box
,gym.spaces.Discrete
, etc.
-
keras_gym.utils.
idx
(arr, axis=0)[source]¶ Given a numpy array, return its corresponding integer index array.
Parameters: - arr : array
Input array.
- axis : int, optional
The axis along which we’d like to get an index.
Returns: - index : 1d array, shape: arr.shape[axis]
An index array [0, 1, 2, …].
-
keras_gym.utils.
log_softmax
(arr, axis=-1)[source]¶ Compute the log-softmax.
Note: This is the numpy implementation.
Parameters: - arr : numpy array
The input array.
- axis : int, optional
The axis along which to normalize, default is 0.
Returns: - out : array of same shape
The entries may be interpreted as log-probabilities.
-
keras_gym.utils.
one_hot
(i, n, dtype='float')[source]¶ Create a dense one-hot encoded vector.
Parameters: - i : int or 1d array of ints
The index of the non-zero entry.
- n : int
The dimensionality of the dense vector. Note that n must be greater than i.
- dtype : str or datatype
The output data type, default is ‘float’.
Returns: - x : 1d array of length n
The dense one-hot encoded vector.
-
keras_gym.utils.
project_onto_actions_np
(Y, A)[source]¶ Project tensor onto specific actions taken: numpy implementation.
Note: This only applies to discrete action spaces.
Parameters: - Y : 2d array, shape: [batch_size, num_actions]
The tensor to project down.
- A : 1d array, shape: [batch_size]
The batch of actions used to project.
Returns: - Y_projected : 1d array, shape: [batch_size]
The tensor projected onto the actions taken.
-
keras_gym.utils.
reals_to_box_np
(arr, space)[source]¶ Transform array values from the reals to a Box space. This is done by first applying the logistic sigmoid to map the reals onto the unit interval and then applying
unit_interval_to_box_np()
to rescale to the Box space.Parameters: - arr : nd array
A numpy array containing a single instance or a batch of elements of a Box space, encoded as logits.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
Returns: - out : nd array, same shape as input
A numpy array with the transformed values. The output values are contained in the provided Box space.
-
keras_gym.utils.
softmax
(arr, axis=-1)[source]¶ Compute the softmax (normalized point-wise exponential).
Note: This is the numpy implementation.
Parameters: - arr : numpy array
The input array.
- axis : int, optional
The axis along which to normalize, default is 0.
Returns: - out : array of same shape
The entries of the output array are non-negative and normalized, which make them good candidates for modeling probabilities.
-
keras_gym.utils.
unit_interval_to_box_np
(arr, space)[source]¶ Rescale array values from the unit interval to a Box space. This is essentially inverted min-max scaling:
\[x\ \mapsto\ x_\text{low} + (x_\text{high} - x_\text{low})\,x\]Parameters: - arr : nd array
A numpy array containing a single instance or a batch of elements of a Box space, scaled to the unit interval.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
Returns: - out : nd array, same shape as input
A numpy array with the transformed values. The output values are contained in the provided Box space.
Tensor Utilities¶
keras_gym.utils.box_to_reals_tf |
Transform Tensor values from a Box space to the reals. |
keras_gym.utils.box_to_unit_interval_tf |
Rescale Tensor values from Box space to the unit interval. |
keras_gym.utils.check_tensor |
This helper function is mostly for internal use. |
keras_gym.utils.diff_transform_matrix |
A helper function that implements discrete differentiation for stacked state observations. |
keras_gym.utils.log_softmax_tf |
Compute the log-softmax. |
keras_gym.utils.project_onto_actions_tf |
Project tensor onto specific actions taken: tensorflow implementation. |
keras_gym.utils.project_onto_actions_tf |
Project tensor onto specific actions taken: tensorflow implementation. |
keras_gym.utils.unit_interval_to_box_tf |
Rescale Tensor values from the unit interval to a Box space. |
-
keras_gym.utils.
box_to_reals_tf
(tensor, space, epsilon=1e-15)[source]¶ Transform Tensor values from a Box space to the reals. This is done by first mapping the Box values to the unit interval \(x\in[0, 1]\) and then feeding it to the
clipped_logit_tf()
function.Parameters: - tensor : nd Tensor
A tensor containing a single instance or a batch of elements of a Box space.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
- epsilon : float, optional
The cut-off value used by
clipped_logit_tf()
.
Returns: - out : nd Tensor, same shape as input
A Tensor with the transformed values. The output values are real-valued.
-
keras_gym.utils.
box_to_unit_interval_tf
(tensor, space)[source]¶ Rescale Tensor values from Box space to the unit interval. This is essentially just min-max scaling:
\[x\ \mapsto\ \frac{x-x_\text{low}}{x_\text{high}-x_\text{low}}\]Parameters: - tensor : nd Tensor
A tensor containing a single instance or a batch of elements of a Box space.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
Returns: - out : nd Tensor, same shape as input
A Tensor with the transformed values. The output values lie on the unit interval \([0,1]\).
-
keras_gym.utils.
check_tensor
(tensor, ndim=None, ndim_min=None, dtype=None, same_dtype_as=None, same_shape_as=None, same_as=None, int_shape=None, axis_size=None, axis=None)[source]¶ This helper function is mostly for internal use. It is used to check a few common properties of a Tensor.
Parameters: - ndim : int or list of ints
Check
K.ndim(tensor)
.- ndim_min : int
Check if
K.ndim(tensor)
is at leastndim_min
.- dtype : Tensor dtype or list of Tensor dtypes
Check
tensor.dtype
.- same_dtype_as : Tensor
Check if dtypes match.
- same_shape_as : Tensor
Check if shapes match.
- same_as : Tensor
Check if both dtypes and shapes match.
- int_shape : tuple of ints
Check
K.int_shape(tensor)
.- axis_size : int
Check size along axis, where axis is specified by
axis=...
kwarg.- axis : int
The axis the check for size.
Raises: - TensorCheckError
If one of the checks fails, it raises a
TensorCheckError
.
-
keras_gym.utils.
diff_transform_matrix
(num_frames, dtype='float32')[source]¶ A helper function that implements discrete differentiation for stacked state observations.
Let’s say we have a feature vector \(X\) consisting of four stacked frames, i.e. the shape would be:
[batch_size, height, width, 4]
.The corresponding diff-transform matrix with
num_frames=4
is a \(4\times 4\) matrix given by:\[\begin{split}M_\text{diff}^{(4)}\ =\ \begin{pmatrix} -1 & 0 & 0 & 0 \\ 3 & 1 & 0 & 0 \\ -3 & -2 & -1 & 0 \\ 1 & 1 & 1 & 1 \end{pmatrix}\end{split}\]such that the diff-transformed feature vector is readily computed as:
\[X_\text{diff}\ =\ X\, M_\text{diff}^{(4)}\]The diff-transformation preserves the shape, but it reorganizes the frames in such a way that they look more like canonical variables. You can think of \(X_\text{diff}\) as the stacked variables \(x\), \(\dot{x}\), \(\ddot{x}\), etc. (in reverse order). These represent the position, velocity, acceleration, etc. of pixels in a single frame.
Parameters: - num_frames : positive int
The number of stacked frames in the original \(X\).
- dtype : keras dtype, optional
The output data type.
Returns: - M : 2d-Tensor, shape: [num_frames, num_frames]
A square matrix that is intended to be multiplied from the left, e.g.
X_diff = K.dot(X_orig, M)
, where we assume that the frames are stacked inaxis=-1
ofX_orig
, in chronological order.
-
keras_gym.utils.
log_softmax_tf
(Z, axis=-1)[source]¶ Compute the log-softmax.
Note: This is the tensorflow implementation.
Parameters: - Z : Tensor
The input logits.
- axis : int, optional
The axis along which to normalize, default is 0.
Returns: - out : Tensor of same shape as input
The entries may be interpreted as log-probabilities.
-
keras_gym.utils.
project_onto_actions_tf
(Y, A)[source]¶ Project tensor onto specific actions taken: tensorflow implementation.
Note: This only applies to discrete action spaces.
Parameters: - Y : 2d Tensor, shape: [batch_size, num_actions]
The tensor to project down.
- A : 1d Tensor, shape: [batch_size]
The batch of actions used to project.
Returns: - Y_projected : 1d Tensor, shape: [batch_size]
The tensor projected onto the actions taken.
-
keras_gym.utils.
project_onto_actions_tf
(Y, A)[source] Project tensor onto specific actions taken: tensorflow implementation.
Note: This only applies to discrete action spaces.
Parameters: - Y : 2d Tensor, shape: [batch_size, num_actions]
The tensor to project down.
- A : 1d Tensor, shape: [batch_size]
The batch of actions used to project.
Returns: - Y_projected : 1d Tensor, shape: [batch_size]
The tensor projected onto the actions taken.
-
keras_gym.utils.
unit_interval_to_box_tf
(tensor, space)[source]¶ Rescale Tensor values from the unit interval to a Box space. This is essentially inverted min-max scaling:
\[x\ \mapsto\ x_\text{low} + (x_\text{high} - x_\text{low})\,x\]Parameters: - tensor : nd Tensor
A numpy array containing a single instance or a batch of elements of a Box space, scaled to the unit interval.
- space : gym.spaces.Box
The Box space. This is needed to determine the shape and size of the space.
Returns: - out : nd Tensor, same shape as input
A Tensor with the transformed values. The output values are contained in the provided Box space.
Glossary¶
In this package we make heavy use of function approximators using
keras.Model
objects. In Section 1 we list the available types of
function approximators. A function approximator uses multiple keras models to
support its full functionality. The different types keras models are listed in
Section 2. Finally, in Section 3 we list the different kinds of inputs and
outputs that our keras models expect.
1. Function approximator types¶
- function approximator
- A function approximator is any object that can be updated.
- body
- The body is what we call the part of the computation graph that may
be shared between e.g. policy (actor) and value function (critic). It
is typlically the part of a neural net that does most of the heavy
lifting. One may think of the
body()
as an elaborate automatic feature extractor. - head
The head is the part of the computation graph that actually generates the desired output format/shape. As its input, it takes the output of body. The different heads that
FunctionApproximator
class provides are:head_v
This is the state value head. It returns a batch of scalar values V.head_q1
This is the type-I Q-value head. It returns a batch of scalar values Q_sa.head_q2
This is the type-II Q-value head. It returns a batch of vectors Q_s.head_pi
This is the policy head. It returns a batch of distribution parameters Z.- forward_pass
- This is just the consecutive application of head after body.
In this package we have four distinct types of function approximators:
- state value function
- State value functions \(v(s)\) are implemented by
V
. - type-I state-action value function
This is the standard state-action value function \(q(s,a)\). It models the Q-function as
\[(s, a) \mapsto q(s,a)\ \in\ \mathbb{R}\]This function approximator is implemented by
QTypeI
.- type-II state-action value function
This type of state-action value function is different from type-I in that it models the Q-function as
\[s \mapsto q(s,.)\ \in\ \mathbb{R}^n\]where \(n\) is the number of actions. The type-II Q-function is implemented by
QTypeII
.- updateable policy
- This function approximator represents a policy directly. It is
implemented by e.g.
SoftmaxPolicy
. - actor-critic
- This is a special function approximator that allows for the sharing of parts of the computation graph between a value function (critic) and a policy (actor).
Note
At the moment, type-II Q-functions and updateable policies are only
implemented for environments with a Discrete
action space.
2. Keras model types¶
Now each function approximator takes multiple keras.Model
objects. The
different models are named according to role they play in the functions
approximator object:
- train_model
- This
keras.Model
is used for training. - predict_model
- This
keras.Model
is used for predicting. - target_model
- This
keras.Model
is a kind of shadow copy of predict_model that is used in off-policy methods. For instance, in DQN we use it for reducing the variance of the bootstrapped target by synchronizing with predict_model only periodically.
Note
The specific input depends on the type of function approximator you’re using. These are provided in each individual class doc.
3. Keras model inputs/outputs¶
Each keras.Model
object expects specific inputs and outputs. These are
provided in each individual function approximator’s docs.
Below we list the different available arrays that we might use as inputs/outputs to our keras models.
- S
- A batch of (preprocessed) state observations. The shape is
[batch_size, ...]
where the ellipses might be any number of dimensions. - A
- A batch of actions taken, with shape
[batch_size]
. - P
- A batch of distribution parameters that allow us to construct action
propensities according to the behavior/target policy \(b(a|s)\).
For instance, the parameters of a
keras_gym.SoftmaxPolicy
(for discrete actions spaces) are those of a categorical distribution. On the other hand, for continuous action spaces we use akeras_gym.GaussianPolicy
, whose parameters are the parameters of the underlying normal distribution. - Z
- Similar to P, this is a batch of distribution parameters. In contrast to P, however, Z represents the primary updateable policy \(\pi_\theta(a|s)\) instead of the behavior/target policy \(b(a|s)\).
- G
- A batch of (\(\gamma\)-discounted) returns, shape:
[batch_size]
. - Rn
A batch of partial (\(\gamma\)-discounted) returns. For instance, in n-step bootstrapping these are given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots + \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the part of the n-step return without the bootstrapping term. The shape is
[batch_size]
.- In
A batch of bootstrap factors. For instance, in n-step bootstrapping these are given by \(I^{(n)}_t=\gamma^n\) when bootstrapping and \(I^{(n)}_t=0\) otherwise. It is used in bootstrapped updates. For instance, the n-step bootstrapped target makes use of it as follows:
\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]The shape is
[batch_size]
.- S_next
- A batch of (preprocessed) next-state observations. This is typically
used in bootstrapping (see In). The shape is
[batch_size, ...]
where the ellipses might be any number of dimensions. - A_next
- A batch of next-actions to be taken. These can be actions that were
actually taken (on-policy), but they can also be any other would-be
next-actions (off-policy). The shape is
[batch_size]
. - P_next
- A batch of action propensities according to the policy \(\pi(a|s)\).
- V
- A batch of V-values \(v(s)\) of shape
[batch_size]
. - Q_sa
- A batch of Q-values \(q(s,a)\) of shape
[batch_size]
. - Q_s
- A batch of Q-values \(q(s,.)\) of shape
[batch_size, num_actions]
. - Adv
- A batch of advantages \(\mathcal{A}(s,a) = q(s,a) - v(s)\), which
has shape:
[batch_size]
.
Release Notes¶
v0.2.17¶
- Made keras-gym compatible with tensorflow v2.0 (unfortunately had to disable eager mode)
- Added
SoftActorCritic
class - Added
frozen_lake/sac
script and notebook - Added
atari/sac
script, which is still WIP
v0.2.16¶
Major update: support Box action spaces.
- introduced
keras_gym.proba_dists
sub-module, which implements differentiable proability ditributions (incl. differentiablesample()
methods) - removed policy-based losses in favor
BaseUpdateablePolicy.policy_loss_with_metrics()
, which now uses the differentiableProbaDist
objects - removed
ConjointActorCritic
(was redundant) - changed how we implement target models: no longer rely on global namespaces; instead we use
keras.models.clone_model()
- changed
BaseFunctionApproximator.sync_target_model()
: usemodel.{get,set}_weights()
- added script and notebook for Pendulum-v0 with PPO
v0.2.15¶
This is a relatively minor update. Just a couple of small bug fixes.
- fixed logging, which was broken by abseil (dependence of tensorflow>=1.14)
- added enable_logging helper
- updated some docs
v0.2.13¶
This version is another major overhaul. In particular, the
FunctionApproximator
class is
introduced, which offers a unified interface for all function approximator
types, i.e. state(-action) value functions and updateable policies. This makes
it a lot easier to create your own custom function approximator, whereby you
only ahve to define your own forward-pass by creating a subclass of
FunctionApproximator
and providing a
body
method. Further flexibility
is provided by allowing the head method(s) to be overridden.
- added
FunctionApproximator
class - refactored value functions and policies to just be a wrapper around a
FunctionApproximator
object - MILESTONE: got AlphaZero to work on ConnectFour (although this game is likely too simple to see the real power of AlphaZero - MCTS on its own works fine)
v0.2.12¶
- MILESTONE: got PPO working on Atari Pong
- added
PolicyKLDivergence
andPolicyEntropy
- added
entropy_beta
andppo_clip_eps
kwargs to updateable policies
v0.2.11¶
- optimized ActorCritic to avoid feeding in S three times instead of once
- removed all mention of
bootstrap_model
- implemented PPO with
ClippedSurrogateLoss
v0.2.10¶
This is the second overhaul, a complete rewrite in fact. There was just too much of the old scikit-gym structure that was standing in the way of progress.
The main thing that changed in this version is that I ditch the notion of an algorithm. Instead, function approximators carry their own “update strategy”. In the case of Q-functions, this is ‘sarsa’, ‘q_learning’ etc., while policies have the options ‘vanilla’, ‘ppo’, etc.
Value functions carry another property that was previously attributed to algorithm objects. This is the bootstrap-n, i.e. the number of steps over which to delay bootstrapping.
This new structure accommodates for modularity much much better than the old structure.
- removed algorithms, replaced by ‘bootstrap_n’ and ‘update_strategy’ settings on function approximators
- implemented
ExperienceReplayBuffer
- milestone: added DQN implementation for Atari 2600 envs.
- other than that.. too much to mention. It really was a complete rewrite
v0.2.9¶
- changed definitions of Q-functions to
GenericQ
andGenericQTypeII
- added option for efficient bootstrapped updating (
bootstrap_model
argument in value functions, see example usage:NStepBootstrapV
) - renamed
ValuePolicy
toValueBasedPolicy
v0.2.8¶
- implemented base class for updateable policy objects
- implemented first example of updateable policy:
GenericSoftmaxPolicy
- implemented predefined softmax policy:
LinearSoftmaxPolicy
- added first policy gradient algorithm:
Reinforce
- added REINFORCE example notebook
- updated documentation
v0.2.7¶
This was a MAJOR overhaul in which I ported everything from scikit-learn to Keras. The reason for this is that I was stuck on the implementation of policy gradient methods due to the lack of flexibility of the scikit-learn ecosystem. I chose Keras as a replacement, it’s nice an modular like scikit-learn, but in addition it’s much more flexible. In particular, the ability to provide custom loss functions has been the main selling point. Another selling point was that some environments require more sophisticated neural nets than a simple MLP, which is readily available in Keras.
- added compatibility wrapper for scikit-learn function approximators
- ported all value functions to use keras.Model
- ported predefined models
LinearV
andLinearQ
to keras - ported algorithms to keras
- ported all notebooks to keras
- changed name of the package keras-gym and root module
keras_gym
Other changes:
- added propensity score outputs to policy objects
- created a stub for directly updateable policies
v0.2.6¶
- refactored BaseAlgorithm to simplify implementation (at the cost of more code, but it’s worth it)
- refactored notebooks: they are now bundled by environment / algo type
- added n-step bootstrap algorithms:
NStepQLearning
NStepSarsa
NStepExpectedSarsa
v0.2.5¶
- added algorithm:
keras_gym.algorithms.ExpectedSarsa
- added object:
keras_gym.utils.ExperienceCache
- rewrote
MonteCarlo
to useExperienceCache
v0.2.4¶
- added algorithm:
keras_gym.algorithms.MonteCarlo
v0.2.3¶
- added algorithm:
keras_gym.algorithms.Sarsa
v0.2.2¶
- changed doc theme from sklearn to readthedocs
v0.2.1¶
- first working implementation value function + policy + algorithm
- added first working example in a notebook
- added algorithm:
keras_gym.algorithms.QLearning
Example¶
To get started, check out the Example Notebooks for examples. Alternatively, check out this short tutorial video:
Here’s one of the examples from the notebooks, in which we solve the
CartPole-v0
environment with the SARSA algorithm, using a simple
linear function approximator for our Q-function:
import gym
import keras_gym as km
from tensorflow import keras
# the cart-pole MDP
env = gym.make('CartPole-v0')
class Linear(km.FunctionApproximator):
""" linear function approximator """
def body(self, X):
# body is trivial, only flatten and then pass to head (one dense layer)
return keras.layers.Flatten()(X)
# value function and its derived policy
func = Linear(env, lr=0.001)
q = km.QTypeI(func, update_strategy='sarsa')
policy = km.EpsilonGreedy(q)
# static parameters
num_episodes = 200
num_steps = env.spec.max_episode_steps
# used for early stopping
num_consecutive_successes = 0
# train
for ep in range(num_episodes):
s = env.reset()
policy.epsilon = 0.1 if ep < 10 else 0.01
for t in range(num_steps):
a = policy(s)
s_next, r, done, info = env.step(a)
q.update(s, a, r, done)
if done:
if t == num_steps - 1:
num_consecutive_successes += 1
print("num_consecutive_successes: {}"
.format(num_consecutive_successes))
else:
num_consecutive_successes = 0
print("failed after {} steps".format(t))
break
s = s_next
if num_consecutive_successes == 10:
break
# run env one more time to render
km.render_episode(env, policy, step_delay_ms=25)