TorchRL Documentation¶
Agents¶
BaseAgent¶
-
class
torchrl.agents.
BaseAgent
(batcher, optimizer, *, gamma=0.99, log_dir='runs')[source]¶ Bases:
abc.ABC
Basic TorchRL agent. Encapsulate an environment and a model.
Parameters: - env (torchrl.envs) – A torchrl environment.
- gamma (float) – Discount factor on future rewards (Default is 0.99).
- log_dir (string) – Directory where logs will be written (Default is runs).
-
step
()[source]¶ This method is called at each interaction of the training loop, and defines the training procedure.
-
_check_termination
()[source]¶ Check if the training loop reached the end.
Returns: - bool
- True if done, False otherwise.
-
_register_model
(name, model)[source]¶ Save a torchrl model to the internal memory.
Parameters: - name (str) – Desired name for the model.
- model (torchrl.models) – The model to register.
-
train
(*, max_iters=-1, max_episodes=-1, max_steps=-1, log_freq=1, eval_env=None, eval_freq=None)[source]¶ Defines the training loop of the algorithm, calling
step()
at every iteration.Parameters:
-
select_action
(state, step)[source]¶ Receive a state and use the model to select an action.
Parameters: state (numpy.ndarray) – The environment state. Returns: action – The selected action. Return type: int or numpy.ndarray
PGAgent¶
-
class
torchrl.agents.
PGAgent
(batcher, *, policy_model, value_model=None, normalize_advantages=True, advantage=<torchrl.utils.estimators.advantage.estimators.GAE object>, vtarget=<torchrl.utils.estimators.value.estimators.FromAdvantage object>, **kwargs)[source]¶ Bases:
torchrl.agents.base_agent.BaseAgent
Policy Gradient Agent, compatible with all PG models.
This agent encapsulates a policy_model and optionally a value_model, it defines the steps needed for the training loop (see
step()
), and calculates all the necessary values to train the model(s).Parameters: - env (torchrl.envs) – A torchrl environment.
- policy_model (torchrl.models) – Should be a subclass of
torchrl.models.BasePGModel
- value_model (torchrl.models) – Should be an instance of
torchrl.models.ValueModel
(Default is None) - normalize_advantages (bool) – If True, normalize the advantages per batch.
- advantage (torchrl.utils.estimators.advantage) – Class used for calculating the advantages.
- vtarget (torchrl.utils.estimators.value) – Class used for calculating the states target values.
Models¶
BaseModel¶
-
class
torchrl.models.
BaseModel
(model, batcher, *, cuda_default=True)[source]¶ Bases:
torchrl.nn.container.ModuleExtended
,abc.ABC
Basic TorchRL model. Takes two
Config
objects that identify the body(ies) and head(s) of the model.Parameters: - model (nn.Module) – A pytorch model.
- batcher (torchrl.batcher) – A torchrl batcher.
- num_epochs (int) – How many times to train over the entire dataset (Default is 1).
- num_mini_batches (int) – How many mini-batches to subset the batch (Default is 1, so all the batch is used at once).
- opt_fn (torch.optim) – The optimizer reference function (the constructor, not the instance) (Default is Adam).
- opt_params (dict) – Parameters for the optimizer (Default is empty dict).
- clip_grad_norm (float) – Max norm of the gradients, if float(‘inf’) no clipping is done (Default is float(‘inf’)).
- loss_coef (float) – Used when sharing networks, should balance the contribution of the grads of each model.
- cuda_default (bool) – If True and cuda is supported, use it (Default is True).
-
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
register_losses
¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
static
output_layer
(input_shape, action_info)[source]¶ The final layer of the model, will be appended to the model head.
Parameters: Examples
The output of most PG models have the same dimension as the action, but the output of the Value models is rank 1. This is where this is defined.
-
forward
(x)[source]¶ Defines the computation performed at every call.
Parameters: x (numpy.ndarray) – The environment state.
-
attach_logger
(logger)[source]¶ Register a logger to this model.
Parameters: logger (torchrl.utils.logger) –
-
write_logs
(batch)[source]¶ Write logs to the terminal and to a tf log file.
Parameters: batch (Batch) – Some logs might need the batch for calculation.
-
classmethod
from_config
(config, batcher=None, body=None, head=None, **kwargs)[source]¶ Creates a model from a configuration file.
Parameters: - config (Config) – Should contatin at least a network definition (
nn_config
section). - env (torchrl.envs) – A torchrl environment (Default is None and must be present in the config).
- kwargs (key-word arguments) – Extra arguments that will be passed to the class constructor.
Returns: A TorchRL model.
Return type: torchrl.models
- config (Config) – Should contatin at least a network definition (
ValueModel¶
-
class
torchrl.models.
ValueModel
(model, batcher, **kwargs)[source]¶ Bases:
torchrl.models.base_model.BaseModel
A standard regression model, can be used to estimate the value of states or Q values.
Parameters: clip_range (float) – Similar to PPOClip, limits the change between the new and old value function. -
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
register_losses
()[source]¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
write_logs
(batch)[source]¶ Write logs to the terminal and to a tf log file.
Parameters: batch (Batch) – Some logs might need the batch for calculation.
-
BasePGModel¶
-
class
torchrl.models.
BasePGModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.base_model.BaseModel
Base class for all Policy Gradient Models.
-
entropy_loss
(batch)[source]¶ Adds a entropy cost to the loss function, with the intent of encouraging exploration.
Parameters: batch (Batch) – The batch should contain all the information necessary to compute the gradients.
-
create_dist
(parameters)[source]¶ Specify how the policy distributions should be created. The type of the distribution depends on the environment.
Parameters: - parameters (np.array) –
- parameters are used to create a distribution (The) –
- or discrete depending on the type of the environment) ((continuous) –
-
write_logs
(batch)[source]¶ Write logs to the terminal and to a tf log file.
Parameters: batch (Batch) – Some logs might need the batch for calculation.
-
VanillaPGModel¶
-
class
torchrl.models.
VanillaPGModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.base_pg_model.BasePGModel
The classical Policy Gradient algorithm.
-
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
A2CModel¶
-
class
torchrl.models.
A2CModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.vanilla_pg_model.VanillaPGModel
A2C is just a parallel implementation of the actor-critic algorithm.
So just be sure to create a list of envs and pass to
torchrl.envs.ParallelEnv
to reproduce A2C.
SurrogatePGModel¶
-
class
torchrl.models.
SurrogatePGModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.base_pg_model.BasePGModel
The Surrogate Policy Gradient algorithm instead maximizes a “surrogate” objective, given by:
\[L^{CPI}({\theta}) = \hat{E}_t \left[\frac{\pi_{\theta}(a|s)} {\pi_{\theta_{old}}(a|s)} \hat{A} \right ]\]-
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
register_losses
()[source]¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
surrogate_pg_loss
(batch)[source]¶ The surrogate pg loss, as described before.
Parameters: batch (Batch) –
-
calculate_prob_ratio
(new_log_probs, old_log_probs)[source]¶ Calculates the probability ratio between two policies.
Parameters: - new_log_probs (torch.Tensor) –
- old_log_probs (torch.Tensor) –
-
PPOClipModel¶
-
class
torchrl.models.
PPOClipModel
(model, batcher, ppo_clip_range=0.2, **kwargs)[source]¶ Bases:
torchrl.models.surrogate_pg_model.SurrogatePGModel
Proximal Policy Optimization as described in https://arxiv.org/pdf/1707.06347.pdf.
Parameters: -
register_losses
()[source]¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
PPOAdaptiveModel¶
-
class
torchrl.models.
PPOAdaptiveModel
(model, batcher, *, kl_target=0.01, kl_penalty=1.0, **kwargs)[source]¶ Bases:
torchrl.models.surrogate_pg_model.SurrogatePGModel
Proximal Policy Optimization as described in https://arxiv.org/pdf/1707.06347.pdf.
Parameters: num_epochs (int) – How many times to train over the entire dataset (Default is 10).
torchrl.envs¶
The environment is the world that the agent interacts with, it could be a game, a physics engine or anything you would like. It should receive and execute an action and return to the agent the next observation and a reward.
BaseEnv¶
-
class
torchrl.envs.
BaseEnv
(env_name)[source]¶ Bases:
abc.ABC
Abstract base class used for implementing new environments.
Includes some basic functionalities, like the option to use a running mean and standard deviation for normalizing states.
Parameters: - env_name (str) – The environment name.
- fixed_normalize_states (bool) – If True, use the state min and max value to normalize the states (Default is False).
- running_normalize_states (bool) – If True, use the running mean and std to normalize the states (Default is False).
- scale_reward (bool) – If True, use the running std to scale the rewards (Default is False).
-
get_state_info
()[source]¶ Returns a dict containing information about the state space.
The dict should contain two keys:
shape
indicating the state shape, anddtype
indicating the state type.Example
State space containing 4 continuous actions:
return dict(shape=(4,), dtype='continuous')
-
get_action_info
()[source]¶ Returns a dict containing information about the action space.
The dict should contain two keys:
shape
indicating the action shape, anddtype
indicating the action type.If dtype is
int
it will be assumed a discrete action space.Example
Action space containing 4 float numbers:
return dict(shape=(4,), dtype='float')
-
simulator
¶ Returns the name of the simulator being used as a string.
-
reset
()[source]¶ Resets the environment to an initial state.
Returns: A numpy array with the state information. Return type: numpy.ndarray
-
step
(action)[source]¶ Receives an action and execute it on the environment.
Parameters: action (int or float or numpy.ndarray) – The action to be executed in the environment, it should be an int
for discrete enviroments andfloat
for continuous. There’s also the possibility of executing multiple actions (if the environment supports so), in this case it should be anumpy.ndarray
.Returns: - next_state (numpy.ndarray) – A numpy array with the state information.
- reward (float) – The reward.
- done (bool) – Flag indicating the termination of the episode.
- info (dict) – Dict containing additional information about the state.
GymEnv¶
-
class
torchrl.envs.
GymEnv
(env_name, **kwargs)[source]¶ Bases:
torchrl.envs.base_env.BaseEnv
Creates and wraps a gym environment.
Parameters: -
simulator
¶ Returns the name of the simulator being used as a string.
-
reset
()[source]¶ Calls the reset method on the gym environment.
Returns: state – A numpy array with the state information. Return type: numpy.ndarray
-
step
(action)[source]¶ Calls the step method on the gym environment.
Parameters: action (int or float or numpy.ndarray) – The action to be executed in the environment, it should be an int for discrete enviroments and float for continuous. There’s also the possibility of executing multiple actions (if the environment supports so), in this case it should be a numpy.ndarray. Returns: - next_state (numpy.ndarray) – A numpy array with the state information.
- reward (float) – The reward.
- done (bool) – Flag indicating the termination of the episode.
-
get_state_info
()[source]¶ Dictionary containing the shape and type of the state space. If it is continuous, also contains the minimum and maximum value.
-
get_action_info
()[source]¶ Dictionary containing the shape and type of the action space. If it is continuous, also contains the minimum and maximum value.
-
RoboschoolEnv¶
-
class
torchrl.envs.
RoboschoolEnv
(*args, **kwargs)[source]¶ Bases:
torchrl.envs.gym_env.GymEnv
Support for gym Roboschool.
-
get_action_info
()¶ Dictionary containing the shape and type of the action space. If it is continuous, also contains the minimum and maximum value.
-
static
get_space_info
(space)¶ Gets the shape of the possible types of states in gym.
Parameters: space (gym.spaces) – Space object that describes the valid actions and observations Returns: Dictionary containing the space shape and type Return type: dict
-
get_state_info
()¶ Dictionary containing the shape and type of the state space. If it is continuous, also contains the minimum and maximum value.
-
reset
()¶ Calls the reset method on the gym environment.
Returns: state – A numpy array with the state information. Return type: numpy.ndarray
-
simulator
¶ Returns the name of the simulator being used as a string.
-
step
(action)¶ Calls the step method on the gym environment.
Parameters: action (int or float or numpy.ndarray) – The action to be executed in the environment, it should be an int for discrete enviroments and float for continuous. There’s also the possibility of executing multiple actions (if the environment supports so), in this case it should be a numpy.ndarray. Returns: - next_state (numpy.ndarray) – A numpy array with the state information.
- reward (float) – The reward.
- done (bool) – Flag indicating the termination of the episode.
-
Containers¶
ModuleExtended¶
SequentialExtended¶
FlattenLinear¶
torchrl.utils¶
Config¶
Configuration object used by other modules. Can be saved and imported as a YAML file
-
class
torchrl.utils.config.
Config
(*args, **kwargs)[source]¶ Bases:
object
Configuration object used for initializing an Agent. It maintains the order from which the attributes have been set.
Parameters: configs (Keyword arguments) – Additional parameters that will be stored. Returns: An object containing all configuration details (with possibly nested Config). Return type: Config object -
as_dict
()[source]¶ Returns all object attributes as a nested OrderedDict.
Returns: Nested OrderedDict containing all object attributes. Return type: dict
-
new_section
(name, **configs)[source]¶ Creates a new Config object and add as an attribute of this instance.
Parameters: - name (str) – Name of the new section.
- configs (Keyword arguments) – Parameters that will be stored in this section, accepts nested parameters.
Examples
Simple use case:
config.new_section('new_section_name', attr1=value1, attr2=value2, ...)
Nested parameters:
config.new_section('new_section_name', attr1=Config(attr1=value1, attr2=value2))
It’s possible to access the variable like so:
config.new_section_name.attr1
-
save
(file_path)[source]¶ Saves current configuration to a JSON file. The configuration is stored as a nested dictionary (maintaining the order).
Parameters: file_path (str) – Path to write the file
-
Memories¶
SimpleMemory¶
DefaultMemory¶
-
class
torchrl.utils.memories.
DefaultMemory
(*args, **kwargs)[source]¶ Bases:
collections.defaultdict
A defaultdict whose keys can be accessed as attributes.
Logger¶
-
class
torchrl.utils.logger.
Logger
(log_dir=None, *, debug=False, log_freq=1)[source]¶ Common logger used by all agents, aggregates values and print a nice table.
Parameters: log_dir (str) – Path to write logs file. -
add_log
(name, value, precision=2)[source]¶ Register a value to a name, this function can be called multiple times and the values will be averaged when logging.
Parameters:
-
add_tf_only_log
(name, value, precision=2)[source]¶ Register a value to a name, this function can be called multiple times and the values will be averaged when logging. Will not display the logs on the console but just write on the file.
Parameters:
-
add_histogram
(name, values)[source]¶ Register a histogram that can be seen at tensorboard.
Parameters: - name (str) – Name displayed when printing the table.
- value (torch.Tensor) – Value to log.
-
Net Builder¶
auto_input_shape¶
get_module_list¶
nn_from_config¶
-
torchrl.utils.net_builder.
nn_from_config
(config, state_info, action_info, body=None, head=None)[source]¶ Creates a pytorch model following the instructions of config.
Parameters: - config (Config) – The configuration object that should contain the basic network structure.
- state_info (dict) – Dict containing information about the environment states (e.g. shape).
- action_info (dict) – Dict containing information about the environment actions (e.g. shape).
- body (Module) – If given use it instead of creating (Default is None).
- head (Module) – If given use it instead of creating (Default is None).
Returns: A torchrl NN (basically a pytorch NN with extended functionalities).
Return type: torchrl.SequentialExtended
Utils¶
get_obj¶
env_from_config¶
-
torchrl.utils.utils.
env_from_config
(config)[source]¶ Tries to create an environment from a configuration obj.
Parameters: config (Config) – Configuration file containing the environment function. Returns: env – A torchrl environment. Return type: torchrl.envs Raises: AttributeError
– If no env is defined in the config obj.
join_transitions¶
explained_var¶
-
torchrl.utils.utils.
explained_var
(target, preds)[source]¶ Calculates the explained variance between two datasets. Useful for estimating the quality of the value function
Parameters: - target (np.array) – Target dataset.
- preds (np.array) – Predictions array.
Returns: The explained variance.
Return type: