PyText Documentation

PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces and abstractions for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine. We use PyText at Facebook to iterate quickly on new modeling ideas and then seamlessly ship them at scale.

Core PyText Features:

How To Use

Please follow the tutorial series in Getting Started to get a sense of how to train a basic model and deploy to production.

After that, you can explore more options of builtin models and training methods in Training More Advanced Models

If you want to use PyText as a library and build your own models, please check the tutorial in Extending PyText

Note

All the demo configs and test data for the tutorials can be found in source code. You can either install PyText from source or download the files manually from GitHub.

Installation

PyText requires Python 3.6+

PyText is available in the Python Package Index via

$ pip install pytext-nlp

The easiest way to get started on most systems is to create a virtualenv

$ python3 -m venv pytext_venv
$ source pytext_venv/bin/activate
(pytext_venv) $ pip install pytext-nlp

This will install a version of PyTorch depending on your system. See PyTorch for more information. If you are using MacOS or Windows, this likely will not include GPU support by default; if you are using Linux, you should automatically get a version of PyTorch compatible with CUDA 9.0.

If you need a different version of PyTorch, follow the instructions on the PyTorch website to install the appropriate version of PyTorch before installing PyText

OS Dependencies

if you’re having issues getting things to run, these guides might help

On MacOS

Install brew, then run the command:

$ brew install cmake protobuf

On Windows

Coming Soon!

On Linux

For Ubuntu/Debian distros, you might need to run the following command:

$ sudo apt-get install protobuf-compiler libprotoc-dev

For rpm-based distros, you might need to run the following command:

$ sudo yum install protobuf-devel

Install From Source

$ git clone git@github.com:facebookresearch/pytext.git
$ cd pytext
$ source activation_venv
(pytext_venv) $ pip install torch # go to https://pytorch.org for platform specific installs
(pytext_venv) $ ./install_deps

Once that is installed, you can run the unit tests. We recommend using pytest as a runner.

(pytext_venv) $ pip install -U pytest
(pytext_venv) $ pytest
# If you want to measure test coverage, we recommend `pytest-cov`
(pytext_venv) $ pip install -U pytest-cov
(pytext_venv) $ pytest --cov=pytext

To resume development in an already checked-out repo:

$ cd pytext
$ source activation_venv

To exit the virtual environment:

(pytext_venv) $ deactivate

Cloud VM Setup

This guide will cover all the setup work you have to do in order to be able to easily install PyText on a cloud VM . Note that while these instructions worked when they were written, they may become incorrect or out of date. If they do, please send us a Pull Request!

After following these instructions, you should be good to either follow the Installation instructions or the Install From Source instructions

Amazon Web Services

Coming Soon

Google Cloud Engine

If you have problems launching your VM, make sure you have a non-zero gpu quota, click here to learn about quotas

This guide uses Google’s Deep Learning VM as a base.

Setting Up Your VM

  • Click “Launch on Compute Engine”
  • Configure the VM:
    • The default 2CPU K80 setup is fine for most tutorials, if you need more, change it here.
    • For Framework, select one of the Base images, rather than one with a framework pre-installed. Note which version of CUDA you choose for later.
    • When you’re ready, click “Deploy”
    • When your VM is done loading, you can SSH into it from the GCE Console
  • Install Python 3.6 (based on this RoseHosting blog post ):
    • $ sudo nano /etc/apt/sources.list
    • add deb http://ftp.de.debian.org/debian testing main to the list
    • $ echo 'APT::Default-Release "stable";' | sudo tee -a /etc/apt/apt.conf.d/00local
    • $ sudo apt-get update
    • $ sudo apt-get -t testing install python3.6
    • $ sudo apt-get install python3.6-venv protobuf-compiler libprotoc-dev

Microsoft Azure

This guide uses the Azure Ubuntu Server 18.04 LTS image as a base

Setting Up Your VM

  • From the Azure Dashboard, select “Virtual Machines” and then click “add”
  • Give your VM a name and select the region you want it in, keeping in mind that GPU servers are not present in all regions
  • For this tutorial, you should select “Ubuntu Server 18.04 LTS” as your image
  • Click “Change size” in order to select a GPU server.
    • Note that the default filters won’t show GPU servers, we recommend clearing all filters except “family” and setting “family” to GPU
    • For this tutorial, we will use the NC6 VM Size, but this should work on the larger and faster VMs as well
  • Make sure you set up SSH access, we recommend using a public key rather than a password. * don’t forget to “allow selected ports” and select SSH
  • install Nvidia driver and CUDA, (based on https://askubuntu.com/a/1036265)
    • sudo add-apt-repository ppa:graphics-drivers/ppa
    • sudo apt update
    • sudo apt-get install ubuntu-drivers-common
    • sudo ubuntu-drivers autoinstall
    • reboot: sudo shutdown -r now
    • sudo apt install nvidia-cuda-toolkit gcc-6
  • install OS dependencies: sudo apt-get install python3-venv protobuf-compiler libprotoc-dev

Train your first model

Once you’ve installed PyText you can start training your first model!

This tutorial series is an overview of using PyText, and will cover the main concepts PyText uses to interact with the world. It won’t deal with modifying the code (e.g. hacking on new model architectures). By the end, you should have a high-quality text classification model that can be used in production.

You can use PyText as a library either in your own scripts or in a Jupyter notebook, but the fastest way to start training is through the PyText command line tool. This tool will automatically be in your path when you install PyText!

(pytext) $ pytext

Usage: pytext [OPTIONS] COMMAND [ARGS]...

  Configs can be passed by file or directly from json. If neither --config-
  file or --config-json is passed, attempts to read the file from stdin.

  Example:

    pytext train < demo/configs/docnn.json

Options:
  --config-file TEXT
  --config-json TEXT
  --help              Show this message and exit.

Commands:
  export   Convert a pytext model snapshot to a caffe2 model.
  predict  Start a repl executing examples against a caffe2 model.
  test     Test a trained model snapshot.
  train    Train a model and save the best snapshot.

Background

Fundamentally, “machine learning” means learning a function automatically. Your training, evaluation, and test datasets are examples of inputs and their corresponding outputs which show how that function behaves. A model is an implementation of that function. To train a model means to make a statistical implementation of that function that uses the training data as a rubric. To predict using a model means to take a trained implementation and apply it to new inputs, thus predicting what the result of the idealized function would be on those inputs.

More examples to train on usually corresponds to more accurate and better-generalizing models. This can mean thousands to millions or billions of examples depending on the task (function) you’re trying to learn.

PyText Configs

Training a state-of-the-art PyText model on a dataset is primarily about configuration. Picking your training dataset, your model parameters, your training parameters, and so on, is a central part of building high-quality text models.

Configuration is a central part of every component within PyText, and the config system that we provide allows for all of these configurations to be easily expressible in JSON format. PyText comes in-built with a number of example configurations that can train in-built models, and we have a system for automatically documenting the default configurations and possible configuration values.

PyText Modes

  • train - Using a configuration, initialize a model and train it. Save the best model found as a model snapshot. This snapshot is something that can be loaded back in to PyText and trained further, tested, or exported.
  • test - Load a trained model snapshot and evaluate its performance against a test set.
  • export - Save the model as a serialized Caffe2 model, which is a stable model representation that can be loaded in production. (PyTorch model snapshots aren’t very durable; if you update parts of your runtime environment, they may be invalidated).
  • predict - Provide a simple REPL which lets you run inputs through your exported Caffe2 model and get a tangible sense for how your model will behave.

Train your first model

To get our feet wet, let’s run one of the demo configurations included with PyText.

(pytext) $ cat demo/configs/docnn.json
{
  "version": 8,
  "task": {
    "DocumentClassificationTask": {
      "data": {
        "source": {
          "TSVDataSource": {
            "field_names": ["label", "slots", "text"],
            "train_filename": "tests/data/train_data_tiny.tsv",
            "test_filename": "tests/data/test_data_tiny.tsv",
            "eval_filename": "tests/data/test_data_tiny.tsv"
          }
        }
      },
      "model": {
        "DocModel": {
          "representation": {
            "DocNNRepresentation": {}
          }
        }
      }
    }
  }
}

This config will train a document classification model (DocNN) to detect the “class” of a series of commands given to a smart assistant. Let’s take a quick look at the dataset:

(pytext) $ head -2 tests/data/train_data_tiny.tsv
alarm/modify_alarm      16:24:datetime,39:57:datetime   change my alarm tomorrow to wake me up 30 minutes earlier
alarm/set_alarm         Turn on all my alarms
(pytext) $ wc -l tests/data/train_data_tiny.tsv
    10 tests/data/train_data_tiny.tsv

As you can see, the dataset is quite small, so don’t get your hopes up on accuracy! We included this dataset for running unit tests against our models. PyText uses data in a tab separated format, as specified in the config by TSVDataSource. The order of the columns can be configured, but here we use the default. The first column is the “class”, the output label that we’re trying to predict. The second column is word-level tags, which we’re not trying to predict yet, so ignore them for now. The last column here is the input text, which is the command whose class (the first column) the model tries to predict.

Let’s train the model!

(pytext) $ pytext train < demo/configs/docnn.json
... [snip]

Stage.TEST
Epoch:1
loss: 1.646484
Accuracy: 50.00

Soft Metrics:
+--------------------------+-------------------+---------+
| Label                    | Average precision | ROC AUC |
+--------------------------+-------------------+---------+
|       alarm/modify_alarm |               nan |   0.000 |
|          alarm/set_alarm |             1.000 |   1.000 |
|       alarm/snooze_alarm |               nan |   0.000 |
| alarm/time_left_on_alarm |             0.333 |   0.333 |
|    reminder/set_reminder |             1.000 |   1.000 |
|  reminder/show_reminders |               nan |   0.000 |
|             weather/find |               nan |   0.000 |
+--------------------------+-------------------+---------+

Recall at Precision
+--------------------------+---------+---------+---------+---------+---------+
| Label                    | R@P 0.2 | R@P 0.4 | R@P 0.6 | R@P 0.8 | R@P 0.9 |
+--------------------------+---------+---------+---------+---------+---------+
| alarm/modify_alarm       |   0.000 |   0.000 |   0.000 |   0.000 |   0.000 |
| alarm/set_alarm          |   1.000 |   1.000 |   1.000 |   1.000 |   1.000 |
| alarm/snooze_alarm       |   0.000 |   0.000 |   0.000 |   0.000 |   0.000 |
| alarm/time_left_on_alarm |   1.000 |   0.000 |   0.000 |   0.000 |   0.000 |
| reminder/set_reminder    |   1.000 |   1.000 |   1.000 |   1.000 |   1.000 |
| reminder/show_reminders  |   0.000 |   0.000 |   0.000 |   0.000 |   0.000 |
| weather/find             |   0.000 |   0.000 |   0.000 |   0.000 |   0.000 |
+--------------------------+---------+---------+---------+---------+---------+
saving result to file /tmp/test_out.txt

The model ran over the training set 10 times. This output is the result of evaluating the model on the test set, and tracking how well it did. If you’re not familiar with these accuracy measurements,

  • Precision - The number of times the model guessed this label and was right
  • Recall - The number of times the model correctly identified this label, out of every time it shows up in the test set. If this number is low for a label, the model should be predicting this label more.
  • F1 - A harmonic mean of recall and precision.
  • Support - The number of times this label shows up in the test set.

As you can see, the training results were pretty bad. We ran over the data 10 times, and in that time managed to learn how to predict only one of the labels in the test set successfully. In fact, many of the labels were never predicted at all! With 10 examples, that’s not too surprising. See the next tutorial to run on a real dataset and get more usable results.

Execute your first model

In Train your first model, we learnt how to train a small, simple model. We can continue this tutorial with that model here. This procedure can be used for any pytext model by supplying the matching config. For example, the much more powerful model from Train Intent-Slot model on ATIS Dataset can be executed using this same procedure.

Evaluate the model

We want to run the model on our test dataset and see how well it performs. Some results have been abbreviated for clarity.

(pytext) $ pytext test < demo/configs/docnn.json

Stage.TEST
loss: 2.059336
Accuracy: 20.00

Macro P/R/F1 Scores:
    Label                       Precision   Recall      F1          Support

    reminder/set_reminder       25.00       100.00      40.00       1
    alarm/time_left_on_alarm    0.00        0.00        0.00        1
    alarm/show_alarms           0.00        0.00        0.00        1
    alarm/set_alarm             0.00        0.00        0.00        2
    Overall macro scores        6.25        25.00       10.00

Soft Metrics:
    Label       Average precision
    alarm/set_alarm 50.00
    alarm/time_left_on_alarm    20.00
    reminder/set_reminder   25.00
    alarm/show_alarms   20.00
    weather/find    nan
    alarm/modify_alarm  nan
    alarm/snooze_alarm  nan
    reminder/show_reminders nan
    Label       Recall at precision 0.2
    alarm/set_alarm 100.00
    Label       Recall at precision 0.4
    alarm/set_alarm 100.00
    Label       Recall at precision 0.6
    alarm/set_alarm 0.00
    Label       Recall at precision 0.8
    alarm/set_alarm 0.00
    Label       Recall at precision 0.9
    alarm/set_alarm 0.00
    Label       Recall at precision 0.2
    alarm/time_left_on_alarm    100.00
    Label       Recall at precision 0.4
    alarm/time_left_on_alarm    0.00
    Label       Recall at precision 0.6
    alarm/time_left_on_alarm    0.00
... [snip]
    reminder/show_reminders 0.00
    Label       Recall at precision 0.6
    reminder/show_reminders 0.00
    Label       Recall at precision 0.8
    reminder/show_reminders 0.00
    Label       Recall at precision 0.9
    reminder/show_reminders 0.00

Export the model

When you save a PyTorch model, the snapshot uses pickle for serialization. This means that simple code changes (e.g. a word embedding update) can cause backward incompatibilities with your deployed model. To combat this, you can export your model into the Caffe2 format using in-built ONNX integration. The exported Caffe2 model would have the same behavior regardless of changes in PyText or in your development code.

Exporting a model is pretty simple:

(pytext) $ pytext export --help
Usage: pytext export [OPTIONS]

  Convert a pytext model snapshot to a caffe2 model.

Options:
  --model TEXT        the pytext snapshot model file to load
  --output-path TEXT  where to save the exported model
  --help              Show this message and exit.

You can also pass in a configuration to infer some of these options. In this case let’s do that because depending on how you’re following along your snapshot might be in different places!

(pytext) $ pytext export --output-path exported_model.c2 < demo/configs/docnn.json
...[snip]
Saving caffe2 model to: exported_model.c2

This file now contains all of the information needed to run your model.

There’s an important distinction between what a model does and what happens before/after the model is called, i.e. the preprocessing and postprocessing steps. PyText strives to do as little preprocessing as possible, but one step that is very often needed is tokenization of the input text. This will happen automatically with our prediction interface, and if this behavior ever changes, we’ll make sure that old models are still supported. The model file you export will always work, and you don’t necessarily need PyText to use it! Depending on your use case you can implement preprocessing yourself and call the model directly, but that’s outside the scope of this tutorial.

Make a simple app

Let’s put this all into practice! How might we make a simple web app that loads an exported model and does something meaningful with it?

To run the following code, you should

(pytext) $ pip install flask

Then we implement a minimal Flask web server.

import sys
import flask
import pytext

config_file = sys.argv[1]
model_file = sys.argv[2]

config = pytext.load_config(config_file)
predictor = pytext.create_predictor(config, model_file)

app = flask.Flask(__name__)

@app.route('/get_flight_info', methods=['GET', 'POST'])
def get_flight_info():
    text = flask.request.data.decode()

    # Pass the inputs to PyText's prediction API
    result = predictor({"text": text})

    # Results is a list of output blob names and their scores.
    # The blob names are different for joint models vs doc models
    # Since this tutorial is for both, let's check which one we should look at.
    doc_label_scores_prefix = (
        'scores:' if any(r.startswith('scores:') for r in result)
        else 'doc_scores:'
    )

    # For now let's just output the top document label!
    best_doc_label = max(
        (label for label in result if label.startswith(doc_label_scores_prefix)),
        key=lambda label: result[label][0],
    # Strip the doc label prefix here
    )[len(doc_label_scores_prefix):]

    return flask.jsonify({"question": f"Are you asking about {best_doc_label}?"})

app.run(host='0.0.0.0', port='8080', debug=True)

Execute the app

(pytext) $ python flask_app.py demo/configs/docnn.json exported_model.c2
* Serving Flask app "flask_app" (lazy loading)
* Environment: production
  WARNING: Do not use the development server in a production environment.
  Use a production WSGI server instead.
* Debug mode: on

Then in a separate terminal window

$ function ask_about() { curl http://localhost:8080/get_flight_info -H "Content-Type: text/plain" -d "$1" }

$ ask_about 'I am looking for flights from San Francisco to Minneapolis'
{
  "question": "Are you asking about flight?"
}

$ ask_about 'How much does a trip to NY cost?'
{
  "question": "Are you asking about airfare?"
}

$ ask_about "Which airport should I go to?"
{
  "question": "Are you asking about airport?"
}

Visualize Model Training with TensorBoard

Visualizations can be helpful in allowing you to better understand, debug and optimize your models during training. By default, all models trained using PyText can be visualized using TensorBoard <https://www.tensorflow.org/guide/summaries_and_tensorboard>.

Here, we will explore how to visualize the model from the tutorial Train Intent-Slot model on ATIS Dataset.

1. Install TensorBoard visualization server

The TensorBoard web server is required to host your visualizations. To install it, run

$ pip install tensorboard

2. Verify TensorBoard events in current working directory

Complete the tutorial from Train Intent-Slot model on ATIS Dataset if you have not done so. Once that is done, you should be able to see a TensorBoard events file in the working directory where you trained your model. The file path will be something like <WORKING_DIR>/runs/<DATETIME>_<MACHINE_NAME>/events.out.tfevents.<TIMESTAMP>.<MACHINE_NAME>.

3. Launch the visualization server

To launch the visualization server, run:

$ tensorboard --logdir=$EVENTS_FOLDER

$EVENTS_FOLDER is the folder containing the events file in 2., which is something like <WORKING_DIR>/runs/<DATETIME>_<MACHINE_NAME>.

Note: The TensorBoard web server might fail to run might fail to run if TensorFlow is not installed. This dependency is not ideal, but if you see ModuleNotFoundError: No module named ‘tensorflow’ when running the above command, you can install TensorFlow using:

$ pip install tensorflow

4. View your visualizations

After launching the visualization server, you can view your visualizations in a web browser at http://localhost:6006.

PyText visualizes the training metrics as scalars, test metrics as texts, and also the shape of the neural network architecture graph. Below are some screenshots of what you will see:

Training Metrics:

PyText TensorBoard training metrics

Test Metrics:

PyText TensorBoard test metrics

Model Graph:

PyText TensorBoard model graph

Use PyText models in your app

Once you have a PyText model exported to Caffe2, you can host it on a simple web server in the cloud. Then your applications (web/mobile) can make requests to this server and use the returned predictions from the model.

In this tutorial, we’ll take the intent-slot model trained in Train Intent-Slot model on ATIS Dataset, and host it on a Flask server running on an Amazon EC2 instance. Then we’ll write an iOS app which can identify city names in users’ messages by querying the server.

1. Setup an EC2 instance

Amazon EC2 is a service which lets you host servers in the cloud for any arbitrary purpose. Use the official documentation to sign up, create an IAM profile and a key pair. Sign in into the EC2 Management Console and launch a new instance with the default Amazon Linux 2 AMI. In the Configure Security Group step, Add a Rule with type HTTP and port 80.

Connect to your instance using the steps here. Once you’re logged in, install the required dependencies -

$ cd ~
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
$ chmod +x miniconda.sh
$ ./miniconda.sh -b -p ~/miniconda
$ rm -f miniconda.sh
$ source ~/miniconda/bin/activate

$ conda install -y protobuf
$ conda install -y boto3 flask future numpy pip
$ conda install -y pytorch -c pytorch

$ sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to 8080

We’ll make the server listen to (randomly selected) port 8080 and redirect requests coming to port 80 (HTTP), since running a server on latter requires administrative privileges.

2. Implement and test the server

Upload your trained model (models/atis_joint_model.c2) and the server files (demo/flask_server/*) to the instance using scp.

The server handles a GET request with a text field by running it through the model and dumping the output back to a JSON.

@app.route('/')
def predict():
    return json.dumps(atis.predict(request.args.get('text', '')))

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

The code in demo/flask_server/atis.py does the pre-processing (tokenization) and post-processing (extract spans of city names) specific to the ATIS model.

Run the server using

$ python server.py

Test it out by finding your IPv4 Public IP on the EC2 Management Console page and pointing your browser to it. The server will respond with the character spans of the city names e.g.

_images/flask_www.png

3. Implement the iOS app

Install Xcode and CocoaPods if you haven’t already.

We use the open-source MessageKit to bootstrap our iOS app. Clone the app from our sister repository, and run -

$ pod install
$ open PyTextATIS.workspace

The comments in ViewController.swift explain the modifications over the base code. Change the IP address in that file to your instance’s and run the app!

PyText ATIS iOS Demo

Serve Models in Production

We have seen how to use PyText models in an app using Flask in the previous tutorial, but the server implementation still requires a Python runtime. Caffe2 models are designed to perform well even in production scenarios with high requirements for performance and scalability.

In this tutorial, we will implement a Thrift server in C++, in order to extract the maximum performance from our exported Caffe2 intent-slot model trained on the ATIS dataset. We will also prepare a Docker image which can be deployed to your cloud provider of choice.

The full source code for the implemented server in this tutorial can be found in the demos directory.

To complete this tutorial, you will need to have Docker installed.

1. Create a Dockerfile and install dependencies

The first step is to prepare our Docker image with the necessary dependencies. In an empty, folder, create a Dockerfile with the following contents:

Dockerfile

FROM ubuntu:16.04

# Install Caffe2 + dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
  build-essential \
  git \
  libgoogle-glog-dev \
  libgtest-dev \
  libiomp-dev \
  libleveldb-dev \
  liblmdb-dev \
  libopencv-dev \
  libopenmpi-dev \
  libsnappy-dev \
  openmpi-bin \
  openmpi-doc \
  python-dev \
  python-pip
RUN pip install --upgrade pip
RUN pip install setuptools wheel
RUN pip install future numpy protobuf typing hypothesis pyyaml
RUN apt-get install -y --no-install-recommends \
      libgflags-dev \
      cmake
RUN git clone https://github.com/pytorch/pytorch.git
WORKDIR pytorch
RUN git submodule update --init --recursive
RUN python setup.py install

# Install Thrift + dependencies
WORKDIR /
RUN apt-get update && apt-get install -y \
  libboost-dev \
  libboost-test-dev \
  libboost-program-options-dev \
  libboost-filesystem-dev \
  libboost-thread-dev \
  libevent-dev \
  automake \
  libtool \
  curl \
  flex \
  bison \
  pkg-config \
  libssl-dev
RUN curl https://www-us.apache.org/dist/thrift/0.11.0/thrift-0.11.0.tar.gz --output thrift-0.11.0.tar.gz
RUN tar -xvf thrift-0.11.0.tar.gz
WORKDIR thrift-0.11.0
RUN ./bootstrap.sh
RUN ./configure
RUN make
RUN make install

2. Add Thrift API

Thrift is a software library for developing scalable cross-language services. It comes with a client code generation engine enabling services to be interfaced across the network on multiple languages or devices. We will use Thrift to create a service which serves our model.

Our C++ server will expose a very simple API that receives an sentence/utterance as a string, and return a map of label names(string) -> scores(list<double>). For document scores, the list will only contain one score, and for word scores, the list will contain one score per word. The corresponding thrift spec fo the API is below:

predictor.thrift

namespace cpp predictor_service

service Predictor {
   // Returns list of scores for each label
   map<string,list<double>> predict(1:string doc),
}

3. Implement server code

Now, we will write our server’s code. The first thing our server needs to be able to do is to load the model from a file path into the Caffe2 workspace and initialize it. We do that in the constructor of our PredictorHandler thrift server class:

server.cpp

class PredictorHandler : virtual public PredictorIf {
  private:
    NetDef mPredictNet;
    Workspace mWorkspace;

    NetDef loadAndInitModel(Workspace& workspace, string& modelFile) {
      auto db = unique_ptr<DBReader>(new DBReader("minidb", modelFile));
      auto metaNetDef = runGlobalInitialization(move(db), &workspace);
      const auto predictInitNet = getNet(
        *metaNetDef.get(),
        PredictorConsts::default_instance().predict_init_net_type()
      );
      CAFFE_ENFORCE(workspace.RunNetOnce(predictInitNet));

      auto predictNet = NetDef(getNet(
        *metaNetDef.get(),
        PredictorConsts::default_instance().predict_net_type()
      ));
      CAFFE_ENFORCE(workspace.CreateNet(predictNet));

      return predictNet;
    }
...
  public:
    PredictorHandler(string &modelFile): mWorkspace("workspace") {
      mPredictNet = loadAndInitModel(mWorkspace, modelFile);
    }
...
}

Now that our model is loaded, we need to implement the predict API method which is our main interface to clients. The implementation needs to do the following:

  1. Pre-process the input sentence into tokens
  2. Feed the input as tensors to the model
  3. Run the model
  4. Extract and populate the results into the response

server.cpp

class PredictorHandler : virtual public PredictorIf {
...
  public:
    void predict(map<string, vector<double>>& _return, const string& doc) {
      // Pre-process: tokenize input doc
      vector<string> tokens;
      string docCopy = doc;
      tokenize(tokens, docCopy);

      // Feed input to model as tensors
      Tensor valTensor = TensorCPUFromValues<string>(
        {static_cast<int64_t>(1), static_cast<int64_t>(tokens.size())}, {tokens}
      );
      BlobGetMutableTensor(mWorkspace.CreateBlob("tokens_vals_str:value"), CPU)
        ->CopyFrom(valTensor);
      Tensor lensTensor = TensorCPUFromValues<int>(
        {static_cast<int64_t>(1)}, {static_cast<int>(tokens.size())}
      );
      BlobGetMutableTensor(mWorkspace.CreateBlob("tokens_lens"), CPU)
        ->CopyFrom(lensTensor);

      // Run the model
      CAFFE_ENFORCE(mWorkspace.RunNet(mPredictNet.name()));

      // Extract and populate results into the response
      for (int i = 0; i < mPredictNet.external_output().size(); i++) {
        string label = mPredictNet.external_output()[i];
        _return[label] = vector<double>();
        Tensor scoresTensor = mWorkspace.GetBlob(label)->Get<Tensor>();
        for (int j = 0; j < scoresTensor.numel(); j++) {
          float score = scoresTensor.data<float>()[j];
          _return[label].push_back(score);
        }
      }
    }
...
}

The full source code for server.cpp can be found here.

Note: The source code in the demo also implements a REST proxy for the Thrift server to make it easy to test and make calls over simple HTTP, however it is not covered in the scope of this tutorial since the Thrift protocol is what we’ll use in production.

4. Build and compile scripts

To build our server, we need to provide necessary headers during compile time and the required dependent libraries during link time: libthrift.so, libcaffe2.so, libprotobuf.so and libc10.so. The Makefile below does this:

Makefile

CPPFLAGS += -g -std=c++11 -std=c++14 \
  -I./gen-cpp \
  -I/pytorch -I/pytorch/build \
      -I/pytorch/aten/src/ \
      -I/pytorch/third_party/protobuf/src/
CLIENT_LDFLAGS += -lthrift
SERVER_LDFLAGS += -L/pytorch/build/lib -lthrift -lcaffe2 -lprotobuf -lc10

# ...

server: server.o gen-cpp/Predictor.o
      g++ $^ $(SERVER_LDFLAGS) -o $@

clean:
      rm -f *.o server

In our Dockerfile, we also add some steps to copy our local files into the docker image, compile the app, and add the necessary library search paths.

Dockerfile

# Copy local files to /app
COPY . /app
WORKDIR /app

# Compile app
RUN thrift -r --gen cpp predictor.thrift
RUN make

# Add library search paths
RUN echo '/pytorch/build/lib/' >> /etc/ld.so.conf.d/local.conf
RUN echo '/usr/local/lib/' >> /etc/ld.so.conf.d/local.conf
RUN ldconfig

5. Test/Run the server

This section assumes that your local files match the one found here.

Now that you have implemented your server, we will run the following commands to take it for a test run. In your server folder:

  1. Build the image:
$ docker build -t predictor_service .

If successful, you should see the message “Successfully tagged predictor_service:latest”.

  1. Run the server. We use models/atis_joint_model.c2 as the local path to our model file (add your trained model there):
$ docker run -it -p 8080:8080 predictor_service:latest ./server models/atis_joint_model.c2

If successful, you should see the message “Server running. Thrift port: 9090, REST port: 8080”

  1. Test our server by sending a test utterance “Flight from Seattle to San Francisco”:
$ curl -G "http://localhost:8080" --data-urlencode "doc=Flights from Seattle to San Francisco"

If successful, you should see the scores printed out on the console. On further inspection, the doc score for “flight”, the 3rd word score for “B-fromloc.city_name” corresponding to “Seattle”, the 5th word score for “B-toloc.city_name” corresponding to “San”, and the 6th word score for “I-toloc.city_name” corresponding to “Francisco” should be close to 0.

doc_scores:flight:-2.07426e-05
word_scores:B-fromloc.city_name:-14.5363 -12.8977 -0.000172928 -12.9868 -9.94603 -16.0366
word_scores:B-toloc.city_name:-15.2309 -15.9051 -9.89932 -12.077 -0.000134 -8.52712
word_scores:I-toloc.city_name:-13.1989 -16.8094 -15.9375 -12.5332 -10.7318 -0.000501401

Congratulations! You have now built your own server that can serve your PyText models in production!

We also provide a Docker image on Docker Hub with this example, which you can freely use and adapt to your needs.

Config Files Explained

PyText Models and training Tasks contain many components, and each components expects many parameters to define their behavior. PyText uses a config to specify those parameters. The config is can be loaded from a JSON file, which is what we describe here.

Structure of a Config File

A typical config file only contains the parameters specific to your project. Here’s a fully working JSON file, and it does not need to be more complicated than this:

{
    "task": {
        "DocumentClassificationTask": {
            "data": {
              "source": {
                "TSVDataSource": {
                  "field_names": ["label", "text"],
                  "train_filename": "my/data/train.tsv",
                  "eval_filename": "my/data/eval.tsv",
                  "test_filename": "my/data/test.tsv"
                }
            },
            "model": {
                "embedding": {
                    "embed_dim": 200
                }
            }
        }
    },
    "version": 15
}

At the top level, the most important settings are the “task” and the “version”. “task” defines the Task component to be used, which specifies where to get the “data”, which “model” to train, which “trainer” to use, and which “metric_reporter” will present the results.

Each of those parameters can be a Component that is specified by its class name, or omitted to use the default class with its default parameters. In the example above, we specify TSVDataSource to use this class, but we skip the model class name because we want to use the default DocModel.

The “version” number helps PyText maintain backwards compatibility. PyText will use config adapters to internally try and update the configs to match the latest component parameters so you don’t have to keep changing your configs at each PyText update. To manually update your config to the latest version, you can use the update-config command.

Parameters in Config File

Parameters are either a component or a value. In the config above, we see that “field_names” expects a list of strings, “train_filename” expects a string, and “embed_dim” expects an integer.

“source” and “model” however expect a component, and as we’ve seen in the previous section, we can optionally specify the class name of a component if we decide to use a component that is not the default. We can tell whether it’s a class name or a parameter name by looking at the first letter: class names start with an upper case letter. For “source” we decided to specify TSVDataSource, but for “model” we did not and decided to let DocumentClassificationTask use its default DocModel. We could have specified the class name like this, and that would be equivalent:

"model": {
    "DocModel": {
        "embedding": {
            "embed_dim": 100
        }
    }
}

In the next example, the default representation for DocModel is BiLSTMDocAttention. We did not specify “representation” before because we were happy with this default. But if we decide to use DocNNRepresentation instead, we would modify the config like this:

"model": {
    "embedding": {
        "embed_dim": 100
    },
    "representation": {
        "DocNNRepresentation": {
        }
    }
}

In this example we just want to change the class of “representation” and use its default parameters, so we don’t need to specify any of them and we can leave its parameters set empty {}.

To explore more components parameters and their possible values, you can use the help-config command or browse the class documentation.

Changing a Config File

Users typically start with an existing config file, or create one using the gen-default-config command, and then edit it to tweak the parameters.

The file generated by gen-default-config is very large, because it contains the default value of every parameter for every component. Any of those parameters can be omitted from the config file, because PyText can recover their default values.

In general, you should remove from your config file all the parameters you don’t want to override and keep those you do want to override now, or you might want to tweak later.

For example, TSVDataSource can use a different “delimiter”, but in most cases we want to use the default “\t” for tab-separated-values files (TSV), so the config above does not specify “delimiter”: “\t”. If we wanted to load a CVS file, we could override this default by adding our own “delimiter” to our config (and since CVS fields can be “quoted”, unlike TSV where this option’s default is false, we’d also override it with true.)

"TSVDataSource": {
    "delimiter": ",",
    "quoted": true,
    "field_names": ["label", "text"],
    "train_filename": "my/data/train.csv",
    "eval_filename": "my/data/eval.csv",
    "test_filename": "my/data/test.csv"
}

The config at the top of this page is a fully working example. It could be simplified even further by removing the “model” section if you don’t want to change any of the model parameters, but in this case I guess the author decided to tweak “embed_dim”.

JSON Format Primer

A few notes about the JSON syntax and the differences with python:

  • field names and string values should all be quoted with “double-quotes”
  • booleans are lower case: true, false
  • no trailing comma (after the last value of a block)
  • empty value is: null
  • indentation is optional but recommended for readability
  • the first character must be { and the last one must be }
  • obviously all brackets must be balanced: {}, []

Config Commands

This page explains the usage of the commands help-config to explore PyText components, and gen-default-config to create a config file with custom components and parameters.

Exploring Config Options

You can explore PyText Components with the command help-config. This will print the documentation of the component, its full module name, its base class, as well as the list of its config parameters, their type and their default value.

$ pytext help-config LMTask
=== pytext.task.tasks.LMTask (NewTask) ===
    data = Data
    exporter = null
    features = FeatureConfig
    featurizer = SimpleFeaturizer
    metric_reporter: LanguageModelMetricReporter = LanguageModelMetricReporter
    model: LMLSTM = LMLSTM
    trainer = TaskTrainer

You can drill down to the component you’re interested in. For example, if you want to know more about the model LMLSTM, you can use the same command. Notice how PyText lists the possible values for Union types (for example with representation below.)

$ pytext help-config LMLSTM
=== pytext.models.language_models.lmlstm.LMLSTM (BaseModel) ===
"""
`LMLSTM` implements a word-level language model that uses LSTMs to
    represent the document.
"""
    ModelInput = LMLSTM.Config.ModelInput
    caffe2_format: (ExporterType)
         PREDICTOR (default)
         INIT_PREDICT
    decoder: (one of)
         None
         MLPDecoder (default)
    embedding: WordFeatConfig = WordEmbedding
    inputs: LMLSTM.Config.ModelInput = ModelInput
    output_layer: LMOutputLayer = LMOutputLayer
    representation: (one of)
         DeepCNNRepresentation
         BiLSTM (default)
    stateful: bool
    tied_weights: bool

PyText internally registers all the component classes, so we can look up and find any component using the class name or their aliases. For example somewhere in PyText we have import DeepCNNRepresentation as CNN, so we would normally look up DeepCNNRepresentation, but if we know that this class has an alias we can look up CNN instead, and print the information about this class:

$ pytext help-config CNN
=== pytext.models.representations.deepcnn.DeepCNNRepresentation (RepresentationBase) ===
"""
`DeepCNNRepresentation` implements CNN representation layer
    preceded by a dropout layer. CNN representation layer is based on the encoder
    in the architecture proposed by Gehring et. al. in Convolutional Sequence to
    Sequence Learning.

    Args:
        config (Config): Configuration object of type DeepCNNRepresentation.Config.
        embed_dim (int): The number of expected features in the input.
"""
    cnn: CNNParams = CNNParams
    dropout: float = 0.3

Creating a Config File

The command gen-default-config creates a json config files for a given Task using the default value for all the parameters. You must specify the class name of the Task. The json config will be printed in the terminal, so you need to send it to a file using of your choice (for example my_config.json) to be able to edit it and use it.

$ pytext gen-default-config LMTask > my_config.json
INFO - Applying task option: LMTask
...

In the help-config LMLSTM above, we see that representation is by default BiLSTM, but could also be DeepCNNRepresentation. (This can be because the type is declared as a Union of valid alternatives, or because the type is a base class.) Those two classes will have different parameters, so we can’t just edit the my_config.json and replace the class name.

We can specify which components to use by adding any number of class names to the command. Let’s create this config, and we’ll use add DeepCNNRepresentation to our command. gen-default-config will look up this class name and find that it is a suitable representation component for the LMLSTM model in our LMTask.

$ pytext gen-default-config LMTask DeepCNNRepresentation > my_config.json
INFO - Applying task option: LMTask
INFO - Applying class option: task->model->representation = CNN
...

This also works with parameters which are not component class names. You can specify the parameter name and its value, and gen-default-config will automatically apply this parameter to the right component.

$ pytext gen-default-config LMTask epochs=200
INFO - Applying task option: LMTask
INFO - Applying parameter option to task.trainer.epochs : epochs=200
...

Sometimes the same parameter name is used by multiple components. In this case PyText prints the list of those parameters with their full config path. You can then simply use the last part of the path that is enough to differentiate them and pick the one you want. In the next example, we omit the prefix task.model. because we don’t need it to find where to apply our parameter representation.dropout.

$ pytext gen-default-config LMTask dropout=0.7 > my_config.json
INFO - Applying task option: LMTask
...
Exception: Multiple possibilities for dropout=0.7: task.model.representation.dropout, task.model.decoder.dropout

$ pytext gen-default-config LMTask representation.dropout=0.7 > my_config.json
INFO - Applying task option: LMTask
INFO - Applying parameter option to task.model.representation.dropout : representation.dropout=0.7
...

You can add any number and combination of those parameters. Please note that they will be applied in order, so if you want to change a component class and some of its parameters, you must specify the parameters in this order (component first, then parameters). If you don’t do that, your parameters changes will be ignored. For example, changing representation.dropout first, then overriding the representation component will replace the default representation with a new CNN component with all the parameter using the default value.

Look at this bad example: you can verify that the representation dropout is 0.3 (the default value for CNN) and not 0.7 as we specified, because CNN was applied after and replaced the component that had its dropout modified first.

$ pytext gen-default-config LMTask representation.dropout=0.7 CNN > my_config.json
INFO - Applying task option: LMTask
INFO - Applying parameter option to task.model.representation.dropout : representation.dropout=0.7
INFO - Applying class option: task->model->representation = CNN
...

Now let’s combine everything:

$ pytext gen-default-config LMTask BlockShardedTSVDataSource CNN dilated=True epochs=200 representation.dropout=0.7 > my_config.json
INFO - Applying task option: LMTask
INFO - Applying class option: task->data->source = BlockShardedTSVDataSource
INFO - Applying class option: task->model->representation = CNN
INFO - Applying parameter option to task.model.representation.cnn.dilated : dilated=True
INFO - Applying parameter option to task.trainer.epochs : epochs=200
INFO - Applying parameter option to task.model.representation.dropout : representation.dropout=0.2
...

Updating a Config File

When there’s a new release of PyText, some component parameters might change because of bug fixes or new features. While PyText has config_adapters that can internally transform old configs to map them to the latest components, it is sometimes useful to update your config file to the current version. This can be done with the command update-config:

$ pytext update-config < my_config_old.json > my_config_new.json

Train Intent-Slot model on ATIS Dataset

OBSOLETE This documentation is using the old API and needs to be updated with the new classes configs.

Intent detection and Slot filling are two common tasks in Natural Language Understanding for personal assistants. Given a user’s “utterance” (e.g. Set an alarm for 10 pm), we detect its intent (set_alarm) and tag the slots required to fulfill the intent (10 pm).

The two tasks can be modeled as text classification and sequence labeling, respectively. We can train two separate models, but training a joint model has been shown to perform better.

In this tutorial, we will train a joint intent-slot model in PyText on the ATIS (Airline Travel Information System) dataset. Note that to download the dataset, you will need a Kaggle account for which you can sign up for free.

1. Prepare the data

The in-built PyText data-handler expects the data to be stored in a tab-separated file that contains the intent label, slot label and the raw utterance.

Download the data locally and use the script below to preprocess it into format PyText expects

$ unzip <download_dir>/atis.zip -d <download_dir>/atis
$ python3 demo/atis_joint_model/data_processor.py
  --download-folder <download_dir>/atis --output-directory demo/atis_joint_model/

The script will also randomly split the training data into training and validation sets. All the pre-processed data will be written to the output-directory argument specified in the command.

An alternative approach here would be to write a custom data-handler for your custom data format, but that is beyond the scope of this tutorial.

2. Download Pre-trained word embeddings

Word embeddings are the vector representations of the different words understood by your model. Pre-trained word embeddings can significantly improve the accuracy of your model, since they have been trained on vast amounts of data. In this tutorial, we’ll use GloVe embeddings, which can be downloaded by:

$ curl https://nlp.stanford.edu/data/wordvecs/glove.6B.zip > demo/atis_joint_model/glove.6B.zip
$ unzip demo/atis_joint_model/glove.6B.zip -d demo/atis_joint_model

The downloaded file size is ~800 MB.

3. Train the model

To train a PyText model, you need to pick the right task and model architecture, among other parameters. Default values are available for many parameters and can give reasonable results in most cases. The following is a sample config which can train a joint intent-slot model

{
  "config": {
    "task": {
      "IntentSlotTask": {
        "data": {
          "Data": {
            "source": {
              "TSVDataSource": {
                "field_names": [
                  "label",
                  "slots",
                  "text",
                  "doc_weight",
                  "word_weight"
                ],
                "train_filename": "demo/atis_joint_model/atis.processed.train.csv",
                "eval_filename": "demo/atis_joint_model/atis.processed.val.csv",
                "test_filename": "demo/atis_joint_model/atis.processed.test.csv"
              }
            },
            "batcher": {
              "PoolingBatcher": {
                "train_batch_size": 128,
                "eval_batch_size": 128,
                "test_batch_size": 128,
                "pool_num_batches": 10000
              }
            },
            "sort_key": "tokens",
            "in_memory": true
          }
        },
        "model": {
          "representation": {
            "BiLSTMDocSlotAttention": {
              "pooling": {
                "SelfAttention": {}
              }
            }
          },
          "output_layer": {
            "doc_output": {
              "loss": {
                "CrossEntropyLoss": {}
              }
            },
            "word_output": {
              "CRFOutputLayer": {}
            }
          },
          "word_embedding": {
            "embed_dim": 100,
            "pretrained_embeddings_path": "demo/atis_joint_model/glove.6B.100d.txt"
          }
        },
        "trainer": {
          "epochs": 20,
          "optimizer": {
            "Adam": {
              "lr": 0.001
            }
          }
        }
      }
    }
  }
}

We explain some of the parameters involved:

  • IntentSlotTask trains a joint model for document classification and word tagging.
  • The Model has multiple layers - - We use BiLSTM model with attention as the representation layer. The pooling attribute decides the attention technique used. - We use different loss functions for document classification (Cross Entropy Loss) and slot filling (CRF layer)
  • Pre-trained word embeddings are provided within the word_embedding attribute.

To train the PyText model,

(pytext) $ pytext train < sample_config.json

3. Tune the model and get final results

Tuning the model’s hyper-parameters is key to obtaining the best model accuracy. Using hyper-parameter sweeps on learning rate, number of layers, dimension and dropout of BiLSTM etc., we can achieve a F1 score of ~95% on slot labels which is close to the state-of-the-art. The fine-tuned model config is available at demos/atis_intent_slot/atis_joint_config.json

To train the model using fine tuned model config,

(pytext) $ pytext train < demo/atis_joint_model/atis_joint_config.json

4. Generate predictions

Lets make the model run on some sample utterances! You can input one by running

(pytext) $ pytext --config-file demo/atis_joint_model/atis_joint_config.json \
  predict --exported-model /tmp/atis_joint_model.c2 <<< '{"text": "flights from colorado"}'

The response from the model is log of probabilities for different intents and slots, with the correct intent and slot hopefully having the highest.

In the following snippet of the model’s response, we see that the intent doc_scores:flight and slot word_scores:fromloc.city_name for third word “colorado” have the highest predictions.

{
 ....
 'doc_scores:flight': array([-0.00016726], dtype=float32),
 'doc_scores:ground_service+ground_fare': array([-25.865768], dtype=float32),
 'doc_scores:meal': array([-17.864975], dtype=float32),
 ..,
 'word_scores:airline_name': array([[-12.158762],
       [-15.142928],
       [ -8.991585]], dtype=float32),
 'word_scores:fromloc.city_name': array([[-1.5084317e+01],
       [-1.3880151e+01],
       [-1.4416825e-02]], dtype=float32),
 'word_scores:fromloc.state_code': array([[-17.824356],
       [-17.89767 ],
       [ -9.848984]], dtype=float32),
 'word_scores:meal': array([[-15.079164],
       [-17.229427],
       [-17.529446]], dtype=float32),
 'word_scores:transport_type': array([[-14.722928],
       [-16.700478],
       [-13.4414  ]], dtype=float32),
 ...
}

Hierarchical intent and slot filling

In this tutorial, we will train a semantic parser for task oriented dialog by modeling hierarchical intents and slots (Gupta et al. , Semantic Parsing for Task Oriented Dialog using Hierarchical Representations, EMNLP 2018). The underlying model used in the paper is the Recurrent Neural Network Grammar (Dyer et al., Recurrent Neural Network Grammar, NAACL 2016). RNNG is neural constituency parser that explicitly models the compositional tree structure of the words and phrases in an utterance.

1. Fetch the dataset

Download the dataset to a local directory. We will refer to this as base_dir in the next section.

$ curl -o top-dataset-semantic-parsing.zip -L https://fb.me/semanticparsingdialog
$ unzip top-dataset-semantic-parsing.zip

2. Prepare configuration file

Prepare the configuration file for training. A sample config file can be found in your PyText repository at demo/configs/rnng.json. If you haven’t set up PyText, please follow Installation, then make the following changes in the config:

  • Set train_path to base_dir/top-dataset-semantic-parsing/train.tsv.
  • Set eval_path to base_dir/top-dataset-semantic-parsing/eval.tsv.
  • Set test_path to base_dir/top-dataset-semantic-parsing/test.tsv.

3. Train a model with the downloaded dataset

Train the model using the command below

(pytext) $ pytext train < demo/configs/rnng.json

The output will look like:

Merged Intent and Slot Metrics
P = 24.03 R = 31.90, F1 = 27.41.

This will take about hour. If you want to train with a smaller dataset to make it quick then generate a subset of the dataset using the commands below and update the paths in demo/configs/rnng.json:

$ head -n 1000 base_dir/top-dataset-semantic-parsing/train.tsv > base_dir/top-dataset-semantic-parsing/train_small.tsv
$ head -n 100 base_dir/top-dataset-semantic-parsing/eval.tsv > base_dir/top-dataset-semantic-parsing/eval_small.tsv
$ head -n 100 base_dir/top-dataset-semantic-parsing/test.tsv > base_dir/top-dataset-semantic-parsing/test_small.tsv

If you now train the model with smaller datasets, the output will look like:

Merged Intent and Slot Metrics
P = 24.03 R = 31.90, F1 = 27.41.

4. Test the model interactively against input utterances.

Load the model using the command below

(pytext) $ pytext predict-py --model-file=/tmp/model.pt
please input a json example, the names should be the same with column_to_read in model training config:

This will give you a REPL prompt. You can enter an utterance to get back the model’s prediction repeatedly. You should enter in a json format shown below. Once done press Ctrl+D.

{"text": "order coffee from starbucks"}

You should see an output like:

[{'prediction': [7, 0, 5, 0, 1, 0, 3, 0, 1, 1],
'score': [
        0.44425372408062447,
        0.8018286800064633,
        0.6880680051949267,
        0.9891564979506277,
        0.9999506231665385,
        0.9992705616574005,
        0.34512090135492923,
        0.9999979545618913,
        0.9999998668826438,
        0.9999998686418744]}]

We have also provided a pre-trained model which you may download here

Multitask training with disjoint datasets

In this tutorial, we will jointly train a classification task with a language modeling task in a multitask setting. The models will share the embedding and representation layers.

We will use the following datasets:

  1. Binarized Stanford Sentiment Treebank (SST-2), which is part of the GLUE benchmark. This dataset contains segments from movie reviews labeled with their binary sentiment.
  2. WikiText-2, a medium-size language modeling dataset with text extracted from Wikipedia.

1. Fetch and prepare the dataset

Download the dataset in a local directory. We will refer to this as base_dir in the next section.

$ curl "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip" -o wikitext-2-v1.zip
$ unzip wikitext-2-v1.zip
$ curl "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8" -o SST-2.zip
$ unzip SST-2.zip

Remove headers from SST-2 data:

$ cd base_dir/SST-2
$ sed -i '1d' train.tsv
$ sed -i '1d' dev.tsv

Remove empty lines from WikiText:

$ cd base_dir/wikitext-2
$ sed -i '/^\s*$/d' train.tsv
$ sed -i '/^\s*$/d' valid.tsv
$ sed -i '/^\s*$/d' test.tsv

2. Train a base model

Prepare the configuration file for training. A sample config file for the base document classification model can be found in your PyText repository at demo/configs/sst2.json. If you haven’t set up PyText, please follow Installation, then make the following changes in the config:

  • Set train_path to base_dir/SST-2/train.tsv.
  • Set eval_path to base_dir/SST-2/eval.tsv.
  • Set test_path to base_dir/SST-2/test.tsv.

The test set labels for this tasks are not openly available, therefore we will use the dev set. Train the model using the command below.

(pytext) $ pytext train < demo/configs/sst2.json

The output will look like:

Stage.EVAL
loss: 0.472868
Accuracy: 85.67

3. Configure for multitasking

The example configuration for this tutorial is at demo/configs/multitask_sst_lm.json. The main configuration is under tasks, which is a dictionary of task name to task config:

      "task_weights": {
              "SST2": 1,
              "LM": 1
      },
"tasks": {
  "SST2": {
    "DocClassificationTask": { ... }
  },
  "LM": {
    "LMTask": { ... }
  }
}

You can also modify task_weights to weight the loss for each task. The sub-tasks can be configured as you would in a single task setting, with the exception of changes described in the next sections.

3. Specify which parameters to share

Parameter sharing is specified at module level with the shared_module_key parameter, which is an arbitrary string. Modules with identical shared_module_key share parameters.

Here we will share the BiLSTM module. Under the SST task, we set

"representation": {
  "BiLSTMDocAttention": {
    "lstm": {
      "shared_module_key": "SHARED_LSTM"
    }
  }
}

Under the LM task, we set

"representation": {
  "shared_module_key": "SHARED_LSTM"
},

In this case, BiLSTMDocAttention.lstm of DocClassificationTask and representation of LMTask are both of type BiLSTM, therefore parameter sharing is possible.

3. Share the embedding layer

The embedding is also a module, and can be similarly shared. This is configured under the features section. However, we need to ensure that we use the same vocabulary for both tasks, by specifying a pre-built vocabulary file. First create the vocabulary from the classification task data:

$ cd base_dir/SST-2
$ cat train.tsv dev.tsv | tr ' ' '\n' | sort | uniq > sst_vocab.txt

Then point to this file in configuration:

"features": {
    "shared_module_key": "SHARED_EMBEDDING",
    "word_feat": {
      "vocab_file": "base_dir/SST-2/sst_vocab.txt",
      "vocab_size": 15000,
      "vocab_from_train_data": false
    }
  }

3. Train the model

You can train the model with

(pytext) $ pytext train < demo/configs/multitask_sst_lm.json

The output will look like

Stage.EVAL
loss: 0.455871
Accuracy: 86.12

Not a great improvement, but we used a very primitive language modeling task (bi-directional with no masking) for the purposes of this tutorial. Happy multitasking!

Data Parallel Distributed Training

Distributed training enables one to easily parallelize computations across processes and clusters of machines. To do so, it leverages messaging passing semantics allowing each process to communicate data to any of the other processes.

PyText exploits DistributedDataParallel for synchronizing gradients and torch.multiprocessing to spawn multiple processes which each setup the distributed environment with NCCL as default backend, initialize the process group, and finally execute the given run function. The module is replicated on each machine and each device (e.g every single process), and each such replica handles a portion of the input partitioned by PyText’s DataHandler. For more on distributed training in PyTorch, refer to Writing distributed applications with PyTorch.

In this tutorial, we will train a DocNN model on a single node with 8 GPUs using the SST dataset.

1. Requirement

Distributed training is only available for GPUs, so you’ll need GPU-equipped server or virtual machine to run this tutorial.

Notes:
  • This demo use a local temporary file for initializing the distributed processes group, which means it only works on a single node. Please make sure to set distributed_world_size less than or equal to the maximum available GPUs on the server.
  • For distributed training on clusters of machines, you can use a shared file accessible to all the hosts (ex: file:///mnt/nfs/sharedfile) or the TCP init method. More info on distributed initialization.
  • In demo/configs/distributed_docnn.json, set distributed_world_size to 1 to disable distributed training, and set use_cuda_if_available to false to disable training on GPU.

2. Fetch the dataset

Download the SST dataset (The Stanford Sentiment Treebank) to a local directory. We will refer to this as base_dir in the next section.

$ unzip SST-2.zip && cd SST-2
$ sed 1d train.tsv | head -1000 > train_tiny.tsv
$ sed 1d dev.tsv | head -100 > eval_tiny.tsv

3. Prepare configuration file

Prepare the configuration file for training. A sample config file can be found in your PyText repository at demo/configs/distributed_docnn.json. If you haven’t set up PyText, please follow Installation.

The two parameters that are used for distributed training are:

  • distributed_world_size: total number of GPUs used for distributed training, e.g. if set to 40 with every server having 8 GPU, 5 servers will be fully used.
  • use_cuda_if_available: set to true for training on GPUs.

For this tutorial, please change the following in the config file.

  • Set train_path to base_dir/train_tiny.tsv.
  • Set eval_path to base_dir/eval_tiny.tsv.
  • Set test_path to base_dir/eval_tiny.tsv.

4. Train model with the downloaded dataset

Train the model using the command below

(pytext) $ pytext train < demo/configs/distributed_docnn.json

XLM-RoBERTa

Introduction

XLM-R (XLM-RoBERTa, Unsupervised Cross-lingual Representation Learning at Scale) is a scaled cross lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross lingual benchmarks.

Pre-trained models

Model Description #params vocab size Download
xlmr.base.v0 XLM-R using the BERT-base architecture 250M 250k xlm.base.v0.tar.gz
xlmr.large.v0 XLM-R using the BERT-large architecture 560M 250k xlm.large.v0.tar.gz

(Note: The above models are still under training, we will update the weights, once fully trained, the results are based on the above checkpoints.)

Results

XNLI (Conneau et al., 2018):

Model average en fr es de el bg ru tr ar vi th zh hi sw ur
roberta.large.mnli (TRANSLATE-TEST) 77.8 91.3 82.9 84.3 81.2 81.7 83.1 78.3 76.8 76.6 74.2 74.1 77.5 70.9 66.7 66.8
xlmr.large.v0 (TRANSLATE-TRAIN-ALL) 82.4 88.7 85.2 85.6 84.6 83.6 85.5 82.4 81.6 80.9 83.4 80.9 83.3 79.8 75.9 74.3

MLQA (Lewis et al., 2018)

Model average en es de ar hi vi zh
BERT-large
80.2/67.4
mBERT 57.7 / 41.6 77.7 / 65.2 64.3 / 46.6 57.9 / 44.3 45.7 / 29.8 43.8 / 29.7 57.1 / 38.6 57.5 / 37.3
xlmr.large.v0 70.0 / 52.2 80.1 / 67.7 73.2 / 55.1 68.3 / 53.7 62.8 / 43.7 68.3 / 51.0 70.5 / 50.1 67.1 / 44.4

Citation

@article{
    title = {Unsupervised Cross-lingual Representation Learning at Scale},
    author = {Alexis Conneau and Kartikay Khandelwal
        and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek
        and Francisco Guzm\'an and Edouard Grave and Myle Ott
        and Luke Zettlemoyer and Veselin Stoyanov
    },
    journal={},
    year = {2019},
}

Architecture Overview

PyText is designed to help users build end to end pipelines for training and inference. A number of default pipelines are implemented for popular tasks which can be used as-is. Users are free to extend or replace one or more of the pipelines’s components.

The following figure describes the relationship between the major components of PyText:

_images/pytext.png

Note: some models might implement a single “encoder_decoder” component while others implement two components: a representation and a decoder.

Model

The Model class is the central concept in PyText. It defines the neural network architecture. PyText provides models for common NLP jobs. Users can implement their custom model in two ways:

  • subclassing Model will give you most of the functions for the common architecture embedding -> representation -> decoder -> output_layer.
  • if you need more flexibility, you can subclass the more basic BaseModel which makes no assumptions about architectures, allowing you to implement any model.

Most PyText models implement Model and use the following architecture:

- model
  - model_input
    - tensorizers
  - embeddings
  - encoder+decoder
  - output_layer
    - loss
    - prediction
  • model_input: defines how the input strings will be transformed into tensors. This is done by input-specific “Tensorizers”. For example, the TokenTensorizer takes a sentence, tokenize it and looks up in its vocabulary to create the corresponding tensor. (The vocabulary is created during initialization by doing a first pass on the inputs.) In addition to the inputs, we also define here how to handle other data that can be found in the input files, such as the “labels” (arguably an output, but true labels are used an input during training).
  • embeddings: this step transforms the tensors created by model_input into embeddings. Each model_input (tensorizer) will be associated to a compatible embedding class (for example: WordEmbedding, or CharacterEmbedding). (see pytext/models/embeddings/)
  • representation: also called “encoder”, this can be one of the provided classes, such as those using a CNN (for example DocNNRepresentation), those using an LSTM (for example BiLSTMDocAttention), or any other type of representation. The parameters will depend on the representation selected. (see pytext/models/representations/)
  • decoder: this is typically an MLP (Multi-Layer Perceptron). If you use the default MLPDecoder, hidden_dims is the most useful parameter, which is an array containing the number of nodes in each hidden layer. (see pytext/models/decoders/)
  • output_layer: this is where the human-understandable output of the model is defined. For example, a document classification can automatically use the “labels” vocabulary defined in model_input as outputs. output_layer also defines the loss function to use during training. (see pytext/models/output_layers/)

Task: training definition

To train the model, we define a Task, which will tell PyText how to load the data, which model to use, how to train it, as well as the how to measure metrics.

The Task is defined with the following information:

  • data: defines where to find and how to handle the data: see data_source and batcher.
  • data -> data_source: The format of the input data (training, eval and testing) can differ a lot depending on the source. PyText provides TSVDataSource to read from the common tab-separated files. Users can easily write their own custom implementation if their files have a different format.
  • data -> batcher: The batcher is responsible for grouping the input data into batches that will be processed one at a time. train_batch_size, eval_batch_size and test_batch_size can be changed to reduce the running time (while increasing the memory requirements). The default Batcher takes the input sequentially, which is adequat in most cases. Alternatively, PoolingBatcher shuffles the inputs to make sure the data is not in order, which could introduce a biais in the results.
  • trainer: This defines a number of useful options for the training runs, like number of epochs, whether to report_train_metrics only during eval, and the random_seed to use.
  • metric_reporter: different models will need to report different metrics. (For example, common metrics for document classification are precision, recall, f1 score.) Each PyText task can use a corresponding default metric reporter class, but users might want to use alternatives or implement their own.
  • exporter: defines how to export the model so it can be used in production. PyText currently exports to caffe2 via onnx or torchscript.
  • model: (see above)

How Data is Consumed

  1. data_source: Defines where the data can be found (for example: one training file, one eval file, and one test file) and the schema (field names). The data_source class will read each entry one by one (for example: each line in a TSV file) and convert each one into a row, which is a python dict of field name to entry value. Values are converted automatically if their type is specified.
  2. tensorizer: Defines how rows are transformed into tensors. Tensorizers listed in the model will use one or more fields in the row to create a tensor or a tuple of tensors. To do that, some tensorizers will split the field values using a tokenizer that can be overridden in the config. Tensorizers typically have a vocabulary that allows them to map words or labels to numbers, and it’s built during the initialization phase by scanning the data once. (Alternatively, it can be loaded from file.)
  3. model -> arrange_model_inputs(): At this point, we have a python dict of tensorizer name to tensor or a tuple of tensors. Model has the method arrange_model_inputs() which flattens this python dict into a list tensors or tuple of tensors in the right order for the Model’s forward method.
  4. model -> forward(): This is where the magic happens. Input tensors are passed to the embbedings forward methods, then the results are passed to the encoder/decoder forward methods, and finally the ouput layer produces a prediction.

Config Example

We only specify the options we want to override. Everything else will use the default values. A typical config might look like this:

{
  "task": {
    "MyTask": {
      "data": {
        "source": {
          "TSVDataSource": {
            "field_names": ["label", "slots", "text"],
            "train_filename": "data/my_train_data.tsv",
            "test_filename": "data/my_test_data.tsv",
            "eval_filename": "data/my_eval_data.tsv"
          }
        }
      }
    }
  }
}

Code Example

class MyTask(NewTask):
    class Config(NewTask.Config):
        model: MyModel.Config = MyModel.Config()

class MyModel(Model):
    class Config(Model.Config):
        class ModelInput(Model.Config.ModelInput):
            tokens: TokenTensorizer.Config = TokenTensorizer.Config()
            labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config()

        inputs: ModelInput = ModelInput()
        embedding: WordEmbedding.Config = WordEmbedding.Config()

        representation: Union[
            BiLSTMSlotAttention.Config,
            BSeqCNNRepresentation.Config,
            PassThroughRepresentation.Config,
        ] = BiLSTMSlotAttention.Config()
        output_layer: Union[
            WordTaggingOutputLayer.Config, CRFOutputLayer.Config
        ] = WordTaggingOutputLayer.Config()
        decoder: MLPDecoder.Config = MLPDecoder.Config()

    @classmethod
    def from_config(cls, config, tensorizers):
        vocab = tensorizers["tokens"].vocab
        embedding = create_module(config.embedding, vocab=vocab)

        labels = tensorizers["labels"].vocab
        representation = create_module(
            config.representation, embed_dim=embedding.embedding_dim
        )
        decoder = create_module(
            config.decoder,
            in_dim=representation.representation_dim,
            out_dim=len(labels),
        )
        output_layer = create_module(config.output_layer, labels=labels)
        return cls(embedding, representation, decoder, output_layer)

    def arrange_model_inputs(self, tensor_dict):
        tokens, seq_lens, _ = tensor_dict["tokens"]
        return (tokens, seq_lens)

    def arrange_targets(self, tensor_dict):
        return tensor_dict["labels"]

    def forward(
        self,
        tokens: torch.Tensor,
    ) -> List[torch.Tensor]:
        embeddings = [self.token_embedding(tokens)]

        final_embedding = torch.cat(embeddings, -1)
        representation = self.representation(final_embedding)

        return self.decoder(representation)

Custom Data Format

PyText’s default reader is TSVDataSource to read your dataset if it’s in tsv format (tab-separated values). In many cases, your data is going to be in a different format. You could write a pre-processing script to format your data into tsv format, but it’s easier and more convenient to implement your own DataSource component so that PyText can read your data directly, without any preprocessing.

This tutorial explains how to implement a simple DataSource that can read the ATIS data and to perform a classification task using the “intent” labels.

1. Download the data

Download the ATIS (Airline Travel Information System) dataset and unzip it in a directory. Note that to download the dataset, you will need a Kaggle account for which you can sign up for free. The zip file is about 240KB.

$ unzip <download_dir>/atis.zip -d <download_dir>/atis

2. The data format

The ATIS dataset has a few defining characterics:

  1. it has a train set and a test set, but not eval set
  2. the data is split into a “dict” file, which is a vocab file containing the words or labels, and the train and test sets, which only contain integers representing the word indexes.
  3. sentences always start with the token 178 = BOS (Beginning Of Sentence) and end with the token 179 = EOS (End Of Sentence).
$ tail atis/atis.dict.vocab.csv
y
year
yes
yn
york
you
your
yx
yyz
zone
$ tail atis/atis.test.query.csv
178 479 0 545 851 264 882 429 851 915 330 179
178 479 902 851 264 180 428 444 736 521 301 851 915 330 179
178 818 581 207 827 204 616 915 330 179
178 479 0 545 851 264 180 428 444 299 851 619 937 301 654 887 200 435 621 740 179
178 818 581 207 827 204 482 827 619 937 301 229 179
178 688 423 207 827 429 444 299 851 218 203 482 827 619 937 301 229 826 236 621 740 253 130 689 179
178 423 581 180 428 444 299 851 218 203 482 827 619 937 301 229 179
178 479 0 545 851 431 444 589 851 297 654 212 200 179
178 479 932 545 851 264 180 730 870 428 444 511 301 851 297 179
178 423 581 180 428 826 427 444 587 851 810 179

Our DataSource must then resolve the words from the vocab files to rebuild the sentences and labels as strings. It must also take a subset the one of train or test dataset to create the eval dataset. Since the test set is pretty small, we’ll use the train set for that purpose and randomly take a small fraction (say 25%) to create the eval set. Finally, we can safely remove the first and last tokens of every query (BOS and EOS), as they don’t add any value for classification.

The ATIS dataset also has information for slots tagging that we’ll ignore because we only care about classification in this tutorial.

3. DataSource

PyText defines a DataSource to read the data. It expect each row of data to be represented as a python dict where the keys are the column names and the values are the columns properly typed.

Most of the time, the dataset will come as strings and the casting to the proper types can be inferred automatically from the other components in the config. To make the implementation of a new DataSource easier, PyText provides the class RootDataSource that does this type lookup for you. Most users should use RootDataSource as a base class.

4. Implementing AtisIntentDataSource

We will write the all the code for our AtisIntentDataSource in the file my_classifier/source.py.

First, let’s write the utilities that will help us read the data: a function to load the vocab files, and the generator that uses them to rebuild the sentences and labels. We return pytext.data.utils.UNK for unknown words. We store the indexes as strings to avoid casting from and to ints when reading the inputs.:

def load_vocab(file_path):
    """
    Given a file, prepare the vocab dictionary where each line is the value and
    (line_no - 1) is the key
    """
    vocab = {}
    with open(file_path, "r") as file_contents:
        for idx, word in enumerate(file_contents):
            vocab[str(idx)] = word.strip()
    return vocab

def reader(file_path, vocab):
    with open(file_path, "r") as reader:
        for line in reader:
            yield " ".join(
                vocab.get(s.strip(), UNK)
                # ATIS every row starts/ends with BOS/EOS: remove them
                for s in line.split()[1:-1]
            )

Then we declate the DataSource class itself: AtisIntentDataSource. It inherits from RootDataSource, which gives us the automatic lookup of data types. We declare all the config parameters that will be useful, and give sensible default values so that the general case where users provide only path and field_names will likely work. We load the vocab files for queries and intent only once in the constructor and keep them in memory for the entire run:

class AtisIntentDataSource(RootDataSource):

    def __init__(
        self,
        path="my_directory",
        field_names=None,
        validation_split=0.25,
        random_seed=12345,
        # Filenames can be overridden if necessary
        intent_filename="atis.dict.intent.csv",
        vocab_filename="atis.dict.vocab.csv",
        test_queries_filename="atis.test.query.csv",
        test_intent_filename="atis.test.intent.csv",
        train_queries_filename="atis.train.query.csv",
        train_intent_filename="atis.train.intent.csv",
        **kwargs,
    ):
        super().__init__(**kwargs)

        field_names = field_names or ["text", "label"]
        assert len(field_names or []) == 2, \
           "AtisIntentDataSource only handles 2 field_names: {}".format(field_names)

        self.random_seed = random_seed
        self.eval_split = eval_split

        # Load the vocab dict in memory for the readers
        self.words = load_vocab(os.path.join(path, vocab_filename))
        self.intents = load_vocab(os.path.join(path, intent_filename))

        self.query_field = field_names[0]
        self.intent_field = field_names[1]

        self.test_queries_filepath = os.path.join(path, test_queries_filename)
        self.test_intent_filepath = os.path.join(path, test_intent_filename)
        self.train_queries_filepath = os.path.join(path, train_queries_filename)
        self.train_intent_filepath = os.path.join(path, train_intent_filename)

To generate the eval data set, we need to randomly select some of the rows in training, but in a consistent and repeatable way. This is not strictly needed, and the training will work if the selection were completely random, but having a consistent sequence will help with debugging and give comparable results from training to training. In order to do that, we need to use the same seed for a new random number generator each time we start reading the train data set. The function below can be used for either training or eval and ensures that those two sets are complement of each other, with the ratio determined by eval_split. This function returns True or False depending on whether the row should be included or not:

def _selector(self, select_eval):
    """
    This selector ensures that the same pseudo-random sequence is
    always the used from the Beginning. The `select_eval` parameter
    guarantees that the training set and eval set are exact complements.
    """
    rng = Random(self.random_seed)
    def fn():
        return select_eval ^ (rng.random() >= self.eval_split)
    return fn

Next, we write the function that iterates through both the reader for the queries (sentences) and the reader for the intents (labels) simultaneously. It yields each row in the form a python dictionnary, where the keys are the field_names. We can pass an optional function to select a subset of the row (ie: _selector defined above); the default is to select all the rows:

def _iter_rows(self, query_reader, intent_reader, select_fn=lambda: True):
    for query_str, intent_str in zip(query_reader, intent_reader):
        if select_fn():
            yield {
                # in ATIS every row starts/ends with BOS/EOS: remove them
                self.query_field: query_str[4:-4],
                self.intent_field: intent_str,
            }

Finally, we tie everything toghether by implementing the 3 API methods of RootDataSource. Each of those methods should return a generator that can iterate through the specific dataset entirely. For the test dataset, we simply return all the row presented by the data in test_queries_filepath and test_intent_filepath, using the corresponding vocab:

def raw_test_data_generator(self):
    return iter(self._iter_rows(
        query_reader=reader(
            self.test_queries_filepath,
            self.words,
        ),
        intent_reader=reader(
            self.test_intent_filepath,
            self.intents,
        ),
    ))

For the eval and train datasets, we read the same files train_queries_filepath and train_intent_filepath, but we select some of the rows for eval and the rest for train:

def raw_train_data_generator(self):
    return iter(self._iter_rows(
        query_reader=reader(
            self.train_queries_filepath,
            self.words,
        ),
        intent_reader=reader(
            self.train_intent_filepath,
            self.intents,
        ),
        select_fn=self._selector(select_eval=False),
    ))

def raw_eval_data_generator(self):
    return iter(self._iter_rows(
        query_reader=reader(
            self.train_queries_filepath,
            self.words,
        ),
        intent_reader=reader(
            self.train_intent_filepath,
            self.intents,
        ),
        select_fn=self._selector(select_eval=True),
    ))

RootDataSource needs to know how it should transform the values in the dictionnaries created by the raw generators into the types matching the tensorizers used in the model. Fortunately, RootDataSource already provides a number of type conversion functions like the one below, so we don’t need to do it for strings. If we did need to do it, we would declare one like this for AtisIntentDataSource.:

@AtisIntentDataSource.register_type(str)
def load_string(s):
    return s

The full source code for this tutorial can be found in demo/datasource/source.py, which include the imports needed.

5. Testing AtisIntentDataSource

For rapid dev-test cycles, we add a simple main code printing the generated data in the terminal:

if __name__ == "__main__":
    import sys
    src = AtisIntentDataSource(
        path=sys.argv[1],
        field_names=["query", "intent"],
        schema={},
    )
    for row in src.raw_train_data_generator():
        print("TRAIN", row)
    for row in src.raw_eval_data_generator():
        print("EVAL", row)
    for row in src.raw_test_data_generator():
        print("TEST", row)

We test our class to make sure we’re getting the right data.

$ python3 my_classifier/source.py atis | head -n 3
TRAIN {'query': 'what flights are available from pittsburgh to baltimore on thursday morning', 'intent': 'flight'}
TRAIN {'query': 'cheapest airfare from tacoma to orlando', 'intent': 'airfare'}
TRAIN {'query': 'round trip fares from pittsburgh to philadelphia under 1000 dollars', 'intent': 'airfare'}

$ python3 my_classifier/source.py atis | cut -d " " -f 1 | uniq -c
3732 TRAIN
1261 EVAL
 893 TEST

6. Training the Model

First let’s get a config using our new AtisIntentDataSource

$ pytext --include my_classifier gen-default-config DocumentClassificationTask AtisIntentDataSource > my_classifier/config.json
Including: my_classifier
... importing module: my_classifier.source
... importing: <class 'my_classifier.source.AtisIntentDataSource'>
INFO - Applying option: task->data->source = AtisIntentDataSource

This default config contains all the parameters with their default value. So we edit the config to remove the parameters that we don’t care about, and we edit the ones we care about. We only want to run 3 epochs for now. It looks like this.

$ cat my_classifier/config.json
{
  "debug_path": "my_classifier.debug",
  "export_caffe2_path": "my_classifier.caffe2.predictor",
  "export_onnx_path": "my_classifier.onnx",
  "save_snapshot_path": "my_classifier.pt",
  "task": {
    "DocumentClassificationTask": {
      "data": {
        "Data": {
          "source": {
            "AtisIntentDataSource": {
              "field_names": ["text", "label"],
              "path": "atis",
              "random_seed": 12345,
              "validation_split": 0.25
            }
          }
        }
      },
      "metric_reporter": {
        "output_path": "my_classifier.out"
      },
      "trainer": {
        "epochs": 3
      }
    }
  },
  "test_out_path": "my_classifier_test.out",
  "version": 12
}

And, at last, we can train the model

$ pytext --include my_classifier train < my_classifier/config.json

Notes

In the current version of PyText, we need to explicitly declare a few more things, like the Config class (that looks like the __init__ parameters) and the from_config method:

class Config(RootDataSource.Config):
    path: str = "."
    field_names: List[str] = ["text", "label"]
    validation_split: float = 0.25
    random_seed: int = 12345
    # Filenames can be overridden if necessary
    intent_filename: str = "atis.dict.intent.csv"
    vocab_filename: str = "atis.dict.vocab.csv"
    test_queries_filename: str = "atis.test.query.csv"
    test_intent_filename: str = "atis.test.intent.csv"
    train_queries_filename: str = "atis.train.query.csv"
    train_intent_filename: str = "atis.train.intent.csv"

# Config mimics the constructor
# This will be the default in future pytext.
@classmethod
def from_config(cls, config: Config, schema: Dict[str, Type]):
    return cls(schema=schema, **config._asdict())

Custom Tensorizer

Tensorizer is the class that prepares your data coming out of the data source and transforms it into tensors suitable for processing. Each tensorizer knows how to prepare the input data from specific columns. In order to do that, the tensorizer (after initialization, such as creating or loading the vocabulary for look-ups) executes the following steps:

  1. Its Config defines which column name(s) the tensorizer will look at
  2. numberize() takes one row and transform the strings into numbers
  3. tensorize() takes a batch of rows and creates the tensors

PyText provides a number of tensorizers for the most common cases. However, if you have your own custom features that don’t have a suitable Tensorizer, you will need to write your own class. Fortunately it’s quite easy: you simply need to create a class that inherits from Tensorizer (or one of its subclasses), and implement a few functions.

First a Config inner class, from_config class method, and the constructor __init__. This is just to declare member variables.

The tensorizer should declare the schema of your Tensorizer by defining a column_schema property which returns a list of tuples, one for each field/column read from the data source. Each tuple specifies the name of the column, and the type of the data. By specifying the type of your data, the data source will automatically parse the inputs and pass objects of those types to the tensorizers. You don’t need to parse your own inputs.

For example, SeqTokenTensorizer reads one column from the input data. The data is formatted like a json list of strings: [“where do you wanna meet?”, “MPK”]. The schema declaration is like this:

@property
def column_schema(self):
    return [(self.column, List[str])]

Another example with GazetteerTensorizer: it needs 2 columns, one string for the text itself, and one for the gazetteer features formatted like a complex json object. (The Gazetteer type is registered in the data source to automatically convert the raw strings from the input to this type.) The schema declaration is like this:

Gazetteer = List[Dict[str, Dict[str, float]]]

@property
def column_schema(self):
    return [(self.text_column, str), (self.dict_column, Gazetteer)]

Example Implementation

Let’s implement a simple word tensorizer that creates a tensor with the word indexes from a vocabulary.

class MyWordTensorizer(Tensorizer):

    class Config(Tensorizer.Config):
        #: The name of the text column to read from the data source.
        column: str = "text"

    @classmethod
    def from_config(cls, config: Config):
        return cls(column=config.column)

    def __init__(self, column):
        self.column = column
        self.vocab = vocab

    @property
    def column_schema(self):
        return [(self.column, str)]

Next we need to build the vocabulary by reading the training data and count the words. Since multiple tensorizers might need to read the data, we parallelize the reading part and the tensorizers use the pattern row = yield to read their inputs. In this simple example, our “tokenize” function is just going to split on spaces.

def _tokenize(self, row):
    raw_text = row[self.column]
    return raw_text.split()

def initialize(self):
    """Build vocabulary based on training corpus."""
    vocab_builder = VocabBuilder()

    try:
        while True:
            row = yield
            words = _tokenize(row)
            vocab_builder.add_all(words)
    except GeneratorExit:
        self.vocab = vocab_builder.make_vocab()

The most important method is numberize, which takes a row and transforms it into list of numbers. The exact meaning of those numbers is arbitrary and depends on the design of the model. In our case, we look up the word indexes in the vocabulary.

def numberize(self, row):
    """Look up tokens in vocabulary to get their corresponding index"""
    words = _tokenize(row)
    idx = self.vocab.lookup_all(words)
    # LSTM representations need the length of the sequence
    return idx, len(idx)

Because LSTM-based representations need the length of the sequence to only consider the useful values and ignore the padding, we also return the length of each sequence.

Finally, the last function will create properly padded torch.Tensors from the batches produced by numberize. Numberized results can be cached for performance. We have a separate function to tensorize them because they are shuffled and batched differently (at each epoch), and then they will need different padding (because padding dimensions depend on the batch).

def tensorize(self, batch):
    tokens, seq_lens = zip(*batch)
    return (
        pad_and_tensorize(tokens, self.vocab.get_pad_index()),
        pad_and_tensorize(seq_lens),
    )

LSTM-based representations implemented in Torch also need the batches to be sorted by sequence length descending, so we’re add in a sort function.

def sort_key(self, row):
    # LSTM representations need the batches to be sorted by descending seq_len
    return row[1]

The full code is in demo/examples/tensorizer.py

Testing

We can test our tensorizer with the following code that initializes the vocab, then tries the numberize function:

rows = [
    {"text": "I want some coffee"},
    {"text": "Turn it up"},
]
tensorizer = MyWordTensorizer(column="text")

# Vocabulary starts with 0 and 1 for Unknown and Padding.
# The rest of the vocabulary is built by the rows in order.
init = tensorizer.initialize()
init.send(None)  # start the loop
for row in rows:
    init.send(row)
init.close()

# Verify numberize.
numberized_rows = (tensorizer.numberize(r) for r in rows)
words, seq_len = next(numberized_rows)
assert words == [2, 3, 4, 5]
assert seq_len == 4  # "I want some coffee" has 4 words
words, seq_len = next(numberized_rows)
assert words == [6, 7, 8]
assert seq_len == 3  # "Turn it up" has 3 words

# test again, this time also make the tensors
numberized_rows = (tensorizer.numberize(r) for r in rows)
words_tensors, seq_len_tensors = tensorizer.tensorize(numberized_rows)
# Notice the padding (1) of the 2nd tensor to match the dimension
assert words_tensors.equal(torch.tensor([[2, 3, 4, 5], [6, 7, 8, 1]]))
assert seq_len_tensors.equal(torch.tensor([4, 3]))

Using External Dense Features

Sometime you want to add external features to augment the inputs to your model. For example, if you want to classify a text that has an image associated to it, you might want to process the image separately and use features of this image along with the text to help the classifier. Those features are added in the input data as one extra field (column) and should look like a list of floats (json).

Let’s look at a simple example, first without the dense feature, then we add dense features.

Example: Simple Model

First, here’s an example of a simple classifier that uses just the text and no dense features. (This is only showing the relevant parts of the model code for simplicity.)

class MyModel(Model):
  class Config(Model.Config):
    class ModelInput(Model.Config.InputConfig):
      tokens: TokenTensorizer.Config = TokenTensorizer.Config()
      labels: LabelTensorizer.Config = LabelTensorizer.Config()

    inputs: ModelInput = ModelInput()
    token_embedding: WordEmbedding.Config = WordEmbedding.Config()

    representation: RepresentationBase.Config = DocNNRepresentation.Config()
    decoder: DecoderBase.Config = MLPDecoder.Config()
    output_layer: OutputLayerBase.Config = ClassificationOutputLayer.Config()

  def from_config(cls, config, tensorizers):
    token_embedding = create_module(config.token_embedding, tensorizer=tensorizers["tokens"])
    representation = create_module(config.representation, embed_dim=token_embedding.embedding_dim)
    labels = tensorizers["labels"].vocab
    decoder = create_module(
        config.decoder,
        in_dim=representation.representation_dim
        out_dim=len(labels),
    )
    output_layer = create_module(config.output_layer, labels=labels)
    return cls(token_embedding, representation, decoder, output_layer)

  def arrange_model_inputs(self, tensor_dict):
    return (tensor_dict["tokens"],)

  def forward(
      self,
      tokens_in: Tuple[torch.Tensor, torch.Tensor],
  ) -> List[torch.Tensor]:
        word_tokens, seq_lens = tokens
        embedding_out = self.embedding(word_tokens)
        representation_out = self.representation(embedding_out, seq_lens)
        return self.decoder(representation_out)

Example: Simple Model With Dense Features

To use the dense features, you will typically write your model to use them directly in the decoder, bypassing the embeddings and representation stages that process the text part of your inputs. Here’s the same example again, this time with the dense features added (see lines marked with <–).

class MyModel(Model):
  class Config(Model.Config):
    class ModelInput(Model.Config.InputConfig):
      tokens: TokenTensorizer.Config = TokenTensorizer.Config()
      dense: FloatListTensorizer.Config = FloatListTensorizer.Config()  # <--
      labels: LabelTensorizer.Config = LabelTensorizer.Config()

    inputs: ModelInput = ModelInput()
    token_embedding: WordEmbedding.Config = WordEmbedding.Config()

    representation: RepresentationBase.Config = DocNNRepresentation.Config()
    decoder: DecoderBase.Config = MLPDecoder.Config()
    output_layer: OutputLayerBase.Config = ClassificationOutputLayer.Config()

  def from_config(cls, config, tensorizers):
    token_embedding = create_module(config.token_embedding, tensorizer=tensorizers["tokens"])
    representation = create_module(config.representation, embed_dim=token_embedding.embedding_dim)
    dense_dim = tensorizers["dense"].out_dim    # <--
    labels = tensorizers["labels"].vocab
    decoder = create_module(
        config.decoder,
        in_dim=representation.representation_dim + dense_dim    # <--
        out_dim=len(labels),
    )
    output_layer = create_module(config.output_layer, labels=labels)
    return cls(token_embedding, representation, decoder, output_layer)

  def arrange_model_inputs(self, tensor_dict):
    return (tensor_dict["tokens"], tensor_dict["dense"])  # <--

  def forward(
      self,
      tokens_in: Tuple[torch.Tensor, torch.Tensor],
      dense_in: torch.Tensor,    # <--
  ) -> List[torch.Tensor]:
        word_tokens, seq_lens = tokens
        embedding_out = self.embedding(word_tokens)
        representation_out = self.representation(embedding_out, seq_lens)
        representation_out = torch.cat((representation_out, dense_in), 1)    # <--
        return self.decoder(representation_out)

Creating A New Model

PyText uses a Model class as a central place to define components for data processing, model training, etc. and wire up those components.

In this tutorial, we will create a word tagging model for the ATIS dataset. The format of the ATIS dataset is explained in the Custom Data Format, so we will not repeat it here. We are going to create a similar data source that uses the slot tagging information rather than the intent information. We won’t describe in detail how this data source is created but you can look at the Custom Data Format, and the full source code for this tutorial in demo/my_tagging for more information.

This model will predict a “slot”, also called “tag” or “label”, for each word in the utterance, using the IOB2 format), where the O tag is used for Outside (no match), B- for Beginning and I- for Inside (continuation). Here’s an example:

{
  "text": "please list the flights from newark to los angeles",
  "slots": "O O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name"
}

1. The Components

The first step is to specify the components used in this model by listing them in the Config class, the corresponding from_config function, and the constructor __init__.

Thanks to the modular nature of PyText, we can simply use many included common components, such as TokenTensorizer, WordEmbedding, BiLSTMSlotAttention and MLPDecoder. Since we’re also using the common pattern of embedding -> representation -> decoder -> output_layer, we use Model as a base class, so we don’t need to write __init__.

ModelInput defines how the data that is read will be transformed into tensors. This is done using a Tensorizer. These components take one or several columns (often strings) from each input row and create the corresponding numeric features in a properly padded tensor. The tensorizers will to be initialized first, and in this step they will often parse the training data to create their Vocabulary.

In our case, the utterance is in the column “text” (which is the default column name for this tensorizer), and is composed of tokens (words), so we can use the TokenTensorizer. The Vocabulary will be created from all the utterances.

The slots are also composed of tokens: the IOB2 tags. We can also use TokenTensorizer for the column “slots”. This Vocabulary will be the list of IOB2 tags found in the “slots” column of the training data. This is a different column name, so we specify it.

class MyTaggingModel(Model):
    class Config(ConfigBase):
        class ModelInput(Model.Config.ModelInput):
            tokens: TokenTensorizer.Config = TokenTensorizer.Config()
            slots: TokenTensorizer.Config = TokenTensorizer.Config(column="slots")

        inputs: ModelInput = ModelInput()
        embedding: WordEmbedding.Config = WordEmbedding.Config()
        representation: BiLSTMSlotAttention.Config = BiLSTMSlotAttention.Config,
        decoder: MLPDecoder.Config = MLPDecoder.Config()
        output_layer: MyTaggingOutputLayer.Config = MyTaggingOutputLayer.Config()

2. from_config method

from_config is where the components are created with the proper parameters. Some come from the Config (passed by the user in json format), some use the default values, others are dicated by the model’s architecture so that the different components fit with each other. For example, the representation layer needs to know the dimension of the embeddings it will receive, the decoder needs to know the dimension of the representation layer before it and the size of the slots vocab to output.

In this model, we only need one embedding: the one of the tokens. The slots don’t have embeddings because while they are listed as input (in ModelInput), they are actually outputs and the will be used in the output layer. (During training, they are inputs as true values.)

@classmethod
def from_config(cls, config, tensorizers):
    embedding = create_module(config.embedding, tensorizer=tensorizers["tokens"])
    representation = create_module(
        config.representation, embed_dim=embedding.embedding_dim
    )
    slots = tensorizers["slots"].vocab
    decoder = create_module(
        config.decoder,
        in_dim=representation.representation_dim,
        out_dim=len(slots),
    )
    output_layer = MyTaggingOutputLayer(slots, CrossEntropyLoss(None))
    # call __init__ constructor from super class Model
    return cls(embedding, representation, decoder, output_layer)

3. Forward method

The forward method contains the execution logic calling each of those components and passing the results of one to the next. It will be called for every row transformed into tensors.

TokenTensorizer returns the tensor for the tokens themselves and also the sequence length, which is the number of tokens in the utterances. This is because we need to pad the tensors in a batch to give them all the same dimensions, and LSTM-based reprentations need to differentiate the padding from the actual tokens.

def forward(
    self,
    word_tokens: torch.Tensor,
    seq_lens: torch.Tensor,
) -> List[torch.Tensor]:
    # fetch embeddings for the tokens in the utterance
    embedding = self.embedding(word_tokens)

    # pass the embeddings to the BiLSTMSlotAttention layer.
    # LSTM-based representations also need seq_lens.
    representation = self.representation(embedding, seq_lens)

    # some LSTM representations return extra tensors, we don't use those.
    if isinstance(representation, tuple):
        representation = representation[0]

    # finally run the results through the decoder
    return self.decoder(representation)

4. Complete MyTaggingModel

To finish this class, we need to define a few more functions.

All the inputs are placed in a python dict where the key is the name of the tensorizer as defined in ModelInput, and the value is the tensor for this input row.

First, we define how the inputs will be passed to the forward function in arrange_model_inputs. In our case, the only input passed to the forward function is the tensors from the “tokens” input. As explained above, TokenTensorizer returns 2 tensors: the tokens and the sequence length. (Actually it returns 3 tensors, we’ll ignore the 3rd one, the token ranges, in this tutorial)

Then we define arrange_targets, which is doing something similar for the targets, which are passed to the loss function during training. In our case, it’s the “slots” tensorizer doing that. The padding value can be passed to the loss function (unlike LSTM representations), so we only need the first tensor.

def arrange_model_inputs(self, tensor_dict):
    tokens, seq_lens, _ = tensor_dict["tokens"]
    return (tokens, seq_lens)

def arrange_targets(self, tensor_dict):
    slots, _, _ = tensor_dict["slots"]
    return slots

5. Output Layer

So far, our model is using the same components as any other model, including a common classification model, except for two things: the BiLSTMSlotAttention and the output layer.

BiLSTMSlotAttention is a multi-layer bidirectional LSTM based representation with attention over slots. The implementation of this representation is outside the scope of this tutorial, and this component is already included in PyText, so we’ll just use it.

The output layer can be simple enough and demonstrates a few important notions in PyText, like how the loss function is tied to the output layer. We implement it like this:

class MyTaggingOutputLayer(OutputLayerBase):

    class Config(OutputLayerBase.Config):
        loss: CrossEntropyLoss.Config = CrossEntropyLoss.Config()

    @classmethod
    def from_config(cls, config, vocab, pad_token):
        return cls(
            vocab,
            create_loss(config.loss, ignore_index=pad_token),
        )

    def get_loss(self, logit, target, context, reduce=True):
        # flatten the logit from [batch_size, seq_lens, dim] to
        # [batch_size * seq_lens, dim]
        return self.loss_fn(logit.view(-1, logit.size()[-1]), target.view(-1), reduce)

    def get_pred(self, logit, *args, **kwargs):
        preds = torch.max(logit, 2)[1]
        scores = F.log_softmax(logit, 2)
        return preds, scores

6. Metric Reporter

Next we need to write a MetricReporter to calculate metrics and report model training/test results:

The MetricReporter base class aggregates all the output from Trainer, including predictions, scores and targets. The default aggregation behavior is concatenating the tensors from each batch and converting it to list. If you want different aggregation behavior, you can override it with your own implementation. Here we use the compute_classification_metrics method provided in pytext.metrics to get the precision/recall/F1 scores. PyText ships with a few common metric calculation methods, but you can easily incorporate other libraries, such as sklearn.

In the __init__ method, we can pass a list of Channel to report the results to any output stream. We use a simple ConsoleChannel that prints everything to stdout and a TensorBoardChannel that outputs metrics to TensorBoard:

class MyTaggingMetricReporter(MetricReporter):

    @classmethod
    def from_config(cls, config, vocab):
        return MyTaggingMetricReporter(
            channels=[ConsoleChannel(), TensorBoardChannel()],
            label_names=vocab
        )

    def __init__(self, label_names, channels):
        super().__init__(channels)
        self.label_names = label_names

    def calculate_metric(self):
        return compute_classification_metrics(
            list(
                itertools.chain.from_iterable(
                    (
                        LabelPrediction(s, p, e)
                        for s, p, e in zip(scores, pred, expect)
                    )
                    for scores, pred, expect in zip(
                        self.all_scores, self.all_preds, self.all_targets
                    )
                )
            ),
            self.label_names,
            self.calculate_loss(),
        )

7. Task

Finally, we declare a task by inheriting from NewTask. This base class specifies the training parameters of the model: the data source and batcher, the trainer class (most models will use the default one), and the metric reporter.

Since our metric reporter needs to be initialized with a specific vocab, we need to define the classmethod create_metric_reporter so that PyText can construct it properly.

class MyTaggingTask(NewTask):
    class Config(NewTask.Config):
        model: MyTaggingModel.Config = MyTaggingModel.Config()
        metric_reporter: MyTaggingMetricReporter.Config = MyTaggingMetricReporter.Config()

    @classmethod
    def create_metric_reporter(cls, config, tensorizers):
        return MyTaggingMetricReporter(
            channels=[ConsoleChannel(), TensorBoardChannel()],
            label_names=list(tensorizers["slots"].vocab),
        )

8. Generate sample config and train the model

Save all your files in the same directory. For example, I saved all my files in my_tagging/.Now you can tell PyText to include your classes with the parameter --include my_tagging

Now that we have a fully functional class:~Task, we can generate a default JSON config for it by using the pytext cli tool.

(pytext) $ pytext --include my_tagging gen-default-config MyTaggingTask > my_config.json

Tweak the config as you like, for instance change the number of epochs. Most importantly, specify the path to your ATIS dataset. Then train the model with:

(pytext) $ pytext --include my_tagging train < my_config.json

Hacking PyText

Using your own classes in PyText

Most people just want to create their own components and use them to load their data, train models, etc. In this case, you just need to put all your .py files in a directory and include it with the option –include <my directory>. PyText will be able to find your code and import your classes. This works with PyText from pip install or from github sources.

Example with Custom Data Source

Changing PyText

Why would you want to change PyText? Maybe you want to fix one of the github issues, or you want to experiment with your own changes that you can’t simply include and you would like to see included in PyText’s future releases. In this case you need to download the sources and submit them back to github. Since getting your changes ready and integrated can take some time, you might need to keep your sources up to date. Here’s how to do it.

Installation

First, make a copy of the PyText repo into your github account. For that (you need a github account), go to the PyText repo and click the Fork button at top-right of the page.

Once the fork is complete, clone your fork onto your computer by clicking the “Clone or download” button and copying the URL. Then, in a terminal, use the git clone command to clone the repo in the directory of your choice.

$ git clone https://github.com/<your_account>/pytext.git

To be able to update your github fork with the latest changes from Facebook’s PyText sources, you need to add it as a “remote” with this command. (This can be done later.) The name “upstream” is what’s typically used, but you can use any name you want for your remotes.

$ git remote add upstream https://github.com/facebookresearch/pytext.git

Now you should have 2 remotes: origin is your own github fork, and upstream is facebook’s github repo.

Now you can install the PyText dependencies in a virtual environment. (This means the dependencies will be installed inside the directory pytext_venv under pytext/, not in your machine’s system directory.) Notice the (pytext_venv) in the terminal prompt when it’s active.

$ cd pytext
$ source activation_venv
(pytext_venv) $ ./install_deps

To exit the virtual environment:

(pytext_venv) $ deactivate

Writing Code

After you’ve made some code changes, you need to create a branch to commit your code. Do not commit your code in your master branch! Give your branch a name that represents what your experiment is about. Then add your changes and commit them.

$ git checkout -b <my_experiment>
$ git status -sb
... # list of files you changed
$ git add <file1> <file2>
$ git diff --cached  # see the code changes you added
# ... maybe keep changing and run git add again
$ git commit  # save your changes
$ git show # optional, look at the code changes
$ git push --set-upstream origin <my_experiment>  # send your branch to your github fork

At this point you should be able to see your branch in your github repo and create a Pull Request to Facebook’s github if you want to submit it for review and later be integrated.

Keeping Up-to-Date

To resume development in an already cloned repo, you might need re-activate the virtual environment:

$ cd pytext
$ source activation_venv

If you need to update your github repo with the latest changes in the Facebook upstream repo, fetch the changes with this command, merge your master with those changes, and push the changes to your github forks. In order to do that, you can’t have any pending changes, so make sure you commit your current work to a branch.

$ git fetch upstream
$ git checkout master
$ git merge upstream/master
$ git push

Important: never commit changes in your master. Doing this would prevent further updates. Instead, always commit changes to a branch. (See below for more on this.)

Finally, you might need to rebase your branches to the latest master. Check out the branch, rebase it, and (optionally) push it again to your github fork.

$ git checkout <my_experiment>
$ git rebase master
$ git push  # optional

Modifying your Pull Request

Many times you will need to modify your code and submit your pull request again. Maybe you found a bug that you need to fix, or you want to integrate some feedback you got in the pull request, or after you rebased your branch you had to solve a conflict.

If you’re going to change your pull request, it’s always a good idea to start by rebasing your branch on the lastest upstream/master (see above.)

After making your changes, amend to your existing commit rather than creating a new commit on top of it. This is to ensure your changes are in a single clean commit that does not contain your failed experiments. At this point, you will have a branch <my_experiment>, and the branch you pushed to your github forked origin/<my_experiment>. Then you will need to force the push to replace the github branch with your changes. The pull request will be automatically updated upstream.

$ git commit --amend
$ git push --force

Addendum

One commit or multiple commits?

For most contributions, you will want to keep your pull request as a single, clean commit. It’s better to amend the same commit rather than keeping the entire history of intermediate edits.

If your change is more involved, it might be better to create multiple commits, as long as each commit does one thing and is self contained.

Code Quality

In order to get your pull request integrated with PyText, it needs to pass the tests and be reviewed. The pull requests will automatically run the circleci tests, and they must be all green for your pull request to be accepted. These tests include building the documentation, run the unit tests under python 3.6 and 3.7, and run the linter black to verify code formatting. You can run the linter yourself after installing it with pip install black.

If all the tests are green, people will start reviewing your changes. (You too can review other pull requests and make comments and suggestions.) If reviewers ask questions or make suggestions, try your best to answer them with comments or code changes.

A very common reason to reject a pull request is lack of unit testing. Make sure your code is covered by unit tests (add your own tests) to make sure they work now and also in the future when other people make changes to your code!

Creating Documentation

Whether you want to add documentation for your feature in code, or just change the existing the documentation, you will need to test it locally. First install extra dependencies needed to build the documentation:

$ pip install --upgrade -r docs_requirements.txt
$ pip install --upgrade -r pytext/docs/requirements.txt

Then you can build the documentation

$ cd pytext/docs
$ make html

Finally you can look at the documentation produced with a URL like this file:///<path_to_pytext_sources>/pytext/docs/build/html/hacking_pytext.html

Useful git alias

One of the most useful command for git is one where you print the commits and branches like a tree. This is a complex command most useful when stored as an alias, so we’re giving it here.

$ git config --global alias.lg "log --pretty=tformat:'%C(yellow)%h %Cgreen(%ad)%Cred%d %Creset%s %C(bold blue)<%cn>%Creset' --decorate --date=short --date=local --graph --all"

$ # try it
$ git lg

pytext

config

field_config

FeatureConfig

Component: Module

class pytext.config.field_config.FeatureConfig[source]

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
word_feat: WordEmbedding.Config = WordEmbedding.Config()
seq_word_feat: Optional[WordEmbedding.Config] = None
dict_feat: Optional[DictEmbedding.Config] = None
char_feat: Optional[CharacterEmbedding.Config] = None
dense_feat: Optional[FloatVectorConfig] = None
contextual_token_embedding: Optional[ContextualTokenEmbedding.Config] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "word_feat": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "seq_word_feat": null,
    "dict_feat": null,
    "char_feat": null,
    "dense_feat": null,
    "contextual_token_embedding": null
}
FloatVectorConfig
class pytext.config.field_config.FloatVectorConfig[source]

Bases: ConfigBase

All Attributes (including base classes)

dim: int = 0
export_input_names: list[str] = ['float_vec_vals']
dim_error_check: bool = False

Default JSON

{
    "dim": 0,
    "export_input_names": [
        "float_vec_vals"
    ],
    "dim_error_check": false
}

module_config

CNNParams
class pytext.config.module_config.CNNParams[source]

Bases: ConfigBase

All Attributes (including base classes)

kernel_num: int = 100
kernel_sizes: list[int] = [3, 4]
weight_norm: bool = False
dilated: bool = False
causal: bool = False

Default JSON

{
    "kernel_num": 100,
    "kernel_sizes": [
        3,
        4
    ],
    "weight_norm": false,
    "dilated": false,
    "causal": false
}

pytext_config

PyTextConfig
class pytext.config.pytext_config.PyTextConfig[source]

Bases: ConfigBase

All Attributes (including base classes)

task: Union[TaskBase.Config, Task_Deprecated.Config, _NewTask.Config, NewTask.Config, DisjointMultitask.Config, NewDisjointMultitask.Config, QueryDocumentPairwiseRankingTask.Config, EnsembleTask.Config, DocumentClassificationTask.Config, DocumentRegressionTask.Config, NewBertClassificationTask.Config, NewBertPairClassificationTask.Config, BertPairRegressionTask.Config, WordTaggingTask.Config, IntentSlotTask.Config, LMTask.Config, MaskedLMTask.Config, PairwiseClassificationTask.Config, RoBERTaNERTask.Config, SeqNNTask.Config, SquadQATask.Config, SemanticParsingTask.Config]
use_cuda_if_available: bool = True
use_fp16: bool = False
distributed_world_size: int = 1
gpu_streams_for_distributed_training: int = 1
load_snapshot_path: str = ''
save_snapshot_path: str = '/tmp/model.pt'
use_config_from_snapshot: bool = True
auto_resume_from_snapshot: bool = False
export_caffe2_path: Optional[str] = None
export_onnx_path: str = '/tmp/model.onnx'
export_torchscript_path: Optional[str] = None
torchscript_quantize: Optional[bool] = False
modules_save_dir: str = ''
save_module_checkpoints: bool = False
save_all_checkpoints: bool = False
use_tensorboard: bool = True
random_seed: Optional[int] = None
Seed value to seed torch, python, and numpy random generators.
use_deterministic_cudnn: bool = False
Whether to allow CuDNN to behave deterministically.
report_eval_results: bool = False
include_dirs: Optional[list[str]] = None
version: int
use_cuda_for_testing: bool = True
test_out_path: str = '/tmp/test_out.txt'
debug_path: str = '/tmp/model.debug'

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

data

batch_sampler

AlternatingRandomizedBatchSampler.Config

Component: AlternatingRandomizedBatchSampler

class AlternatingRandomizedBatchSampler.Config[source]

Bases: Component.Config

All Attributes (including base classes)

unnormalized_iterator_probs: dict[str, float]
second_unnormalized_iterator_probs: dict[str, float]

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

BaseBatchSampler.Config

Component: BaseBatchSampler

class BaseBatchSampler.Config

Bases: Component.Config

All Attributes (including base classes)

Subclasses
  • EvalBatchSampler.Config

Default JSON

{}
EvalBatchSampler.Config

Component: EvalBatchSampler

class EvalBatchSampler.Config

Bases: BaseBatchSampler.Config

All Attributes (including base classes)

Default JSON

{}
RandomizedBatchSampler.Config

Component: RandomizedBatchSampler

class RandomizedBatchSampler.Config[source]

Bases: Component.Config

All Attributes (including base classes)

unnormalized_iterator_probs: dict[str, float]

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

RoundRobinBatchSampler.Config

Component: RoundRobinBatchSampler

class RoundRobinBatchSampler.Config[source]

Bases: Component.Config

All Attributes (including base classes)

iter_to_set_epoch: str = ''

Default JSON

{
    "iter_to_set_epoch": ""
}

bert_tensorizer

BERTTensorizer.Config

Component: BERTTensorizer

class BERTTensorizer.Config[source]

Bases: BERTTensorizerBase.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['text']
tokenizer: Tokenizer.Config = WordPieceTokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = '/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
max_seq_len: int = 256
Subclasses
  • SquadForBERTTensorizer.Config
  • SquadForBERTTensorizerForKD.Config

Default JSON

{
    "is_input": true,
    "columns": [
        "text"
    ],
    "tokenizer": {
        "WordPieceTokenizer": {
            "basic_tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            },
            "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
        }
    },
    "base_tokenizer": null,
    "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
    "max_seq_len": 256
}
BERTTensorizerBase.Config

Component: BERTTensorizerBase

class BERTTensorizerBase.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['text']
tokenizer: Tokenizer.Config = Tokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = ''
max_seq_len: int = 256
Subclasses
  • BERTTensorizer.Config
  • RoBERTaTensorizer.Config
  • RoBERTaTokenLevelTensorizer.Config
  • SquadForBERTTensorizer.Config
  • SquadForBERTTensorizerForKD.Config
  • SquadForRoBERTaTensorizer.Config

Default JSON

{
    "is_input": true,
    "columns": [
        "text"
    ],
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    },
    "base_tokenizer": null,
    "vocab_file": "",
    "max_seq_len": 256
}

data

Batcher.Config

Component: Batcher

class Batcher.Config[source]

Bases: Component.Config

All Attributes (including base classes)

train_batch_size: int = 16
Make batches of this size when possible. If there’s not enough data, might generate some smaller batches.
eval_batch_size: int = 16
test_batch_size: int = 16
Subclasses
  • PoolingBatcher.Config
  • DynamicPoolingBatcher.Config
  • ExponentialDynamicPoolingBatcher.Config
  • LinearDynamicPoolingBatcher.Config

Default JSON

{
    "train_batch_size": 16,
    "eval_batch_size": 16,
    "test_batch_size": 16
}
Data.Config

Component: Data

class Data.Config[source]

Bases: Component.Config

All Attributes (including base classes)

source: DataSource.Config = TSVDataSource.Config()
Specify where training/test/eval data come from. The default value will not provide any data.
batcher: Batcher.Config = PoolingBatcher.Config()
How training examples are split into batches for the optimizer.
sort_key: Optional[str] = None
in_memory: Optional[bool] = True
cache numberized result in memory, turn off when CPU memory bound.
Subclasses
  • PackedLMData.Config

Default JSON

{
    "source": {
        "TSVDataSource": {
            "column_mapping": {},
            "train_filename": null,
            "test_filename": null,
            "eval_filename": null,
            "field_names": null,
            "delimiter": "\t",
            "quoted": false,
            "drop_incomplete_rows": false
        }
    },
    "batcher": {
        "PoolingBatcher": {
            "train_batch_size": 16,
            "eval_batch_size": 16,
            "test_batch_size": 16,
            "pool_num_batches": 10000,
            "num_shuffled_pools": 1
        }
    },
    "sort_key": null,
    "in_memory": true
}
PoolingBatcher.Config

Component: PoolingBatcher

class PoolingBatcher.Config[source]

Bases: Batcher.Config

All Attributes (including base classes)

train_batch_size: int = 16
eval_batch_size: int = 16
test_batch_size: int = 16
pool_num_batches: int = 10000
Size of a pool expressed in number of batches
num_shuffled_pools: int = 1
How many pool-sized chunks to load at a time for shuffling
Subclasses
  • DynamicPoolingBatcher.Config
  • ExponentialDynamicPoolingBatcher.Config
  • LinearDynamicPoolingBatcher.Config

Default JSON

{
    "train_batch_size": 16,
    "eval_batch_size": 16,
    "test_batch_size": 16,
    "pool_num_batches": 10000,
    "num_shuffled_pools": 1
}

data_handler

DataHandler.Config

Component: DataHandler

class DataHandler.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

columns_to_read: list[str] = []
shuffle: bool = True
sort_within_batch: bool = True
train_path: str = 'train.tsv'
eval_path: str = 'eval.tsv'
test_path: str = 'test.tsv'
train_batch_size: int = 128
eval_batch_size: int = 128
test_batch_size: int = 128
column_mapping: dict[str, str] = {}
Subclasses
  • DisjointMultitaskDataHandler.Config

Default JSON

{
    "columns_to_read": [],
    "shuffle": true,
    "sort_within_batch": true,
    "train_path": "train.tsv",
    "eval_path": "eval.tsv",
    "test_path": "test.tsv",
    "train_batch_size": 128,
    "eval_batch_size": 128,
    "test_batch_size": 128,
    "column_mapping": {}
}

disjoint_multitask_data

DisjointMultitaskData.Config

Component: DisjointMultitaskData

class DisjointMultitaskData.Config[source]

Bases: Component.Config

All Attributes (including base classes)

sampler: BaseBatchSampler.Config = RoundRobinBatchSampler.Config()
test_key: Optional[str] = None

Default JSON

{
    "sampler": {
        "RoundRobinBatchSampler": {
            "iter_to_set_epoch": ""
        }
    },
    "test_key": null
}

disjoint_multitask_data_handler

DisjointMultitaskDataHandler.Config

Component: DisjointMultitaskDataHandler

class DisjointMultitaskDataHandler.Config[source]

Bases: DataHandler.Config

Configuration class for DisjointMultitaskDataHandler.

upsample

If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.

Type:bool

All Attributes (including base classes)

columns_to_read: list[str] = []
shuffle: bool = True
sort_within_batch: bool = True
train_path: str = 'train.tsv'
eval_path: str = 'eval.tsv'
test_path: str = 'test.tsv'
train_batch_size: int = 128
eval_batch_size: int = 128
test_batch_size: int = 128
column_mapping: dict[str, str] = {}
upsample: bool = True

Default JSON

{
    "columns_to_read": [],
    "shuffle": true,
    "sort_within_batch": true,
    "train_path": "train.tsv",
    "eval_path": "eval.tsv",
    "test_path": "test.tsv",
    "train_batch_size": 128,
    "eval_batch_size": 128,
    "test_batch_size": 128,
    "column_mapping": {},
    "upsample": true
}

dynamic_pooling_batcher

BatcherSchedulerConfig

Component: Module

class pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig[source]

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
start_batch_size: int = 32
end_batch_size: int = 256
epoch_period: int = 10
step_size: int = 1
Subclasses

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "start_batch_size": 32,
    "end_batch_size": 256,
    "epoch_period": 10,
    "step_size": 1
}
DynamicPoolingBatcher.Config

Component: DynamicPoolingBatcher

class DynamicPoolingBatcher.Config[source]

Bases: PoolingBatcher.Config

All Attributes (including base classes)

train_batch_size: int = 16
eval_batch_size: int = 16
test_batch_size: int = 16
pool_num_batches: int = 10000
num_shuffled_pools: int = 1
scheduler_config: BatcherSchedulerConfig = BatcherSchedulerConfig()
Subclasses
  • ExponentialDynamicPoolingBatcher.Config
  • LinearDynamicPoolingBatcher.Config

Default JSON

{
    "train_batch_size": 16,
    "eval_batch_size": 16,
    "test_batch_size": 16,
    "pool_num_batches": 10000,
    "num_shuffled_pools": 1,
    "scheduler_config": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "start_batch_size": 32,
        "end_batch_size": 256,
        "epoch_period": 10,
        "step_size": 1
    }
}
ExponentialBatcherSchedulerConfig

Component: Module

class pytext.data.dynamic_pooling_batcher.ExponentialBatcherSchedulerConfig[source]

Bases: BatcherSchedulerConfig

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
start_batch_size: int = 32
end_batch_size: int = 256
epoch_period: int = 10
step_size: int = 1
gamma: float = 5

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "start_batch_size": 32,
    "end_batch_size": 256,
    "epoch_period": 10,
    "step_size": 1,
    "gamma": 5
}
ExponentialDynamicPoolingBatcher.Config

Component: ExponentialDynamicPoolingBatcher

class ExponentialDynamicPoolingBatcher.Config[source]

Bases: DynamicPoolingBatcher.Config

All Attributes (including base classes)

train_batch_size: int = 16
eval_batch_size: int = 16
test_batch_size: int = 16
pool_num_batches: int = 10000
num_shuffled_pools: int = 1
scheduler_config: ExponentialBatcherSchedulerConfig = BatcherSchedulerConfig()

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

LinearDynamicPoolingBatcher.Config

Component: LinearDynamicPoolingBatcher

class LinearDynamicPoolingBatcher.Config

Bases: DynamicPoolingBatcher.Config

All Attributes (including base classes)

train_batch_size: int = 16
eval_batch_size: int = 16
test_batch_size: int = 16
pool_num_batches: int = 10000
num_shuffled_pools: int = 1
scheduler_config: BatcherSchedulerConfig = BatcherSchedulerConfig()

Default JSON

{
    "train_batch_size": 16,
    "eval_batch_size": 16,
    "test_batch_size": 16,
    "pool_num_batches": 10000,
    "num_shuffled_pools": 1,
    "scheduler_config": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "start_batch_size": 32,
        "end_batch_size": 256,
        "epoch_period": 10,
        "step_size": 1
    }
}

featurizer

featurizer
Featurizer.Config

Component: Featurizer

class Featurizer.Config

Bases: Component.Config

All Attributes (including base classes)

Default JSON

{}
simple_featurizer
SimpleFeaturizer.Config

Component: SimpleFeaturizer

class SimpleFeaturizer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

sentence_markers: Optional[tuple[str, str]] = None
lowercase_tokens: bool = True
split_regex: str = '\\s+'
convert_to_bytes: bool = False

Default JSON

{
    "sentence_markers": null,
    "lowercase_tokens": true,
    "split_regex": "\\s+",
    "convert_to_bytes": false
}

packed_lm_data

PackedLMData.Config

Component: PackedLMData

class PackedLMData.Config[source]

Bases: Data.Config

All Attributes (including base classes)

source: DataSource.Config = TSVDataSource.Config()
batcher: Batcher.Config = PoolingBatcher.Config()
sort_key: Optional[str] = None
in_memory: Optional[bool] = True
max_seq_len: int = 128

Default JSON

{
    "source": {
        "TSVDataSource": {
            "column_mapping": {},
            "train_filename": null,
            "test_filename": null,
            "eval_filename": null,
            "field_names": null,
            "delimiter": "\t",
            "quoted": false,
            "drop_incomplete_rows": false
        }
    },
    "batcher": {
        "PoolingBatcher": {
            "train_batch_size": 16,
            "eval_batch_size": 16,
            "test_batch_size": 16,
            "pool_num_batches": 10000,
            "num_shuffled_pools": 1
        }
    },
    "sort_key": null,
    "in_memory": true,
    "max_seq_len": 128
}

roberta_tensorizer

RoBERTaTensorizer.Config

Component: RoBERTaTensorizer

class RoBERTaTensorizer.Config[source]

Bases: BERTTensorizerBase.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['text']
tokenizer: Tokenizer.Config = GPT2BPETokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = 'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt'
max_seq_len: int = 256
Subclasses
  • RoBERTaTokenLevelTensorizer.Config
  • SquadForRoBERTaTensorizer.Config

Default JSON

{
    "is_input": true,
    "columns": [
        "text"
    ],
    "tokenizer": {
        "GPT2BPETokenizer": {
            "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
            "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
        }
    },
    "base_tokenizer": null,
    "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
    "max_seq_len": 256
}
RoBERTaTokenLevelTensorizer.Config

Component: RoBERTaTokenLevelTensorizer

class RoBERTaTokenLevelTensorizer.Config[source]

Bases: RoBERTaTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['text']
tokenizer: Tokenizer.Config = GPT2BPETokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = 'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt'
max_seq_len: int = 256
labels_columns: list[str] = ['label']
labels: list[str] = []

Default JSON

{
    "is_input": true,
    "columns": [
        "text"
    ],
    "tokenizer": {
        "GPT2BPETokenizer": {
            "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
            "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
        }
    },
    "base_tokenizer": null,
    "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
    "max_seq_len": 256,
    "labels_columns": [
        "label"
    ],
    "labels": []
}

sources

conllu
CoNLLUNERDataSource.Config

Component: CoNLLUNERDataSource

class CoNLLUNERDataSource.Config

Bases: CoNLLUPOSDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
language: Optional[str] = None
train_filename: Optional[str] = None
test_filename: Optional[str] = None
eval_filename: Optional[str] = None
field_names: Optional[list[str]] = None
delimiter: str = '\t'

Default JSON

{
    "column_mapping": {},
    "language": null,
    "train_filename": null,
    "test_filename": null,
    "eval_filename": null,
    "field_names": null,
    "delimiter": "\t"
}
CoNLLUPOSDataSource.Config

Component: CoNLLUPOSDataSource

class CoNLLUPOSDataSource.Config[source]

Bases: RootDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
language: Optional[str] = None
Name of the language. If not set, languages will be empty.
train_filename: Optional[str] = None
Filename of training set. If not set, iteration will be empty.
test_filename: Optional[str] = None
Filename of testing set. If not set, iteration will be empty.
eval_filename: Optional[str] = None
Filename of eval set. If not set, iteration will be empty.
field_names: Optional[list[str]] = None
Field names for the TSV. If this is not set, the first line of each file will be assumed to be a header containing the field names.
delimiter: str = '\t'
The column delimiter. CoNLL-U file default is t.
Subclasses
  • CoNLLUNERDataSource.Config

Default JSON

{
    "column_mapping": {},
    "language": null,
    "train_filename": null,
    "test_filename": null,
    "eval_filename": null,
    "field_names": null,
    "delimiter": "\t"
}
data_source
DataSource.Config

Component: DataSource

class DataSource.Config

Bases: Component.Config

All Attributes (including base classes)

Subclasses
  • RowShardedDataSource.Config
  • ShardedDataSource.Config
  • SquadDataSource.Config
  • SquadDataSourceForKD.Config

Default JSON

{}
RootDataSource.Config

Component: RootDataSource

class RootDataSource.Config[source]

Bases: Component.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
An optional column mapping, allowing the columns in the raw data source to not map directly to the column names in the schema. This mapping will remap names from the raw data source to names in the schema.
Subclasses
  • CoNLLUNERDataSource.Config
  • CoNLLUPOSDataSource.Config
  • PandasDataSource.Config
  • SessionPandasDataSource.Config
  • SessionDataSource.Config
  • BlockShardedTSVDataSource.Config
  • MultilingualTSVDataSource.Config
  • SessionTSVDataSource.Config
  • TSVDataSource.Config

Default JSON

{
    "column_mapping": {}
}
RowShardedDataSource.Config

Component: RowShardedDataSource

class RowShardedDataSource.Config

Bases: ShardedDataSource.Config

All Attributes (including base classes)

Default JSON

{}
ShardedDataSource.Config

Component: ShardedDataSource

class ShardedDataSource.Config

Bases: DataSource.Config

All Attributes (including base classes)

Subclasses
  • RowShardedDataSource.Config

Default JSON

{}
pandas
PandasDataSource.Config

Component: PandasDataSource

class PandasDataSource.Config

Bases: RootDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
Subclasses
  • SessionPandasDataSource.Config

Default JSON

{
    "column_mapping": {}
}
SessionPandasDataSource.Config

Component: SessionPandasDataSource

class SessionPandasDataSource.Config

Bases: PandasDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}

Default JSON

{
    "column_mapping": {}
}
session
SessionDataSource.Config

Component: SessionDataSource

class SessionDataSource.Config

Bases: RootDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}

Default JSON

{
    "column_mapping": {}
}
squad
SquadDataSource.Config

Component: SquadDataSource

class SquadDataSource.Config[source]

Bases: DataSource.Config

All Attributes (including base classes)

train_filename: Optional[str] = 'train-v2.0.json'
test_filename: Optional[str] = 'dev-v2.0.json'
eval_filename: Optional[str] = 'dev-v2.0.json'
ignore_impossible: bool = True
max_character_length: int = 1048576
min_overlap: float = 0.1
delimiter: str = '\t'
quoted: bool = False
Subclasses
  • SquadDataSourceForKD.Config

Default JSON

{
    "train_filename": "train-v2.0.json",
    "test_filename": "dev-v2.0.json",
    "eval_filename": "dev-v2.0.json",
    "ignore_impossible": true,
    "max_character_length": 1048576,
    "min_overlap": 0.1,
    "delimiter": "\t",
    "quoted": false
}
SquadDataSourceForKD.Config

Component: SquadDataSourceForKD

class SquadDataSourceForKD.Config

Bases: SquadDataSource.Config

All Attributes (including base classes)

train_filename: Optional[str] = 'train-v2.0.json'
test_filename: Optional[str] = 'dev-v2.0.json'
eval_filename: Optional[str] = 'dev-v2.0.json'
ignore_impossible: bool = True
max_character_length: int = 1048576
min_overlap: float = 0.1
delimiter: str = '\t'
quoted: bool = False

Default JSON

{
    "train_filename": "train-v2.0.json",
    "test_filename": "dev-v2.0.json",
    "eval_filename": "dev-v2.0.json",
    "ignore_impossible": true,
    "max_character_length": 1048576,
    "min_overlap": 0.1,
    "delimiter": "\t",
    "quoted": false
}
tsv
BlockShardedTSVDataSource.Config

Component: BlockShardedTSVDataSource

class BlockShardedTSVDataSource.Config

Bases: TSVDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
train_filename: Optional[str] = None
test_filename: Optional[str] = None
eval_filename: Optional[str] = None
field_names: Optional[list[str]] = None
delimiter: str = '\t'
quoted: bool = False
drop_incomplete_rows: bool = False

Default JSON

{
    "column_mapping": {},
    "train_filename": null,
    "test_filename": null,
    "eval_filename": null,
    "field_names": null,
    "delimiter": "\t",
    "quoted": false,
    "drop_incomplete_rows": false
}
MultilingualTSVDataSource.Config

Component: MultilingualTSVDataSource

class MultilingualTSVDataSource.Config[source]

Bases: TSVDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
train_filename: Optional[str] = None
test_filename: Optional[str] = None
eval_filename: Optional[str] = None
field_names: Optional[list[str]] = None
delimiter: str = '\t'
quoted: bool = False
drop_incomplete_rows: bool = False
data_source_languages: dict[str, list[str]] = {'train': ['en'], 'eval': ['en'], 'test': ['en']}
language_columns: list[str] = ['language']

Default JSON

{
    "column_mapping": {},
    "train_filename": null,
    "test_filename": null,
    "eval_filename": null,
    "field_names": null,
    "delimiter": "\t",
    "quoted": false,
    "drop_incomplete_rows": false,
    "data_source_languages": {
        "train": [
            "en"
        ],
        "eval": [
            "en"
        ],
        "test": [
            "en"
        ]
    },
    "language_columns": [
        "language"
    ]
}
SessionTSVDataSource.Config

Component: SessionTSVDataSource

class SessionTSVDataSource.Config

Bases: TSVDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
train_filename: Optional[str] = None
test_filename: Optional[str] = None
eval_filename: Optional[str] = None
field_names: Optional[list[str]] = None
delimiter: str = '\t'
quoted: bool = False
drop_incomplete_rows: bool = False

Default JSON

{
    "column_mapping": {},
    "train_filename": null,
    "test_filename": null,
    "eval_filename": null,
    "field_names": null,
    "delimiter": "\t",
    "quoted": false,
    "drop_incomplete_rows": false
}
TSVDataSource.Config

Component: TSVDataSource

class TSVDataSource.Config[source]

Bases: RootDataSource.Config

All Attributes (including base classes)

column_mapping: dict[str, str] = {}
train_filename: Optional[str] = None
Filename of training set. If not set, iteration will be empty.
test_filename: Optional[str] = None
Filename of testing set. If not set, iteration will be empty.
eval_filename: Optional[str] = None
Filename of eval set. If not set, iteration will be empty.
field_names: Optional[list[str]] = None
Field names for the TSV. If this is not set, the first line of each file will be assumed to be a header containing the field names.
delimiter: str = '\t'
The column delimiter passed to Python’s csv library. Change to “,” for csv.
quoted: bool = False
Whether the columns can use quotes to include delimiters or not. Rows with unclosed quotes will be merged with n inside. Change to True for quoted csv.
drop_incomplete_rows: bool = False
Subclasses
  • BlockShardedTSVDataSource.Config
  • MultilingualTSVDataSource.Config
  • SessionTSVDataSource.Config

Default JSON

{
    "column_mapping": {},
    "train_filename": null,
    "test_filename": null,
    "eval_filename": null,
    "field_names": null,
    "delimiter": "\t",
    "quoted": false,
    "drop_incomplete_rows": false
}

squad_for_bert_tensorizer

SquadForBERTTensorizer.Config

Component: SquadForBERTTensorizer

class SquadForBERTTensorizer.Config[source]

Bases: BERTTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['question', 'doc']
tokenizer: Tokenizer.Config = WordPieceTokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = '/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
max_seq_len: int = 256
answers_column: str = 'answers'
answer_starts_column: str = 'answer_starts'
Subclasses
  • SquadForBERTTensorizerForKD.Config

Default JSON

{
    "is_input": true,
    "columns": [
        "question",
        "doc"
    ],
    "tokenizer": {
        "WordPieceTokenizer": {
            "basic_tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            },
            "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
        }
    },
    "base_tokenizer": null,
    "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
    "max_seq_len": 256,
    "answers_column": "answers",
    "answer_starts_column": "answer_starts"
}
SquadForBERTTensorizerForKD.Config

Component: SquadForBERTTensorizerForKD

class SquadForBERTTensorizerForKD.Config[source]

Bases: SquadForBERTTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['question', 'doc']
tokenizer: Tokenizer.Config = WordPieceTokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = '/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
max_seq_len: int = 256
answers_column: str = 'answers'
answer_starts_column: str = 'answer_starts'
start_logits_column: str = 'start_logits'
end_logits_column: str = 'end_logits'
has_answer_logits_column: str = 'has_answer_logits'
pad_mask_column: str = 'pad_mask'
segment_labels_column: str = 'segment_labels'

Default JSON

{
    "is_input": true,
    "columns": [
        "question",
        "doc"
    ],
    "tokenizer": {
        "WordPieceTokenizer": {
            "basic_tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            },
            "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
        }
    },
    "base_tokenizer": null,
    "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
    "max_seq_len": 256,
    "answers_column": "answers",
    "answer_starts_column": "answer_starts",
    "start_logits_column": "start_logits",
    "end_logits_column": "end_logits",
    "has_answer_logits_column": "has_answer_logits",
    "pad_mask_column": "pad_mask",
    "segment_labels_column": "segment_labels"
}
SquadForRoBERTaTensorizer.Config

Component: SquadForRoBERTaTensorizer

class SquadForRoBERTaTensorizer.Config[source]

Bases: RoBERTaTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
columns: list[str] = ['question', 'doc']
tokenizer: Tokenizer.Config = GPT2BPETokenizer.Config()
base_tokenizer: Optional[Tokenizer.Config] = None
vocab_file: str = 'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt'
max_seq_len: int = 256
answers_column: str = 'answers'
answer_starts_column: str = 'answer_starts'

Default JSON

{
    "is_input": true,
    "columns": [
        "question",
        "doc"
    ],
    "tokenizer": {
        "GPT2BPETokenizer": {
            "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
            "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
        }
    },
    "base_tokenizer": null,
    "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
    "max_seq_len": 256,
    "answers_column": "answers",
    "answer_starts_column": "answer_starts"
}

squad_tensorizer

SquadTensorizer.Config

Component: SquadTensorizer

class SquadTensorizer.Config[source]

Bases: TokenTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text'
tokenizer: Tokenizer.Config = Tokenizer.Config(split_regex='\\W+')
add_bos_token: bool = False
add_eos_token: bool = False
use_eos_token_for_bos: bool = False
max_seq_len: Optional[int] = None
vocab: VocabConfig = VocabConfig()
vocab_file_delimiter: str = ' '
doc_column: str = 'doc'
ques_column: str = 'question'
answers_column: str = 'answers'
answer_starts_column: str = 'answer_starts'
max_ques_seq_len: int = 64
max_doc_seq_len: int = 256
Subclasses
  • SquadTensorizerForKD.Config

Default JSON

{
    "is_input": true,
    "column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\W+",
            "lowercase": true
        }
    },
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false,
    "max_seq_len": null,
    "vocab": {
        "build_from_data": true,
        "size_from_data": 0,
        "vocab_files": []
    },
    "vocab_file_delimiter": " ",
    "doc_column": "doc",
    "ques_column": "question",
    "answers_column": "answers",
    "answer_starts_column": "answer_starts",
    "max_ques_seq_len": 64,
    "max_doc_seq_len": 256
}
SquadTensorizerForKD.Config

Component: SquadTensorizerForKD

class SquadTensorizerForKD.Config[source]

Bases: SquadTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text'
tokenizer: Tokenizer.Config = Tokenizer.Config(split_regex='\\W+')
add_bos_token: bool = False
add_eos_token: bool = False
use_eos_token_for_bos: bool = False
max_seq_len: Optional[int] = None
vocab: VocabConfig = VocabConfig()
vocab_file_delimiter: str = ' '
doc_column: str = 'doc'
ques_column: str = 'question'
answers_column: str = 'answers'
answer_starts_column: str = 'answer_starts'
max_ques_seq_len: int = 64
max_doc_seq_len: int = 256
start_logits_column: str = 'start_logits'
end_logits_column: str = 'end_logits'
has_answer_logits_column: str = 'has_answer_logits'
pad_mask_column: str = 'pad_mask'
segment_labels_column: str = 'segment_labels'

Default JSON

{
    "is_input": true,
    "column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\W+",
            "lowercase": true
        }
    },
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false,
    "max_seq_len": null,
    "vocab": {
        "build_from_data": true,
        "size_from_data": 0,
        "vocab_files": []
    },
    "vocab_file_delimiter": " ",
    "doc_column": "doc",
    "ques_column": "question",
    "answers_column": "answers",
    "answer_starts_column": "answer_starts",
    "max_ques_seq_len": 64,
    "max_doc_seq_len": 256,
    "start_logits_column": "start_logits",
    "end_logits_column": "end_logits",
    "has_answer_logits_column": "has_answer_logits",
    "pad_mask_column": "pad_mask",
    "segment_labels_column": "segment_labels"
}

tensorizers

AnnotationNumberizer.Config

Component: AnnotationNumberizer

class AnnotationNumberizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'seqlogical'

Default JSON

{
    "is_input": true,
    "column": "seqlogical"
}
ByteTensorizer.Config

Component: ByteTensorizer

class ByteTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text'
The name of the text column to parse from the data source.
lower: bool = True
max_seq_len: Optional[int] = None
add_bos_token: Optional[bool] = False
add_eos_token: Optional[bool] = False
use_eos_token_for_bos: Optional[bool] = False

Default JSON

{
    "is_input": true,
    "column": "text",
    "lower": true,
    "max_seq_len": null,
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false
}
ByteTokenTensorizer.Config

Component: ByteTokenTensorizer

class ByteTokenTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text'
The name of the text column to parse from the data source.
tokenizer: Tokenizer.Config = Tokenizer.Config()
The tokenizer to use to split input text into tokens.
max_seq_len: Optional[int] = None
The max token length for input text.
max_byte_len: int = 15
The max byte length for a token.
offset_for_non_padding: int = 0
Offset to add to all non-padding bytes
add_bos_token: bool = False
add_eos_token: bool = False
use_eos_token_for_bos: bool = False

Default JSON

{
    "is_input": true,
    "column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    },
    "max_seq_len": null,
    "max_byte_len": 15,
    "offset_for_non_padding": 0,
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false
}
CharacterTokenTensorizer.Config

Component: CharacterTokenTensorizer

class CharacterTokenTensorizer.Config[source]

Bases: TokenTensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text'
tokenizer: Tokenizer.Config = Tokenizer.Config()
add_bos_token: bool = False
add_eos_token: bool = False
use_eos_token_for_bos: bool = False
max_seq_len: Optional[int] = None
vocab: VocabConfig = VocabConfig()
vocab_file_delimiter: str = ' '
max_char_length: int = 20
The max character length for a token.

Default JSON

{
    "is_input": true,
    "column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    },
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false,
    "max_seq_len": null,
    "vocab": {
        "build_from_data": true,
        "size_from_data": 0,
        "vocab_files": []
    },
    "vocab_file_delimiter": " ",
    "max_char_length": 20
}
FloatListTensorizer.Config

Component: FloatListTensorizer

class FloatListTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str
The name of the label column to parse from the data source.
error_check: bool = False
dim: Optional[int] = None
normalize: bool = False

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

FloatTensorizer.Config

Component: FloatTensorizer

class FloatTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str
The name of the column to parse from the data source.

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

GazetteerTensorizer.Config

Component: GazetteerTensorizer

class GazetteerTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
text_column: str = 'text'
dict_column: str = 'dict'
tokenizer: Tokenizer.Config = Tokenizer.Config()
tokenizer to split text and create dict tensors of the same size.

Default JSON

{
    "is_input": true,
    "text_column": "text",
    "dict_column": "dict",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    }
}
LabelListTensorizer.Config

Component: LabelListTensorizer

class LabelListTensorizer.Config

Bases: LabelTensorizer.Config

All Attributes (including base classes)

is_input: bool = False
column: str = 'label'
allow_unknown: bool = False
pad_in_vocab: bool = False
label_vocab: Optional[list[str]] = None

Default JSON

{
    "is_input": false,
    "column": "label",
    "allow_unknown": false,
    "pad_in_vocab": false,
    "label_vocab": null
}
LabelTensorizer.Config

Component: LabelTensorizer

class LabelTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = False
column: str = 'label'
The name of the label column to parse from the data source.
allow_unknown: bool = False
Whether to allow for unknown labels at test/prediction time
pad_in_vocab: bool = False
if vocab should have pad, usually false when label is used as target
label_vocab: Optional[list[str]] = None
The label values, if known. Will skip initialization step if provided.
Subclasses
  • LabelListTensorizer.Config
  • SoftLabelTensorizer.Config

Default JSON

{
    "is_input": false,
    "column": "label",
    "allow_unknown": false,
    "pad_in_vocab": false,
    "label_vocab": null
}
MetricTensorizer.Config

Component: MetricTensorizer

class MetricTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = False
names: list[str]
indexes: list[int]
Subclasses
  • NtokensTensorizer.Config

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

NtokensTensorizer.Config

Component: NtokensTensorizer

class NtokensTensorizer.Config

Bases: MetricTensorizer.Config

All Attributes (including base classes)

is_input: bool = False
names: list[str]
indexes: list[int]

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

NumericLabelTensorizer.Config

Component: NumericLabelTensorizer

class NumericLabelTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = False
column: str = 'label'
The name of the label column to parse from the data source.
rescale_range: Optional[list[float]] = None
If provided, the range of values the raw label can be. Will rescale the label values to be within [0, 1].

Default JSON

{
    "is_input": false,
    "column": "label",
    "rescale_range": null
}
SeqTokenTensorizer.Config

Component: SeqTokenTensorizer

class SeqTokenTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text_seq'
max_seq_len: Optional[int] = None
add_bos_token: bool = False
sentence markers
add_eos_token: bool = False
use_eos_token_for_bos: bool = False
add_bol_token: bool = False
list markers
add_eol_token: bool = False
use_eol_token_for_bol: bool = False
tokenizer: Tokenizer.Config = Tokenizer.Config()
The tokenizer to use to split input text into tokens.

Default JSON

{
    "is_input": true,
    "column": "text_seq",
    "max_seq_len": null,
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false,
    "add_bol_token": false,
    "add_eol_token": false,
    "use_eol_token_for_bol": false,
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    }
}
SlotLabelTensorizer.Config

Component: SlotLabelTensorizer

class SlotLabelTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = False
slot_column: str = 'slots'
The name of the slot label column to parse from the data source.
text_column: str = 'text'
The name of the text column to parse from the data source. We need this to be able to generate tensors which correspond to input text.
tokenizer: Tokenizer.Config = Tokenizer.Config()
The tokenizer to use to split input text into tokens. This should be configured in a way which yields tokens consistent with the tokens input to or output by a model, so that the labels generated by this tensorizer will match the indices of the model’s tokens.
allow_unknown: bool = False
Whether to allow for unknown labels at test/prediction time
Subclasses
  • SlotLabelTensorizerExpansible.Config

Default JSON

{
    "is_input": false,
    "slot_column": "slots",
    "text_column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    },
    "allow_unknown": false
}
SlotLabelTensorizerExpansible.Config

Component: SlotLabelTensorizerExpansible

class SlotLabelTensorizerExpansible.Config

Bases: SlotLabelTensorizer.Config

All Attributes (including base classes)

is_input: bool = False
slot_column: str = 'slots'
text_column: str = 'text'
tokenizer: Tokenizer.Config = Tokenizer.Config()
allow_unknown: bool = False

Default JSON

{
    "is_input": false,
    "slot_column": "slots",
    "text_column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    },
    "allow_unknown": false
}
SoftLabelTensorizer.Config

Component: SoftLabelTensorizer

class SoftLabelTensorizer.Config[source]

Bases: LabelTensorizer.Config

All Attributes (including base classes)

is_input: bool = False
column: str = 'label'
allow_unknown: bool = False
pad_in_vocab: bool = False
label_vocab: Optional[list[str]] = None
probs_column: str = 'target_probs'
logits_column: str = 'target_logits'
labels_column: str = 'target_labels'

Default JSON

{
    "is_input": false,
    "column": "label",
    "allow_unknown": false,
    "pad_in_vocab": false,
    "label_vocab": null,
    "probs_column": "target_probs",
    "logits_column": "target_logits",
    "labels_column": "target_labels"
}
Tensorizer.Config

Component: Tensorizer

class Tensorizer.Config[source]

Bases: Component.Config

All Attributes (including base classes)

is_input: bool = True
Subclasses
  • BERTTensorizer.Config
  • BERTTensorizerBase.Config
  • RoBERTaTensorizer.Config
  • RoBERTaTokenLevelTensorizer.Config
  • SquadForBERTTensorizer.Config
  • SquadForBERTTensorizerForKD.Config
  • SquadForRoBERTaTensorizer.Config
  • SquadTensorizer.Config
  • SquadTensorizerForKD.Config
  • AnnotationNumberizer.Config
  • ByteTensorizer.Config
  • ByteTokenTensorizer.Config
  • CharacterTokenTensorizer.Config
  • FloatListTensorizer.Config
  • FloatTensorizer.Config
  • GazetteerTensorizer.Config
  • LabelListTensorizer.Config
  • LabelTensorizer.Config
  • MetricTensorizer.Config
  • NtokensTensorizer.Config
  • NumericLabelTensorizer.Config
  • SeqTokenTensorizer.Config
  • SlotLabelTensorizer.Config
  • SlotLabelTensorizerExpansible.Config
  • SoftLabelTensorizer.Config
  • TokenTensorizer.Config
  • UidTensorizer.Config

Default JSON

{
    "is_input": true
}
TokenTensorizer.Config

Component: TokenTensorizer

class TokenTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'text'
The name of the text column to parse from the data source.
tokenizer: Tokenizer.Config = Tokenizer.Config()
The tokenizer to use to split input text into tokens.
add_bos_token: bool = False
add_eos_token: bool = False
use_eos_token_for_bos: bool = False
max_seq_len: Optional[int] = None
vocab: VocabConfig = VocabConfig()
vocab_file_delimiter: str = ' '
Subclasses
  • SquadTensorizer.Config
  • SquadTensorizerForKD.Config
  • CharacterTokenTensorizer.Config

Default JSON

{
    "is_input": true,
    "column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true
        }
    },
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false,
    "max_seq_len": null,
    "vocab": {
        "build_from_data": true,
        "size_from_data": 0,
        "vocab_files": []
    },
    "vocab_file_delimiter": " "
}
UidTensorizer.Config

Component: UidTensorizer

class UidTensorizer.Config[source]

Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True
column: str = 'uid'
allow_unknown: bool = True

Default JSON

{
    "is_input": true,
    "column": "uid",
    "allow_unknown": true
}
VocabConfig

Component: Component

class pytext.data.tensorizers.VocabConfig[source]

Bases: Component.Config

All Attributes (including base classes)

build_from_data: bool = True
Whether to add tokens from training data to vocab.
size_from_data: int = 0
Add size_from_data most frequent tokens in training data to vocab (if this is 0, add all tokens from training data).
vocab_files: list[VocabFileConfig] = []

Default JSON

{
    "build_from_data": true,
    "size_from_data": 0,
    "vocab_files": []
}
VocabFileConfig

Component: Component

class pytext.data.tensorizers.VocabFileConfig[source]

Bases: Component.Config

All Attributes (including base classes)

filepath: str = ''
File containing tokens to add to vocab (first whitespace-separated entry per line)
skip_header_line: bool = False
Whether to skip the first line of the file (e.g. if it is a header line)
lowercase_tokens: bool = False
Whether to lowercase each of the tokens in the file
size_limit: int = 0
The max number of tokens to add to vocab

Default JSON

{
    "filepath": "",
    "skip_header_line": false,
    "lowercase_tokens": false,
    "size_limit": 0
}

tokenizers

tokenizer
BERTInitialTokenizer.Config

Component: BERTInitialTokenizer

class BERTInitialTokenizer.Config[source]

Bases: Tokenizer.Config

Config for this class.

All Attributes (including base classes)

split_regex: str = '\\s+'
lowercase: bool = True

Default JSON

{
    "split_regex": "\\s+",
    "lowercase": true
}
DoNothingTokenizer.Config

Component: DoNothingTokenizer

class DoNothingTokenizer.Config[source]

Bases: Component.Config

All Attributes (including base classes)

do_nothing: str = ''

Default JSON

{
    "do_nothing": ""
}
GPT2BPETokenizer.Config

Component: GPT2BPETokenizer

class GPT2BPETokenizer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

bpe_encoder_path: str = 'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json'
bpe_vocab_path: str = 'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe'

Default JSON

{
    "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
    "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
SentencePieceTokenizer.Config

Component: SentencePieceTokenizer

class SentencePieceTokenizer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

sp_model_path: str = ''

Default JSON

{
    "sp_model_path": ""
}
Tokenizer.Config

Component: Tokenizer

class Tokenizer.Config[source]

Bases: Component.Config

All Attributes (including base classes)

split_regex: str = '\\s+'
A regular expression for the tokenizer to split on. Tokens are the segments between the regular expression matches. The start index is inclusive of the unmatched region, and the end index is exclusive (matching the first character of the matched split region).
lowercase: bool = True
Whether token values should be lowercased or not.
Subclasses
  • BERTInitialTokenizer.Config

Default JSON

{
    "split_regex": "\\s+",
    "lowercase": true
}
WordPieceTokenizer.Config

Component: WordPieceTokenizer

class WordPieceTokenizer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

basic_tokenizer: BERTInitialTokenizer.Config = BERTInitialTokenizer.Config()
wordpiece_vocab_path: str = '/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'

Default JSON

{
    "basic_tokenizer": {
        "split_regex": "\\s+",
        "lowercase": true
    },
    "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}

exporters

custom_exporters

DenseFeatureExporter.Config

Component: DenseFeatureExporter

class DenseFeatureExporter.Config

Bases: ModelExporter.Config

All Attributes (including base classes)

export_logits: bool = False
export_raw_to_metrics: bool = False

Default JSON

{
    "export_logits": false,
    "export_raw_to_metrics": false
}
InitPredictNetExporter.Config

Component: InitPredictNetExporter

class InitPredictNetExporter.Config

Bases: ModelExporter.Config

All Attributes (including base classes)

export_logits: bool = False
export_raw_to_metrics: bool = False

Default JSON

{
    "export_logits": false,
    "export_raw_to_metrics": false
}

exporter

ModelExporter.Config

Component: ModelExporter

class ModelExporter.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

export_logits: bool = False
export_raw_to_metrics: bool = False
Subclasses
  • DenseFeatureExporter.Config
  • InitPredictNetExporter.Config

Default JSON

{
    "export_logits": false,
    "export_raw_to_metrics": false
}

loss

loss

AUCPRHingeLoss.Config

Component: AUCPRHingeLoss

class AUCPRHingeLoss.Config[source]

Bases: ConfigBase

precision_range_lower

the lower range of precision values over which to compute AUC. Must be nonnegative, leq precision_range_upper, and leq 1.0.

Type:float
precision_range_upper

the upper range of precision values over which to compute AUC. Must be nonnegative, geq precision_range_lower, and leq 1.0.

Type:float
num_classes

number of classes(aka labels)

Type:int
num_anchors

The number of grid points used to approximate the Riemann sum.

Type:int

All Attributes (including base classes)

precision_range_lower: float = 0.0
precision_range_upper: float = 1.0
num_classes: int = 1
num_anchors: int = 20

Default JSON

{
    "precision_range_lower": 0.0,
    "precision_range_upper": 1.0,
    "num_classes": 1,
    "num_anchors": 20
}
BinaryCrossEntropyLoss.Config

Component: BinaryCrossEntropyLoss

class BinaryCrossEntropyLoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

reweight_negative: bool = True
reduce: bool = True

Default JSON

{
    "reweight_negative": true,
    "reduce": true
}
CosineEmbeddingLoss.Config

Component: CosineEmbeddingLoss

class CosineEmbeddingLoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

margin: float = 0.0

Default JSON

{
    "margin": 0.0
}
CrossEntropyLoss.Config

Component: CrossEntropyLoss

class CrossEntropyLoss.Config

Bases: Loss.Config

All Attributes (including base classes)

Default JSON

{}
KLDivergenceBCELoss.Config

Component: KLDivergenceBCELoss

class KLDivergenceBCELoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

temperature: float = 1.0
hard_weight: float = 0.0

Default JSON

{
    "temperature": 1.0,
    "hard_weight": 0.0
}
KLDivergenceCELoss.Config

Component: KLDivergenceCELoss

class KLDivergenceCELoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

temperature: float = 1.0
hard_weight: float = 0.0

Default JSON

{
    "temperature": 1.0,
    "hard_weight": 0.0
}
LabelSmoothedCrossEntropyLoss.Config

Component: LabelSmoothedCrossEntropyLoss

class LabelSmoothedCrossEntropyLoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

beta: float = 0.1
from_logits: bool = True
use_entropy: bool = False

Default JSON

{
    "beta": 0.1,
    "from_logits": true,
    "use_entropy": false
}
Loss.Config

Component: Loss

class Loss.Config

Bases: Component.Config

All Attributes (including base classes)

Subclasses
  • CrossEntropyLoss.Config
  • MultiLabelSoftMarginLoss.Config
  • NLLLoss.Config

Default JSON

{}
MAELoss.Config

Component: MAELoss

class MAELoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{}
MSELoss.Config

Component: MSELoss

class MSELoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{}
MultiLabelSoftMarginLoss.Config

Component: MultiLabelSoftMarginLoss

class MultiLabelSoftMarginLoss.Config

Bases: Loss.Config

All Attributes (including base classes)

Default JSON

{}
NLLLoss.Config

Component: NLLLoss

class NLLLoss.Config

Bases: Loss.Config

All Attributes (including base classes)

Default JSON

{}
PairwiseRankingLoss.Config

Component: PairwiseRankingLoss

class PairwiseRankingLoss.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

margin: float = 1.0

Default JSON

{
    "margin": 1.0
}

metric_reporters

classification_metric_reporter

ClassificationMetricReporter.Config

Component: ClassificationMetricReporter

class ClassificationMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
model_select_metric: ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>
target_label: Optional[str] = None
text_column_names: list[str] = ['text']
These column names correspond to raw input data columns. Text in these columns (usually just 1 column) will be concatenated and output in the IntentModelChannel as an evaluation tsv.
additional_column_names: list[str] = []
These column names correspond to raw input data columns, that will be read by data_source into context, and included in the run_model output file along with other saving results.
recall_at_precision_thresholds: list[float] = [0.2, 0.4, 0.6, 0.8, 0.9]
Subclasses
  • MultiLabelClassificationMetricReporter.Config

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "model_select_metric": "accuracy",
    "target_label": null,
    "text_column_names": [
        "text"
    ],
    "additional_column_names": [],
    "recall_at_precision_thresholds": [
        0.2,
        0.4,
        0.6,
        0.8,
        0.9
    ]
}
MultiLabelClassificationMetricReporter.Config

Component: MultiLabelClassificationMetricReporter

class MultiLabelClassificationMetricReporter.Config

Bases: ClassificationMetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
model_select_metric: ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>
target_label: Optional[str] = None
text_column_names: list[str] = ['text']
additional_column_names: list[str] = []
recall_at_precision_thresholds: list[float] = [0.2, 0.4, 0.6, 0.8, 0.9]

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "model_select_metric": "accuracy",
    "target_label": null,
    "text_column_names": [
        "text"
    ],
    "additional_column_names": [],
    "recall_at_precision_thresholds": [
        0.2,
        0.4,
        0.6,
        0.8,
        0.9
    ]
}

compositional_metric_reporter

CompositionalMetricReporter.Config

Component: CompositionalMetricReporter

class CompositionalMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
text_column_name: str = 'tokenized_text'

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "text_column_name": "tokenized_text"
}

disjoint_multitask_metric_reporter

DisjointMultitaskMetricReporter.Config

Component: DisjointMultitaskMetricReporter

class DisjointMultitaskMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
use_subtask_select_metric: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "use_subtask_select_metric": false
}

intent_slot_detection_metric_reporter

IntentSlotMetricReporter.Config

Component: IntentSlotMetricReporter

class IntentSlotMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}

language_model_metric_reporter

LanguageModelMetricReporter.Config

Component: LanguageModelMetricReporter

class LanguageModelMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
aggregate_metrics: bool = True
perplexity_type: PerplexityType = <PerplexityType.MEDIAN: 'median'>
Subclasses
  • MaskedLMMetricReporter.Config

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "aggregate_metrics": true,
    "perplexity_type": "median"
}
MaskedLMMetricReporter.Config

Component: MaskedLMMetricReporter

class MaskedLMMetricReporter.Config

Bases: LanguageModelMetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
aggregate_metrics: bool = True
perplexity_type: PerplexityType = <PerplexityType.MEDIAN: 'median'>

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "aggregate_metrics": true,
    "perplexity_type": "median"
}

metric_reporter

MetricReporter.Config

Component: MetricReporter

class MetricReporter.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
Subclasses
  • ClassificationMetricReporter.Config
  • MultiLabelClassificationMetricReporter.Config
  • CompositionalMetricReporter.Config
  • DisjointMultitaskMetricReporter.Config
  • IntentSlotMetricReporter.Config
  • LanguageModelMetricReporter.Config
  • MaskedLMMetricReporter.Config
  • PureLossMetricReporter.Config
  • PairwiseRankingMetricReporter.Config
  • RegressionMetricReporter.Config
  • SquadMetricReporter.Config
  • NERMetricReporter.Config
  • SequenceTaggingMetricReporter.Config
  • WordTaggingMetricReporter.Config

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}
PureLossMetricReporter.Config

Component: PureLossMetricReporter

class PureLossMetricReporter.Config

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}

pairwise_ranking_metric_reporter

PairwiseRankingMetricReporter.Config

Component: PairwiseRankingMetricReporter

class PairwiseRankingMetricReporter.Config

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}

regression_metric_reporter

RegressionMetricReporter.Config

Component: RegressionMetricReporter

class RegressionMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}

squad_metric_reporter

SquadMetricReporter.Config

Component: SquadMetricReporter

class SquadMetricReporter.Config[source]

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False
n_best_size: int = 5
max_answer_length: int = 16
ignore_impossible: bool = True
false_label: str = 'False'

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false,
    "n_best_size": 5,
    "max_answer_length": 16,
    "ignore_impossible": true,
    "false_label": "False"
}

word_tagging_metric_reporter

NERMetricReporter.Config

Component: NERMetricReporter

class NERMetricReporter.Config

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}
SequenceTaggingMetricReporter.Config

Component: SequenceTaggingMetricReporter

class SequenceTaggingMetricReporter.Config

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}
WordTaggingMetricReporter.Config

Component: WordTaggingMetricReporter

class WordTaggingMetricReporter.Config

Bases: MetricReporter.Config

All Attributes (including base classes)

output_path: str = '/tmp/test_out.txt'
pep_format: bool = False

Default JSON

{
    "output_path": "/tmp/test_out.txt",
    "pep_format": false
}

models

bert_classification_models

BertModelInput
class pytext.models.bert_classification_models.BertModelInput

Bases: ModelInput

All Attributes (including base classes)

tokens: BERTTensorizer.Config = BERTTensorizer.Config(max_seq_len=128)
dense: Optional[FloatListTensorizer.Config] = None
labels: LabelTensorizer.Config = LabelTensorizer.Config()
num_tokens: NtokensTensorizer.Config = NtokensTensorizer.Config(names=['tokens'], indexes=[2])

Default JSON

{
    "tokens": {
        "BERTTensorizer": {
            "is_input": true,
            "columns": [
                "text"
            ],
            "tokenizer": {
                "WordPieceTokenizer": {
                    "basic_tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    },
                    "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
            "max_seq_len": 128
        }
    },
    "dense": null,
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    },
    "num_tokens": {
        "is_input": false,
        "names": [
            "tokens"
        ],
        "indexes": [
            2
        ]
    }
}
BertPairwiseModel.Config

Component: BertPairwiseModel

class BertPairwiseModel.Config[source]

Bases: BasePairwiseModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens1": {
            "BERTTensorizer": {
                "is_input": true,
                "columns": [
                    "text1"
                ],
                "tokenizer": {
                    "WordPieceTokenizer": {
                        "basic_tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        },
                        "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                "max_seq_len": 128
            }
        },
        "tokens2": {
            "BERTTensorizer": {
                "is_input": true,
                "columns": [
                    "text2"
                ],
                "tokenizer": {
                    "WordPieceTokenizer": {
                        "basic_tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        },
                        "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                "max_seq_len": 128
            }
        },
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        },
        "num_tokens": {
            "is_input": false,
            "names": [
                "tokens1",
                "tokens2"
            ],
            "indexes": [
                2,
                2
            ]
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "ClassificationOutputLayer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        }
    },
    "encode_relations": true,
    "encoder": {
        "HuggingFaceBertSentenceEncoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
            "load_weights": true
        }
    },
    "shared_encoder": true
}
ModelInput
class pytext.models.bert_classification_models.ModelInput

Bases: ModelInputBase

All Attributes (including base classes)

tokens1: BERTTensorizer.Config = BERTTensorizer.Config(columns=['text1'], max_seq_len=128)
tokens2: BERTTensorizer.Config = BERTTensorizer.Config(columns=['text2'], max_seq_len=128)
labels: LabelTensorizer.Config = LabelTensorizer.Config()
num_tokens: NtokensTensorizer.Config = NtokensTensorizer.Config(names=['tokens1', 'tokens2'], indexes=[2, 2])

Default JSON

{
    "tokens1": {
        "BERTTensorizer": {
            "is_input": true,
            "columns": [
                "text1"
            ],
            "tokenizer": {
                "WordPieceTokenizer": {
                    "basic_tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    },
                    "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
            "max_seq_len": 128
        }
    },
    "tokens2": {
        "BERTTensorizer": {
            "is_input": true,
            "columns": [
                "text2"
            ],
            "tokenizer": {
                "WordPieceTokenizer": {
                    "basic_tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    },
                    "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
            "max_seq_len": 128
        }
    },
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    },
    "num_tokens": {
        "is_input": false,
        "names": [
            "tokens1",
            "tokens2"
        ],
        "indexes": [
            2,
            2
        ]
    }
}
NewBertModel.Config

Component: NewBertModel

class NewBertModel.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

Subclasses
  • NewBertRegressionModel.Config
  • BertSquadQAModel.Config
  • RoBERTa.Config

Default JSON

{
    "inputs": {
        "tokens": {
            "BERTTensorizer": {
                "is_input": true,
                "columns": [
                    "text"
                ],
                "tokenizer": {
                    "WordPieceTokenizer": {
                        "basic_tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        },
                        "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                "max_seq_len": 128
            }
        },
        "dense": null,
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        },
        "num_tokens": {
            "is_input": false,
            "names": [
                "tokens"
            ],
            "indexes": [
                2
            ]
        }
    },
    "encoder": {
        "HuggingFaceBertSentenceEncoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
            "load_weights": true
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    }
}

bert_regression_model

InputConfig
class pytext.models.bert_regression_model.InputConfig

Bases: ConfigBase

All Attributes (including base classes)

tokens: BERTTensorizer.Config = BERTTensorizer.Config(columns=['text1', 'text2'], max_seq_len=128)
labels: NumericLabelTensorizer.Config = NumericLabelTensorizer.Config()

Default JSON

{
    "tokens": {
        "BERTTensorizer": {
            "is_input": true,
            "columns": [
                "text1",
                "text2"
            ],
            "tokenizer": {
                "WordPieceTokenizer": {
                    "basic_tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    },
                    "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
            "max_seq_len": 128
        }
    },
    "labels": {
        "is_input": false,
        "column": "label",
        "rescale_range": null
    }
}
NewBertRegressionModel.Config

Component: NewBertRegressionModel

class NewBertRegressionModel.Config[source]

Bases: NewBertModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "BERTTensorizer": {
                "is_input": true,
                "columns": [
                    "text1",
                    "text2"
                ],
                "tokenizer": {
                    "WordPieceTokenizer": {
                        "basic_tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        },
                        "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                "max_seq_len": 128
            }
        },
        "labels": {
            "is_input": false,
            "column": "label",
            "rescale_range": null
        }
    },
    "encoder": {
        "HuggingFaceBertSentenceEncoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
            "load_weights": true
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {},
        "squash_to_unit_range": false
    }
}

decoders

decoder_base
DecoderBase.Config

Component: DecoderBase

class DecoderBase.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
Subclasses
  • IntentSlotModelDecoder.Config
  • MLPDecoder.Config
  • MLPDecoderQueryResponse.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
intent_slot_model_decoder
IntentSlotModelDecoder.Config

Component: IntentSlotModelDecoder

class IntentSlotModelDecoder.Config[source]

Bases: DecoderBase.Config

Configuration class for IntentSlotModelDecoder.

use_doc_probs_in_word

Whether to use intent probabilities for predicting slots.

Type:bool

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
use_doc_probs_in_word: bool = False
doc_decoder: MLPDecoder.Config = MLPDecoder.Config()
word_decoder: MLPDecoder.Config = MLPDecoder.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "use_doc_probs_in_word": false,
    "doc_decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "word_decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    }
}
mlp_decoder
MLPDecoder.Config

Component: MLPDecoder

class MLPDecoder.Config[source]

Bases: DecoderBase.Config

Configuration class for MLPDecoder.

hidden_dims

Dimensions of the outputs of hidden layers..

Type:List[int]

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
hidden_dims: list[int] = []
out_dim: Optional[int] = None
layer_norm: bool = False
dropout: float = 0.0
activation: Activation = <Activation.RELU: 'relu'>

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "hidden_dims": [],
    "out_dim": null,
    "layer_norm": false,
    "dropout": 0.0,
    "activation": "relu"
}
mlp_decoder_query_response
MLPDecoderQueryResponse.Config

Component: MLPDecoderQueryResponse

class MLPDecoderQueryResponse.Config[source]

Bases: DecoderBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
hidden_dims: list[int] = []

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "hidden_dims": []
}

disjoint_multitask_model

DisjointMultitaskModel.Config

Component: DisjointMultitaskModel

class DisjointMultitaskModel.Config

Bases: Model.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
Subclasses
  • NewDisjointMultitaskModel.Config

Default JSON

{
    "inputs": {}
}
NewDisjointMultitaskModel.Config

Component: NewDisjointMultitaskModel

class NewDisjointMultitaskModel.Config

Bases: DisjointMultitaskModel.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()

Default JSON

{
    "inputs": {}
}

doc_model

ByteModelInput
class pytext.models.doc_model.ByteModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "dense": null,
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    },
    "token_bytes": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "max_seq_len": null,
        "max_byte_len": 15,
        "offset_for_non_padding": 0,
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false
    }
}
ByteTokensDocumentModel.Config

Component: ByteTokensDocumentModel

class ByteTokensDocumentModel.Config[source]

Bases: DocModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "dense": null,
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        },
        "token_bytes": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "max_seq_len": null,
            "max_byte_len": 15,
            "offset_for_non_padding": 0,
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    },
    "byte_embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "sparse": false,
        "cnn": {
            "kernel_num": 100,
            "kernel_sizes": [
                3,
                4
            ],
            "weight_norm": false,
            "dilated": false,
            "causal": false
        },
        "highway_layers": 0,
        "projection_dim": null,
        "export_input_names": [
            "char_vals"
        ],
        "vocab_from_train_data": true,
        "max_word_length": 20,
        "min_freq": 1
    }
}
DocModel.Config

Component: DocModel

class DocModel.Config[source]

Bases: Model.Config

All Attributes (including base classes)

Subclasses
  • ByteTokensDocumentModel.Config
  • DocRegressionModel.Config
  • PersonalizedDocModel.Config
  • SeqNNModel.Config

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "dense": null,
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    }
}
DocRegressionModel.Config

Component: DocRegressionModel

class DocRegressionModel.Config[source]

Bases: DocModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "dense": null,
        "labels": {
            "is_input": false,
            "column": "label",
            "rescale_range": null
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {},
        "squash_to_unit_range": false
    }
}
ModelInput
class pytext.models.doc_model.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "dense": null,
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    }
}
PersonalizedDocModel.Config

Component: PersonalizedDocModel

class PersonalizedDocModel.Config[source]

Bases: DocModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "dense": null,
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        },
        "uid": {
            "is_input": true,
            "column": "uid",
            "allow_unknown": true
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    },
    "user_embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    }
}
PersonalizedModelInput
class pytext.models.doc_model.PersonalizedModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "dense": null,
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    },
    "uid": {
        "is_input": true,
        "column": "uid",
        "allow_unknown": true
    }
}
RegressionModelInput
class pytext.models.doc_model.RegressionModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "dense": null,
    "labels": {
        "is_input": false,
        "column": "label",
        "rescale_range": null
    }
}

embeddings

char_embedding
CharacterEmbedding.Config

Component: CharacterEmbedding

class CharacterEmbedding.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
embed_dim: int = 100
sparse: bool = False
cnn: CNNParams = CNNParams()
highway_layers: int = 0
projection_dim: Optional[int] = None
export_input_names: list[str] = ['char_vals']
vocab_from_train_data: bool = True
max_word_length: int = 20
min_freq: int = 1

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "embed_dim": 100,
    "sparse": false,
    "cnn": {
        "kernel_num": 100,
        "kernel_sizes": [
            3,
            4
        ],
        "weight_norm": false,
        "dilated": false,
        "causal": false
    },
    "highway_layers": 0,
    "projection_dim": null,
    "export_input_names": [
        "char_vals"
    ],
    "vocab_from_train_data": true,
    "max_word_length": 20,
    "min_freq": 1
}
contextual_token_embedding
ContextualTokenEmbedding.Config

Component: ContextualTokenEmbedding

class ContextualTokenEmbedding.Config

Bases: ConfigBase

All Attributes (including base classes)

embed_dim: int = 0
model_paths: Optional[dict[str, str]] = None
export_input_names: list[str] = ['contextual_token_embedding']

Default JSON

{
    "embed_dim": 0,
    "model_paths": null,
    "export_input_names": [
        "contextual_token_embedding"
    ]
}
dict_embedding
DictEmbedding.Config

Component: DictEmbedding

class DictEmbedding.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
embed_dim: int = 100
sparse: bool = False
pooling: PoolingType = <PoolingType.MEAN: 'mean'>
export_input_names: list[str] = ['dict_vals', 'dict_weights', 'dict_lens']
vocab_from_train_data: bool = True
mobile: bool = False

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "embed_dim": 100,
    "sparse": false,
    "pooling": "mean",
    "export_input_names": [
        "dict_vals",
        "dict_weights",
        "dict_lens"
    ],
    "vocab_from_train_data": true,
    "mobile": false
}
embedding_base
EmbeddingBase.Config

Component: EmbeddingBase

class EmbeddingBase.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
Subclasses
  • EmbeddingList.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
embedding_list
EmbeddingList.Config

Component: EmbeddingList

class EmbeddingList.Config

Bases: EmbeddingBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
word_embedding
WordEmbedding.Config

Component: WordEmbedding

class WordEmbedding.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
embed_dim: int = 100
embedding_init_strategy: EmbedInitStrategy = <EmbedInitStrategy.RANDOM: 'random'>
embedding_init_range: Optional[list[float]] = None
export_input_names: list[str] = ['tokens_vals']
pretrained_embeddings_path: str = ''
vocab_file: str = ''
vocab_size: int = 0
vocab_from_train_data: bool = True
vocab_from_all_data: bool = False
vocab_from_pretrained_embeddings: bool = False
lowercase_tokens: bool = True
min_freq: int = 1
mlp_layer_dims: Optional[list[int]] = []
padding_idx: Optional[int] = None
cpu_only: bool = False
skip_header: bool = True
delimiter: str = ' '

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "embed_dim": 100,
    "embedding_init_strategy": "random",
    "embedding_init_range": null,
    "export_input_names": [
        "tokens_vals"
    ],
    "pretrained_embeddings_path": "",
    "vocab_file": "",
    "vocab_size": 0,
    "vocab_from_train_data": true,
    "vocab_from_all_data": false,
    "vocab_from_pretrained_embeddings": false,
    "lowercase_tokens": true,
    "min_freq": 1,
    "mlp_layer_dims": [],
    "padding_idx": null,
    "cpu_only": false,
    "skip_header": true,
    "delimiter": " "
}

ensembles

bagging_doc_ensemble
BaggingDocEnsembleModel.Config

Component: BaggingDocEnsembleModel

class BaggingDocEnsembleModel.Config[source]

Bases: EnsembleModel.Config

Configuration class for NewBaggingDocEnsemble. These attributes are used by Ensemble.from_config() to construct instance of NewBaggingDocEnsemble.

models

List of document classification model configurations.

Type:List[NewDocModel.Config]

All Attributes (including base classes)

models: list[DocModel.Config]
sample_rate: float = 1.0

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

bagging_intent_slot_ensemble
BaggingIntentSlotEnsembleModel.Config

Component: BaggingIntentSlotEnsembleModel

class BaggingIntentSlotEnsembleModel.Config[source]

Bases: EnsembleModel.Config

Configuration class for BaggingIntentSlotEnsemble. These attributes are used by Ensemble.from_config() to construct instance of BaggingIntentSlotEnsemble.

models

List of intent-slot model configurations.

Type:List[IntentSlotModel.Config]
output_layer

Output layer of intent-slot model responsible for computing loss and predictions.

Type:IntentSlotOutputLayer

All Attributes (including base classes)

models: list[IntentSlotModel.Config]
sample_rate: float = 1.0
use_crf: bool = False

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

ensemble
EnsembleModel.Config

Component: EnsembleModel

class EnsembleModel.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

models: list[Any]
sample_rate: float = 1.0
Subclasses
  • BaggingDocEnsembleModel.Config
  • BaggingIntentSlotEnsembleModel.Config

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

joint_model

IntentSlotModel.Config

Component: IntentSlotModel

class IntentSlotModel.Config[source]

Bases: Model.Config

All Attributes (including base classes)

Subclasses
  • ContextualIntentSlotModel.Config

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "word_labels": {
            "is_input": false,
            "slot_column": "slots",
            "text_column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "allow_unknown": true
        },
        "doc_labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": true,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        },
        "doc_weight": null,
        "word_weight": null
    },
    "word_embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocSlotAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "BiLSTM": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm_dim": 32,
                    "num_layers": 1,
                    "bidirectional": true,
                    "pack_sequence": true
                }
            },
            "pooling": null,
            "slot_attention": null,
            "doc_mlp_layers": 0,
            "word_mlp_layers": 0
        }
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "doc_output": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        },
        "word_output": {
            "WordTaggingOutputLayer": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "loss": {
                    "CrossEntropyLoss": {}
                },
                "label_weights": {},
                "ignore_pad_in_loss": true
            }
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "use_doc_probs_in_word": false,
        "doc_decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "word_decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        }
    },
    "default_doc_loss_weight": 0.2,
    "default_word_loss_weight": 0.5
}
ModelInput
class pytext.models.joint_model.ModelInput

Bases: ModelInput

All Attributes (including base classes)

tokens: TokenTensorizer.Config = TokenTensorizer.Config()
word_labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config(allow_unknown=True)
doc_labels: LabelTensorizer.Config = LabelTensorizer.Config(allow_unknown=True)
doc_weight: Optional[FloatTensorizer.Config] = None
word_weight: Optional[FloatTensorizer.Config] = None

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "word_labels": {
        "is_input": false,
        "slot_column": "slots",
        "text_column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "allow_unknown": true
    },
    "doc_labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": true,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    },
    "doc_weight": null,
    "word_weight": null
}

language_models

lmlstm
LMLSTM.Config

Component: LMLSTM

class LMLSTM.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: Union[BiLSTM.Config, DeepCNNRepresentation.Config] = BiLSTM.Config(bidirectional=False)
decoder: Optional[MLPDecoder.Config] = MLPDecoder.Config()
output_layer: LMOutputLayer.Config = LMOutputLayer.Config()
tied_weights: bool = False
stateful: bool = False
caffe2_format: ExporterType = <ExporterType.PREDICTOR: 'predictor'>

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": true,
            "add_eos_token": true,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTM": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm_dim": 32,
            "num_layers": 1,
            "bidirectional": false,
            "pack_sequence": true
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {}
    },
    "tied_weights": false,
    "stateful": false,
    "caffe2_format": "predictor"
}
ModelInput
class pytext.models.language_models.lmlstm.ModelInput

Bases: ModelInput

All Attributes (including base classes)

tokens: Optional[TokenTensorizer.Config] = TokenTensorizer.Config(add_bos_token=True, add_eos_token=True)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": true,
        "add_eos_token": true,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    }
}

masked_lm

InputConfig
class pytext.models.masked_lm.InputConfig

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "BERTTensorizerBase": {
            "is_input": true,
            "columns": [
                "text"
            ],
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "base_tokenizer": null,
            "vocab_file": "",
            "max_seq_len": 128
        }
    }
}
MaskedLanguageModel.Config

Component: MaskedLanguageModel

class MaskedLanguageModel.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

inputs: InputConfig = InputConfig()
encoder: TransformerSentenceEncoderBase.Config = TransformerSentenceEncoder.Config()
decoder: MLPDecoder.Config = MLPDecoder.Config()
output_layer: LMOutputLayer.Config = LMOutputLayer.Config()
mask_prob: float = 0.15
mask_bos: bool = False
masking_strategy: MaskingStrategy = <MaskingStrategy.RANDOM: 'random'>
tie_weights: bool = True

Default JSON

{
    "inputs": {
        "tokens": {
            "BERTTensorizerBase": {
                "is_input": true,
                "columns": [
                    "text"
                ],
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "",
                "max_seq_len": 128
            }
        }
    },
    "encoder": {
        "TransformerSentenceEncoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "dropout": 0.1,
            "attention_dropout": 0.1,
            "activation_dropout": 0.1,
            "ffn_embedding_dim": 3072,
            "num_encoder_layers": 6,
            "num_attention_heads": 8,
            "num_segments": 2,
            "use_position_embeddings": true,
            "offset_positions_by_padding": true,
            "apply_bert_init": true,
            "encoder_normalize_before": true,
            "activation_fn": "relu",
            "projection_dim": 0,
            "max_seq_len": 128,
            "multilingual": false,
            "freeze_embeddings": false,
            "n_trans_layers_to_freeze": 0,
            "use_torchscript": false
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {}
    },
    "mask_prob": 0.15,
    "mask_bos": false,
    "masking_strategy": "random",
    "tie_weights": true
}

model

BaseModel.Config

Component: BaseModel

class BaseModel.Config[source]

Bases: Component.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
Subclasses
  • BertPairwiseModel.Config
  • NewBertModel.Config
  • NewBertRegressionModel.Config
  • DisjointMultitaskModel.Config
  • NewDisjointMultitaskModel.Config
  • ByteTokensDocumentModel.Config
  • DocModel.Config
  • DocRegressionModel.Config
  • PersonalizedDocModel.Config
  • IntentSlotModel.Config
  • LMLSTM.Config
  • MaskedLanguageModel.Config
  • Model.Config
  • BasePairwiseModel.Config
  • PairwiseModel.Config
  • BertSquadQAModel.Config
  • DrQAModel.Config
  • QueryDocPairwiseRankingModel.Config
  • RoBERTa.Config
  • RoBERTaWordTaggingModel.Config
  • ContextualIntentSlotModel.Config
  • SeqNNModel.Config
  • WordTaggingLiteModel.Config
  • WordTaggingModel.Config

Default JSON

{
    "inputs": {}
}
Model.Config

Component: Model

class Model.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
Subclasses
  • DisjointMultitaskModel.Config
  • NewDisjointMultitaskModel.Config
  • ByteTokensDocumentModel.Config
  • DocModel.Config
  • DocRegressionModel.Config
  • PersonalizedDocModel.Config
  • IntentSlotModel.Config
  • ContextualIntentSlotModel.Config
  • SeqNNModel.Config
  • WordTaggingLiteModel.Config
  • WordTaggingModel.Config

Default JSON

{
    "inputs": {}
}
ModelInput
class pytext.models.model.ModelInput[source]

Bases: ModelInputBase

All Attributes (including base classes)

Default JSON

{}

module

Module.Config

Component: Module

class Module.Config

Bases: ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
Subclasses
  • FeatureConfig
  • BatcherSchedulerConfig
  • ExponentialBatcherSchedulerConfig
  • DecoderBase.Config
  • IntentSlotModelDecoder.Config
  • MLPDecoder.Config
  • MLPDecoderQueryResponse.Config
  • CharacterEmbedding.Config
  • DictEmbedding.Config
  • EmbeddingBase.Config
  • EmbeddingList.Config
  • WordEmbedding.Config
  • PairwiseCosineDistanceOutputLayer.Config
  • BinaryClassificationOutputLayer.Config
  • ClassificationOutputLayer.Config
  • MultiLabelOutputLayer.Config
  • MulticlassOutputLayer.Config
  • RegressionOutputLayer.Config
  • IntentSlotOutputLayer.Config
  • LMOutputLayer.Config
  • OutputLayerBase.Config
  • PairwiseRankingOutputLayer.Config
  • SquadOutputLayer.Config
  • CRFOutputLayer.Config
  • WordTaggingOutputLayer.Config
  • DotProductSelfAttention.Config
  • MultiplicativeAttention.Config
  • SequenceAlignedAttention.Config
  • AugmentedLSTM.Config
  • BiLSTM.Config
  • BiLSTMDocAttention.Config
  • BiLSTMDocSlotAttention.Config
  • BiLSTMSlotAttention.Config
  • BSeqCNNRepresentation.Config
  • ContextualIntentSlotRepresentation.Config
  • DeepCNNRepresentation.Config
  • DocNNRepresentation.Config
  • HuggingFaceBertSentenceEncoder.Config
  • JointCNNRepresentation.Config
  • SharedCNNRepresentation.Config
  • OrderedNeuronLSTM.Config
  • OrderedNeuronLSTMLayer.Config
  • PassThroughRepresentation.Config
  • LastTimestepPool.Config
  • MaxPool.Config
  • MeanPool.Config
  • NoPool.Config
  • PureDocAttention.Config
  • RepresentationBase.Config
  • SeqRepresentation.Config
  • SparseTransformerSentenceEncoder.Config
  • StackedBidirectionalRNN.Config
  • TransformerSentenceEncoder.Config
  • TransformerSentenceEncoderBase.Config
  • RoBERTaEncoder.Config
  • RoBERTaEncoderBase.Config
  • RoBERTaEncoderJit.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}

output_layers

distance_output_layer
PairwiseCosineDistanceOutputLayer.Config

Component: PairwiseCosineDistanceOutputLayer

class PairwiseCosineDistanceOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[BinaryCrossEntropyLoss.Config, CosineEmbeddingLoss.Config, MAELoss.Config, MSELoss.Config, NLLLoss.Config] = CosineEmbeddingLoss.Config()
score_threshold: float = 0.9
score_type: OutputScore = <OutputScore.norm_cosine: 2>
label_weights: Optional[dict[str, float]] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CosineEmbeddingLoss": {
            "margin": 0.0
        }
    },
    "score_threshold": 0.9,
    "score_type": 2,
    "label_weights": null
}
doc_classification_output_layer
BinaryClassificationOutputLayer.Config

Component: BinaryClassificationOutputLayer

class BinaryClassificationOutputLayer.Config

Bases: ClassificationOutputLayer.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
label_weights: Optional[dict[str, float]] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CrossEntropyLoss": {}
    },
    "label_weights": null
}
ClassificationOutputLayer.Config

Component: ClassificationOutputLayer

class ClassificationOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
label_weights: Optional[dict[str, float]] = None
Subclasses
  • BinaryClassificationOutputLayer.Config
  • MultiLabelOutputLayer.Config
  • MulticlassOutputLayer.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CrossEntropyLoss": {}
    },
    "label_weights": null
}
MultiLabelOutputLayer.Config

Component: MultiLabelOutputLayer

class MultiLabelOutputLayer.Config

Bases: ClassificationOutputLayer.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
label_weights: Optional[dict[str, float]] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CrossEntropyLoss": {}
    },
    "label_weights": null
}
MulticlassOutputLayer.Config

Component: MulticlassOutputLayer

class MulticlassOutputLayer.Config

Bases: ClassificationOutputLayer.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
label_weights: Optional[dict[str, float]] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CrossEntropyLoss": {}
    },
    "label_weights": null
}
doc_regression_output_layer
RegressionOutputLayer.Config

Component: RegressionOutputLayer

class RegressionOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: MSELoss.Config = MSELoss.Config()
squash_to_unit_range: bool = False

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {},
    "squash_to_unit_range": false
}
intent_slot_output_layer
IntentSlotOutputLayer.Config

Component: IntentSlotOutputLayer

class IntentSlotOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
doc_output: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
word_output: Union[WordTaggingOutputLayer.Config, CRFOutputLayer.Config] = WordTaggingOutputLayer.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "doc_output": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    },
    "word_output": {
        "WordTaggingOutputLayer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": {},
            "ignore_pad_in_loss": true
        }
    }
}
lm_output_layer
LMOutputLayer.Config

Component: LMOutputLayer

class LMOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: CrossEntropyLoss.Config = CrossEntropyLoss.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {}
}
output_layer_base
OutputLayerBase.Config

Component: OutputLayerBase

class OutputLayerBase.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
Subclasses
  • PairwiseCosineDistanceOutputLayer.Config
  • BinaryClassificationOutputLayer.Config
  • ClassificationOutputLayer.Config
  • MultiLabelOutputLayer.Config
  • MulticlassOutputLayer.Config
  • RegressionOutputLayer.Config
  • IntentSlotOutputLayer.Config
  • LMOutputLayer.Config
  • PairwiseRankingOutputLayer.Config
  • SquadOutputLayer.Config
  • CRFOutputLayer.Config
  • WordTaggingOutputLayer.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
pairwise_ranking_output_layer
PairwiseRankingOutputLayer.Config

Component: PairwiseRankingOutputLayer

class PairwiseRankingOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: PairwiseRankingLoss.Config = PairwiseRankingLoss.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "margin": 1.0
    }
}
squad_output_layer
SquadOutputLayer.Config

Component: SquadOutputLayer

class SquadOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[CrossEntropyLoss.Config, KLDivergenceCELoss.Config] = CrossEntropyLoss.Config()
ignore_impossible: bool = True
pos_loss_weight: float = 0.5
has_answer_loss_weight: float = 0.5
false_label: str = 'False'
max_answer_len: int = 30
hard_weight: float = 0.0

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CrossEntropyLoss": {}
    },
    "ignore_impossible": true,
    "pos_loss_weight": 0.5,
    "has_answer_loss_weight": 0.5,
    "false_label": "False",
    "max_answer_len": 30,
    "hard_weight": 0.0
}
word_tagging_output_layer
CRFOutputLayer.Config

Component: CRFOutputLayer

class CRFOutputLayer.Config

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
WordTaggingOutputLayer.Config

Component: WordTaggingOutputLayer

class WordTaggingOutputLayer.Config[source]

Bases: OutputLayerBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
label_weights: dict[str, float] = {}
ignore_pad_in_loss: Optional[bool] = True

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "loss": {
        "CrossEntropyLoss": {}
    },
    "label_weights": {},
    "ignore_pad_in_loss": true
}

pair_classification_model

BasePairwiseModel.Config

Component: BasePairwiseModel

class BasePairwiseModel.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

Subclasses
  • BertPairwiseModel.Config
  • PairwiseModel.Config
  • QueryDocPairwiseRankingModel.Config

Default JSON

{
    "inputs": {},
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "ClassificationOutputLayer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        }
    },
    "encode_relations": true
}
ModelInput
class pytext.models.pair_classification_model.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens1": {
        "is_input": true,
        "column": "text1",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "tokens2": {
        "is_input": true,
        "column": "text2",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    }
}
PairwiseModel.Config

Component: PairwiseModel

class PairwiseModel.Config[source]

Bases: BasePairwiseModel.Config

encode_relations

if false, return the concatenation of the two representations; if true, also concatenate their pairwise absolute difference and pairwise elementwise product (à la arXiv:1705.02364). Default: true.

Type:bool
tied_representation

whether to use the same representation, with tied weights, for all the input subrepresentations. Default: true.

All Attributes (including base classes)

Subclasses
  • QueryDocPairwiseRankingModel.Config

Default JSON

{
    "inputs": {
        "tokens1": {
            "is_input": true,
            "column": "text1",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "tokens2": {
            "is_input": true,
            "column": "text2",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "ClassificationOutputLayer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        }
    },
    "encode_relations": true,
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    },
    "shared_representations": true
}

qna

bert_squad_qa
BertSquadQAModel.Config

Component: BertSquadQAModel

class BertSquadQAModel.Config[source]

Bases: NewBertModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "squad_input": {
            "SquadForBERTTensorizer": {
                "is_input": true,
                "columns": [
                    "question",
                    "doc"
                ],
                "tokenizer": {
                    "WordPieceTokenizer": {
                        "basic_tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        },
                        "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                "max_seq_len": 256,
                "answers_column": "answers",
                "answer_starts_column": "answer_starts"
            }
        },
        "has_answer": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "has_answer",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        }
    },
    "encoder": {
        "HuggingFaceBertSentenceEncoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
            "load_weights": true
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "ignore_impossible": true,
        "pos_loss_weight": 0.5,
        "has_answer_loss_weight": 0.5,
        "false_label": "False",
        "max_answer_len": 30,
        "hard_weight": 0.0
    },
    "pos_decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": 2,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "has_ans_decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": 2,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "is_kd": false
}
ModelInput
class pytext.models.qna.bert_squad_qa.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "squad_input": {
        "SquadForBERTTensorizer": {
            "is_input": true,
            "columns": [
                "question",
                "doc"
            ],
            "tokenizer": {
                "WordPieceTokenizer": {
                    "basic_tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    },
                    "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
            "max_seq_len": 256,
            "answers_column": "answers",
            "answer_starts_column": "answer_starts"
        }
    },
    "has_answer": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "has_answer",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    }
}
dr_qa
DrQAModel.Config

Component: DrQAModel

class DrQAModel.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
dropout: float = 0.4
embedding: WordEmbedding.Config = WordEmbedding.Config(embed_dim=300, pretrained_embeddings_path='/mnt/vol/pytext/users/kushall/pretrained/glove.840B.300d.txt', vocab_from_pretrained_embeddings=True)
ques_rnn: StackedBidirectionalRNN.Config = StackedBidirectionalRNN.Config(dropout=0.4)
doc_rnn: StackedBidirectionalRNN.Config = StackedBidirectionalRNN.Config(dropout=0.4)
output_layer: SquadOutputLayer.Config = SquadOutputLayer.Config()
is_kd: bool = False

Default JSON

{
    "inputs": {
        "squad_input": {
            "SquadTensorizer": {
                "is_input": true,
                "column": "text",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\W+",
                        "lowercase": true
                    }
                },
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " ",
                "doc_column": "doc",
                "ques_column": "question",
                "answers_column": "answers",
                "answer_starts_column": "answer_starts",
                "max_ques_seq_len": 64,
                "max_doc_seq_len": 256
            }
        },
        "has_answer": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "has_answer",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        }
    },
    "dropout": 0.4,
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 300,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "/mnt/vol/pytext/users/kushall/pretrained/glove.840B.300d.txt",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": true,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "ques_rnn": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_size": 32,
        "num_layers": 1,
        "dropout": 0.4,
        "bidirectional": true,
        "rnn_type": "lstm",
        "concat_layers": true
    },
    "doc_rnn": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_size": 32,
        "num_layers": 1,
        "dropout": 0.4,
        "bidirectional": true,
        "rnn_type": "lstm",
        "concat_layers": true
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "ignore_impossible": true,
        "pos_loss_weight": 0.5,
        "has_answer_loss_weight": 0.5,
        "false_label": "False",
        "max_answer_len": 30,
        "hard_weight": 0.0
    },
    "is_kd": false
}
ModelInput
class pytext.models.qna.dr_qa.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "squad_input": {
        "SquadTensorizer": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\W+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " ",
            "doc_column": "doc",
            "ques_column": "question",
            "answers_column": "answers",
            "answer_starts_column": "answer_starts",
            "max_ques_seq_len": 64,
            "max_doc_seq_len": 256
        }
    },
    "has_answer": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "has_answer",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    }
}

query_document_pairwise_ranking_model

ModelInput
class pytext.models.query_document_pairwise_ranking_model.ModelInput

Bases: ModelInput

All Attributes (including base classes)

pos_response: TokenTensorizer.Config = TokenTensorizer.Config(column='pos_response')
neg_response: TokenTensorizer.Config = TokenTensorizer.Config(column='neg_response')
query: TokenTensorizer.Config = TokenTensorizer.Config(column='query')

Default JSON

{
    "pos_response": {
        "is_input": true,
        "column": "pos_response",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "neg_response": {
        "is_input": true,
        "column": "neg_response",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "query": {
        "is_input": true,
        "column": "query",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    }
}
QueryDocPairwiseRankingModel.Config

Component: QueryDocPairwiseRankingModel

class QueryDocPairwiseRankingModel.Config[source]

Bases: PairwiseModel.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
decoder: MLPDecoderQueryResponse.Config = MLPDecoderQueryResponse.Config()
output_layer: PairwiseRankingOutputLayer.Config = PairwiseRankingOutputLayer.Config()
encode_relations: bool = True
embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: Union[BiLSTMDocAttention.Config, DocNNRepresentation.Config] = BiLSTMDocAttention.Config()
shared_representations: bool = True
decoder_output_dim: int = 64

Default JSON

{
    "inputs": {
        "pos_response": {
            "is_input": true,
            "column": "pos_response",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "neg_response": {
            "is_input": true,
            "column": "neg_response",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "query": {
            "is_input": true,
            "column": "query",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": []
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "margin": 1.0
        }
    },
    "encode_relations": true,
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    },
    "shared_representations": true,
    "decoder_output_dim": 64
}

representations

attention
DotProductSelfAttention.Config

Component: DotProductSelfAttention

class DotProductSelfAttention.Config[source]

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
input_dim: int = 32

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "input_dim": 32
}
MultiplicativeAttention.Config

Component: MultiplicativeAttention

class MultiplicativeAttention.Config[source]

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
p_hidden_dim: int = 32
q_hidden_dim: int = 32
normalize: bool = False

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "p_hidden_dim": 32,
    "q_hidden_dim": 32,
    "normalize": false
}
SequenceAlignedAttention.Config

Component: SequenceAlignedAttention

class SequenceAlignedAttention.Config[source]

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
proj_dim: int = 32

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "proj_dim": 32
}
augmented_lstm
AugmentedLSTM.Config

Component: AugmentedLSTM

class AugmentedLSTM.Config[source]

Bases: RepresentationBase.Config, ConfigBase

Configuration class for AugmentedLSTM.

dropout

Variational dropout probability to use. Defaults to 0.0.

Type:float
lstm_dim

Number of features in the hidden state of the LSTM. Defaults to 32.

Type:int
num_layers

Number of recurrent layers. Eg. setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in the outputs of the first LSTM and computing the final result. Defaults to 1.

Type:int
bidirectional

If True, becomes a bidirectional LSTM. Defaults to True.

Type:bool
use_highway

If True we append a highway network to the outputs of the LSTM.

Type:bool
use_bias

If True we use a bias in our LSTM calculations, otherwise we don’t.

Type:bool

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.0
lstm_dim: int = 32
use_highway: bool = True
bidirectional: bool = False
num_layers: int = 1
use_bias: bool = False

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.0,
    "lstm_dim": 32,
    "use_highway": true,
    "bidirectional": false,
    "num_layers": 1,
    "use_bias": false
}
bilstm
BiLSTM.Config

Component: BiLSTM

class BiLSTM.Config[source]

Bases: RepresentationBase.Config, ConfigBase

Configuration class for BiLSTM.

dropout

Dropout probability to use. Defaults to 0.4.

Type:float
lstm_dim

Number of features in the hidden state of the LSTM. Defaults to 32.

Type:int
num_layers

Number of recurrent layers. Eg. setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in the outputs of the first LSTM and computing the final result. Defaults to 1.

Type:int
bidirectional

If True, becomes a bidirectional LSTM. Defaults to True.

Type:bool

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
lstm_dim: int = 32
num_layers: int = 1
bidirectional: bool = True
pack_sequence: bool = True

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "lstm_dim": 32,
    "num_layers": 1,
    "bidirectional": true,
    "pack_sequence": true
}
bilstm_doc_attention
BiLSTMDocAttention.Config

Component: BiLSTMDocAttention

class BiLSTMDocAttention.Config[source]

Bases: RepresentationBase.Config

Configuration class for BiLSTM.

dropout

Dropout probability to use. Defaults to 0.4.

Type:float
lstm

Config for the BiLSTM.

Type:BiLSTM.Config
pooling

Config for the underlying pooling module.

Type:ConfigBase
mlp_decoder

Config for the non-linear projection module.

Type:MLPDecoder.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
lstm: BiLSTM.Config = BiLSTM.Config()
pooling: Union[SelfAttention.Config, MaxPool.Config, MeanPool.Config, NoPool.Config, LastTimestepPool.Config] = SelfAttention.Config()
mlp_decoder: Optional[MLPDecoder.Config] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "lstm": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "lstm_dim": 32,
        "num_layers": 1,
        "bidirectional": true,
        "pack_sequence": true
    },
    "pooling": {
        "SelfAttention": {
            "attn_dimension": 64,
            "dropout": 0.4
        }
    },
    "mlp_decoder": null
}
bilstm_doc_slot_attention
BiLSTMDocSlotAttention.Config

Component: BiLSTMDocSlotAttention

class BiLSTMDocSlotAttention.Config[source]

Bases: RepresentationBase.Config, ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
lstm: Union[BiLSTM.Config, OrderedNeuronLSTM.Config, AugmentedLSTM.Config] = BiLSTM.Config()
pooling: Union[SelfAttention.Config, MaxPool.Config, MeanPool.Config, NoneType] = None
slot_attention: Optional[SlotAttention.Config] = None
doc_mlp_layers: int = 0
word_mlp_layers: int = 0

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "lstm": {
        "BiLSTM": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm_dim": 32,
            "num_layers": 1,
            "bidirectional": true,
            "pack_sequence": true
        }
    },
    "pooling": null,
    "slot_attention": null,
    "doc_mlp_layers": 0,
    "word_mlp_layers": 0
}
bilstm_slot_attn
BiLSTMSlotAttention.Config

Component: BiLSTMSlotAttention

class BiLSTMSlotAttention.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
lstm: BiLSTM.Config = BiLSTM.Config()
slot_attention: SlotAttention.Config = SlotAttention.Config()
mlp_decoder: Optional[MLPDecoder.Config] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "lstm": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "lstm_dim": 32,
        "num_layers": 1,
        "bidirectional": true,
        "pack_sequence": true
    },
    "slot_attention": {
        "attn_dimension": 64,
        "attention_type": "no_attention"
    },
    "mlp_decoder": null
}
biseqcnn
BSeqCNNRepresentation.Config

Component: BSeqCNNRepresentation

class BSeqCNNRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
cnn: CNNParams = CNNParams()
fwd_bwd_context_len: int = 5
surrounding_context_len: int = 2

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "cnn": {
        "kernel_num": 100,
        "kernel_sizes": [
            3,
            4
        ],
        "weight_norm": false,
        "dilated": false,
        "causal": false
    },
    "fwd_bwd_context_len": 5,
    "surrounding_context_len": 2
}
contextual_intent_slot_rep
ContextualIntentSlotRepresentation.Config

Component: ContextualIntentSlotRepresentation

class ContextualIntentSlotRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
sen_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
seq_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
joint_representation: Union[BiLSTMDocSlotAttention.Config, JointCNNRepresentation.Config] = BiLSTMDocSlotAttention.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "sen_representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "cnn": {
            "kernel_num": 100,
            "kernel_sizes": [
                3,
                4
            ],
            "weight_norm": false,
            "dilated": false,
            "causal": false
        }
    },
    "seq_representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "cnn": {
            "kernel_num": 100,
            "kernel_sizes": [
                3,
                4
            ],
            "weight_norm": false,
            "dilated": false,
            "causal": false
        }
    },
    "joint_representation": {
        "BiLSTMDocSlotAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "BiLSTM": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm_dim": 32,
                    "num_layers": 1,
                    "bidirectional": true,
                    "pack_sequence": true
                }
            },
            "pooling": null,
            "slot_attention": null,
            "doc_mlp_layers": 0,
            "word_mlp_layers": 0
        }
    }
}
deepcnn
DeepCNNRepresentation.Config

Component: DeepCNNRepresentation

class DeepCNNRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
cnn: CNNParams = CNNParams()
dropout: float = 0.3
activation: Activation = <Activation.GLU: 'glu'>
separable: bool = False
bottleneck: int = 0
pooling_type: PoolingType = <PoolingType.NONE: 'none'>

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "cnn": {
        "kernel_num": 100,
        "kernel_sizes": [
            3,
            4
        ],
        "weight_norm": false,
        "dilated": false,
        "causal": false
    },
    "dropout": 0.3,
    "activation": "glu",
    "separable": false,
    "bottleneck": 0,
    "pooling_type": "none"
}
docnn
DocNNRepresentation.Config

Component: DocNNRepresentation

class DocNNRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
cnn: CNNParams = CNNParams()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "cnn": {
        "kernel_num": 100,
        "kernel_sizes": [
            3,
            4
        ],
        "weight_norm": false,
        "dilated": false,
        "causal": false
    }
}
huggingface_bert_sentence_encoder
HuggingFaceBertSentenceEncoder.Config

Component: HuggingFaceBertSentenceEncoder

class HuggingFaceBertSentenceEncoder.Config[source]

Bases: TransformerSentenceEncoderBase.Config, ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
bert_cpt_dir: str = '/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/'
load_weights: bool = True

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false,
    "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
    "load_weights": true
}
jointcnn_rep
JointCNNRepresentation.Config

Component: JointCNNRepresentation

class JointCNNRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
doc_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
word_representation: Union[BSeqCNNRepresentation.Config, DeepCNNRepresentation.Config] = BSeqCNNRepresentation.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "doc_representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "cnn": {
            "kernel_num": 100,
            "kernel_sizes": [
                3,
                4
            ],
            "weight_norm": false,
            "dilated": false,
            "causal": false
        }
    },
    "word_representation": {
        "BSeqCNNRepresentation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "cnn": {
                "kernel_num": 100,
                "kernel_sizes": [
                    3,
                    4
                ],
                "weight_norm": false,
                "dilated": false,
                "causal": false
            },
            "fwd_bwd_context_len": 5,
            "surrounding_context_len": 2
        }
    }
}
SharedCNNRepresentation.Config

Component: SharedCNNRepresentation

class SharedCNNRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
word_representation: Union[BSeqCNNRepresentation.Config, DeepCNNRepresentation.Config] = DeepCNNRepresentation.Config()
pooling_type: PoolingType = <PoolingType.MAX: 'max'>

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "word_representation": {
        "DeepCNNRepresentation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "cnn": {
                "kernel_num": 100,
                "kernel_sizes": [
                    3,
                    4
                ],
                "weight_norm": false,
                "dilated": false,
                "causal": false
            },
            "dropout": 0.3,
            "activation": "glu",
            "separable": false,
            "bottleneck": 0,
            "pooling_type": "none"
        }
    },
    "pooling_type": "max"
}
ordered_neuron_lstm
OrderedNeuronLSTM.Config

Component: OrderedNeuronLSTM

class OrderedNeuronLSTM.Config[source]

Bases: RepresentationBase.Config, ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
lstm_dim: int = 32
num_layers: int = 1

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "lstm_dim": 32,
    "num_layers": 1
}
OrderedNeuronLSTMLayer.Config

Component: OrderedNeuronLSTMLayer

class OrderedNeuronLSTMLayer.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
pass_through
PassThroughRepresentation.Config

Component: PassThroughRepresentation

class PassThroughRepresentation.Config

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
pooling
BoundaryPool.Config

Component: BoundaryPool

class BoundaryPool.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

boundary_type: str = 'first'

Default JSON

{
    "boundary_type": "first"
}
LastTimestepPool.Config

Component: LastTimestepPool

class LastTimestepPool.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
MaxPool.Config

Component: MaxPool

class MaxPool.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
MeanPool.Config

Component: MeanPool

class MeanPool.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
NoPool.Config

Component: NoPool

class NoPool.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
SelfAttention.Config

Component: SelfAttention

class SelfAttention.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

attn_dimension: int = 64
dropout: float = 0.4

Default JSON

{
    "attn_dimension": 64,
    "dropout": 0.4
}
pure_doc_attention
PureDocAttention.Config

Component: PureDocAttention

class PureDocAttention.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
dropout: float = 0.4
pooling: Union[SelfAttention.Config, MaxPool.Config, MeanPool.Config, NoPool.Config, BoundaryPool.Config] = SelfAttention.Config()
mlp_decoder: Optional[MLPDecoder.Config] = None

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "dropout": 0.4,
    "pooling": {
        "SelfAttention": {
            "attn_dimension": 64,
            "dropout": 0.4
        }
    },
    "mlp_decoder": null
}
representation_base
RepresentationBase.Config

Component: RepresentationBase

class RepresentationBase.Config

Bases: Module.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
Subclasses
  • AugmentedLSTM.Config
  • BiLSTM.Config
  • BiLSTMDocAttention.Config
  • BiLSTMDocSlotAttention.Config
  • BiLSTMSlotAttention.Config
  • BSeqCNNRepresentation.Config
  • ContextualIntentSlotRepresentation.Config
  • DeepCNNRepresentation.Config
  • DocNNRepresentation.Config
  • HuggingFaceBertSentenceEncoder.Config
  • JointCNNRepresentation.Config
  • SharedCNNRepresentation.Config
  • OrderedNeuronLSTM.Config
  • PassThroughRepresentation.Config
  • PureDocAttention.Config
  • SeqRepresentation.Config
  • SparseTransformerSentenceEncoder.Config
  • TransformerSentenceEncoder.Config
  • TransformerSentenceEncoderBase.Config
  • RoBERTaEncoder.Config
  • RoBERTaEncoderBase.Config
  • RoBERTaEncoderJit.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null
}
seq_rep
SeqRepresentation.Config

Component: SeqRepresentation

class SeqRepresentation.Config[source]

Bases: RepresentationBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
doc_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
seq_representation: Union[BiLSTMDocAttention.Config, DocNNRepresentation.Config] = BiLSTMDocAttention.Config()

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "doc_representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "cnn": {
            "kernel_num": 100,
            "kernel_sizes": [
                3,
                4
            ],
            "weight_norm": false,
            "dilated": false,
            "causal": false
        }
    },
    "seq_representation": {
        "BiLSTMDocAttention": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": true,
                "pack_sequence": true
            },
            "pooling": {
                "SelfAttention": {
                    "attn_dimension": 64,
                    "dropout": 0.4
                }
            },
            "mlp_decoder": null
        }
    }
}
slot_attention
SlotAttention.Config

Component: SlotAttention

class SlotAttention.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

attn_dimension: int = 64
attention_type: SlotAttentionType = <SlotAttentionType.NO_ATTENTION: 'no_attention'>

Default JSON

{
    "attn_dimension": 64,
    "attention_type": "no_attention"
}
sparse_transformer_sentence_encoder
SparseTransformerSentenceEncoder.Config

Component: SparseTransformerSentenceEncoder

class SparseTransformerSentenceEncoder.Config[source]

Bases: TransformerSentenceEncoder.Config, ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
dropout: float = 0.1
attention_dropout: float = 0.1
activation_dropout: float = 0.1
ffn_embedding_dim: int = 3072
num_encoder_layers: int = 6
num_attention_heads: int = 8
num_segments: int = 2
use_position_embeddings: bool = True
offset_positions_by_padding: bool = True
apply_bert_init: bool = True
encoder_normalize_before: bool = True
activation_fn: str = 'relu'
projection_dim: int = 0
max_seq_len: int = 128
multilingual: bool = False
freeze_embeddings: bool = False
n_trans_layers_to_freeze: int = 0
use_torchscript: bool = False
project_representation: bool = False
is_bidirectional: bool = True
stride: int = 32
expressivity: int = 8

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false,
    "dropout": 0.1,
    "attention_dropout": 0.1,
    "activation_dropout": 0.1,
    "ffn_embedding_dim": 3072,
    "num_encoder_layers": 6,
    "num_attention_heads": 8,
    "num_segments": 2,
    "use_position_embeddings": true,
    "offset_positions_by_padding": true,
    "apply_bert_init": true,
    "encoder_normalize_before": true,
    "activation_fn": "relu",
    "projection_dim": 0,
    "max_seq_len": 128,
    "multilingual": false,
    "freeze_embeddings": false,
    "n_trans_layers_to_freeze": 0,
    "use_torchscript": false,
    "project_representation": false,
    "is_bidirectional": true,
    "stride": 32,
    "expressivity": 8
}
stacked_bidirectional_rnn
StackedBidirectionalRNN.Config

Component: StackedBidirectionalRNN

class StackedBidirectionalRNN.Config[source]

Bases: Module.Config

Configuration class for StackedBidirectionalRNN.

hidden_size

Number of features in the hidden state of the RNN. Defaults to 32.

Type:int
num_layers

Number of recurrent layers. Eg. setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in the outputs of the first RNN and computing the final result. Defaults to 1.

Type:int
dropout

Dropout probability to use. Defaults to 0.4.

Type:float
bidirectional

If True, becomes a bidirectional RNN. Defaults to True.

Type:bool
rnn_type

Which RNN type to use. Options: “rnn”, “lstm”, “gru”.

Type:str
concat_layers

Whether to concatenate the outputs of each layer of stacked RNN.

Type:bool

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
hidden_size: int = 32
num_layers: int = 1
dropout: float = 0.0
bidirectional: bool = True
rnn_type: RnnType = <RnnType.LSTM: 'lstm'>
concat_layers: bool = True

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "hidden_size": 32,
    "num_layers": 1,
    "dropout": 0.0,
    "bidirectional": true,
    "rnn_type": "lstm",
    "concat_layers": true
}
transformer_sentence_encoder
TransformerSentenceEncoder.Config

Component: TransformerSentenceEncoder

class TransformerSentenceEncoder.Config[source]

Bases: TransformerSentenceEncoderBase.Config, ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
dropout: float = 0.1
attention_dropout: float = 0.1
activation_dropout: float = 0.1
ffn_embedding_dim: int = 3072
num_encoder_layers: int = 6
num_attention_heads: int = 8
num_segments: int = 2
use_position_embeddings: bool = True
offset_positions_by_padding: bool = True
apply_bert_init: bool = True
encoder_normalize_before: bool = True
activation_fn: str = 'relu'
projection_dim: int = 0
max_seq_len: int = 128
multilingual: bool = False
freeze_embeddings: bool = False
n_trans_layers_to_freeze: int = 0
use_torchscript: bool = False
Subclasses
  • SparseTransformerSentenceEncoder.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false,
    "dropout": 0.1,
    "attention_dropout": 0.1,
    "activation_dropout": 0.1,
    "ffn_embedding_dim": 3072,
    "num_encoder_layers": 6,
    "num_attention_heads": 8,
    "num_segments": 2,
    "use_position_embeddings": true,
    "offset_positions_by_padding": true,
    "apply_bert_init": true,
    "encoder_normalize_before": true,
    "activation_fn": "relu",
    "projection_dim": 0,
    "max_seq_len": 128,
    "multilingual": false,
    "freeze_embeddings": false,
    "n_trans_layers_to_freeze": 0,
    "use_torchscript": false
}
transformer_sentence_encoder_base
TransformerSentenceEncoderBase.Config

Component: TransformerSentenceEncoderBase

class TransformerSentenceEncoderBase.Config[source]

Bases: RepresentationBase.Config, ConfigBase

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
Subclasses
  • HuggingFaceBertSentenceEncoder.Config
  • SparseTransformerSentenceEncoder.Config
  • TransformerSentenceEncoder.Config
  • RoBERTaEncoder.Config
  • RoBERTaEncoderBase.Config
  • RoBERTaEncoderJit.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false
}

roberta

InputConfig
class pytext.models.roberta.InputConfig

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "columns": [
            "text"
        ],
        "tokenizer": {
            "GPT2BPETokenizer": {
                "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
                "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
            }
        },
        "base_tokenizer": null,
        "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
        "max_seq_len": 256
    },
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    }
}
RoBERTa.Config

Component: RoBERTa

class RoBERTa.Config[source]

Bases: NewBertModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "columns": [
                "text"
            ],
            "tokenizer": {
                "GPT2BPETokenizer": {
                    "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
                    "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
            "max_seq_len": 256
        },
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        }
    },
    "encoder": {
        "RoBERTaEncoderJit": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "pretrained_encoder": {
                "load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
                "save_path": null,
                "freeze": false,
                "shared_module_key": null
            }
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    }
}
RoBERTaEncoder.Config

Component: RoBERTaEncoder

class RoBERTaEncoder.Config[source]

Bases: RoBERTaEncoderBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
vocab_size: int = 50265
num_encoder_layers: int = 12
num_attention_heads: int = 12
model_path: str = 'manifold://pytext_training/tree/static/models/roberta_base_torch.pt'
is_finetuned: bool = False

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false,
    "vocab_size": 50265,
    "num_encoder_layers": 12,
    "num_attention_heads": 12,
    "model_path": "manifold://pytext_training/tree/static/models/roberta_base_torch.pt",
    "is_finetuned": false
}
RoBERTaEncoderBase.Config

Component: RoBERTaEncoderBase

class RoBERTaEncoderBase.Config[source]

Bases: TransformerSentenceEncoderBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
Subclasses
  • RoBERTaEncoder.Config
  • RoBERTaEncoderJit.Config

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false
}
RoBERTaEncoderJit.Config

Component: RoBERTaEncoderJit

class RoBERTaEncoderJit.Config[source]

Bases: RoBERTaEncoderBase.Config

All Attributes (including base classes)

load_path: Optional[str] = None
save_path: Optional[str] = None
freeze: bool = False
shared_module_key: Optional[str] = None
output_dropout: float = 0.4
embedding_dim: int = 768
pooling: PoolingMethod = <PoolingMethod.CLS_TOKEN: 'cls_token'>
export: bool = False
pretrained_encoder: Module.Config = Module.Config(load_path='manifold://pytext_training/tree/static/models/roberta_public.pt1')

Default JSON

{
    "load_path": null,
    "save_path": null,
    "freeze": false,
    "shared_module_key": null,
    "output_dropout": 0.4,
    "embedding_dim": 768,
    "pooling": "cls_token",
    "export": false,
    "pretrained_encoder": {
        "load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
        "save_path": null,
        "freeze": false,
        "shared_module_key": null
    }
}
RoBERTaWordTaggingModel.Config

Component: RoBERTaWordTaggingModel

class RoBERTaWordTaggingModel.Config[source]

Bases: BaseModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "columns": [
                "text"
            ],
            "tokenizer": {
                "GPT2BPETokenizer": {
                    "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
                    "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
                }
            },
            "base_tokenizer": null,
            "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
            "max_seq_len": 256,
            "labels_columns": [
                "label"
            ],
            "labels": []
        }
    },
    "encoder": {
        "RoBERTaEncoderJit": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "output_dropout": 0.4,
            "embedding_dim": 768,
            "pooling": "cls_token",
            "export": false,
            "pretrained_encoder": {
                "load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
                "save_path": null,
                "freeze": false,
                "shared_module_key": null
            }
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": {},
        "ignore_pad_in_loss": true
    }
}
WordTaggingInputConfig
class pytext.models.roberta.WordTaggingInputConfig

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "columns": [
            "text"
        ],
        "tokenizer": {
            "GPT2BPETokenizer": {
                "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
                "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
            }
        },
        "base_tokenizer": null,
        "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
        "max_seq_len": 256,
        "labels_columns": [
            "label"
        ],
        "labels": []
    }
}

semantic_parsers

rnng
rnng_parser
AblationParams
class pytext.models.semantic_parsers.rnng.rnng_parser.AblationParams

Bases: ConfigBase

Ablation parameters.

use_buffer

whether to use the buffer LSTM

Type:bool
use_stack

whether to use the stack LSTM

Type:bool
use_action

whether to use the action LSTM

Type:bool
use_last_open_NT_feature

whether to use the last open non-terminal as a 1-hot feature when computing representation for the action classifier

Type:bool

All Attributes (including base classes)

use_buffer: bool = True
use_stack: bool = True
use_action: bool = True
use_last_open_NT_feature: bool = False

Default JSON

{
    "use_buffer": true,
    "use_stack": true,
    "use_action": true,
    "use_last_open_NT_feature": false
}
ModelInput
class pytext.models.semantic_parsers.rnng.rnng_parser.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "tokenized_text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "actions": {
        "is_input": true,
        "column": "seqlogical"
    }
}
RNNGConstraints
class pytext.models.semantic_parsers.rnng.rnng_parser.RNNGConstraints

Bases: ConfigBase

Constraints when computing valid actions.

intent_slot_nesting

for the intent slot models, the top level non-terminal has to be an intent, an intent can only have slot non-terminals as children and vice-versa.

Type:bool
ignore_loss_for_unsupported

if the data has “unsupported” label, that is if the label has a substring “unsupported” in it, do not compute loss

Type:bool
no_slots_inside_unsupported

if the data has “unsupported” label, that is if the label has a substring “unsupported” in it, do not predict slots inside this label.

Type:bool

All Attributes (including base classes)

intent_slot_nesting: bool = True
ignore_loss_for_unsupported: bool = False
no_slots_inside_unsupported: bool = True

Default JSON

{
    "intent_slot_nesting": true,
    "ignore_loss_for_unsupported": false,
    "no_slots_inside_unsupported": true
}
RNNGParser.Config

Component: RNNGParser

class RNNGParser.Config[source]

Bases: RNNGParserBase.Config

All Attributes (including base classes)

version: int = 2
lstm: BiLSTM.Config = BiLSTM.Config()
ablation: AblationParams = AblationParams()
constraints: RNNGConstraints = RNNGConstraints()
max_open_NT: int = 10
dropout: float = 0.1
beam_size: int = 1
top_k: int = 1
compositional_type: CompositionalType = <CompositionalType.BLSTM: 'blstm'>
inputs: ModelInput = ModelInput()
embedding: WordEmbedding.Config = WordEmbedding.Config()

Default JSON

{
    "version": 2,
    "lstm": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "lstm_dim": 32,
        "num_layers": 1,
        "bidirectional": true,
        "pack_sequence": true
    },
    "ablation": {
        "use_buffer": true,
        "use_stack": true,
        "use_action": true,
        "use_last_open_NT_feature": false
    },
    "constraints": {
        "intent_slot_nesting": true,
        "ignore_loss_for_unsupported": false,
        "no_slots_inside_unsupported": true
    },
    "max_open_NT": 10,
    "dropout": 0.1,
    "beam_size": 1,
    "top_k": 1,
    "compositional_type": "blstm",
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "tokenized_text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "actions": {
            "is_input": true,
            "column": "seqlogical"
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    }
}
RNNGParserBase.Config

Component: RNNGParserBase

class RNNGParserBase.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

version: int = 2
lstm: BiLSTM.Config = BiLSTM.Config()
ablation: AblationParams = AblationParams()
constraints: RNNGConstraints = RNNGConstraints()
max_open_NT: int = 10
dropout: float = 0.1
beam_size: int = 1
top_k: int = 1
compositional_type: CompositionalType = <CompositionalType.BLSTM: 'blstm'>
Subclasses
  • RNNGParser.Config

Default JSON

{
    "version": 2,
    "lstm": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "dropout": 0.4,
        "lstm_dim": 32,
        "num_layers": 1,
        "bidirectional": true,
        "pack_sequence": true
    },
    "ablation": {
        "use_buffer": true,
        "use_stack": true,
        "use_action": true,
        "use_last_open_NT_feature": false
    },
    "constraints": {
        "intent_slot_nesting": true,
        "ignore_loss_for_unsupported": false,
        "no_slots_inside_unsupported": true
    },
    "max_open_NT": 10,
    "dropout": 0.1,
    "beam_size": 1,
    "top_k": 1,
    "compositional_type": "blstm"
}

seq_models

contextual_intent_slot
ContextualIntentSlotModel.Config

Component: ContextualIntentSlotModel

class ContextualIntentSlotModel.Config[source]

Bases: IntentSlotModel.Config

All Attributes (including base classes)

inputs: ModelInput = ModelInput()
word_embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: ContextualIntentSlotRepresentation.Config = ContextualIntentSlotRepresentation.Config()
output_layer: IntentSlotOutputLayer.Config = IntentSlotOutputLayer.Config()
decoder: IntentSlotModelDecoder.Config = IntentSlotModelDecoder.Config()
default_doc_loss_weight: float = 0.2
default_word_loss_weight: float = 0.5
seq_embedding: Optional[WordEmbedding.Config] = WordEmbedding.Config()

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "word_labels": {
            "is_input": false,
            "slot_column": "slots",
            "text_column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "allow_unknown": true
        },
        "doc_labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": true,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        },
        "doc_weight": null,
        "word_weight": null,
        "seq_tokens": {
            "is_input": true,
            "column": "text_seq",
            "max_seq_len": null,
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "add_bol_token": false,
            "add_eol_token": false,
            "use_eol_token_for_bol": false,
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            }
        }
    },
    "word_embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "sen_representation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "cnn": {
                "kernel_num": 100,
                "kernel_sizes": [
                    3,
                    4
                ],
                "weight_norm": false,
                "dilated": false,
                "causal": false
            }
        },
        "seq_representation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "cnn": {
                "kernel_num": 100,
                "kernel_sizes": [
                    3,
                    4
                ],
                "weight_norm": false,
                "dilated": false,
                "causal": false
            }
        },
        "joint_representation": {
            "BiLSTMDocSlotAttention": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm": {
                    "BiLSTM": {
                        "load_path": null,
                        "save_path": null,
                        "freeze": false,
                        "shared_module_key": null,
                        "dropout": 0.4,
                        "lstm_dim": 32,
                        "num_layers": 1,
                        "bidirectional": true,
                        "pack_sequence": true
                    }
                },
                "pooling": null,
                "slot_attention": null,
                "doc_mlp_layers": 0,
                "word_mlp_layers": 0
            }
        }
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "doc_output": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        },
        "word_output": {
            "WordTaggingOutputLayer": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "loss": {
                    "CrossEntropyLoss": {}
                },
                "label_weights": {},
                "ignore_pad_in_loss": true
            }
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "use_doc_probs_in_word": false,
        "doc_decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "word_decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        }
    },
    "default_doc_loss_weight": 0.2,
    "default_word_loss_weight": 0.5,
    "seq_embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    }
}
ModelInput
class pytext.models.seq_models.contextual_intent_slot.ModelInput

Bases: ModelInput

All Attributes (including base classes)

tokens: TokenTensorizer.Config = TokenTensorizer.Config()
word_labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config(allow_unknown=True)
doc_labels: LabelTensorizer.Config = LabelTensorizer.Config(allow_unknown=True)
doc_weight: Optional[FloatTensorizer.Config] = None
word_weight: Optional[FloatTensorizer.Config] = None
seq_tokens: Optional[SeqTokenTensorizer.Config] = SeqTokenTensorizer.Config()

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "word_labels": {
        "is_input": false,
        "slot_column": "slots",
        "text_column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "allow_unknown": true
    },
    "doc_labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": true,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    },
    "doc_weight": null,
    "word_weight": null,
    "seq_tokens": {
        "is_input": true,
        "column": "text_seq",
        "max_seq_len": null,
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "add_bol_token": false,
        "add_eol_token": false,
        "use_eol_token_for_bol": false,
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        }
    }
}
seqnn
ModelInput
class pytext.models.seq_models.seqnn.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text_seq",
        "max_seq_len": null,
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "add_bol_token": false,
        "add_eol_token": false,
        "use_eol_token_for_bol": false,
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        }
    },
    "dense": null,
    "labels": {
        "LabelTensorizer": {
            "is_input": false,
            "column": "label",
            "allow_unknown": false,
            "pad_in_vocab": false,
            "label_vocab": null
        }
    }
}
SeqNNModel.Config

Component: SeqNNModel

class SeqNNModel.Config[source]

Bases: DocModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text_seq",
            "max_seq_len": null,
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "add_bol_token": false,
            "add_eol_token": false,
            "use_eol_token_for_bol": false,
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            }
        },
        "dense": null,
        "labels": {
            "LabelTensorizer": {
                "is_input": false,
                "column": "label",
                "allow_unknown": false,
                "pad_in_vocab": false,
                "label_vocab": null
            }
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "doc_representation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "cnn": {
                "kernel_num": 100,
                "kernel_sizes": [
                    3,
                    4
                ],
                "weight_norm": false,
                "dilated": false,
                "causal": false
            }
        },
        "seq_representation": {
            "BiLSTMDocAttention": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm_dim": 32,
                    "num_layers": 1,
                    "bidirectional": true,
                    "pack_sequence": true
                },
                "pooling": {
                    "SelfAttention": {
                        "attn_dimension": 64,
                        "dropout": 0.4
                    }
                },
                "mlp_decoder": null
            }
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    }
}
SeqNNModel_Deprecated.Config

Component: SeqNNModel_Deprecated

class SeqNNModel_Deprecated.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{
    "representation": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "doc_representation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "cnn": {
                "kernel_num": 100,
                "kernel_sizes": [
                    3,
                    4
                ],
                "weight_norm": false,
                "dilated": false,
                "causal": false
            }
        },
        "seq_representation": {
            "BiLSTMDocAttention": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm_dim": 32,
                    "num_layers": 1,
                    "bidirectional": true,
                    "pack_sequence": true
                },
                "pooling": {
                    "SelfAttention": {
                        "attn_dimension": 64,
                        "dropout": 0.4
                    }
                },
                "mlp_decoder": null
            }
        }
    },
    "output_layer": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "loss": {
            "CrossEntropyLoss": {}
        },
        "label_weights": null
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    }
}

word_model

ByteModelInput
class pytext.models.word_model.ByteModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "token_bytes": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "max_seq_len": null,
        "max_byte_len": 15,
        "offset_for_non_padding": 0,
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false
    },
    "labels": {
        "is_input": false,
        "slot_column": "slots",
        "text_column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "allow_unknown": false
    }
}
ModelInput
class pytext.models.word_model.ModelInput

Bases: ModelInput

All Attributes (including base classes)

Default JSON

{
    "tokens": {
        "is_input": true,
        "column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "add_bos_token": false,
        "add_eos_token": false,
        "use_eos_token_for_bos": false,
        "max_seq_len": null,
        "vocab": {
            "build_from_data": true,
            "size_from_data": 0,
            "vocab_files": []
        },
        "vocab_file_delimiter": " "
    },
    "labels": {
        "is_input": false,
        "slot_column": "slots",
        "text_column": "text",
        "tokenizer": {
            "Tokenizer": {
                "split_regex": "\\s+",
                "lowercase": true
            }
        },
        "allow_unknown": false
    }
}
WordTaggingLiteModel.Config

Component: WordTaggingLiteModel

class WordTaggingLiteModel.Config[source]

Bases: WordTaggingModel.Config

All Attributes (including base classes)

Default JSON

{
    "inputs": {
        "token_bytes": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "max_seq_len": null,
            "max_byte_len": 15,
            "offset_for_non_padding": 0,
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false
        },
        "labels": {
            "is_input": false,
            "slot_column": "slots",
            "text_column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "allow_unknown": false
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "sparse": false,
        "cnn": {
            "kernel_num": 100,
            "kernel_sizes": [
                3,
                4
            ],
            "weight_norm": false,
            "dilated": false,
            "causal": false
        },
        "highway_layers": 0,
        "projection_dim": null,
        "export_input_names": [
            "char_vals"
        ],
        "vocab_from_train_data": true,
        "max_word_length": 20,
        "min_freq": 1
    },
    "representation": {
        "PassThroughRepresentation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null
        }
    },
    "output_layer": {
        "WordTaggingOutputLayer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": {},
            "ignore_pad_in_loss": true
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    }
}
WordTaggingModel.Config

Component: WordTaggingModel

class WordTaggingModel.Config[source]

Bases: Model.Config

All Attributes (including base classes)

Subclasses
  • WordTaggingLiteModel.Config

Default JSON

{
    "inputs": {
        "tokens": {
            "is_input": true,
            "column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "add_bos_token": false,
            "add_eos_token": false,
            "use_eos_token_for_bos": false,
            "max_seq_len": null,
            "vocab": {
                "build_from_data": true,
                "size_from_data": 0,
                "vocab_files": []
            },
            "vocab_file_delimiter": " "
        },
        "labels": {
            "is_input": false,
            "slot_column": "slots",
            "text_column": "text",
            "tokenizer": {
                "Tokenizer": {
                    "split_regex": "\\s+",
                    "lowercase": true
                }
            },
            "allow_unknown": false
        }
    },
    "embedding": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "embed_dim": 100,
        "embedding_init_strategy": "random",
        "embedding_init_range": null,
        "export_input_names": [
            "tokens_vals"
        ],
        "pretrained_embeddings_path": "",
        "vocab_file": "",
        "vocab_size": 0,
        "vocab_from_train_data": true,
        "vocab_from_all_data": false,
        "vocab_from_pretrained_embeddings": false,
        "lowercase_tokens": true,
        "min_freq": 1,
        "mlp_layer_dims": [],
        "padding_idx": null,
        "cpu_only": false,
        "skip_header": true,
        "delimiter": " "
    },
    "representation": {
        "PassThroughRepresentation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null
        }
    },
    "output_layer": {
        "WordTaggingOutputLayer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": {},
            "ignore_pad_in_loss": true
        }
    },
    "decoder": {
        "load_path": null,
        "save_path": null,
        "freeze": false,
        "shared_module_key": null,
        "hidden_dims": [],
        "out_dim": null,
        "layer_norm": false,
        "dropout": 0.0,
        "activation": "relu"
    }
}

optimizer

fp16_optimizer

FP16Optimizer.Config

Component: FP16Optimizer

class FP16Optimizer.Config

Bases: Optimizer.Config

All Attributes (including base classes)

Subclasses
  • FP16OptimizerApex.Config
  • FP16OptimizerFairseq.Config
  • MemoryEfficientFP16OptimizerFairseq.Config

Default JSON

{}
FP16OptimizerApex.Config

Component: FP16OptimizerApex

class FP16OptimizerApex.Config[source]

Bases: FP16Optimizer.Config

All Attributes (including base classes)

opt_level: str = 'O2'
init_loss_scale: Optional[int] = None
min_loss_scale: Optional[float] = None

Default JSON

{
    "opt_level": "O2",
    "init_loss_scale": null,
    "min_loss_scale": null
}
FP16OptimizerFairseq.Config

Component: FP16OptimizerFairseq

class FP16OptimizerFairseq.Config[source]

Bases: FP16Optimizer.Config

All Attributes (including base classes)

init_loss_scale: int = 128
scale_window: Optional[int] = None
scale_tolerance: float = 0.0
threshold_loss_scale: Optional[float] = None
min_loss_scale: float = 0.0001

Default JSON

{
    "init_loss_scale": 128,
    "scale_window": null,
    "scale_tolerance": 0.0,
    "threshold_loss_scale": null,
    "min_loss_scale": 0.0001
}
MemoryEfficientFP16OptimizerFairseq.Config

Component: MemoryEfficientFP16OptimizerFairseq

class MemoryEfficientFP16OptimizerFairseq.Config[source]

Bases: FP16Optimizer.Config

All Attributes (including base classes)

init_loss_scale: int = 128
scale_window: Optional[int] = None
scale_tolerance: float = 0.0
threshold_loss_scale: Optional[float] = None
min_loss_scale: float = 0.0001

Default JSON

{
    "init_loss_scale": 128,
    "scale_window": null,
    "scale_tolerance": 0.0,
    "threshold_loss_scale": null,
    "min_loss_scale": 0.0001
}

lamb

Lamb.Config

Component: Lamb

class Lamb.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

lr: float = 0.001
weight_decay: float = 1e-05
eps: float = 1e-08
min_trust: Optional[float] = None

Default JSON

{
    "lr": 0.001,
    "weight_decay": 1e-05,
    "eps": 1e-08,
    "min_trust": null
}

optimizers

Adagrad.Config

Component: Adagrad

class Adagrad.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

lr: float = 0.01
weight_decay: float = 1e-05

Default JSON

{
    "lr": 0.01,
    "weight_decay": 1e-05
}
Adam.Config

Component: Adam

class Adam.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

lr: float = 0.001
weight_decay: float = 1e-05
eps: float = 1e-08

Default JSON

{
    "lr": 0.001,
    "weight_decay": 1e-05,
    "eps": 1e-08
}
AdamW.Config

Component: AdamW

class AdamW.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

lr: float = 0.001
weight_decay: float = 0.01
eps: float = 1e-08

Default JSON

{
    "lr": 0.001,
    "weight_decay": 0.01,
    "eps": 1e-08
}
Optimizer.Config

Component: Optimizer

class Optimizer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Subclasses
  • FP16Optimizer.Config
  • FP16OptimizerApex.Config
  • FP16OptimizerFairseq.Config
  • MemoryEfficientFP16OptimizerFairseq.Config
  • Lamb.Config
  • Adagrad.Config
  • Adam.Config
  • AdamW.Config
  • SGD.Config
  • RAdam.Config
  • StochasticWeightAveraging.Config

Default JSON

{}
SGD.Config

Component: SGD

class SGD.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

lr: float = 0.001
momentum: float = 0.0

Default JSON

{
    "lr": 0.001,
    "momentum": 0.0
}

radam

RAdam.Config

Component: RAdam

class RAdam.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

lr: float = 0.001
weight_decay: float = 1e-05
eps: float = 1e-08

Default JSON

{
    "lr": 0.001,
    "weight_decay": 1e-05,
    "eps": 1e-08
}

scheduler

BatchScheduler.Config

Component: BatchScheduler

class BatchScheduler.Config

Bases: Scheduler.Config

All Attributes (including base classes)

Subclasses
  • PolynomialDecayScheduler.Config
  • SchedulerWithWarmup.Config
  • WarmupScheduler.Config

Default JSON

{}
CosineAnnealingLR.Config

Component: CosineAnnealingLR

class CosineAnnealingLR.Config[source]

Bases: Scheduler.Config

All Attributes (including base classes)

t_max: int = 1000
Maximum number of iterations.
eta_min: float = 0
Minimum learning rate

Default JSON

{
    "t_max": 1000,
    "eta_min": 0
}
CyclicLR.Config

Component: CyclicLR

class CyclicLR.Config[source]

Bases: Scheduler.Config

All Attributes (including base classes)

base_lr: float = 0.001
max_lr: float = 0.002
step_size_up: int = 2000
step_size_down: Optional[int] = None
mode: str = 'triangular'
gamma: float = 1.0
scale_mode: str = 'cycle'
cycle_momentum: bool = True
base_momentum: float = 0.8
max_momentum: float = 0.9
last_epoch: int = -1

Default JSON

{
    "base_lr": 0.001,
    "max_lr": 0.002,
    "step_size_up": 2000,
    "step_size_down": null,
    "mode": "triangular",
    "gamma": 1.0,
    "scale_mode": "cycle",
    "cycle_momentum": true,
    "base_momentum": 0.8,
    "max_momentum": 0.9,
    "last_epoch": -1
}
ExponentialLR.Config

Component: ExponentialLR

class ExponentialLR.Config[source]

Bases: Scheduler.Config

All Attributes (including base classes)

gamma: float = 0.1
Multiplicative factor of learning rate decay.

Default JSON

{
    "gamma": 0.1
}
LmFineTuning.Config

Component: LmFineTuning

class LmFineTuning.Config[source]

Bases: Scheduler.Config

All Attributes (including base classes)

cut_frac: float = 0.1
The fraction of iterations we increase the learning rate. Default 0.1
ratio: int = 32
How much smaller the lowest LR is from the maximum LR eta_max.
non_pretrained_param_groups: int = 2
Number of param_groups, starting from the end, that were not pretrained. The default value is 2, since the base Model class supplies to the optimizer typically one param_group from the embedding and one param_group from its other components.
lm_lr_multiplier: float = 1.0
Factor to multiply lr for all pretrained layers by.
lm_use_per_layer_lr: bool = False
Whether to make each pretrained layer’s lr one-half as large as the next (higher) layer.
lm_gradual_unfreezing: bool = True
Whether to unfreeze layers one by one (per epoch).
last_epoch: int = -1
Though the name is last_epoch, it means last batch update. last_batch_update: = current_epoch_number * num_batches_per_epoch + batch_id after each batch update, it will increment 1

Default JSON

{
    "cut_frac": 0.1,
    "ratio": 32,
    "non_pretrained_param_groups": 2,
    "lm_lr_multiplier": 1.0,
    "lm_use_per_layer_lr": false,
    "lm_gradual_unfreezing": true,
    "last_epoch": -1
}
PolynomialDecayScheduler.Config

Component: PolynomialDecayScheduler

class PolynomialDecayScheduler.Config[source]

Bases: BatchScheduler.Config

All Attributes (including base classes)

warmup_steps: int = 0
number of training steps over which to increase learning rate
total_steps: int
number of training steps for learning rate decay
end_learning_rate: float
end learning rate after total_steps of training
power: float = 1.0
power used for polynomial decay calculation

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

ReduceLROnPlateau.Config

Component: ReduceLROnPlateau

class ReduceLROnPlateau.Config[source]

Bases: Scheduler.Config

All Attributes (including base classes)

lower_is_better: bool = True
This indicates the desirable direction in which we would like the training to proceed. If set to true, learning rate will be reduce when quantity being monitored stops going down
factor: float = 0.1
Factor by which the learning rate will be reduced. new_lr = lr * factor
patience: int = 5
Number of epochs with no improvement after which learning rate will be reduced
min_lr: float = 0
Lower bound on the learning rate of all param groups
threshold: float = 0.0001
Threshold for measuring the new optimum, to only focus on significant changes.
threshold_is_absolute: bool = True
One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode.
cooldown: int = 0
Number of epochs to wait before resuming normal operation after lr has been reduced.

Default JSON

{
    "lower_is_better": true,
    "factor": 0.1,
    "patience": 5,
    "min_lr": 0,
    "threshold": 0.0001,
    "threshold_is_absolute": true,
    "cooldown": 0
}
Scheduler.Config

Component: Scheduler

class Scheduler.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Subclasses
  • BatchScheduler.Config
  • CosineAnnealingLR.Config
  • CyclicLR.Config
  • ExponentialLR.Config
  • LmFineTuning.Config
  • PolynomialDecayScheduler.Config
  • ReduceLROnPlateau.Config
  • SchedulerWithWarmup.Config
  • StepLR.Config
  • WarmupScheduler.Config

Default JSON

{}
SchedulerWithWarmup.Config

Component: SchedulerWithWarmup

class SchedulerWithWarmup.Config[source]

Bases: BatchScheduler.Config

All Attributes (including base classes)

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

StepLR.Config

Component: StepLR

class StepLR.Config[source]

Bases: Scheduler.Config

All Attributes (including base classes)

step_size: int = 30
Period of learning rate decay.
gamma: float = 0.1
Multiplicative factor of learning rate decay.

Default JSON

{
    "step_size": 30,
    "gamma": 0.1
}
WarmupScheduler.Config

Component: WarmupScheduler

class WarmupScheduler.Config[source]

Bases: BatchScheduler.Config

All Attributes (including base classes)

warmup_steps: int = 10000
number of training steps over which to increase learning rate
inverse_sqrt_decay: bool = False
whether to perform inverse sqrt decay after the warmup phase

Default JSON

{
    "warmup_steps": 10000,
    "inverse_sqrt_decay": false
}

sparsifiers

blockwise_sparsifier
BlockwiseMagnitudeSparsifier.Config

Component: BlockwiseMagnitudeSparsifier

class BlockwiseMagnitudeSparsifier.Config[source]

Bases: L0_projection_sparsifier.Config

All Attributes (including base classes)

sparsity: float = 0.9
starting_epoch: int = 2
frequency: int = 1
layerwise_pruning: bool = True
accumulate_mask: bool = False
block_size: int = 16
columnwise_blocking: bool = False

Default JSON

{
    "sparsity": 0.9,
    "starting_epoch": 2,
    "frequency": 1,
    "layerwise_pruning": true,
    "accumulate_mask": false,
    "block_size": 16,
    "columnwise_blocking": false
}
sparsifier
CRF_L1_SoftThresholding.Config

Component: CRF_L1_SoftThresholding

class CRF_L1_SoftThresholding.Config[source]

Bases: CRF_SparsifierBase.Config

All Attributes (including base classes)

starting_epoch: int = 1
frequency: int = 1
lambda_l1: float = 0.001

Default JSON

{
    "starting_epoch": 1,
    "frequency": 1,
    "lambda_l1": 0.001
}
CRF_MagnitudeThresholding.Config

Component: CRF_MagnitudeThresholding

class CRF_MagnitudeThresholding.Config[source]

Bases: CRF_SparsifierBase.Config

All Attributes (including base classes)

starting_epoch: int = 1
frequency: int = 1
sparsity: float = 0.9
grouping: str = 'row'

Default JSON

{
    "starting_epoch": 1,
    "frequency": 1,
    "sparsity": 0.9,
    "grouping": "row"
}
CRF_SparsifierBase.Config

Component: CRF_SparsifierBase

class CRF_SparsifierBase.Config[source]

Bases: Sparsifier.Config

All Attributes (including base classes)

starting_epoch: int = 1
frequency: int = 1
Subclasses
  • CRF_L1_SoftThresholding.Config
  • CRF_MagnitudeThresholding.Config

Default JSON

{
    "starting_epoch": 1,
    "frequency": 1
}
L0_projection_sparsifier.Config

Component: L0_projection_sparsifier

class L0_projection_sparsifier.Config[source]

Bases: Sparsifier.Config

All Attributes (including base classes)

sparsity: float = 0.9
starting_epoch: int = 2
frequency: int = 1
layerwise_pruning: bool = True
accumulate_mask: bool = False
Subclasses
  • BlockwiseMagnitudeSparsifier.Config

Default JSON

{
    "sparsity": 0.9,
    "starting_epoch": 2,
    "frequency": 1,
    "layerwise_pruning": true,
    "accumulate_mask": false
}
Sparsifier.Config

Component: Sparsifier

class Sparsifier.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Subclasses
  • BlockwiseMagnitudeSparsifier.Config
  • CRF_L1_SoftThresholding.Config
  • CRF_MagnitudeThresholding.Config
  • CRF_SparsifierBase.Config
  • L0_projection_sparsifier.Config

Default JSON

{}

swa

StochasticWeightAveraging.Config

Component: StochasticWeightAveraging

class StochasticWeightAveraging.Config[source]

Bases: Optimizer.Config

All Attributes (including base classes)

optimizer: Union[SGD.Config, Adam.Config, AdamW.Config, Adagrad.Config, RAdam.Config, Lamb.Config] = SGD.Config()
start: int = 10
frequency: int = 5
swa_learning_rate: Optional[float] = 0.05

Default JSON

{
    "optimizer": {
        "SGD": {
            "lr": 0.001,
            "momentum": 0.0
        }
    },
    "start": 10,
    "frequency": 5,
    "swa_learning_rate": 0.05
}

task

disjoint_multitask

DisjointMultitask.Config

Component: DisjointMultitask

class DisjointMultitask.Config[source]

Bases: TaskBase.Config

All Attributes (including base classes)

features: FeatureConfig = FeatureConfig()
featurizer: Featurizer.Config = SimpleFeaturizer.Config()
data_handler: DisjointMultitaskDataHandler.Config = DisjointMultitaskDataHandler.Config()
trainer: Trainer.Config = Trainer.Config()
exporter: Optional[ModelExporter.Config] = None
tasks: dict[str, Task_Deprecated.Config]
task_weights: dict[str, float] = {}
target_task_name: Optional[str] = None
metric_reporter: DisjointMultitaskMetricReporter.Config = DisjointMultitaskMetricReporter.Config()

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

NewDisjointMultitask.Config

Component: NewDisjointMultitask

class NewDisjointMultitask.Config[source]

Bases: _NewTask.Config

All Attributes (including base classes)

data: DisjointMultitaskData.Config = DisjointMultitaskData.Config()
trainer: TaskTrainer.Config = TaskTrainer.Config()
tasks: dict[str, NewTask.Config] = {}
task_weights: dict[str, float] = {}
target_task_name: Optional[str] = None
metric_reporter: DisjointMultitaskMetricReporter.Config = DisjointMultitaskMetricReporter.Config()

Default JSON

{
    "data": {
        "sampler": {
            "RoundRobinBatchSampler": {
                "iter_to_set_epoch": ""
            }
        },
        "test_key": null
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "tasks": {},
    "task_weights": {},
    "target_task_name": null,
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false,
        "use_subtask_select_metric": false
    }
}

new_task

NewTask.Config

Component: NewTask

class NewTask.Config[source]

Bases: _NewTask.Config

All Attributes (including base classes)

Subclasses
  • BertPairRegressionTask.Config
  • DocumentClassificationTask.Config
  • DocumentRegressionTask.Config
  • EnsembleTask.Config
  • IntentSlotTask.Config
  • LMTask.Config
  • MaskedLMTask.Config
  • NewBertClassificationTask.Config
  • NewBertPairClassificationTask.Config
  • PairwiseClassificationTask.Config
  • QueryDocumentPairwiseRankingTask.Config
  • RoBERTaNERTask.Config
  • SemanticParsingTask.Config
  • SeqNNTask.Config
  • SquadQATask.Config
  • WordTaggingTask.Config

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

_NewTask.Config

Component: _NewTask

class _NewTask.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Subclasses
  • NewDisjointMultitask.Config
  • NewTask.Config
  • BertPairRegressionTask.Config
  • DocumentClassificationTask.Config
  • DocumentRegressionTask.Config
  • EnsembleTask.Config
  • IntentSlotTask.Config
  • LMTask.Config
  • MaskedLMTask.Config
  • NewBertClassificationTask.Config
  • NewBertPairClassificationTask.Config
  • PairwiseClassificationTask.Config
  • QueryDocumentPairwiseRankingTask.Config
  • RoBERTaNERTask.Config
  • SemanticParsingTask.Config
  • SeqNNTask.Config
  • SquadQATask.Config
  • WordTaggingTask.Config

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    }
}

task

TaskBase.Config

Component: TaskBase

class TaskBase.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

features: FeatureConfig = FeatureConfig()
featurizer: Featurizer.Config = SimpleFeaturizer.Config()
data_handler: DataHandler.Config
trainer: Trainer.Config = Trainer.Config()
exporter: Optional[ModelExporter.Config] = None
Subclasses
  • DisjointMultitask.Config
  • Task_Deprecated.Config

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

Task_Deprecated.Config

Component: Task_Deprecated

class Task_Deprecated.Config

Bases: TaskBase.Config

All Attributes (including base classes)

features: FeatureConfig = FeatureConfig()
featurizer: Featurizer.Config = SimpleFeaturizer.Config()
data_handler: DataHandler.Config
trainer: Trainer.Config = Trainer.Config()
exporter: Optional[ModelExporter.Config] = None

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

tasks

BertPairRegressionTask.Config

Component: BertPairRegressionTask

class BertPairRegressionTask.Config[source]

Bases: DocumentRegressionTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "BERTTensorizer": {
                    "is_input": true,
                    "columns": [
                        "text1",
                        "text2"
                    ],
                    "tokenizer": {
                        "WordPieceTokenizer": {
                            "basic_tokenizer": {
                                "split_regex": "\\s+",
                                "lowercase": true
                            },
                            "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                        }
                    },
                    "base_tokenizer": null,
                    "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                    "max_seq_len": 128
                }
            },
            "labels": {
                "is_input": false,
                "column": "label",
                "rescale_range": null
            }
        },
        "encoder": {
            "HuggingFaceBertSentenceEncoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "output_dropout": 0.4,
                "embedding_dim": 768,
                "pooling": "cls_token",
                "export": false,
                "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
                "load_weights": true
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {},
            "squash_to_unit_range": false
        }
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false
    }
}
DocumentClassificationTask.Config

Component: DocumentClassificationTask

class DocumentClassificationTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Subclasses
  • NewBertClassificationTask.Config
  • NewBertPairClassificationTask.Config

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "DocModel": {
            "inputs": {
                "tokens": {
                    "is_input": true,
                    "column": "text",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "add_bos_token": false,
                    "add_eos_token": false,
                    "use_eos_token_for_bos": false,
                    "max_seq_len": null,
                    "vocab": {
                        "build_from_data": true,
                        "size_from_data": 0,
                        "vocab_files": []
                    },
                    "vocab_file_delimiter": " "
                },
                "dense": null,
                "labels": {
                    "LabelTensorizer": {
                        "is_input": false,
                        "column": "label",
                        "allow_unknown": false,
                        "pad_in_vocab": false,
                        "label_vocab": null
                    }
                }
            },
            "embedding": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "embed_dim": 100,
                "embedding_init_strategy": "random",
                "embedding_init_range": null,
                "export_input_names": [
                    "tokens_vals"
                ],
                "pretrained_embeddings_path": "",
                "vocab_file": "",
                "vocab_size": 0,
                "vocab_from_train_data": true,
                "vocab_from_all_data": false,
                "vocab_from_pretrained_embeddings": false,
                "lowercase_tokens": true,
                "min_freq": 1,
                "mlp_layer_dims": [],
                "padding_idx": null,
                "cpu_only": false,
                "skip_header": true,
                "delimiter": " "
            },
            "representation": {
                "BiLSTMDocAttention": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm": {
                        "load_path": null,
                        "save_path": null,
                        "freeze": false,
                        "shared_module_key": null,
                        "dropout": 0.4,
                        "lstm_dim": 32,
                        "num_layers": 1,
                        "bidirectional": true,
                        "pack_sequence": true
                    },
                    "pooling": {
                        "SelfAttention": {
                            "attn_dimension": 64,
                            "dropout": 0.4
                        }
                    },
                    "mlp_decoder": null
                }
            },
            "decoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "hidden_dims": [],
                "out_dim": null,
                "layer_norm": false,
                "dropout": 0.0,
                "activation": "relu"
            },
            "output_layer": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "loss": {
                    "CrossEntropyLoss": {}
                },
                "label_weights": null
            }
        }
    },
    "metric_reporter": {
        "ClassificationMetricReporter": {
            "output_path": "/tmp/test_out.txt",
            "pep_format": false,
            "model_select_metric": "accuracy",
            "target_label": null,
            "text_column_names": [
                "text"
            ],
            "additional_column_names": [],
            "recall_at_precision_thresholds": [
                0.2,
                0.4,
                0.6,
                0.8,
                0.9
            ]
        }
    }
}
DocumentRegressionTask.Config

Component: DocumentRegressionTask

class DocumentRegressionTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Subclasses
  • BertPairRegressionTask.Config

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "is_input": true,
                "column": "text",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " "
            },
            "dense": null,
            "labels": {
                "is_input": false,
                "column": "label",
                "rescale_range": null
            }
        },
        "embedding": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "embed_dim": 100,
            "embedding_init_strategy": "random",
            "embedding_init_range": null,
            "export_input_names": [
                "tokens_vals"
            ],
            "pretrained_embeddings_path": "",
            "vocab_file": "",
            "vocab_size": 0,
            "vocab_from_train_data": true,
            "vocab_from_all_data": false,
            "vocab_from_pretrained_embeddings": false,
            "lowercase_tokens": true,
            "min_freq": 1,
            "mlp_layer_dims": [],
            "padding_idx": null,
            "cpu_only": false,
            "skip_header": true,
            "delimiter": " "
        },
        "representation": {
            "BiLSTMDocAttention": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm_dim": 32,
                    "num_layers": 1,
                    "bidirectional": true,
                    "pack_sequence": true
                },
                "pooling": {
                    "SelfAttention": {
                        "attn_dimension": 64,
                        "dropout": 0.4
                    }
                },
                "mlp_decoder": null
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {},
            "squash_to_unit_range": false
        }
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false
    }
}
EnsembleTask.Config

Component: EnsembleTask

class EnsembleTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Warning

This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.

IntentSlotTask.Config

Component: IntentSlotTask

class IntentSlotTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "IntentSlotModel": {
            "inputs": {
                "tokens": {
                    "is_input": true,
                    "column": "text",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "add_bos_token": false,
                    "add_eos_token": false,
                    "use_eos_token_for_bos": false,
                    "max_seq_len": null,
                    "vocab": {
                        "build_from_data": true,
                        "size_from_data": 0,
                        "vocab_files": []
                    },
                    "vocab_file_delimiter": " "
                },
                "word_labels": {
                    "is_input": false,
                    "slot_column": "slots",
                    "text_column": "text",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "allow_unknown": true
                },
                "doc_labels": {
                    "LabelTensorizer": {
                        "is_input": false,
                        "column": "label",
                        "allow_unknown": true,
                        "pad_in_vocab": false,
                        "label_vocab": null
                    }
                },
                "doc_weight": null,
                "word_weight": null
            },
            "word_embedding": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "embed_dim": 100,
                "embedding_init_strategy": "random",
                "embedding_init_range": null,
                "export_input_names": [
                    "tokens_vals"
                ],
                "pretrained_embeddings_path": "",
                "vocab_file": "",
                "vocab_size": 0,
                "vocab_from_train_data": true,
                "vocab_from_all_data": false,
                "vocab_from_pretrained_embeddings": false,
                "lowercase_tokens": true,
                "min_freq": 1,
                "mlp_layer_dims": [],
                "padding_idx": null,
                "cpu_only": false,
                "skip_header": true,
                "delimiter": " "
            },
            "representation": {
                "BiLSTMDocSlotAttention": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm": {
                        "BiLSTM": {
                            "load_path": null,
                            "save_path": null,
                            "freeze": false,
                            "shared_module_key": null,
                            "dropout": 0.4,
                            "lstm_dim": 32,
                            "num_layers": 1,
                            "bidirectional": true,
                            "pack_sequence": true
                        }
                    },
                    "pooling": null,
                    "slot_attention": null,
                    "doc_mlp_layers": 0,
                    "word_mlp_layers": 0
                }
            },
            "output_layer": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "doc_output": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "loss": {
                        "CrossEntropyLoss": {}
                    },
                    "label_weights": null
                },
                "word_output": {
                    "WordTaggingOutputLayer": {
                        "load_path": null,
                        "save_path": null,
                        "freeze": false,
                        "shared_module_key": null,
                        "loss": {
                            "CrossEntropyLoss": {}
                        },
                        "label_weights": {},
                        "ignore_pad_in_loss": true
                    }
                }
            },
            "decoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "use_doc_probs_in_word": false,
                "doc_decoder": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "hidden_dims": [],
                    "out_dim": null,
                    "layer_norm": false,
                    "dropout": 0.0,
                    "activation": "relu"
                },
                "word_decoder": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "hidden_dims": [],
                    "out_dim": null,
                    "layer_norm": false,
                    "dropout": 0.0,
                    "activation": "relu"
                }
            },
            "default_doc_loss_weight": 0.2,
            "default_word_loss_weight": 0.5
        }
    },
    "metric_reporter": {
        "IntentSlotMetricReporter": {
            "output_path": "/tmp/test_out.txt",
            "pep_format": false
        }
    }
}
LMTask.Config

Component: LMTask

class LMTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "is_input": true,
                "column": "text",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "add_bos_token": true,
                "add_eos_token": true,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " "
            }
        },
        "embedding": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "embed_dim": 100,
            "embedding_init_strategy": "random",
            "embedding_init_range": null,
            "export_input_names": [
                "tokens_vals"
            ],
            "pretrained_embeddings_path": "",
            "vocab_file": "",
            "vocab_size": 0,
            "vocab_from_train_data": true,
            "vocab_from_all_data": false,
            "vocab_from_pretrained_embeddings": false,
            "lowercase_tokens": true,
            "min_freq": 1,
            "mlp_layer_dims": [],
            "padding_idx": null,
            "cpu_only": false,
            "skip_header": true,
            "delimiter": " "
        },
        "representation": {
            "BiLSTM": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm_dim": 32,
                "num_layers": 1,
                "bidirectional": false,
                "pack_sequence": true
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {}
        },
        "tied_weights": false,
        "stateful": false,
        "caffe2_format": "predictor"
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false,
        "aggregate_metrics": true,
        "perplexity_type": "median"
    }
}
MaskedLMTask.Config

Component: MaskedLMTask

class MaskedLMTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "PackedLMData": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true,
            "max_seq_len": 128
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "BERTTensorizerBase": {
                    "is_input": true,
                    "columns": [
                        "text"
                    ],
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "base_tokenizer": null,
                    "vocab_file": "",
                    "max_seq_len": 128
                }
            }
        },
        "encoder": {
            "TransformerSentenceEncoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "output_dropout": 0.4,
                "embedding_dim": 768,
                "pooling": "cls_token",
                "export": false,
                "dropout": 0.1,
                "attention_dropout": 0.1,
                "activation_dropout": 0.1,
                "ffn_embedding_dim": 3072,
                "num_encoder_layers": 6,
                "num_attention_heads": 8,
                "num_segments": 2,
                "use_position_embeddings": true,
                "offset_positions_by_padding": true,
                "apply_bert_init": true,
                "encoder_normalize_before": true,
                "activation_fn": "relu",
                "projection_dim": 0,
                "max_seq_len": 128,
                "multilingual": false,
                "freeze_embeddings": false,
                "n_trans_layers_to_freeze": 0,
                "use_torchscript": false
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {}
        },
        "mask_prob": 0.15,
        "mask_bos": false,
        "masking_strategy": "random",
        "tie_weights": true
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false,
        "aggregate_metrics": true,
        "perplexity_type": "median"
    }
}
NewBertClassificationTask.Config

Component: NewBertClassificationTask

class NewBertClassificationTask.Config[source]

Bases: DocumentClassificationTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "BERTTensorizer": {
                    "is_input": true,
                    "columns": [
                        "text"
                    ],
                    "tokenizer": {
                        "WordPieceTokenizer": {
                            "basic_tokenizer": {
                                "split_regex": "\\s+",
                                "lowercase": true
                            },
                            "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                        }
                    },
                    "base_tokenizer": null,
                    "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                    "max_seq_len": 128
                }
            },
            "dense": null,
            "labels": {
                "LabelTensorizer": {
                    "is_input": false,
                    "column": "label",
                    "allow_unknown": false,
                    "pad_in_vocab": false,
                    "label_vocab": null
                }
            },
            "num_tokens": {
                "is_input": false,
                "names": [
                    "tokens"
                ],
                "indexes": [
                    2
                ]
            }
        },
        "encoder": {
            "HuggingFaceBertSentenceEncoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "output_dropout": 0.4,
                "embedding_dim": 768,
                "pooling": "cls_token",
                "export": false,
                "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
                "load_weights": true
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        }
    },
    "metric_reporter": {
        "ClassificationMetricReporter": {
            "output_path": "/tmp/test_out.txt",
            "pep_format": false,
            "model_select_metric": "accuracy",
            "target_label": null,
            "text_column_names": [
                "text"
            ],
            "additional_column_names": [],
            "recall_at_precision_thresholds": [
                0.2,
                0.4,
                0.6,
                0.8,
                0.9
            ]
        }
    }
}
NewBertPairClassificationTask.Config

Component: NewBertPairClassificationTask

class NewBertPairClassificationTask.Config[source]

Bases: DocumentClassificationTask.Config

All Attributes (including base classes)

data: Data.Config = Data.Config()
trainer: TaskTrainer.Config = TaskTrainer.Config()
model: NewBertModel.Config = NewBertModel.Config(inputs=BertModelInput(tokens=BERTTensorizer.Config(columns=['text1', 'text2'], max_seq_len=128)))
metric_reporter: ClassificationMetricReporter.Config = ClassificationMetricReporter.Config(text_column_names=['text1', 'text2'])

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "BERTTensorizer": {
                    "is_input": true,
                    "columns": [
                        "text1",
                        "text2"
                    ],
                    "tokenizer": {
                        "WordPieceTokenizer": {
                            "basic_tokenizer": {
                                "split_regex": "\\s+",
                                "lowercase": true
                            },
                            "wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
                        }
                    },
                    "base_tokenizer": null,
                    "vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
                    "max_seq_len": 128
                }
            },
            "dense": null,
            "labels": {
                "LabelTensorizer": {
                    "is_input": false,
                    "column": "label",
                    "allow_unknown": false,
                    "pad_in_vocab": false,
                    "label_vocab": null
                }
            },
            "num_tokens": {
                "is_input": false,
                "names": [
                    "tokens"
                ],
                "indexes": [
                    2
                ]
            }
        },
        "encoder": {
            "HuggingFaceBertSentenceEncoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "output_dropout": 0.4,
                "embedding_dim": 768,
                "pooling": "cls_token",
                "export": false,
                "bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
                "load_weights": true
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        }
    },
    "metric_reporter": {
        "ClassificationMetricReporter": {
            "output_path": "/tmp/test_out.txt",
            "pep_format": false,
            "model_select_metric": "accuracy",
            "target_label": null,
            "text_column_names": [
                "text1",
                "text2"
            ],
            "additional_column_names": [],
            "recall_at_precision_thresholds": [
                0.2,
                0.4,
                0.6,
                0.8,
                0.9
            ]
        }
    }
}
PairwiseClassificationTask.Config

Component: PairwiseClassificationTask

class PairwiseClassificationTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "PairwiseModel": {
            "inputs": {
                "tokens1": {
                    "is_input": true,
                    "column": "text1",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "add_bos_token": false,
                    "add_eos_token": false,
                    "use_eos_token_for_bos": false,
                    "max_seq_len": null,
                    "vocab": {
                        "build_from_data": true,
                        "size_from_data": 0,
                        "vocab_files": []
                    },
                    "vocab_file_delimiter": " "
                },
                "tokens2": {
                    "is_input": true,
                    "column": "text2",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "add_bos_token": false,
                    "add_eos_token": false,
                    "use_eos_token_for_bos": false,
                    "max_seq_len": null,
                    "vocab": {
                        "build_from_data": true,
                        "size_from_data": 0,
                        "vocab_files": []
                    },
                    "vocab_file_delimiter": " "
                },
                "labels": {
                    "LabelTensorizer": {
                        "is_input": false,
                        "column": "label",
                        "allow_unknown": false,
                        "pad_in_vocab": false,
                        "label_vocab": null
                    }
                }
            },
            "decoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "hidden_dims": [],
                "out_dim": null,
                "layer_norm": false,
                "dropout": 0.0,
                "activation": "relu"
            },
            "output_layer": {
                "ClassificationOutputLayer": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "loss": {
                        "CrossEntropyLoss": {}
                    },
                    "label_weights": null
                }
            },
            "encode_relations": true,
            "embedding": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "embed_dim": 100,
                "embedding_init_strategy": "random",
                "embedding_init_range": null,
                "export_input_names": [
                    "tokens_vals"
                ],
                "pretrained_embeddings_path": "",
                "vocab_file": "",
                "vocab_size": 0,
                "vocab_from_train_data": true,
                "vocab_from_all_data": false,
                "vocab_from_pretrained_embeddings": false,
                "lowercase_tokens": true,
                "min_freq": 1,
                "mlp_layer_dims": [],
                "padding_idx": null,
                "cpu_only": false,
                "skip_header": true,
                "delimiter": " "
            },
            "representation": {
                "BiLSTMDocAttention": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm": {
                        "load_path": null,
                        "save_path": null,
                        "freeze": false,
                        "shared_module_key": null,
                        "dropout": 0.4,
                        "lstm_dim": 32,
                        "num_layers": 1,
                        "bidirectional": true,
                        "pack_sequence": true
                    },
                    "pooling": {
                        "SelfAttention": {
                            "attn_dimension": 64,
                            "dropout": 0.4
                        }
                    },
                    "mlp_decoder": null
                }
            },
            "shared_representations": true
        }
    },
    "metric_reporter": {
        "ClassificationMetricReporter": {
            "output_path": "/tmp/test_out.txt",
            "pep_format": false,
            "model_select_metric": "accuracy",
            "target_label": null,
            "text_column_names": [
                "text1",
                "text2"
            ],
            "additional_column_names": [],
            "recall_at_precision_thresholds": [
                0.2,
                0.4,
                0.6,
                0.8,
                0.9
            ]
        }
    }
}
QueryDocumentPairwiseRankingTask.Config

Component: QueryDocumentPairwiseRankingTask

class QueryDocumentPairwiseRankingTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "pos_response": {
                "is_input": true,
                "column": "pos_response",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " "
            },
            "neg_response": {
                "is_input": true,
                "column": "neg_response",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " "
            },
            "query": {
                "is_input": true,
                "column": "query",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " "
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": []
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "margin": 1.0
            }
        },
        "encode_relations": true,
        "embedding": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "embed_dim": 100,
            "embedding_init_strategy": "random",
            "embedding_init_range": null,
            "export_input_names": [
                "tokens_vals"
            ],
            "pretrained_embeddings_path": "",
            "vocab_file": "",
            "vocab_size": 0,
            "vocab_from_train_data": true,
            "vocab_from_all_data": false,
            "vocab_from_pretrained_embeddings": false,
            "lowercase_tokens": true,
            "min_freq": 1,
            "mlp_layer_dims": [],
            "padding_idx": null,
            "cpu_only": false,
            "skip_header": true,
            "delimiter": " "
        },
        "representation": {
            "BiLSTMDocAttention": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "lstm": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm_dim": 32,
                    "num_layers": 1,
                    "bidirectional": true,
                    "pack_sequence": true
                },
                "pooling": {
                    "SelfAttention": {
                        "attn_dimension": 64,
                        "dropout": 0.4
                    }
                },
                "mlp_decoder": null
            }
        },
        "shared_representations": true,
        "decoder_output_dim": 64
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false
    }
}
RoBERTaNERTask.Config

Component: RoBERTaNERTask

class RoBERTaNERTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "is_input": true,
                "columns": [
                    "text"
                ],
                "tokenizer": {
                    "GPT2BPETokenizer": {
                        "bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
                        "bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
                    }
                },
                "base_tokenizer": null,
                "vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
                "max_seq_len": 256,
                "labels_columns": [
                    "label"
                ],
                "labels": []
            }
        },
        "encoder": {
            "RoBERTaEncoderJit": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "output_dropout": 0.4,
                "embedding_dim": 768,
                "pooling": "cls_token",
                "export": false,
                "pretrained_encoder": {
                    "load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null
                }
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": {},
            "ignore_pad_in_loss": true
        }
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false
    }
}
SemanticParsingTask.Config

Component: SemanticParsingTask

class SemanticParsingTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "real_trainer": {
            "TaskTrainer": {
                "epochs": 10,
                "early_stop_after": 0,
                "max_clip_norm": null,
                "report_train_metrics": true,
                "target_time_limit_seconds": null,
                "do_eval": true,
                "load_best_model_after_train": true,
                "num_samples_to_log_progress": 1000,
                "num_accumulated_batches": 1,
                "num_batches_per_epoch": null,
                "optimizer": {
                    "Adam": {
                        "lr": 0.001,
                        "weight_decay": 1e-05,
                        "eps": 1e-08
                    }
                },
                "scheduler": null,
                "sparsifier": null,
                "fp16_args": {
                    "FP16OptimizerFairseq": {
                        "init_loss_scale": 128,
                        "scale_window": null,
                        "scale_tolerance": 0.0,
                        "threshold_loss_scale": null,
                        "min_loss_scale": 0.0001
                    }
                }
            }
        },
        "num_workers": 1
    },
    "model": {
        "version": 2,
        "lstm": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "dropout": 0.4,
            "lstm_dim": 32,
            "num_layers": 1,
            "bidirectional": true,
            "pack_sequence": true
        },
        "ablation": {
            "use_buffer": true,
            "use_stack": true,
            "use_action": true,
            "use_last_open_NT_feature": false
        },
        "constraints": {
            "intent_slot_nesting": true,
            "ignore_loss_for_unsupported": false,
            "no_slots_inside_unsupported": true
        },
        "max_open_NT": 10,
        "dropout": 0.1,
        "beam_size": 1,
        "top_k": 1,
        "compositional_type": "blstm",
        "inputs": {
            "tokens": {
                "is_input": true,
                "column": "tokenized_text",
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                },
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "max_seq_len": null,
                "vocab": {
                    "build_from_data": true,
                    "size_from_data": 0,
                    "vocab_files": []
                },
                "vocab_file_delimiter": " "
            },
            "actions": {
                "is_input": true,
                "column": "seqlogical"
            }
        },
        "embedding": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "embed_dim": 100,
            "embedding_init_strategy": "random",
            "embedding_init_range": null,
            "export_input_names": [
                "tokens_vals"
            ],
            "pretrained_embeddings_path": "",
            "vocab_file": "",
            "vocab_size": 0,
            "vocab_from_train_data": true,
            "vocab_from_all_data": false,
            "vocab_from_pretrained_embeddings": false,
            "lowercase_tokens": true,
            "min_freq": 1,
            "mlp_layer_dims": [],
            "padding_idx": null,
            "cpu_only": false,
            "skip_header": true,
            "delimiter": " "
        }
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false,
        "text_column_name": "tokenized_text"
    }
}
SeqNNTask.Config

Component: SeqNNTask

class SeqNNTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "inputs": {
            "tokens": {
                "is_input": true,
                "column": "text_seq",
                "max_seq_len": null,
                "add_bos_token": false,
                "add_eos_token": false,
                "use_eos_token_for_bos": false,
                "add_bol_token": false,
                "add_eol_token": false,
                "use_eol_token_for_bol": false,
                "tokenizer": {
                    "Tokenizer": {
                        "split_regex": "\\s+",
                        "lowercase": true
                    }
                }
            },
            "dense": null,
            "labels": {
                "LabelTensorizer": {
                    "is_input": false,
                    "column": "label",
                    "allow_unknown": false,
                    "pad_in_vocab": false,
                    "label_vocab": null
                }
            }
        },
        "embedding": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "embed_dim": 100,
            "embedding_init_strategy": "random",
            "embedding_init_range": null,
            "export_input_names": [
                "tokens_vals"
            ],
            "pretrained_embeddings_path": "",
            "vocab_file": "",
            "vocab_size": 0,
            "vocab_from_train_data": true,
            "vocab_from_all_data": false,
            "vocab_from_pretrained_embeddings": false,
            "lowercase_tokens": true,
            "min_freq": 1,
            "mlp_layer_dims": [],
            "padding_idx": null,
            "cpu_only": false,
            "skip_header": true,
            "delimiter": " "
        },
        "representation": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "doc_representation": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "dropout": 0.4,
                "cnn": {
                    "kernel_num": 100,
                    "kernel_sizes": [
                        3,
                        4
                    ],
                    "weight_norm": false,
                    "dilated": false,
                    "causal": false
                }
            },
            "seq_representation": {
                "BiLSTMDocAttention": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "dropout": 0.4,
                    "lstm": {
                        "load_path": null,
                        "save_path": null,
                        "freeze": false,
                        "shared_module_key": null,
                        "dropout": 0.4,
                        "lstm_dim": 32,
                        "num_layers": 1,
                        "bidirectional": true,
                        "pack_sequence": true
                    },
                    "pooling": {
                        "SelfAttention": {
                            "attn_dimension": 64,
                            "dropout": 0.4
                        }
                    },
                    "mlp_decoder": null
                }
            }
        },
        "decoder": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "hidden_dims": [],
            "out_dim": null,
            "layer_norm": false,
            "dropout": 0.0,
            "activation": "relu"
        },
        "output_layer": {
            "load_path": null,
            "save_path": null,
            "freeze": false,
            "shared_module_key": null,
            "loss": {
                "CrossEntropyLoss": {}
            },
            "label_weights": null
        }
    },
    "metric_reporter": {
        "ClassificationMetricReporter": {
            "output_path": "/tmp/test_out.txt",
            "pep_format": false,
            "model_select_metric": "accuracy",
            "target_label": null,
            "text_column_names": [
                "text_seq"
            ],
            "additional_column_names": [],
            "recall_at_precision_thresholds": [
                0.2,
                0.4,
                0.6,
                0.8,
                0.9
            ]
        }
    }
}
SquadQATask.Config

Component: SquadQATask

class SquadQATask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "DrQAModel": {
            "inputs": {
                "squad_input": {
                    "SquadTensorizer": {
                        "is_input": true,
                        "column": "text",
                        "tokenizer": {
                            "Tokenizer": {
                                "split_regex": "\\W+",
                                "lowercase": true
                            }
                        },
                        "add_bos_token": false,
                        "add_eos_token": false,
                        "use_eos_token_for_bos": false,
                        "max_seq_len": null,
                        "vocab": {
                            "build_from_data": true,
                            "size_from_data": 0,
                            "vocab_files": []
                        },
                        "vocab_file_delimiter": " ",
                        "doc_column": "doc",
                        "ques_column": "question",
                        "answers_column": "answers",
                        "answer_starts_column": "answer_starts",
                        "max_ques_seq_len": 64,
                        "max_doc_seq_len": 256
                    }
                },
                "has_answer": {
                    "LabelTensorizer": {
                        "is_input": false,
                        "column": "has_answer",
                        "allow_unknown": false,
                        "pad_in_vocab": false,
                        "label_vocab": null
                    }
                }
            },
            "dropout": 0.4,
            "embedding": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "embed_dim": 300,
                "embedding_init_strategy": "random",
                "embedding_init_range": null,
                "export_input_names": [
                    "tokens_vals"
                ],
                "pretrained_embeddings_path": "/mnt/vol/pytext/users/kushall/pretrained/glove.840B.300d.txt",
                "vocab_file": "",
                "vocab_size": 0,
                "vocab_from_train_data": true,
                "vocab_from_all_data": false,
                "vocab_from_pretrained_embeddings": true,
                "lowercase_tokens": true,
                "min_freq": 1,
                "mlp_layer_dims": [],
                "padding_idx": null,
                "cpu_only": false,
                "skip_header": true,
                "delimiter": " "
            },
            "ques_rnn": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "hidden_size": 32,
                "num_layers": 1,
                "dropout": 0.4,
                "bidirectional": true,
                "rnn_type": "lstm",
                "concat_layers": true
            },
            "doc_rnn": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "hidden_size": 32,
                "num_layers": 1,
                "dropout": 0.4,
                "bidirectional": true,
                "rnn_type": "lstm",
                "concat_layers": true
            },
            "output_layer": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "loss": {
                    "CrossEntropyLoss": {}
                },
                "ignore_impossible": true,
                "pos_loss_weight": 0.5,
                "has_answer_loss_weight": 0.5,
                "false_label": "False",
                "max_answer_len": 30,
                "hard_weight": 0.0
            },
            "is_kd": false
        }
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false,
        "n_best_size": 5,
        "max_answer_length": 16,
        "ignore_impossible": true,
        "false_label": "False"
    }
}
WordTaggingTask.Config

Component: WordTaggingTask

class WordTaggingTask.Config[source]

Bases: NewTask.Config

All Attributes (including base classes)

Default JSON

{
    "data": {
        "Data": {
            "source": {
                "TSVDataSource": {
                    "column_mapping": {},
                    "train_filename": null,
                    "test_filename": null,
                    "eval_filename": null,
                    "field_names": null,
                    "delimiter": "\t",
                    "quoted": false,
                    "drop_incomplete_rows": false
                }
            },
            "batcher": {
                "PoolingBatcher": {
                    "train_batch_size": 16,
                    "eval_batch_size": 16,
                    "test_batch_size": 16,
                    "pool_num_batches": 10000,
                    "num_shuffled_pools": 1
                }
            },
            "sort_key": null,
            "in_memory": true
        }
    },
    "trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "model": {
        "WordTaggingModel": {
            "inputs": {
                "tokens": {
                    "is_input": true,
                    "column": "text",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "add_bos_token": false,
                    "add_eos_token": false,
                    "use_eos_token_for_bos": false,
                    "max_seq_len": null,
                    "vocab": {
                        "build_from_data": true,
                        "size_from_data": 0,
                        "vocab_files": []
                    },
                    "vocab_file_delimiter": " "
                },
                "labels": {
                    "is_input": false,
                    "slot_column": "slots",
                    "text_column": "text",
                    "tokenizer": {
                        "Tokenizer": {
                            "split_regex": "\\s+",
                            "lowercase": true
                        }
                    },
                    "allow_unknown": false
                }
            },
            "embedding": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "embed_dim": 100,
                "embedding_init_strategy": "random",
                "embedding_init_range": null,
                "export_input_names": [
                    "tokens_vals"
                ],
                "pretrained_embeddings_path": "",
                "vocab_file": "",
                "vocab_size": 0,
                "vocab_from_train_data": true,
                "vocab_from_all_data": false,
                "vocab_from_pretrained_embeddings": false,
                "lowercase_tokens": true,
                "min_freq": 1,
                "mlp_layer_dims": [],
                "padding_idx": null,
                "cpu_only": false,
                "skip_header": true,
                "delimiter": " "
            },
            "representation": {
                "PassThroughRepresentation": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null
                }
            },
            "output_layer": {
                "WordTaggingOutputLayer": {
                    "load_path": null,
                    "save_path": null,
                    "freeze": false,
                    "shared_module_key": null,
                    "loss": {
                        "CrossEntropyLoss": {}
                    },
                    "label_weights": {},
                    "ignore_pad_in_loss": true
                }
            },
            "decoder": {
                "load_path": null,
                "save_path": null,
                "freeze": false,
                "shared_module_key": null,
                "hidden_dims": [],
                "out_dim": null,
                "layer_norm": false,
                "dropout": 0.0,
                "activation": "relu"
            }
        }
    },
    "metric_reporter": {
        "output_path": "/tmp/test_out.txt",
        "pep_format": false
    }
}

trainers

ensemble_trainer

EnsembleTrainer.Config

Component: EnsembleTrainer

class EnsembleTrainer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

Default JSON

{
    "real_trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    }
}

hogwild_trainer

HogwildTrainer.Config

Component: HogwildTrainer

class HogwildTrainer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

real_trainer: TaskTrainer.Config = TaskTrainer.Config()
num_workers: int = 1

Default JSON

{
    "real_trainer": {
        "TaskTrainer": {
            "epochs": 10,
            "early_stop_after": 0,
            "max_clip_norm": null,
            "report_train_metrics": true,
            "target_time_limit_seconds": null,
            "do_eval": true,
            "load_best_model_after_train": true,
            "num_samples_to_log_progress": 1000,
            "num_accumulated_batches": 1,
            "num_batches_per_epoch": null,
            "optimizer": {
                "Adam": {
                    "lr": 0.001,
                    "weight_decay": 1e-05,
                    "eps": 1e-08
                }
            },
            "scheduler": null,
            "sparsifier": null,
            "fp16_args": {
                "FP16OptimizerFairseq": {
                    "init_loss_scale": 128,
                    "scale_window": null,
                    "scale_tolerance": 0.0,
                    "threshold_loss_scale": null,
                    "min_loss_scale": 0.0001
                }
            }
        }
    },
    "num_workers": 1
}
HogwildTrainer_Deprecated.Config

Component: HogwildTrainer_Deprecated

class HogwildTrainer_Deprecated.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

real_trainer: Trainer.Config = Trainer.Config()
num_workers: int = 1

Default JSON

{
    "real_trainer": {
        "epochs": 10,
        "early_stop_after": 0,
        "max_clip_norm": null,
        "report_train_metrics": true,
        "target_time_limit_seconds": null,
        "do_eval": true,
        "load_best_model_after_train": true,
        "num_samples_to_log_progress": 1000,
        "num_accumulated_batches": 1,
        "num_batches_per_epoch": null,
        "optimizer": {
            "Adam": {
                "lr": 0.001,
                "weight_decay": 1e-05,
                "eps": 1e-08
            }
        },
        "scheduler": null,
        "sparsifier": null,
        "fp16_args": {
            "FP16OptimizerFairseq": {
                "init_loss_scale": 128,
                "scale_window": null,
                "scale_tolerance": 0.0,
                "threshold_loss_scale": null,
                "min_loss_scale": 0.0001
            }
        }
    },
    "num_workers": 1
}

trainer

TaskTrainer.Config

Component: TaskTrainer

class TaskTrainer.Config[source]

Bases: Trainer.Config

Make mypy happy

All Attributes (including base classes)

epochs: int = 10
early_stop_after: int = 0
max_clip_norm: Optional[float] = None
report_train_metrics: bool = True
target_time_limit_seconds: Optional[int] = None
do_eval: bool = True
load_best_model_after_train: bool = True
num_samples_to_log_progress: int = 1000
num_accumulated_batches: int = 1
num_batches_per_epoch: Optional[int] = None
optimizer: Optimizer.Config = Adam.Config()
scheduler: Optional[Scheduler.Config] = None
sparsifier: Optional[Sparsifier.Config] = None
fp16_args: FP16Optimizer.Config = FP16OptimizerFairseq.Config()

Default JSON

{
    "epochs": 10,
    "early_stop_after": 0,
    "max_clip_norm": null,
    "report_train_metrics": true,
    "target_time_limit_seconds": null,
    "do_eval": true,
    "load_best_model_after_train": true,
    "num_samples_to_log_progress": 1000,
    "num_accumulated_batches": 1,
    "num_batches_per_epoch": null,
    "optimizer": {
        "Adam": {
            "lr": 0.001,
            "weight_decay": 1e-05,
            "eps": 1e-08
        }
    },
    "scheduler": null,
    "sparsifier": null,
    "fp16_args": {
        "FP16OptimizerFairseq": {
            "init_loss_scale": 128,
            "scale_window": null,
            "scale_tolerance": 0.0,
            "threshold_loss_scale": null,
            "min_loss_scale": 0.0001
        }
    }
}
Trainer.Config

Component: Trainer

class Trainer.Config[source]

Bases: ConfigBase

All Attributes (including base classes)

epochs: int = 10
Training epochs
early_stop_after: int = 0
Stop after how many epochs when the eval metric is not improving
max_clip_norm: Optional[float] = None
Clip gradient norm if set
report_train_metrics: bool = True
Whether metrics on training data should be computed and reported.
target_time_limit_seconds: Optional[int] = None
Target time limit for training, default (None) to no time limit.
do_eval: bool = True
Whether to do evaluation and model selection based on it.
load_best_model_after_train: bool = True
num_samples_to_log_progress: int = 1000
Number of samples for logging training progress.
num_accumulated_batches: int = 1
Number of forward & backward per batch before update gradients, the actual_batch_size = batch_size x num_accumulated_batches
num_batches_per_epoch: Optional[int] = None
Define epoch as a fixed number of batches. Subsequent epochs will continue to iterate through the data, cycling through it when they reach the end. If not set, use exactly one pass through the dataset as one epoch. This configuration only affects the train epochs, test and eval will always test their entire datasets.
optimizer: Optimizer.Config = Adam.Config()
config for optimizer, used in parameter update
scheduler: Optional[Scheduler.Config] = None
sparsifier: Optional[Sparsifier.Config] = None
fp16_args: FP16Optimizer.Config = FP16OptimizerFairseq.Config()
Define arguments for fp16 training. A fp16_optimizer will be created and wraps the original optimizer, which will scale loss during backward and master weight will be maintained on original optimizer. https://arxiv.org/abs/1710.03740
Subclasses
  • TaskTrainer.Config

Default JSON

{
    "epochs": 10,
    "early_stop_after": 0,
    "max_clip_norm": null,
    "report_train_metrics": true,
    "target_time_limit_seconds": null,
    "do_eval": true,
    "load_best_model_after_train": true,
    "num_samples_to_log_progress": 1000,
    "num_accumulated_batches": 1,
    "num_batches_per_epoch": null,
    "optimizer": {
        "Adam": {
            "lr": 0.001,
            "weight_decay": 1e-05,
            "eps": 1e-08
        }
    },
    "scheduler": null,
    "sparsifier": null,
    "fp16_args": {
        "FP16OptimizerFairseq": {
            "init_loss_scale": 128,
            "scale_window": null,
            "scale_tolerance": 0.0,
            "threshold_loss_scale": null,
            "min_loss_scale": 0.0001
        }
    }
}
TrainerBase.Config

Component: TrainerBase

class TrainerBase.Config

Bases: Component.Config

All Attributes (including base classes)

Default JSON

{}

pytext package

Subpackages

pytext.common package

Submodules
pytext.common.constants module
class pytext.common.constants.BatchContext[source]

Bases: object

IGNORE_LOSS = 'ignore_loss'
INDEX = 'row_index'
TASK_NAME = 'task_name'
class pytext.common.constants.DFColumn[source]

Bases: object

ALIGNMENT = 'alignment'
CONTEXT_SEQUENCE = 'context_sequence'
DENSE_FEAT = 'dense_feat'
DICT_FEAT = 'dict_feat'
DOC_LABEL = 'doc_label'
DOC_WEIGHT = 'doc_weight'
LANGUAGE_ID = 'lang'
MODEL_FEATS = 'model_feats'
RAW_FEATS = 'raw_feats'
SEQLOGICAL = 'seqlogical'
SOURCE_FEATS = 'source_feats'
SOURCE_SEQUENCE = 'source_sequence'
TARGET_LABELS = 'target_labels'
TARGET_LOGITS = 'target_logits'
TARGET_PROBS = 'target_probs'
TARGET_SEQUENCE = 'target_sequence'
TARGET_TOKENS = 'target_tokens'
TOKEN_RANGE = 'token_range'
UTTERANCE = 'text'
WORD_LABEL = 'word_label'
WORD_WEIGHT = 'word_weight'
class pytext.common.constants.DatasetFieldName[source]

Bases: object

CHAR_FIELD = 'char_feat'
CONTEXTUAL_TOKEN_EMBEDDING = 'contextual_token_embedding'
DENSE_FIELD = 'dense_feat'
DICT_FIELD = 'dict_feat'
DOC_LABEL_FIELD = 'doc_label'
DOC_WEIGHT_FIELD = 'doc_weight'
LANGUAGE_ID_FIELD = 'lang'
NUM_TOKENS = 'num_tokens'
RAW_DICT_FIELD = 'sparsefeat'
RAW_SEQUENCE = 'raw_sequence'
RAW_WORD_LABEL = 'raw_word_label'
SEQ_FIELD = 'seq_word_feat'
SEQ_LENS = 'seq_lens'
SOURCE_SEQ_FIELD = 'source_sequence'
TARGET_SEQ_FIELD = 'target_sequence'
TARGET_SEQ_LENS = 'target_seq_lens'
TEXT_FIELD = 'word_feat'
TOKENS = 'tokens'
TOKEN_INDICES = 'token_indices'
TOKEN_RANGE = 'token_range'
UTTERANCE_FIELD = 'utterance'
WORD_LABEL_FIELD = 'word_label'
WORD_WEIGHT_FIELD = 'word_weight'
class pytext.common.constants.PackageFileName[source]

Bases: object

RAW_EMBED = 'pretrained_embed_raw'
SERIALIZED_EMBED = 'pretrained_embed_pt_serialized'
class pytext.common.constants.Padding[source]

Bases: object

DEFAULT_LABEL_PAD_IDX = -1
WORD_LABEL_PAD = 'PAD_LABEL'
WORD_LABEL_PAD_IDX = 0
class pytext.common.constants.RawExampleFieldName[source]

Bases: object

ROW_INDEX = 'row_index'
class pytext.common.constants.Stage[source]

Bases: enum.Enum

An enumeration.

EVAL = 'Evaluation'
TEST = 'Test'
TRAIN = 'Training'
class pytext.common.constants.VocabMeta[source]

Bases: object

EOS_SEQ = '</s_seq>'
EOS_TOKEN = '</s>'
INIT_SEQ = '<s_seq>'
INIT_TOKEN = '<s>'
PAD_SEQ = '<pad_seq>'
PAD_TOKEN = '<pad>'
UNK_NUM_TOKEN = '<unk>-NUM'
UNK_TOKEN = '<unk>'
pytext.common.utils module
pytext.common.utils.eprint(*args, **kwargs)[source]
Module contents

pytext.config package

Submodules
pytext.config.component module
class pytext.config.component.Component(config=None, *args, **kwargs)[source]

Bases: object

classmethod from_config(config, *args, **kwargs)[source]
class pytext.config.component.ComponentMeta[source]

Bases: type

class pytext.config.component.ComponentType[source]

Bases: enum.Enum

An enumeration.

BATCHER = 'batcher'
BATCH_SAMPLER = 'batch_sampler'
COLUMN = 'column'
DATA_HANDLER = 'data_handler'
DATA_SOURCE = 'data_source'
DATA_TYPE = 'data_type'
EXPORTER = 'exporter'
FEATURIZER = 'featurizer'
LOSS = 'loss'
METRIC_REPORTER = 'metric_reporter'
MODEL = 'model'
MODEL2 = 'model2'
MODULE = 'module'
OPTIMIZER = 'optimizer'
PREDICTOR = 'predictor'
SCHEDULER = 'scheduler'
SPARSIFIER = 'sparsifier'
TASK = 'task'
TENSORIZER = 'tensorizer'
TOKENIZER = 'tokenizer'
TRAINER = 'trainer'
class pytext.config.component.Registry[source]

Bases: object

classmethod add(component_type: pytext.config.component.ComponentType, cls_to_add: Type[CT_co], config_cls: Type[CT_co])[source]
classmethod configs(component_type: pytext.config.component.ComponentType) → Tuple[Type[CT_co], ...][source]
classmethod get(component_type: pytext.config.component.ComponentType, config_cls: Type[CT_co]) → Type[CT_co][source]
classmethod subconfigs(config_cls: Type[CT_co]) → Tuple[Type[CT_co], ...][source]
classmethod values(component_type: pytext.config.component.ComponentType) → Tuple[Type[CT_co], ...][source]
exception pytext.config.component.RegistryError[source]

Bases: Exception

pytext.config.component.create_component(component_type: pytext.config.component.ComponentType, config: Any, *args, **kwargs)[source]
pytext.config.component.create_data_handler(data_handler_config, *args, **kwargs)[source]
pytext.config.component.create_exporter(exporter_config, *args, **kwargs)[source]
pytext.config.component.create_featurizer(featurizer_config, *args, **kwargs)[source]
pytext.config.component.create_loss(loss_config, *args, **kwargs)[source]
pytext.config.component.create_metric_reporter(module_config, *args, **kwargs)[source]
pytext.config.component.create_model(model_config, *args, **kwargs)[source]
pytext.config.component.create_optimizer(optimizer_config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
pytext.config.component.create_predictor(predictor_config, *args, **kwargs)[source]
pytext.config.component.create_scheduler(scheduler_config, optimizer, *args, **kwargs)[source]
pytext.config.component.create_sparsifier(sparsifier_config, *args, **kwargs)[source]
pytext.config.component.create_trainer(trainer_config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
pytext.config.component.get_component_name(obj)[source]

Return the human-readable name of the class of obj. Document the type of a config field and can be used as a Union value in a json config.

pytext.config.component.register_tasks(task_cls: Union[Type[CT_co], List[Type[CT_co]]])[source]

Task classes are already added to registry during declaration, pass them as parameters here just to make sure they’re imported

pytext.config.config_adapter module
pytext.config.config_adapter.create_parameter(config, path_str, value)[source]
pytext.config.config_adapter.delete_parameter(config, path_str)[source]
pytext.config.config_adapter.deprecate(json_config, t)[source]
pytext.config.config_adapter.doc_model_deprecated(json_config)[source]

Rename DocModel to DocModel_Deprecated.

pytext.config.config_adapter.ensemble_task_deprecated(json_config)[source]

Rename tasks with new API consistently

pytext.config.config_adapter.find_dicts_containing_key(json_config, key)[source]
pytext.config.config_adapter.find_parameter(config, path_str)[source]
pytext.config.config_adapter.flatten_deprecated_ensemble_config(json_config)[source]
pytext.config.config_adapter.is_type_specifier(json_dict)[source]

If a config object is a class, it might have a level which is a type specifier, with one key corresponding to the name of whichever type it is. These types should not be explicitly named in the path.

pytext.config.config_adapter.lm_model_deprecated(json_config)[source]

Rename LM model to _Deprecated (LMTask is already deprecated in v5)

pytext.config.config_adapter.migrate_to_new_data_handler(task, columns)[source]
pytext.config.config_adapter.move_epoch_size(json_config)[source]
pytext.config.config_adapter.new_tasks_rename(json_config)[source]

Rename tasks with new API consistently

pytext.config.config_adapter.old_tasks_deprecated(json_config)[source]

Rename tasks with data_handler config to _Deprecated

pytext.config.config_adapter.register_adapter(from_version)[source]
pytext.config.config_adapter.remove_docclassificationtask_deprecated(json_config)[source]
pytext.config.config_adapter.remove_lmtask_deprecated(json_config)[source]
pytext.config.config_adapter.rename(json_config, old_name, new_name)[source]
pytext.config.config_adapter.rename_bitransformer_inputs(json_config)[source]

In “BiTransformer” model, rename input “characters” -> “bytes” and update subfields.

pytext.config.config_adapter.rename_fl_task(json_config)[source]
pytext.config.config_adapter.rename_parameter(config, old_path, new_path, transform=<function <lambda>>)[source]

A powerful tool for writing config adapters, this allows you to specify a JSON-style path for an old and new config parameter. For instance

rename_parameter(config, “task.data.epoch_size”, “task.trainer.batches_per_epoch”)

will look through the config for task.data.epoch_size, including moving through explicitly specified types. If it’s specified, it will delete the value and set it in task.trainer.num_batches_per_epoch instead, creating trainer as an empty dictionary if necessary.

pytext.config.config_adapter.rename_tensorizer_vocab_params(json_config)[source]
pytext.config.config_adapter.upgrade_if_xlm(json_config)[source]

Make XLMModel Union changes for encoder and tokens config. Since they are now unions, insert the old class into the config if no class name is mentioned.

pytext.config.config_adapter.upgrade_one_version(json_config)[source]
pytext.config.config_adapter.upgrade_to_latest(json_config)[source]
pytext.config.config_adapter.v0_to_v1(json_config)[source]
pytext.config.config_adapter.v12_to_v13(json_config)[source]

remove_output_encoded_layers(json_config)

pytext.config.config_adapter.v1_to_v2(json_config)[source]
pytext.config.config_adapter.v2_to_v3(json_config)[source]

Optimizer and Scheduler configs used to be part of the task config, they now live in the trainer’s config.

pytext.config.config_adapter.v3_to_v4(json_config)[source]

Key for provding the path for contextual token embedding has changed from pretrained_model_embedding to contextual_token_embedding. This affects the `features section of the config.

pytext.config.config_adapter.v6_to_v7(json_config)[source]

Make LabelTensorizer expansible. If the labels field should be an instance of LabelTensorizer, convert it to`{LabelTensorizer: labels}`.

pytext.config.contextual_intent_slot module
class pytext.config.contextual_intent_slot.ExtraField[source]

Bases: object

DOC_WEIGHT = 'doc_weight'
RAW_WORD_LABEL = 'raw_word_label'
TOKEN_RANGE = 'token_range'
UTTERANCE = 'utterance'
WORD_WEIGHT = 'word_weight'
class pytext.config.contextual_intent_slot.ModelInput[source]

Bases: object

CHAR = 'char_feat'
CONTEXTUAL_TOKEN_EMBEDDING = 'contextual_token_embedding'
DENSE = 'dense_feat'
DICT = 'dict_feat'
SEQ = 'seq_word_feat'
TEXT = 'word_feat'
class pytext.config.contextual_intent_slot.ModelInputConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

char_feat = None
contextual_token_embedding = None
dense_feat = None
dict_feat = None
seq_word_feat = <pytext.config.field_config.WordFeatConfig object>
word_feat = <pytext.config.field_config.WordFeatConfig object>
pytext.config.doc_classification module
class pytext.config.doc_classification.ExtraField[source]

Bases: object

RAW_TEXT = 'text'
class pytext.config.doc_classification.ModelInput[source]

Bases: object

CHAR_FEAT = 'char_feat'
CONTEXTUAL_TOKEN_EMBEDDING = 'contextual_token_embedding'
DENSE_FEAT = 'dense_feat'
DICT_FEAT = 'dict_feat'
SEQ_LENS = 'seq_lens'
WORD_FEAT = 'word_feat'
class pytext.config.doc_classification.ModelInputConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

char_feat = None
contextual_token_embedding = None
dense_feat = None
dict_feat = None
word_feat = <pytext.config.field_config.WordFeatConfig object>
pytext.config.field_config module
pytext.config.field_config.CharFeatConfig[source]

alias of pytext.config.field_config.CharFeatConfig

pytext.config.field_config.ContextualTokenEmbeddingConfig[source]

alias of pytext.config.field_config.ContextualTokenEmbeddingConfig

pytext.config.field_config.DictFeatConfig[source]

alias of pytext.config.field_config.DictFeatConfig

class pytext.config.field_config.DocLabelConfig(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

export_output_names = ['doc_scores']
label_weights = {}
target_prob = False
class pytext.config.field_config.EmbedInitStrategy[source]

Bases: enum.Enum

An enumeration.

RANDOM = 'random'
ZERO = 'zero'
class pytext.config.field_config.FeatureConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

char_feat = None
contextual_token_embedding = None
dense_feat = None
dict_feat = None
seq_word_feat = None
word_feat = <pytext.config.field_config.WordFeatConfig object>
class pytext.config.field_config.FloatVectorConfig(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

dim = 0
dim_error_check = False
export_input_names = ['float_vec_vals']
class pytext.config.field_config.Target[source]

Bases: object

DOC_LABEL = 'doc_label'
TARGET_LABEL_FIELD = 'target_label'
TARGET_LOGITS_FIELD = 'target_logit'
TARGET_PROB_FIELD = 'target_prob'
pytext.config.field_config.WordFeatConfig[source]

alias of pytext.config.field_config.WordFeatConfig

class pytext.config.field_config.WordLabelConfig(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

export_output_names = ['word_scores']
use_bio_labels = False
pytext.config.module_config module
class pytext.config.module_config.Activation[source]

Bases: enum.Enum

An enumeration.

GELU = 'gelu'
GLU = 'glu'
LEAKYRELU = 'leakyrelu'
RELU = 'relu'
TANH = 'tanh'
class pytext.config.module_config.CNNParams(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

causal = False
dilated = False
kernel_num = 100
kernel_sizes = [3, 4]
weight_norm = False
class pytext.config.module_config.ExporterType[source]

Bases: enum.Enum

An enumeration.

INIT_PREDICT = 'init_predict'
PREDICTOR = 'predictor'
pytext.config.module_config.ModuleConfig[source]

alias of pytext.config.module_config.ModuleConfig

class pytext.config.module_config.PerplexityType[source]

Bases: enum.Enum

An enumeration.

EOS = 'eos'
MAX = 'max'
MEAN = 'mean'
MEDIAN = 'median'
MIN = 'min'
class pytext.config.module_config.PoolingType[source]

Bases: enum.Enum

An enumeration.

MAX = 'max'
MEAN = 'mean'
NONE = 'none'
class pytext.config.module_config.SlotAttentionType[source]

Bases: enum.Enum

An enumeration.

CONCAT = 'concat'
DOT = 'dot'
MULTIPLY = 'multiply'
NO_ATTENTION = 'no_attention'
pytext.config.pair_classification module
class pytext.config.pair_classification.ExtraField[source]

Bases: object

UTTERANCE_PAIR = 'utterance'
class pytext.config.pair_classification.ModelInput[source]

Bases: object

TEXT1 = 'text1'
TEXT2 = 'text2'
class pytext.config.pair_classification.ModelInputConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

text1 = <pytext.config.field_config.WordFeatConfig object>
text2 = <pytext.config.field_config.WordFeatConfig object>
pytext.config.pytext_config module
class pytext.config.pytext_config.ConfigBase(**kwargs)[source]

Bases: object

items()[source]
class pytext.config.pytext_config.ConfigBaseMeta[source]

Bases: type

annotations_and_defaults()[source]
class pytext.config.pytext_config.LogitsConfig(**kwargs)[source]

Bases: pytext.config.pytext_config.TestConfig

dump_raw_input = False
class pytext.config.pytext_config.PlaceHolder[source]

Bases: object

class pytext.config.pytext_config.PyTextConfig(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

auto_resume_from_snapshot = False
debug_path = '/tmp/model.debug'
distributed_world_size = 1
export_caffe2_path = None
export_onnx_path = '/tmp/model.onnx'
export_torchscript_path = None
gpu_streams_for_distributed_training = 1
include_dirs = None
load_snapshot_path = ''
modules_save_dir = ''
random_seed = None

Seed value to seed torch, python, and numpy random generators.

report_eval_results = False
save_all_checkpoints = False
save_module_checkpoints = False
save_snapshot_path = '/tmp/model.pt'
test_out_path = '/tmp/test_out.txt'
torchscript_quantize = False
use_config_from_snapshot = True
use_cuda_for_testing = True
use_cuda_if_available = True
use_deterministic_cudnn = False

Whether to allow CuDNN to behave deterministically.

use_fp16 = False
use_tensorboard = True
class pytext.config.pytext_config.TestConfig(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

field_names = None

Field names for the TSV. If this is not set, the first line of each file will be assumed to be a header containing the field names.

test_out_path = ''
test_path = 'test.tsv'
use_cuda_if_available = True
use_tensorboard = True
pytext.config.query_document_pairwise_ranking module
class pytext.config.query_document_pairwise_ranking.ModelInput[source]

Bases: object

NEG_RESPONSE = 'neg_response'
POS_RESPONSE = 'pos_response'
QUERY = 'query'
class pytext.config.query_document_pairwise_ranking.ModelInputConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

neg_response = <pytext.config.field_config.WordFeatConfig object>
pos_response = <pytext.config.field_config.WordFeatConfig object>
query = <pytext.config.field_config.WordFeatConfig object>
pytext.config.serialize module
exception pytext.config.serialize.ConfigParseError[source]

Bases: Exception

exception pytext.config.serialize.EnumTypeError[source]

Bases: pytext.config.serialize.ConfigParseError

exception pytext.config.serialize.IncorrectTypeError[source]

Bases: Exception

exception pytext.config.serialize.MissingValueError[source]

Bases: pytext.config.serialize.ConfigParseError

exception pytext.config.serialize.UnionTypeError[source]

Bases: pytext.config.serialize.ConfigParseError

pytext.config.serialize.build_subclass_dict(subclasses)[source]
pytext.config.serialize.config_from_json(cls, json_obj, ignore_fields=())[source]
pytext.config.serialize.config_to_json(cls, config_obj)[source]
pytext.config.serialize.parse_config(config_json)[source]

Parse PyTextConfig object from parameter string or parameter file

pytext.config.serialize.pytext_config_from_json(json_obj, ignore_fields=(), auto_upgrade=True)[source]
pytext.config.utils module
pytext.config.utils.cast_str(to_type, value)[source]
pytext.config.utils.find_param(root, suffix, parent='')[source]

Recursively look at all fields in config to find where suffix would fit. This is used to change configs so that they don’t use default values. Return the list of field paths matching.

pytext.config.utils.is_component_class(obj)[source]
pytext.config.utils.replace_param(root, path_list, value)[source]
pytext.config.utils.resolve_optional(type_v)[source]

Deal with Optional implemented as Union[type, None]

Module contents

pytext.data package

Subpackages
pytext.data.data_structures package
Submodules
pytext.data.data_structures.annotation module
class pytext.data.data_structures.annotation.Annotation(annotation_string: str, utterance: str = '', brackets: str = '[]', combination_labels: bool = True, add_dict_feat: bool = False, accept_flat_intents_slots: bool = False)[source]

Bases: object

build_tree(accept_flat_intents_slots: bool = False)[source]
class pytext.data.data_structures.annotation.Intent(label)[source]

Bases: pytext.data.data_structures.annotation.Node

validate_node()[source]
class pytext.data.data_structures.annotation.Node(label)[source]

Bases: object

children_flat_str_spans()[source]
flat_str()[source]
get_info()[source]
get_token_indices()[source]
get_token_span()[source]

0 indexed Like array slicing: For the first 3 tokens, returns 0, 3

list_ancestors()[source]
list_nonTerminals()[source]

Returns all Intent and Slot nodes subordinate to this node

list_terminals()[source]

Returns all Token nodes

list_tokens()[source]
validate_node()[source]
class pytext.data.data_structures.annotation.Node_Info(node)[source]

Bases: object

This class extracts the essential information for a mode, for use in rules.

get_parent(node)[source]
get_same_span(node)[source]
class pytext.data.data_structures.annotation.Root[source]

Bases: pytext.data.data_structures.annotation.Node

validate_node()[source]
class pytext.data.data_structures.annotation.Slot(label)[source]

Bases: pytext.data.data_structures.annotation.Node

validate_node()[source]
class pytext.data.data_structures.annotation.Token(label, index)[source]

Bases: pytext.data.data_structures.annotation.Node

remove()[source]

Removes this token from the tree

validate_node()[source]
class pytext.data.data_structures.annotation.Token_Info(node)[source]

Bases: object

This class extracts the essential information for a token for use in rules.

get_parent(node)[source]
class pytext.data.data_structures.annotation.Tree(root: pytext.data.data_structures.annotation.Root, combination_labels: bool, utterance: str = '', validate_tree: bool = True)[source]

Bases: object

depth()[source]
flat_str()[source]
list_tokens()[source]
lotv_str()[source]

LOTV – Limited Output Token Vocabulary We map the terminal tokens in the input to a constant output (SEQLOGICAL_LOTV_TOKEN) to make the parsing task easier for models where the decoding is decoupled from the input (e.g. seq2seq). This way, the model can focus on learning to predict the parse tree, rather than waste effort learning to replicate terminal tokens.

print_tree()[source]
recursive_validation(node)[source]
to_actions()[source]
validate_tree()[source]

This is a method for checking that roots/intents/slots are nested correctly. Root( Intent( Slot( Intent( Slot, etc.) ) ) )

class pytext.data.data_structures.annotation.TreeBuilder(combination_labels: bool = True)[source]

Bases: object

finalize_tree(validate_tree=True)[source]
update_tree(action, label)[source]
pytext.data.data_structures.annotation.escape_brackets(string: str) → str[source]
pytext.data.data_structures.annotation.is_intent_nonterminal(node_label: str) → bool[source]
pytext.data.data_structures.annotation.is_slot_nonterminal(node_label: str) → bool[source]
pytext.data.data_structures.annotation.is_unsupported(node_label: str) → bool[source]
pytext.data.data_structures.annotation.is_valid_nonterminal(node_label: str) → bool[source]
pytext.data.data_structures.annotation.list_from_actions(tokens_str: List[str], actions_vocab: List[str], actions_indices: List[int])[source]
pytext.data.data_structures.node module
class pytext.data.data_structures.node.Node(label: str, span: pytext.data.data_structures.node.Span, children: Optional[AbstractSet[Node]] = None, text: str = None)[source]

Bases: object

Node in an intent-slot tree, representing either an intent or a slot.

label

Label of the node.

Type:str
span

Span of the node.

Type:Span
children

Children of the node.

Type:set of Node
children
get_depth() → int[source]
label
span
text
class pytext.data.data_structures.node.Span[source]

Bases: tuple

Span of a node in an intent-slot tree.

start

Start position of the node.

end

End position of the node (exclusive).

end

Alias for field number 1

start

Alias for field number 0

Module contents
pytext.data.featurizer package
Submodules
pytext.data.featurizer.featurizer module
class pytext.data.featurizer.featurizer.Featurizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.config.component.Component

Featurizer is tasked with performing data preprocessing that should be shared between training and inference, namely, tokenization and gazetteer features alignment.

This is an interface whose featurize() method must be implemented so that the implemented interface can be used with the appropriate data handler.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]
featurize_batch(input_record_list: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

classmethod from_config(config, feature_config: pytext.config.field_config.FeatureConfig)[source]
get_sentence_markers(locale=None)[source]
class pytext.data.featurizer.featurizer.InputRecord[source]

Bases: tuple

Input data contract between Featurizer and DataHandler.

locale

Alias for field number 2

raw_gazetteer_feats

Alias for field number 1

raw_text

Alias for field number 0

class pytext.data.featurizer.featurizer.OutputRecord[source]

Bases: tuple

Output data contract between Featurizer and DataHandler.

characters

Alias for field number 5

contextual_token_embedding

Alias for field number 6

dense_feats

Alias for field number 7

gazetteer_feat_lengths

Alias for field number 3

gazetteer_feat_weights

Alias for field number 4

gazetteer_feats

Alias for field number 2

token_ranges

Alias for field number 1

tokens

Alias for field number 0

pytext.data.featurizer.simple_featurizer module
class pytext.data.featurizer.simple_featurizer.SimpleFeaturizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.data.featurizer.featurizer.Featurizer

Simple featurizer for basic tokenization and gazetteer feature alignment.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Featurize one instance/example only.

featurize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

get_sentence_markers(locale=None)[source]
tokenize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Tokenize one instance/example only.

tokenize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]
Module contents
class pytext.data.featurizer.Featurizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.config.component.Component

Featurizer is tasked with performing data preprocessing that should be shared between training and inference, namely, tokenization and gazetteer features alignment.

This is an interface whose featurize() method must be implemented so that the implemented interface can be used with the appropriate data handler.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]
featurize_batch(input_record_list: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

classmethod from_config(config, feature_config: pytext.config.field_config.FeatureConfig)[source]
get_sentence_markers(locale=None)[source]
class pytext.data.featurizer.InputRecord[source]

Bases: tuple

Input data contract between Featurizer and DataHandler.

locale

Alias for field number 2

raw_gazetteer_feats

Alias for field number 1

raw_text

Alias for field number 0

class pytext.data.featurizer.OutputRecord[source]

Bases: tuple

Output data contract between Featurizer and DataHandler.

characters

Alias for field number 5

contextual_token_embedding

Alias for field number 6

dense_feats

Alias for field number 7

gazetteer_feat_lengths

Alias for field number 3

gazetteer_feat_weights

Alias for field number 4

gazetteer_feats

Alias for field number 2

token_ranges

Alias for field number 1

tokens

Alias for field number 0

class pytext.data.featurizer.SimpleFeaturizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.data.featurizer.featurizer.Featurizer

Simple featurizer for basic tokenization and gazetteer feature alignment.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Featurize one instance/example only.

featurize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

get_sentence_markers(locale=None)[source]
tokenize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Tokenize one instance/example only.

tokenize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]
pytext.data.sources package
Submodules
pytext.data.sources.conllu module
class pytext.data.sources.conllu.CoNLLUNERDataSource(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]

Bases: pytext.data.sources.conllu.CoNLLUPOSDataSource

Reads an empty line separated data (word label). This data source supports datasets for NER tasks

class pytext.data.sources.conllu.CoNLLUNERFile(file, delim, lang)[source]

Bases: object

class pytext.data.sources.conllu.CoNLLUPOSDataSource(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from CoNLL-U file.

classmethod from_config(config: pytext.data.sources.conllu.CoNLLUPOSDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]
raw_eval_data_generator()[source]

Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]

Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]

Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

pytext.data.sources.data_source module
class pytext.data.sources.data_source.DataSource(schema: Dict[str, Type[CT_co]])[source]

Bases: pytext.config.component.Component

Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.

Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.

eval = <pytext.data.sources.data_source.GeneratorIterator object>[source]
test = <pytext.data.sources.data_source.GeneratorIterator object>[source]
train = <pytext.data.sources.data_source.GeneratorIterator object>[source]
class pytext.data.sources.data_source.GeneratorIterator(generator, *args, **kwargs)[source]

Bases: object

Create an object which can be iterated over multiple times from a generator call. Each iteration will call the generator and allow iterating over it. This is unsafe to use on generators which have side effects, such as file readers; it’s up to the callers to safely manage these scenarios.

class pytext.data.sources.data_source.GeneratorMethodProperty(generator)[source]

Bases: object

Identify a generator method as a property. This will allow instances to iterate over the property multiple times, and not consume the generator. It accomplishes this by wrapping the generator and creating multiple generator instances if iterated over multiple times.

class pytext.data.sources.data_source.RawExample[source]

Bases: dict

A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.

class pytext.data.sources.data_source.RootDataSource(schema: Dict[str, Type[CT_co]], column_mapping: Dict[str, str] = ())[source]

Bases: pytext.data.sources.data_source.DataSource

A data source which actually loads data from a location. This data source needs to be responsible for converting types based on a schema, because it should be the only part of the system that actually needs to understand details about the underlying storage system.

RootDataSource presents a simpler abstraction than DataSource where the rows are automatically converted to the right DataTypes.

A RootDataSource should implement raw_train_data_generator, raw_test_data_generator, and raw_eval_data_generator. These functions should yield dictionaries of raw objects which the loading system can convert using the schema loading functions.

DATA_SOURCE_TYPES = {<class 'str'>: <function load_text>, typing.Any: <function load_text>, typing.List[pytext.utils.data.Slot]: <function load_slots>, typing.List[int]: <function load_json>, typing.List[str]: <function load_json>, typing.List[typing.Dict[str, typing.Dict[str, float]]]: <function load_json>, typing.List[float]: <function load_float_list>, ~JSONString: <function load_json_string>, <class 'float'>: <function load_float>, <class 'int'>: <function load_int>}
eval = <pytext.data.sources.data_source.GeneratorIterator object>[source]
load(value, schema_type)[source]
raw_eval_data_generator()[source]

Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]

Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]

Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

classmethod register_type(type)[source]
test = <pytext.data.sources.data_source.GeneratorIterator object>[source]
train = <pytext.data.sources.data_source.GeneratorIterator object>[source]
class pytext.data.sources.data_source.RowShardedDataSource(data_source: pytext.data.sources.data_source.DataSource, rank=0, world_size=1)[source]

Bases: pytext.data.sources.data_source.ShardedDataSource

Shards a given datasource by row.

train = <pytext.data.sources.data_source.GeneratorIterator object>[source]
train_unsharded = <pytext.data.sources.data_source.GeneratorIterator object>[source]
class pytext.data.sources.data_source.SafeFileWrapper(*args, **kwargs)[source]

Bases: object

A simple wrapper class for files which allows filedescriptors to be managed with normal Python ref counts. Without using this, if you create a file in a from_config you will see a warning along the lines of “ResourceWarning: self._file is acquired but not always released” this is because we’re opening a file not in a context manager (with statement). We want to do it this way because it lets us pass a file object to the DataSource, rather than a filename. This exposes a ton more flexibility and testability, passing filenames is one of the paths towards pain.

However, we don’t have a clear resource management system set up for configuration. from_config functions are the tool that we have to allow objects to specify how they should be created from a configuration, which generally should only happen from the command line, whereas in eg. a notebook you should build the objects with constructors directly. If building from constructors, you can just open a file and pass it, but from_config here needs to create a file object from a configured filename. Python files don’t close automatically, so you also need a system that will close them when the python interpreter shuts down. If you don’t, it will print a resource warning at runtime, as the interpreter manually closes the filehandles (although modern OSs are pretty okay with having open file handles, it’s hard for me to justify exactly why Python is so strict about this; I think one of the main reasons you might actually care is if you have a writeable file handle it might not have flushed properly when the C runtime exits, but Python doesn’t actually distinguish between writeable and non-writeable file handles).

This class is a wrapper that creates a system for (sort-of) safely closing the file handles before the runtime exits. It does this by closing the file when the object’s deleter is called. Although the python standard doesn’t actually make any guarantees about when deleters are called, CPython is reference counted and so as an mplementation detail will call a deleter whenever the last reference to it is removed, which generally will happen to all objects created during program execution as long as there aren’t reference cycles (I don’t actually know off-hand whether the cycle collection is run before shutdown, and anyway the cycles would have to include objects that the runtime itself maintains pointers to, which seems like you’d have to work hard to do and wouldn’t do accidentally). This isn’t true for other python systems like PyPy or Jython which use generational garbage collection and so don’t actually always call destructors before the system shuts down, but again this is only really relevant for mutable files.

An alternative implementation would be to build a resource management system into PyText, something like a function that we use for opening system resources that registers the resources and then we make sure are all closed before system shutdown. That would probably technically be the right solution, but I didn’t really think of that first and also it’s a bit longer to implement.

If you are seeing resource warnings on your system, please file a github issue.

class pytext.data.sources.data_source.ShardedDataSource(schema: Dict[str, Type[CT_co]])[source]

Bases: pytext.data.sources.data_source.DataSource

Base class for sharded data sources.

pytext.data.sources.data_source.generator_property

alias of pytext.data.sources.data_source.GeneratorMethodProperty

pytext.data.sources.data_source.load_float(f)[source]
pytext.data.sources.data_source.load_float_list(s)[source]
pytext.data.sources.data_source.load_int(x)[source]
pytext.data.sources.data_source.load_json(s)[source]
pytext.data.sources.data_source.load_json_string(s)[source]
pytext.data.sources.data_source.load_slots(s)[source]
pytext.data.sources.data_source.load_text(s)[source]
pytext.data.sources.pandas module
class pytext.data.sources.pandas.PandasDataSource(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from a pandas DataFrame.

Inputs:

train_df: DataFrame for training

eval_df: DataFrame for evalu

test_df: DataFrame for test

schema: same as base DataSource, define the list of output values with their types

column_mapping: maps the column names in DataFrame to the name defined in schema

raw_eval_data_generator()[source]

Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

static raw_generator(df: Optional[pandas.core.frame.DataFrame])[source]
raw_test_data_generator()[source]

Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]

Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

class pytext.data.sources.pandas.SessionPandasDataSource(schema: Dict[str, Type[CT_co]], id_col: str, train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, column_mapping: Dict[str, str] = ())[source]

Bases: pytext.data.sources.pandas.PandasDataSource, pytext.data.sources.session.SessionDataSource

pytext.data.sources.session module
class pytext.data.sources.session.SessionDataSource(id_col, **kwargs)[source]

Bases: pytext.data.sources.data_source.RootDataSource

Data source for session based data, the input data is organized in sessions, each session may have multiple rows. The first column is always the session id. Raw input rows are consolidated by session id and returned as one session per example

merge_session(session)[source]
pytext.data.sources.squad module
class pytext.data.sources.squad.SquadDataSource(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]

Bases: pytext.data.sources.data_source.DataSource

Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)

DEFAULT_SCHEMA = {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}
eval = <pytext.data.sources.data_source.GeneratorIterator object>
classmethod from_config(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]
process_file(fname)[source]
test = <pytext.data.sources.data_source.GeneratorIterator object>
train = <pytext.data.sources.data_source.GeneratorIterator object>
class pytext.data.sources.squad.SquadDataSourceForKD(**kwargs)[source]

Bases: pytext.data.sources.squad.SquadDataSource

Squad-like data along with soft labels (logits). Will return tuples of ( doc, question, answer, answer_start, has_answer, start_logits, end_logits, has_answer_logits, pad_mask, segment_labels )

process_file(fname)[source]
pytext.data.sources.squad.process_squad(fname, ignore_impossible, max_character_length, min_overlap=0.1, delimiter='\t', quoted=False, is_kd=False)[source]
pytext.data.sources.squad.process_squad_json(fname, ignore_impossible, max_character_length, min_overlap)[source]
pytext.data.sources.squad.process_squad_tsv(fname, ignore_impossible, max_character_length, min_overlap, delimiter, quoted)[source]
pytext.data.sources.squad.process_squad_tsv_for_kd(fname, ignore_impossible, max_character_length, min_overlap, delimiter, quoted)[source]
pytext.data.sources.tsv module
class pytext.data.sources.tsv.BlockShardedTSV(file, field_names=None, delimiter='t', quoted=False, block_id=0, num_blocks=1, drop_incomplete_rows=False)[source]

Bases: object

Take a TSV file, split into N pieces (by byte location) and return an iterator on one of the pieces. The pieces are equal by byte size, not by number of rows. Thus, care needs to be taken when using this for distributed training, otherwise number of batches for different workers might be different.

class pytext.data.sources.tsv.BlockShardedTSVDataSource(rank=0, world_size=1, **kwargs)[source]

Bases: pytext.data.sources.tsv.TSVDataSource, pytext.data.sources.data_source.ShardedDataSource

train_unsharded = <pytext.data.sources.data_source.GeneratorIterator object>
class pytext.data.sources.tsv.MultilingualTSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', data_source_languages={'eval': ['en'], 'test': ['en'], 'train': ['en']}, language_columns=['language'], **kwargs)[source]

Bases: pytext.data.sources.tsv.TSVDataSource

Data Source for multi-lingual data. The input data can have multiple text fields and each field can either have the same language or different languages. The data_source_languages dict contains the language information for each text field and this should match the number of language identifiers specified in language_columns.

eval = <pytext.data.sources.data_source.GeneratorIterator object>
test = <pytext.data.sources.data_source.GeneratorIterator object>
train = <pytext.data.sources.data_source.GeneratorIterator object>
class pytext.data.sources.tsv.SessionTSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, **kwargs)[source]

Bases: pytext.data.sources.tsv.TSVDataSource, pytext.data.sources.session.SessionDataSource

class pytext.data.sources.tsv.TSV(file, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False)[source]

Bases: object

class pytext.data.sources.tsv.TSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from TSV sources. Uses python’s csv library.

classmethod from_config(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]
raw_eval_data_generator()[source]

Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]

Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]

Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

Module contents
class pytext.data.sources.DataSource(schema: Dict[str, Type[CT_co]])[source]

Bases: pytext.config.component.Component

Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.

Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.

eval = <pytext.data.sources.data_source.GeneratorIterator object>[source]
test = <pytext.data.sources.data_source.GeneratorIterator object>[source]
train = <pytext.data.sources.data_source.GeneratorIterator object>[source]
class pytext.data.sources.RawExample[source]

Bases: dict

A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.

class pytext.data.sources.SquadDataSource(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]

Bases: pytext.data.sources.data_source.DataSource

Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)

DEFAULT_SCHEMA = {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}
eval = <pytext.data.sources.data_source.GeneratorIterator object>
classmethod from_config(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]
process_file(fname)[source]
test = <pytext.data.sources.data_source.GeneratorIterator object>
train = <pytext.data.sources.data_source.GeneratorIterator object>
class pytext.data.sources.TSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from TSV sources. Uses python’s csv library.

classmethod from_config(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]
raw_eval_data_generator()[source]

Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]

Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]

Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

class pytext.data.sources.PandasDataSource(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from a pandas DataFrame.

Inputs:

train_df: DataFrame for training

eval_df: DataFrame for evalu

test_df: DataFrame for test

schema: same as base DataSource, define the list of output values with their types

column_mapping: maps the column names in DataFrame to the name defined in schema

raw_eval_data_generator()[source]

Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

static raw_generator(df: Optional[pandas.core.frame.DataFrame])[source]
raw_test_data_generator()[source]

Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]

Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

class pytext.data.sources.CoNLLUNERDataSource(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]

Bases: pytext.data.sources.conllu.CoNLLUPOSDataSource

Reads an empty line separated data (word label). This data source supports datasets for NER tasks

pytext.data.test package
Submodules
pytext.data.test.batch_sampler_test module
class pytext.data.test.batch_sampler_test.BatchSamplerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_alternate_prob_batch_sampler()[source]
test_eval_batch_sampler()[source]
test_prob_batch_sampler()[source]
test_round_robin_batch_sampler()[source]
pytext.data.test.data_test module
class pytext.data.test.data_test.BatcherTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_batcher()[source]
test_pooling_batcher()[source]
class pytext.data.test.data_test.DataTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_create_batches()[source]
test_create_batches_different_tensorizers()[source]
test_create_batches_with_cache()[source]
test_create_data_no_batcher_provided()[source]
test_data_initializes_tensorsizers()[source]
test_data_iterate_multiple_times()[source]
test_fp16_padding()[source]
test_sort()[source]
pytext.data.test.dynamic_pooling_batcher_test module
class pytext.data.test.dynamic_pooling_batcher_test.DynamicPoolingBatcherTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

end_of_scheduler()[source]
test_batch_size_greater_than_data()[source]
test_exponential_scheduler()[source]
test_linear_scheduler()[source]
test_step_size()[source]
pytext.data.test.pandas_data_source_test module
class pytext.data.test.pandas_data_source_test.PandasDataSourceTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_create_data_source()[source]
test_empty_data()[source]
pytext.data.test.round_robin_batchiterator_test module
class pytext.data.test.round_robin_batchiterator_test.RoundRobinBatchIteratorTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_batch_iterator()[source]
pytext.data.test.simple_featurizer_test module
class pytext.data.test.simple_featurizer_test.SimpleFeaturizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_convert_to_bytes()[source]
test_split_with_regex()[source]
test_tokenize()[source]
test_tokenize_add_sentence_markers()[source]
test_tokenize_dont_lowercase()[source]
pytext.data.test.tensorizers_test module
class pytext.data.test.tensorizers_test.BERTTensorizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_bert_pair_tensorizer()[source]
test_bert_tensorizer()[source]
class pytext.data.test.tensorizers_test.ListTensorizersTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_create_label_list_tensors()[source]
test_initialize_list_tensorizers()[source]
test_label_list_tensors_no_pad_in_vocab()[source]
class pytext.data.test.tensorizers_test.LookupTokensTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_lookup_tokens()[source]
class pytext.data.test.tensorizers_test.RobertaTensorizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_roberta_tensorizer()[source]
class pytext.data.test.tensorizers_test.SquadForBERTTensorizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_squad_tensorizer()[source]
class pytext.data.test.tensorizers_test.SquadForRobertaTensorizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_squad_roberta_tensorizer()[source]
class pytext.data.test.tensorizers_test.SquadTensorizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_initialize()[source]
test_numberize_with_alphanumeric()[source]
test_numberize_with_wordpiece()[source]
test_tsv_numberize_with_alphanumeric()[source]
class pytext.data.test.tensorizers_test.TensorizersTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_annotation_num()[source]
test_byte_tensors_error_code()[source]
test_create_byte_tensors()[source]
test_create_byte_token_tensors()[source]
test_create_float_list_tensor()[source]
test_create_label_tensors()[source]
test_create_normalized_float_list_tensor()[source]
test_create_word_tensors()[source]
test_float_list_tensor_prepare_input()[source]
test_gazetteer_tensor()[source]
test_gazetteer_tensor_bad_json()[source]
test_initialize_label_tensorizer()[source]
test_initialize_tensorizers()[source]
test_initialize_token_tensorizer()[source]
test_seq_tensor()[source]
test_seq_tensor_with_bos_eos_eol_bol()[source]
pytext.data.test.tokenizers_test module
class pytext.data.test.tokenizers_test.GPT2BPETest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_gpt2_bpe_tokenizer()[source]
class pytext.data.test.tokenizers_test.SentencePieceTokenizerTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_tokenize()[source]
class pytext.data.test.tokenizers_test.TokenizeTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_split_with_regex()[source]
test_tokenize()[source]
test_tokenize_dont_lowercase()[source]
pytext.data.test.tsv_data_source_test module
class pytext.data.test.tsv_data_source_test.BlockShardedTSVDataSourceTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_bad_quoting()[source]

The text column of the first row of this file opens a quote but does not close it.

test_quoting()[source]

The text column of the first row of this file opens a quote but does not close it.

class pytext.data.test.tsv_data_source_test.SessionTSVDataSourceTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_read_session_data()[source]
class pytext.data.test.tsv_data_source_test.TSVDataSourceTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_bad_quoting()[source]

The text column of the first row of this file opens a quote but does not close it.

test_csv()[source]
test_iterate_training_data_multiple_times()[source]
test_quoting()[source]

The text column of the first row of this file opens a quote but does not close it.

test_read_data_source()[source]
test_read_data_source_with_column_remapping()[source]
test_read_data_source_with_utf8_issues()[source]
test_read_eval_data_source()[source]
test_read_test_data_source()[source]
pytext.data.test.utils_test module
class pytext.data.test.utils_test.PaddingTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

testPadding()[source]
testPaddingProvideShape()[source]
class pytext.data.test.utils_test.TargetTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_align_target_label()[source]
class pytext.data.test.utils_test.VocabularyTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

testBuildVocabulary()[source]
Module contents
pytext.data.tokenizers package
Submodules
pytext.data.tokenizers.tokenizer module
class pytext.data.tokenizers.tokenizer.BERTInitialTokenizer(basic_tokenizer)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Basic initial tokenization for BERT. This is run prior to word piece, does white space tokenization in addition to lower-casing and accent removal if specified.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.BERTInitialTokenizer.Config)[source]
tokenize(text)[source]

Tokenizes a piece of text.

class pytext.data.tokenizers.tokenizer.CppProcessorMixin[source]

Bases: object

Cpp processors like SentencePiece don’t pickle well; reload them.

class pytext.data.tokenizers.tokenizer.DoNothingTokenizer[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand

classmethod from_config(config: pytext.data.tokenizers.tokenizer.DoNothingTokenizer.Config)[source]
tokenize(input: List[str]) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.tokenizer.GPT2BPETokenizer(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer for gpt-2 and RoBERTa.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.GPT2BPETokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.tokenizer.PickleableGPT2BPEEncoder(encoder, bpe_merges, errors='replace')[source]

Bases: fairseq.data.encoders.gpt2_bpe_utils.Encoder

Fairseq’s encoder stores the regex module as a local reference on its encoders, which means they can’t be saved via pickle.dumps or torch.save. This modified their save/load logic doesn’t store the module, and restores the reference after re-inflating.

class pytext.data.tokenizers.tokenizer.SentencePieceTokenizer(sp_model_path: str = '')[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer, pytext.data.tokenizers.tokenizer.CppProcessorMixin

Sentence piece tokenizer.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.SentencePieceTokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.tokenizer.Token(value, start, end)[source]

Bases: tuple

end

Alias for field number 2

start

Alias for field number 1

value

Alias for field number 0

class pytext.data.tokenizers.tokenizer.Tokenizer(split_regex='\s+', lowercase=True)[source]

Bases: pytext.config.component.Component

A simple regex-splitting tokenizer.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.Tokenizer.Config)[source]
tokenize(input: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.tokenizer.WordPieceTokenizer(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Word piece tokenizer for BERT models.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.WordPieceTokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
Module contents
class pytext.data.tokenizers.GPT2BPETokenizer(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer for gpt-2 and RoBERTa.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.GPT2BPETokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.Token(value, start, end)[source]

Bases: tuple

end

Alias for field number 2

start

Alias for field number 1

value

Alias for field number 0

class pytext.data.tokenizers.Tokenizer(split_regex='\s+', lowercase=True)[source]

Bases: pytext.config.component.Component

A simple regex-splitting tokenizer.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.Tokenizer.Config)[source]
tokenize(input: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.DoNothingTokenizer[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand

classmethod from_config(config: pytext.data.tokenizers.tokenizer.DoNothingTokenizer.Config)[source]
tokenize(input: List[str]) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.WordPieceTokenizer(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Word piece tokenizer for BERT models.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.WordPieceTokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.CppProcessorMixin[source]

Bases: object

Cpp processors like SentencePiece don’t pickle well; reload them.

class pytext.data.tokenizers.SentencePieceTokenizer(sp_model_path: str = '')[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer, pytext.data.tokenizers.tokenizer.CppProcessorMixin

Sentence piece tokenizer.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.SentencePieceTokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
Submodules
pytext.data.batch_sampler module
class pytext.data.batch_sampler.AlternatingRandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.RandomizedBatchSampler

This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.AlternatingRandomizedBatchSampler.Config)[source]
class pytext.data.batch_sampler.BaseBatchSampler[source]

Bases: pytext.config.component.Component

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.config.component.Component.Config)[source]
class pytext.data.batch_sampler.EvalBatchSampler[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

Output: [A, B, C, D, a, b]

batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key.

Parameters:iterators – Dictionary of iterators
class pytext.data.batch_sampler.RandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.

Example

Iterator A: [A, B, C, D], Iterator B: [a, b]

batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]

Parameters:unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.RandomizedBatchSampler.Config)[source]
class pytext.data.batch_sampler.RoundRobinBatchSampler(iter_to_set_epoch: Optional[str] = None)[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.

If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.

If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]

iter_to_set_epoch = None Output: [A, a, B, b]

Parameters:iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key until the target iterator reaches its end.

Parameters:iterators – Dictionary of iterators
classmethod from_config(config: pytext.data.batch_sampler.RoundRobinBatchSampler.Config)[source]
pytext.data.batch_sampler.extract_iterator_properties(input_iterator_probs: Dict[str, float])[source]

Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to generate iterator properties: iterator_names and iterator_probs.

pytext.data.batch_sampler.select_key_and_batch(iterator_names: Dict[str, str], iterator_probs: Dict[str, float], iter_dict: Dict[str, collections.abc.Iterator], iterators: Dict[str, collections.abc.Iterator])[source]

Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to select a key from iterator_names using iterator_probs and return a batch for the selected key using iter_dict and iterators.

pytext.data.bert_tensorizer module
class pytext.data.bert_tensorizer.BERTTensorizer(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBase

Tensorizer for BERT tasks. Works for single sentence, sentence pair, triples etc.

classmethod from_config(config: pytext.data.bert_tensorizer.BERTTensorizer.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

class pytext.data.bert_tensorizer.BERTTensorizerBase(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, base_tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]

Bases: pytext.data.tensorizers.Tensorizer

Base Tensorizer class for all BERT style models including XLM, RoBERTa and XLM-R.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

initialize(vocab_builder=None, from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

sort_key(row)[source]
tensorize(batch) → Tuple[torch.Tensor, ...][source]

Convert instance level vectors into batch level tensors.

tensorizer_script_impl = None
class pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]

Bases: pytext.data.tensorizers.TensorizerScriptImpl

forward(texts: Optional[List[List[str]]] = None, pre_tokenized: Optional[List[List[List[str]]]] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Wire up tokenize(), numberize() and tensorize() functions for data processing. When export to TorchScript, the wrapper module should choose to use texts or pre_tokenized based on the TorchScript tokenizer implementation (e.g use external tokenizer such as Yoda or not).

numberize(per_sentence_tokens: List[List[Tuple[str, int, int]]]) → Tuple[List[int], List[int], int, List[int]][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

Parameters:
  • per_sentence_tokens – list of tokens per sentence level in one row,
  • token represented by token string, start and end indices. (each) –
Returns:

List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions

Return type:

tokens

tensorize(tokens_2d: List[List[int]], segment_labels_2d: List[List[int]], seq_lens_1d: List[int], positions_2d: List[List[int]]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Convert instance level vectors into batch level tensors.

tokenize(row_text: Optional[List[str]], row_pre_tokenized: Optional[List[List[str]]]) → List[List[Tuple[str, int, int]]][source]

This function convert raw inputs into tokens, each token is represented by token(str), start and end indices in the raw inputs. There are two possible inputs to this function depends if the tokenized in implemented in TorchScript or not.

Case 1: Tokenizer has a full TorchScript implementation, the input will be a list of sentences (in most case it is single sentence or a pair).

Case 2: Tokenizer have partial or no TorchScript implementation, in most case, the tokenizer will be host in Yoda, the input will be a list of pre-processed tokens.

Returns:tokens per setence level, each token is represented by token(str), start and end indices.
Return type:per_sentence_tokens
torchscriptify()[source]
class pytext.data.bert_tensorizer.BERTTensorizerScriptImpl(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl

pytext.data.bert_tensorizer.build_fairseq_vocab(vocab_file: str, dictionary_class: fairseq.data.dictionary.Dictionary = <class 'fairseq.data.dictionary.Dictionary'>, special_token_replacements: Dict[str, pytext.data.utils.SpecialToken] = None, max_vocab: int = -1, min_count: int = -1) → pytext.data.utils.Vocabulary[source]

Function builds a PyText vocabulary for models pre-trained using Fairseq modules. The dictionary class can take any Fairseq Dictionary class and is used to load the vocab file.

pytext.data.data module
class pytext.data.data.BatchData(raw_data, numberized)[source]

Bases: tuple

numberized

Alias for field number 1

raw_data

Alias for field number 0

class pytext.data.data.Batcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]

Bases: pytext.config.component.Component

Batcher designed to batch rows of data, before padding.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

Group rows by batch_size. Assume iterable of dicts, yield dict of lists. The last batch will be of length len(iterable) % batch_size.

classmethod from_config(config: pytext.data.data.Batcher.Config)[source]
class pytext.data.data.Data(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]

Bases: pytext.config.component.Component

Data is an abstraction that handles all of the following:

  • Initialize model metadata parameters
  • Create batches of tensors for model training or prediction

It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.

The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.

add_row_indices(rows)[source]
batches(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]

Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.

stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.

Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.

cache(numberized_rows, stage)[source]
classmethod from_config(config: pytext.data.data.Data.Config, schema: Dict[str, Type[CT_co]], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], rank=0, world_size=1, init_tensorizers=True, **kwargs)[source]
numberize_rows(rows)[source]
class pytext.data.data.PoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1)[source]

Bases: pytext.data.data.Batcher

Batcher that loads a pool of data, sorts it, and batches it.

Shuffling is performed before pooling, by loading num_shuffled_pools worth of data, shuffling, and then splitting that up into pools.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
classmethod from_config(config: pytext.data.data.PoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
class pytext.data.data.RowData(raw_data, numberized)[source]

Bases: tuple

numberized

Alias for field number 1

raw_data

Alias for field number 0

pytext.data.data.generator_iterator(fn)[source]

Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.

pytext.data.data.pad_and_tensorize_batches(tensorizers, batches)[source]
pytext.data.data.zip_dicts(dicts)[source]
pytext.data.data_handler module
class pytext.data.data_handler.BatchIterator(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]

Bases: object

BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.

Parameters:
  • batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
  • processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
  • include_input (bool) – if input data should be returned, default is true
  • include_target (bool) – if target data should be returned, default is true
  • include_context (bool) – if context data should be returned, default is true
  • is_train (bool) – if the batch data is for training
  • num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
class pytext.data.data_handler.CommonMetadata[source]

Bases: object

class pytext.data.data_handler.DataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]

Bases: pytext.config.component.Component

DataHandler is the central place to prepare data for model training/testing. The class is responsible of:

  • Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
  • Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata

The data processing pipeline contains the following steps:

  • Read data from file into a list of raw data examples
  • Convert each row of row data to a TorchText Example. This logic happens in process_row function and will:
    • Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
    • Use the raw data and results from featurizer to do any preprocessing
  • Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
  • Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
raw_columns

columns to read from data source. The order should match the data stored in that file.

Type:List[str]
featurizer

perform data preprocessing that should be shared between training and inference

Type:Featurizer
features

a dict of name -> field that used to process data as model input

Type:Dict[str, Field]
labels

a dict of name -> field that used to process data as training target

Type:Dict[str, Field]
extra_fields

fields that process any extra data used neither as model input nor target. This is None by default

Type:Dict[str, Field]
text_feature_name

name of the text field, used to define the default sort key of data

Type:str
shuffle

if the dataset should be shuffled, true by default

Type:bool
sort_within_batch

if data within same batch should be sorted, true by default

Type:bool
train_path

path of training data file

Type:str
eval_path

path of evaluation data file

Type:str
test_path

path of test data file

Type:str
train_batch_size

training batch size, 128 by default

Type:int
eval_batch_size

evaluation batch size, 128 by default

Type:int
test_batch_size

test batch size, 128 by default

Type:int
max_seq_len

maximum length of tokens to keep in sequence

Type:int
pass_index

if the original index of data in the batch should be passed along to downstream steps, default is true

Type:bool
gen_dataset(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.data.dataset.Dataset[source]

Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)

gen_dataset_from_path(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.data.dataset.Dataset[source]

Generate a dataset from file :returns: dataset (TorchText.Dataset)

get_eval_iter()[source]
get_predict_iter(data: Iterable[Dict[str, Any]], batch_size: Optional[int] = None)[source]
get_test_iter()[source]
get_test_iter_from_path(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_test_iter_from_raw_data(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1)[source]
get_train_iter_from_path(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]

Generate data batch iterator for training data. See _get_train_iter() for details

Parameters:
  • train_path (str) – file path of training data
  • batch_size (int) – batch size
  • rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
  • world_size (int) – used for distributed training, total number of Gpu
get_train_iter_from_raw_data(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
init_feature_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
init_metadata()[source]

Initialize metadata using data from configured path

init_metadata_from_path(train_path, eval_path, test_path)[source]

Initialize metadata using data from file

init_metadata_from_raw_data(*data)[source]

Initialize metadata using in memory data

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
load_metadata(metadata: pytext.data.data_handler.CommonMetadata)[source]

Load previously saved metadata

load_vocab(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]

Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.

Parameters:
  • vocab_file (str) – vocab file to load
  • vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
  • lowercase_tokens (bool) – if the tokens should be lowercased
metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

preprocess(data: Iterable[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

read_from_file(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]

Read data from csv file. Input file format is required to be tab-separated columns

Parameters:
  • file_name (str) – csv file name
  • columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
sort_key(example: torchtext.data.example.Example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

pytext.data.disjoint_multitask_data module
class pytext.data.disjoint_multitask_data.DisjointMultitaskData(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]

Bases: pytext.data.data.Data

Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskData.Config.
  • data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_dict

Data handlers to do roundrobin over.

Type:type
batches(stage: pytext.common.constants.Stage, data_source=None)[source]

Yield batches from each task, sampled according to a given sampler. This batcher additionally exposes a task name in the batch to allow the model to filter examples to the appropriate tasks.

classmethod from_config(config: pytext.data.disjoint_multitask_data.DisjointMultitaskData.Config, data_dict: Dict[str, pytext.data.data.Data], task_key: str = 'task_name', rank=0, world_size=1, init_tensorizers=True)[source]
pytext.data.disjoint_multitask_data_handler module
class pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
  • data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
  • target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_handlers

Data handlers to do roundrobin over.

Type:type
target_task_name

Used to select best epoch, and set batch_per_epoch.

Type:type
upsample

If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.

Type:bool
get_eval_iter() → pytext.data.data_handler.BatchIterator[source]
get_test_iter() → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1) → Tuple[pytext.data.data_handler.BatchIterator, ...][source]
init_metadata()[source]

Initialize metadata using data from configured path

load_metadata(metadata)[source]

Load previously saved metadata

metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

class pytext.data.disjoint_multitask_data_handler.RoundRobinBatchIterator(iterators: Dict[str, pytext.data.data_handler.BatchIterator], upsample: bool = True, iter_to_set_epoch: Optional[str] = None)[source]

Bases: pytext.data.data_handler.BatchIterator

We take a dictionary of BatchIterators and do round robin over them in a cycle. The below describes the behavior for one epoch, with the example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

If upsample is True:

If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.

iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]

If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.

Output: [A, a, B, b]

If upsample is False:

Iterate over batches from one epoch of each iterator, with the order among iterators uniformly shuffled.

Possible output: [a, A, B, C, b, D]

Parameters:
  • iterators (Dict[str, BatchIterator]) – Iterators to do roundrobin over.
  • upsample (bool) – If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, in random order. Evaluation will use upsample=False. Default True.
  • iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If upsample is True and this is not set, epoch size defaults to the length of the shortest iterator. If upsample is False, this argument is not used.
iterators

Iterators to do roundrobin over.

Type:Dict[str, BatchIterator]
upsample

Whether to upsample iterators with fewer batches.

Type:bool
iter_to_set_epoch

Name of iterator to define epoch size.

Type:str
classmethod cycle(iterator)[source]
pytext.data.dynamic_pooling_batcher module
class pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

end_batch_size = 256
epoch_period = 10
start_batch_size = 32
step_size = 1
class pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]

Bases: pytext.data.data.PoolingBatcher

Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]
finished_dynamic() → bool[source]
classmethod from_config(config: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
step_epoch()[source]
class pytext.data.dynamic_pooling_batcher.ExponentialBatcherSchedulerConfig(**kwargs)[source]

Bases: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig

gamma = 5
class pytext.data.dynamic_pooling_batcher.ExponentialDynamicPoolingBatcher(*args, **kwargs)[source]

Bases: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher

Exponential Dynamic Batch Scheduler: scales up batch size by a factor of gamma

compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.ExponentialBatcherSchedulerConfig, curr_steps: int) → int[source]
finished_dynamic() → bool[source]
get_max_steps()[source]
class pytext.data.dynamic_pooling_batcher.LinearDynamicPoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]

Bases: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher

Linear Dynamic Batch Scheduler: scales up batch size linearly

compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]
pytext.data.packed_lm_data module
class pytext.data.packed_lm_data.PackedLMData(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, max_seq_len: int = 128, sort_key: Optional[str] = None, language: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True)[source]

Bases: pytext.data.data.Data

Special purpose Data object which assumes a single text tensorizer. Packs tokens into a square batch with no padding. Used for LM training. The object also takes in an optional language argument which is used for cross-lingual LM training.

classmethod from_config(config: pytext.data.packed_lm_data.PackedLMData.Config, schema: Dict[str, Type[CT_co]], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], language: Optional[str] = None, rank: int = 0, world_size: int = 1, init_tensorizers: Optional[bool] = True)[source]
numberize_rows(rows)[source]
pytext.data.roberta_tensorizer module
class pytext.data.roberta_tensorizer.RoBERTaTensorizer(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, base_tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBase

classmethod from_config(config: pytext.data.roberta_tensorizer.RoBERTaTensorizer.Config)[source]
class pytext.data.roberta_tensorizer.RoBERTaTokenLevelTensorizer(columns, tokenizer=None, vocab=None, max_seq_len=256, labels_columns=['label'], labels=[])[source]

Bases: pytext.data.roberta_tensorizer.RoBERTaTensorizer

Tensorizer for token level classification tasks such as NER, POS etc using RoBERTa. Here each token has an associated label and the tensorizer should output a label tensor as well. The input for this tensorizer comes from the CoNLLUNERDataSource data source.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.roberta_tensorizer.RoBERTaTokenLevelTensorizer.Config)[source]
numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

Numberize both the tokens and labels. Since we break up tokens, the label for anything other than the first sub-word is assigned the padding idx.

tensorize(batch) → Tuple[torch.Tensor, ...][source]

Convert instance level vectors into batch level tensors.

torchscriptify()[source]
pytext.data.squad_for_bert_tensorizer module
class pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer(answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizer

Produces BERT inputs and answer spans for Squad.

SPAN_PAD_IDX = -100
classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

numberize(row)[source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorize(batch)[source]

Convert instance level vectors into batch level tensors.

class pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizerForKD(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]

Bases: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer

classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizerForKD.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

numberize(row)[source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorize(batch)[source]

Convert instance level vectors into batch level tensors.

class pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer(columns: List[str] = ['question', 'doc'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, answers_column: str = 'answers', answer_starts_column: str = 'answer_starts')[source]

Bases: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer, pytext.data.roberta_tensorizer.RoBERTaTensorizer

Produces RoBERTa inputs and answer spans for Squad.

classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer.Config)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

torchscriptify()[source]
pytext.data.squad_tensorizer module
class pytext.data.squad_tensorizer.SquadTensorizer(doc_tensorizer: pytext.data.tensorizers.TokenTensorizer, ques_tensorizer: pytext.data.tensorizers.TokenTensorizer, doc_column: str = 'doc', ques_column: str = 'question', answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]

Bases: pytext.data.tensorizers.TokenTensorizer

Produces inputs and answer spans for Squad.

SPAN_PAD_IDX = -100
classmethod from_config(config: pytext.data.squad_tensorizer.SquadTensorizer.Config, **kwargs)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.squad_tensorizer.SquadTensorizerForKD(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]

Bases: pytext.data.squad_tensorizer.SquadTensorizer

classmethod from_config(config: pytext.data.squad_tensorizer.SquadTensorizerForKD.Config, **kwargs)[source]
numberize(row)[source]

Tokenize, look up in vocabulary.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

pytext.data.tensorizers module
class pytext.data.tensorizers.AnnotationNumberizer(column: str = 'seqlogical', vocab=None, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Not really a Tensorizer (since it does not create tensors) but technically serves the same function. This class parses Annotations in the format below and extracts the actions (type List[List[int]])

[IN:GET_ESTIMATED_DURATION How long will it take to [SL:METHOD_TRAVEL
drive ] from [SL:SOURCE Chicago ] to [SL:DESTINATION Mississippi ] ]

Extraction algorithm is handled by Annotation class. We only care about the list of actions, which before vocab index lookups would look like:

[
    IN:GET_ESTIMATED_DURATION, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT,
    SL:METHOD_TRAVEL, SHIFT, REDUCE,
    SHIFT,
    SL:SOURCE, SHIFT, REDUCE,
    SHIFT,
    SL:DESTINATION, SHIFT, REDUCE,
]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.AnnotationNumberizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.ByteTensorizer(text_column, lower=True, max_seq_len=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Turn characters into sequence of int8 bytes. One character will have one or more bytes depending on it’s encoding

NUM = 256
PAD_BYTE = 0
UNK_BYTE = 0
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.ByteTensorizer.Config)[source]
numberize(row)[source]

Convert text to characters.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.ByteTokenTensorizer(text_column, tokenizer=None, max_seq_len=None, max_byte_len=15, offset_for_non_padding=0, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Turn words into 2-dimensional tensors of int8 bytes. Words are padded to max_byte_len. Also computes sequence lengths (1-D tensor) and token lengths (2-D tensor). 0 is the pad byte.

NUM_BYTES = 256
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.ByteTokenTensorizer.Config)[source]
numberize(row)[source]

Convert text to bytes, pad batch.

sort_key(row)[source]
tensorize(batch, pad_token=0)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.CharacterTokenTensorizer(max_char_length: int = 20, **kwargs)[source]

Bases: pytext.data.tensorizers.TokenTensorizer

Turn words into 2-dimensional tensors of ints based on their ascii values. Words are padded to the maximum word length (also capped at max_char_length). Sequence lengths are the length of each token, 0 for pad token.

initialize(from_scratch=True)

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]

Convert text to characters, pad batch.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.FloatListTensorizer(column: str, error_check: bool, dim: Optional[int], normalize: bool, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize numeric labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.FloatListTensorizer.Config)[source]
initialize()[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.FloatTensorizer(column: str, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

A tensorizer for reading in scalars from the data.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.FloatTensorizer.Config)[source]
numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.GazetteerTensorizer(text_column: str = 'text', dict_column: str = 'dict', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Create 3 tensors for dict features.

  • idx: index of feature in token order.
  • weights: weight of feature in token order.
  • lens: number of features per token.

For each input token, there will be the same number of idx and weights entries. (equal to the max number of features any token has in this row). The values in lens will tell how many of these features are actually used per token.

Input format for the dict column is json and should be a list of dictionaries containing the “features” and their weight for each relevant “tokenIdx”. Example:

text: "Order coffee from Starbucks please"
dict: [
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

if we assume this vocab

vocab = {
    UNK: 0, PAD: 1,
    "drink/beverage": 2, "music/song": 3, "store/coffee_shop": 4
}

this example will result in those tensors:

idx =     [1,   1,   2,   3,   1,   1,   4,   1,   1,   1]
weights = [0.0, 0.0, 0.8, 0.2, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
lens =    [1,        2,        1,        1,        1]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.GazetteerTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all dict features to create vocab.

numberize(row)[source]

Numberize dict features. Fill in for tokens with no features with PAD and weight 0.0. All tokens need to have at least one entry. Tokens with more than one feature will have multiple idx and weight added in sequence.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.LabelListTensorizer(label_column: str = 'label', *args, **kwargs)[source]

Bases: pytext.data.tensorizers.LabelTensorizer

LabelListTensorizer takes a list of labels as input and generate a tuple of tensors (label_idx, list_length).

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

numberize(row)[source]

Numberize labels.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.LabelTensorizer(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize labels. Label can be used as either input or target

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.LabelTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all labels and create a vocab map for them.

numberize(row)[source]

Numberize labels.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.MetricTensorizer(names: List[str], indexes: List[int], is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

A tensorizer which use other tensorizers’ numerized data. Used mostly for metric reporting.

classmethod from_config(config: pytext.data.tensorizers.MetricTensorizer.Config)[source]
numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.NtokensTensorizer(names: List[str], indexes: List[int], is_input: bool = False)[source]

Bases: pytext.data.tensorizers.MetricTensorizer

A tensorizer which will reference another tensorizer’s numerized data to calculate the num tokens. Used for calculating tokens per second.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.NumericLabelTensorizer(label_column: str = 'label', rescale_range: Optional[List[float]] = None, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize numeric labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.NumericLabelTensorizer.Config)[source]
numberize(row)[source]

Numberize labels.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.SeqTokenTensorizer(column: str = 'text_seq', tokenizer=None, add_bos_token: bool = False, add_eos_token: bool = False, use_eos_token_for_bos: bool = False, add_bol_token: bool = False, add_eol_token: bool = False, use_eol_token_for_bol: bool = False, max_seq_len=None, vocab=None, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Tensorize a sequence of sentences. The input is a list of strings, like this one:

["where do you wanna meet?", "MPK"]

if we assume this vocab

vocab  {
  UNK: 0, PAD: 1,
  'where': 2, 'do': 3, 'you': 4, 'wanna': 5, 'meet?': 6, 'mpk': 7
}

this example will result in those tensors:

idx = [[2, 3, 4, 5, 6], [7, 1, 1, 1, 1]]
seq_len = [2]

If you’re using BOS, EOS, BOL and EOL, the vocab will look like this

vocab  {
  UNK: 0, PAD: 1,  BOS: 2, EOS: 3, BOL: 4, EOL: 5
  'where': 6, 'do': 7, 'you': 8, 'wanna': 9, 'meet?': 10, 'mpk': 11
}

this example will result in those tensors:

idx = [
    [2,  4, 3, 1, 1,  1, 1],
    [2,  6, 7, 8, 9, 10, 3],
    [2, 11, 3, 1, 1,  1, 1],
    [2,  5, 3, 1, 1,  1, 1]
]
seq_len = [4]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.SeqTokenTensorizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

prepare_input(row)[source]

Tokenize, return tokenized_texts in raw text

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.SlotLabelTensorizer(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize word/slot labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.SlotLabelTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all labels and create a vocab map for them.

numberize(row)[source]

Turn slot labels and text into a list of token labels with the same length as the number of tokens in the text.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.SlotLabelTensorizerExpansible(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.SlotLabelTensorizer

Create a base SlotLabelTensorizer to support selecting different types in ModelInput.

class pytext.data.tensorizers.SoftLabelTensorizer(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, probs_column: str = 'target_probs', logits_column: str = 'target_logits', labels_column: str = 'target_labels', is_input: bool = False)[source]

Bases: pytext.data.tensorizers.LabelTensorizer

Handles numberizing labels for knowledge distillation. This still requires the same label column as LabelTensorizer for the “true” label, but also processes soft “probabilistic” labels generated from a teacher model, via three new columns.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.SoftLabelTensorizer.Config)[source]
numberize(row)[source]

Numberize hard and soft labels

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.Tensorizer(is_input: bool = True)[source]

Bases: pytext.config.component.Component

Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.

Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.Tensorizer.Config)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
prepare_input(row)[source]

Return preprocessed input tensors/blob for caffe2 prediction net.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
torchscriptify()[source]
class pytext.data.tensorizers.TensorizerScriptImpl[source]

Bases: torch.nn.modules.module.Module

batch_size(texts: Optional[List[List[str]]], tokens: Optional[List[List[List[str]]]]) → int[source]
get_texts_by_index(texts: Optional[List[List[str]]], index: int) → Optional[List[str]][source]
get_tokens_by_index(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[List[str]]][source]
numberize(*args, **kwargs)[source]

This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

row_size(texts: Optional[List[List[str]]], tokens: Optional[List[List[List[str]]]]) → int[source]
set_device(device: str)[source]
tensorize(*args, **kwargs)[source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tensorize_wrapper(*args, **kwargs)[source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

It will be called in PyText Tensorizer during training time, this function is not torchscriptiable because it depends on cuda.device().

tokenize(*args, **kwargs)[source]

This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

torchscriptify()[source]
class pytext.data.tensorizers.TokenTensorizer(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Convert text to a list of tokens. Do this based on a tokenizer configuration, and build a vocabulary for numberization. Finally, pad the batch to create a square tensor of the correct size.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.TokenTensorizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

prepare_input(row)[source]

Tokenize, look up in vocabulary, return tokenized_texts in raw text

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.UidTensorizer(uid_column: str = 'uid', allow_unknown: bool = True, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize user IDs which can be either strings or tensors.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.UidTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all uids and create a vocab map for them.

numberize(row)[source]

Numberize uids.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.VocabConfig(**kwargs)[source]

Bases: pytext.config.component.Component.Config

build_from_data = True

Whether to add tokens from training data to vocab.

size_from_data = 0

Add size_from_data most frequent tokens in training data to vocab (if this is 0, add all tokens from training data).

vocab_files = []
class pytext.data.tensorizers.VocabFileConfig(**kwargs)[source]

Bases: pytext.config.component.Component.Config

filepath = ''

File containing tokens to add to vocab (first whitespace-separated entry per line)

lowercase_tokens = False

Whether to lowercase each of the tokens in the file

size_limit = 0

The max number of tokens to add to vocab

skip_header_line = False

Whether to skip the first line of the file (e.g. if it is a header line)

pytext.data.tensorizers.initialize_tensorizers(tensorizers, data_source, from_scratch=True)[source]

A utility function to stream a data source to the initialize functions of a dict of tensorizers.

pytext.data.tensorizers.lookup_tokens(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, vocab: pytext.data.utils.Vocabulary = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]
pytext.data.tensorizers.to_device(tensorizer_script_impl, device)[source]
pytext.data.tensorizers.tokenize(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]
pytext.data.utils module
class pytext.data.utils.SpecialToken[source]

Bases: str

class pytext.data.utils.VocabBuilder(delimiter=' ')[source]

Bases: object

Helper class for aggregating and building Vocabulary objects.

add(value) → None[source]

Count a single value in the vocabulary.

add_all(values) → None[source]

Count a value or nested container of values in the vocabulary.

add_from_file(file_pointer, skip_header_line, lowercase_tokens, size)[source]
make_vocab() → pytext.data.utils.Vocabulary[source]

Build a Vocabulary object from the values seen by the builder.

truncate_to_vocab_size(vocab_size) → None[source]
class pytext.data.utils.Vocabulary(vocab_list: List[str], counts: List[T] = None, replacements: Optional[Dict[str, str]] = None, unk_token: str = '__UNKNOWN__', pad_token: str = '__PAD__', bos_token: str = '__BEGIN_OF_SENTENCE__', eos_token: str = '__END_OF_SENTENCE__')[source]

Bases: object

A mapping from indices to vocab elements.

get_bos_index(value=None)[source]
get_eos_index(value=None)[source]
get_pad_index(value=None)[source]
get_unk_index(value=None)[source]
lookup_all(nested_values)[source]
lookup_all_internal(nested_values)[source]

Look up a value or nested container of values in the vocab index. The return value will have the same shape as the input, with all values replaced with their respective indicies.

replace_tokens(replacements)[source]

Replace tokens in vocab with given replacement. Used for replacing special strings for special tokens. e.g. ‘[UNK]’ for UNK

pytext.data.utils.align_target_label(targets: List[float], labels: List[str], label_vocab: Dict[str, int]) → List[float][source]

Given targets that are ordered according to labels, align the targets to match the order of label_vocab.

pytext.data.utils.align_target_labels(targets_list: List[List[float]], labels_list: List[List[str]], label_vocab: Dict[str, int]) → List[List[float]][source]

Given targets_list that are ordered according to labels_list, align the targets to match the order of label_vocab.

pytext.data.utils.pad(nested_lists, pad_token, pad_shape=None)[source]

Pad the input lists with the pad token. If pad_shape is provided, pad to that shape, otherwise infer the input shape and pad out to a square tensor shape.

pytext.data.utils.pad_and_tensorize(batch, pad_token=0, pad_shape=None, dtype=torch.int64)[source]
pytext.data.utils.shard(rows, rank, num_workers)[source]

Only return every num_workers example for distributed training.

pytext.data.utils.should_iter(i)[source]

Whether or not an object looks like a python iterable (not including strings).

pytext.data.xlm_constants module
pytext.data.xlm_dictionary module
class pytext.data.xlm_dictionary.Dictionary(id2word, word2id, counts)[source]

Bases: object

check_valid()[source]

Check that the dictionary is valid.

index(word, no_unk=False)[source]

Returns the index of the specified word.

static index_data(path, bin_path, dico)[source]

Index sentences with a dictionary.

max_vocab(max_vocab)[source]

Limit the vocabulary size.

min_count(min_count)[source]

Threshold on the word frequency counts.

static read_vocab(vocab_path)[source]

Create a dictionary from a vocabulary file.

pytext.data.xlm_tensorizer module
class pytext.data.xlm_tensorizer.XLMTensorizer(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, language_column: str = 'language', lang2id: Dict[str, int] = {'ar': 0, 'bg': 1, 'de': 2, 'el': 3, 'en': 4, 'es': 5, 'fr': 6, 'hi': 7, 'ru': 8, 'sw': 9, 'th': 10, 'tr': 11, 'ur': 12, 'vi': 13, 'zh': 14}, use_language_embeddings: bool = True, has_language_in_data: bool = False)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBase

Tensorizer for Cross-lingual LM tasks. Works for single sentence as well as sentence pair.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.xlm_tensorizer.XLMTensorizer.Config)[source]
get_lang_id(row: Dict[KT, VT], col: str) → int[source]
numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorizer_script_impl = None
class pytext.data.xlm_tensorizer.XLMTensorizerScriptImpl(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int, language_vocab: List[str], default_language: str)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl

forward(texts: Optional[List[List[str]]] = None, pre_tokenized: Optional[List[List[List[str]]]] = None, languages: Optional[List[List[str]]] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Wire up tokenize(), numberize() and tensorize() functions for data processing.

numberize(per_sentence_tokens: List[List[Tuple[str, int, int]]], per_sentence_languages: List[int]) → Tuple[List[int], List[int], int, List[int]][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

Parameters:
  • per_sentence_tokens – list of tokens per sentence level in one row,
  • token represented by token string, start and end indices. (each) –
Returns:

List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions

Return type:

tokens

Module contents
class pytext.data.AlternatingRandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.RandomizedBatchSampler

This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.AlternatingRandomizedBatchSampler.Config)[source]
class pytext.data.Batcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]

Bases: pytext.config.component.Component

Batcher designed to batch rows of data, before padding.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

Group rows by batch_size. Assume iterable of dicts, yield dict of lists. The last batch will be of length len(iterable) % batch_size.

classmethod from_config(config: pytext.data.data.Batcher.Config)[source]
class pytext.data.BaseBatchSampler[source]

Bases: pytext.config.component.Component

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.config.component.Component.Config)[source]
class pytext.data.BatchIterator(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]

Bases: object

BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.

Parameters:
  • batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
  • processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
  • include_input (bool) – if input data should be returned, default is true
  • include_target (bool) – if target data should be returned, default is true
  • include_context (bool) – if context data should be returned, default is true
  • is_train (bool) – if the batch data is for training
  • num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
class pytext.data.CommonMetadata[source]

Bases: object

class pytext.data.Data(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]

Bases: pytext.config.component.Component

Data is an abstraction that handles all of the following:

  • Initialize model metadata parameters
  • Create batches of tensors for model training or prediction

It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.

The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.

add_row_indices(rows)[source]
batches(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]

Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.

stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.

Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.

cache(numberized_rows, stage)[source]
classmethod from_config(config: pytext.data.data.Data.Config, schema: Dict[str, Type[CT_co]], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], rank=0, world_size=1, init_tensorizers=True, **kwargs)[source]
numberize_rows(rows)[source]
class pytext.data.DataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]

Bases: pytext.config.component.Component

DataHandler is the central place to prepare data for model training/testing. The class is responsible of:

  • Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
  • Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata

The data processing pipeline contains the following steps:

  • Read data from file into a list of raw data examples
  • Convert each row of row data to a TorchText Example. This logic happens in process_row function and will:
    • Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
    • Use the raw data and results from featurizer to do any preprocessing
  • Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
  • Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
raw_columns

columns to read from data source. The order should match the data stored in that file.

Type:List[str]
featurizer

perform data preprocessing that should be shared between training and inference

Type:Featurizer
features

a dict of name -> field that used to process data as model input

Type:Dict[str, Field]
labels

a dict of name -> field that used to process data as training target

Type:Dict[str, Field]
extra_fields

fields that process any extra data used neither as model input nor target. This is None by default

Type:Dict[str, Field]
text_feature_name

name of the text field, used to define the default sort key of data

Type:str
shuffle

if the dataset should be shuffled, true by default

Type:bool
sort_within_batch

if data within same batch should be sorted, true by default

Type:bool
train_path

path of training data file

Type:str
eval_path

path of evaluation data file

Type:str
test_path

path of test data file

Type:str
train_batch_size

training batch size, 128 by default

Type:int
eval_batch_size

evaluation batch size, 128 by default

Type:int
test_batch_size

test batch size, 128 by default

Type:int
max_seq_len

maximum length of tokens to keep in sequence

Type:int
pass_index

if the original index of data in the batch should be passed along to downstream steps, default is true

Type:bool
gen_dataset(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.data.dataset.Dataset[source]

Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)

gen_dataset_from_path(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.data.dataset.Dataset[source]

Generate a dataset from file :returns: dataset (TorchText.Dataset)

get_eval_iter()[source]
get_predict_iter(data: Iterable[Dict[str, Any]], batch_size: Optional[int] = None)[source]
get_test_iter()[source]
get_test_iter_from_path(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_test_iter_from_raw_data(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1)[source]
get_train_iter_from_path(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]

Generate data batch iterator for training data. See _get_train_iter() for details

Parameters:
  • train_path (str) – file path of training data
  • batch_size (int) – batch size
  • rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
  • world_size (int) – used for distributed training, total number of Gpu
get_train_iter_from_raw_data(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
init_feature_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
init_metadata()[source]

Initialize metadata using data from configured path

init_metadata_from_path(train_path, eval_path, test_path)[source]

Initialize metadata using data from file

init_metadata_from_raw_data(*data)[source]

Initialize metadata using in memory data

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
load_metadata(metadata: pytext.data.data_handler.CommonMetadata)[source]

Load previously saved metadata

load_vocab(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]

Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.

Parameters:
  • vocab_file (str) – vocab file to load
  • vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
  • lowercase_tokens (bool) – if the tokens should be lowercased
metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

preprocess(data: Iterable[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

read_from_file(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]

Read data from csv file. Input file format is required to be tab-separated columns

Parameters:
  • file_name (str) – csv file name
  • columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
sort_key(example: torchtext.data.example.Example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

class pytext.data.DisjointMultitaskData(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]

Bases: pytext.data.data.Data

Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskData.Config.
  • data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_dict

Data handlers to do roundrobin over.

Type:type
batches(stage: pytext.common.constants.Stage, data_source=None)[source]

Yield batches from each task, sampled according to a given sampler. This batcher additionally exposes a task name in the batch to allow the model to filter examples to the appropriate tasks.

classmethod from_config(config: pytext.data.disjoint_multitask_data.DisjointMultitaskData.Config, data_dict: Dict[str, pytext.data.data.Data], task_key: str = 'task_name', rank=0, world_size=1, init_tensorizers=True)[source]
class pytext.data.DisjointMultitaskDataHandler(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
  • data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
  • target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_handlers

Data handlers to do roundrobin over.

Type:type
target_task_name

Used to select best epoch, and set batch_per_epoch.

Type:type
upsample

If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.

Type:bool
get_eval_iter() → pytext.data.data_handler.BatchIterator[source]
get_test_iter() → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1) → Tuple[pytext.data.data_handler.BatchIterator, ...][source]
init_metadata()[source]

Initialize metadata using data from configured path

load_metadata(metadata)[source]

Load previously saved metadata

metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

class pytext.data.DynamicPoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]

Bases: pytext.data.data.PoolingBatcher

Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]
finished_dynamic() → bool[source]
classmethod from_config(config: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
step_epoch()[source]
class pytext.data.EvalBatchSampler[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

Output: [A, B, C, D, a, b]

batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key.

Parameters:iterators – Dictionary of iterators
pytext.data.generator_iterator(fn)[source]

Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.

class pytext.data.PoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1)[source]

Bases: pytext.data.data.Batcher

Batcher that loads a pool of data, sorts it, and batches it.

Shuffling is performed before pooling, by loading num_shuffled_pools worth of data, shuffling, and then splitting that up into pools.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
classmethod from_config(config: pytext.data.data.PoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
class pytext.data.RandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.

Example

Iterator A: [A, B, C, D], Iterator B: [a, b]

batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]

Parameters:unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.RandomizedBatchSampler.Config)[source]
class pytext.data.RoundRobinBatchSampler(iter_to_set_epoch: Optional[str] = None)[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.

If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.

If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]

iter_to_set_epoch = None Output: [A, a, B, b]

Parameters:iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key until the target iterator reaches its end.

Parameters:iterators – Dictionary of iterators
classmethod from_config(config: pytext.data.batch_sampler.RoundRobinBatchSampler.Config)[source]
class pytext.data.Tensorizer(is_input: bool = True)[source]

Bases: pytext.config.component.Component

Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.

Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.Tensorizer.Config)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
prepare_input(row)[source]

Return preprocessed input tensors/blob for caffe2 prediction net.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
torchscriptify()[source]

pytext.exporters package

Submodules
pytext.exporters.custom_exporters module
class pytext.exporters.custom_exporters.DenseFeatureExporter(config, input_names, dummy_model_input, vocab_map, output_names)[source]

Bases: pytext.exporters.exporter.ModelExporter

Exporter for models that have DenseFeatures as input to the decoder

classmethod get_feature_metadata(feature_config: pytext.config.field_config.FeatureConfig, feature_meta: Dict[str, pytext.fields.field.FieldMeta])[source]
class pytext.exporters.custom_exporters.InitPredictNetExporter(config, input_names, dummy_model_input, vocab_map, output_names)[source]

Bases: pytext.exporters.exporter.ModelExporter

Exporter for converting models to their caffe2 init and predict nets. Does not rely on c2_prepared, but rather splits the ONNX model into the init and predict nets directly.

export_to_caffe2(model, export_path: str, export_onnx_path: str = None) → List[str][source]

export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model

Parameters:
  • model (Model) – pytorch model to export
  • export_path (str) – path to save the exported caffe2 model
  • export_onnx_path (str) – path to save the exported onnx model
Returns:

list of caffe2 model output names

Return type:

final_output_names

get_export_paths(path)[source]
postprocess_output(init_net, predict_net, workspace, output_names: List[str], model)[source]

Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net

Parameters:
  • init_net (caffe2.python.Net) – caffe2 init net created by the current graph
  • predict_net (caffe2.python.Net) – caffe2 net created by the current graph
  • workspace (caffe2.python.workspace) – caffe2 current workspace
  • output_names (List[str]) – current output names of the caffe2 net
  • py_model (Model) – original pytorch model object
Returns:

list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add

Return type:

result

prepend_operators(init_net, predict_net, input_names: List[str])[source]

Prepend operators to the converted caffe2 net, do nothing by default

Parameters:
  • c2_prepared (Caffe2Rep) – caffe2 net rep
  • input_names (List[str]) – current input names to the caffe2 net
Returns:

caffe2 net with prepended operators input_names (List[str]): list of input names for the new net

Return type:

c2_prepared (Caffe2Rep)

pytext.exporters.custom_exporters.get_exporter(name)[source]
pytext.exporters.custom_exporters.save_caffe2_pb_net(path, model)[source]
pytext.exporters.exporter module
class pytext.exporters.exporter.ModelExporter(config, input_names, dummy_model_input, vocab_map, output_names)[source]

Bases: pytext.config.component.Component

Model exporter exports a PyTorch model to Caffe2 model using ONNX

input_names

names of the input variables to model forward function, in a flattened way. e.g: forward(tokens, dict) where tokens is List[Tensor] and dict is a tuple of value and length: (List[Tensor], List[Tensor]) the input names should looks like [‘token’, ‘dict_value’, ‘dict_length’]

Type:List[Str]
dummy_model_input

dummy values to define the shape of input tensors, should exactly match the shape of the model forward function

Type:Tuple[torch.Tensor]
vocab_map

dict of input feature names to corresponding index_to_string array, e.g:

{
    "text": ["<UNK>", "W1", "W2", "W3", "W4", "W5", "W6", "W7", "W8"],
    "dict": ["<UNK>", "D1", "D2", "D3", "D4", "D5", "D6", "D7", "D8"]
}
Type:Dict[str, List[str]]
output_names

names of output variables

Type:List[Str]
export_to_caffe2(model, export_path: str, export_onnx_path: str = None) → List[str][source]

export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model

Parameters:
  • model (Model) – pytorch model to export
  • export_path (str) – path to save the exported caffe2 model
  • export_onnx_path (str) – path to save the exported onnx model
Returns:

list of caffe2 model output names

Return type:

final_output_names

export_to_metrics(model, metric_channels)[source]

Exports the pytorch model to tensorboard as a graph.

Parameters:
  • model (Model) – pytorch model to export
  • metric_channels (List[Channel]) – outputs of model’s execution graph
classmethod from_config(config, feature_config: pytext.config.field_config.FeatureConfig, target_config: Union[pytext.config.pytext_config.ConfigBase, List[pytext.config.pytext_config.ConfigBase]], meta: pytext.data.data_handler.CommonMetadata, *args, **kwargs)[source]

Gather all the necessary metadata from configs and global metadata to be used in exporter

get_extra_params() → List[str][source]
Returns:list of blobs to be added as extra params to the caffe2 model
classmethod get_feature_metadata(feature_config: pytext.config.field_config.FeatureConfig, feature_meta: Dict[str, pytext.fields.field.FieldMeta])[source]
postprocess_output(init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, output_names: List[str], py_model)[source]

Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net

Parameters:
  • init_net (caffe2.python.Net) – caffe2 init net created by the current graph
  • predict_net (caffe2.python.Net) – caffe2 net created by the current graph
  • workspace (caffe2.python.workspace) – caffe2 current workspace
  • output_names (List[str]) – current output names of the caffe2 net
  • py_model (Model) – original pytorch model object
Returns:

list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add

Return type:

result

prepend_operators(c2_prepared: caffe2.python.onnx.backend_rep.Caffe2Rep, input_names: List[str]) → Tuple[caffe2.python.onnx.backend_rep.Caffe2Rep, List[str]][source]

Prepend operators to the converted caffe2 net, do nothing by default

Parameters:
  • c2_prepared (Caffe2Rep) – caffe2 net rep
  • input_names (List[str]) – current input names to the caffe2 net
Returns:

caffe2 net with prepended operators input_names (List[str]): list of input names for the new net

Return type:

c2_prepared (Caffe2Rep)

Module contents
class pytext.exporters.ModelExporter(config, input_names, dummy_model_input, vocab_map, output_names)[source]

Bases: pytext.config.component.Component

Model exporter exports a PyTorch model to Caffe2 model using ONNX

input_names

names of the input variables to model forward function, in a flattened way. e.g: forward(tokens, dict) where tokens is List[Tensor] and dict is a tuple of value and length: (List[Tensor], List[Tensor]) the input names should looks like [‘token’, ‘dict_value’, ‘dict_length’]

Type:List[Str]
dummy_model_input

dummy values to define the shape of input tensors, should exactly match the shape of the model forward function

Type:Tuple[torch.Tensor]
vocab_map

dict of input feature names to corresponding index_to_string array, e.g:

{
    "text": ["<UNK>", "W1", "W2", "W3", "W4", "W5", "W6", "W7", "W8"],
    "dict": ["<UNK>", "D1", "D2", "D3", "D4", "D5", "D6", "D7", "D8"]
}
Type:Dict[str, List[str]]
output_names

names of output variables

Type:List[Str]
export_to_caffe2(model, export_path: str, export_onnx_path: str = None) → List[str][source]

export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model

Parameters:
  • model (Model) – pytorch model to export
  • export_path (str) – path to save the exported caffe2 model
  • export_onnx_path (str) – path to save the exported onnx model
Returns:

list of caffe2 model output names

Return type:

final_output_names

export_to_metrics(model, metric_channels)[source]

Exports the pytorch model to tensorboard as a graph.

Parameters:
  • model (Model) – pytorch model to export
  • metric_channels (List[Channel]) – outputs of model’s execution graph
classmethod from_config(config, feature_config: pytext.config.field_config.FeatureConfig, target_config: Union[pytext.config.pytext_config.ConfigBase, List[pytext.config.pytext_config.ConfigBase]], meta: pytext.data.data_handler.CommonMetadata, *args, **kwargs)[source]

Gather all the necessary metadata from configs and global metadata to be used in exporter

get_extra_params() → List[str][source]
Returns:list of blobs to be added as extra params to the caffe2 model
classmethod get_feature_metadata(feature_config: pytext.config.field_config.FeatureConfig, feature_meta: Dict[str, pytext.fields.field.FieldMeta])[source]
postprocess_output(init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, output_names: List[str], py_model)[source]

Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net

Parameters:
  • init_net (caffe2.python.Net) – caffe2 init net created by the current graph
  • predict_net (caffe2.python.Net) – caffe2 net created by the current graph
  • workspace (caffe2.python.workspace) – caffe2 current workspace
  • output_names (List[str]) – current output names of the caffe2 net
  • py_model (Model) – original pytorch model object
Returns:

list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add

Return type:

result

prepend_operators(c2_prepared: caffe2.python.onnx.backend_rep.Caffe2Rep, input_names: List[str]) → Tuple[caffe2.python.onnx.backend_rep.Caffe2Rep, List[str]][source]

Prepend operators to the converted caffe2 net, do nothing by default

Parameters:
  • c2_prepared (Caffe2Rep) – caffe2 net rep
  • input_names (List[str]) – current input names to the caffe2 net
Returns:

caffe2 net with prepended operators input_names (List[str]): list of input names for the new net

Return type:

c2_prepared (Caffe2Rep)

class pytext.exporters.DenseFeatureExporter(config, input_names, dummy_model_input, vocab_map, output_names)[source]

Bases: pytext.exporters.exporter.ModelExporter

Exporter for models that have DenseFeatures as input to the decoder

classmethod get_feature_metadata(feature_config: pytext.config.field_config.FeatureConfig, feature_meta: Dict[str, pytext.fields.field.FieldMeta])[source]
class pytext.exporters.InitPredictNetExporter(config, input_names, dummy_model_input, vocab_map, output_names)[source]

Bases: pytext.exporters.exporter.ModelExporter

Exporter for converting models to their caffe2 init and predict nets. Does not rely on c2_prepared, but rather splits the ONNX model into the init and predict nets directly.

export_to_caffe2(model, export_path: str, export_onnx_path: str = None) → List[str][source]

export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model

Parameters:
  • model (Model) – pytorch model to export
  • export_path (str) – path to save the exported caffe2 model
  • export_onnx_path (str) – path to save the exported onnx model
Returns:

list of caffe2 model output names

Return type:

final_output_names

get_export_paths(path)[source]
postprocess_output(init_net, predict_net, workspace, output_names: List[str], model)[source]

Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net

Parameters:
  • init_net (caffe2.python.Net) – caffe2 init net created by the current graph
  • predict_net (caffe2.python.Net) – caffe2 net created by the current graph
  • workspace (caffe2.python.workspace) – caffe2 current workspace
  • output_names (List[str]) – current output names of the caffe2 net
  • py_model (Model) – original pytorch model object
Returns:

list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add

Return type:

result

prepend_operators(init_net, predict_net, input_names: List[str])[source]

Prepend operators to the converted caffe2 net, do nothing by default

Parameters:
  • c2_prepared (Caffe2Rep) – caffe2 net rep
  • input_names (List[str]) – current input names to the caffe2 net
Returns:

caffe2 net with prepended operators input_names (List[str]): list of input names for the new net

Return type:

c2_prepared (Caffe2Rep)

pytext.fields package

Submodules
pytext.fields.char_field module
class pytext.fields.char_field.CharFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = tensor([[[1, 1, 1]], [[1, 1, 1]]])
numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[str]]]) → List[List[List[str]]][source]

Example of minibatch:

[[['p', 'l', 'a', 'y', '<PAD>', '<PAD>'],
  ['t', 'h', 'a', 't', '<PAD>', '<PAD>'],
  ['t', 'r', 'a', 'c', 'k', '<PAD>'],
  ['o', 'n', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
  ['r', 'e', 'p', 'e', 'a', 't']
 ], ...
]
pytext.fields.contextual_token_embedding_field module
class pytext.fields.contextual_token_embedding_field.ContextualTokenEmbeddingField(**kwargs)[source]

Bases: pytext.fields.field.Field

numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]

Example of padded minibatch:

[[[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [3.1, 3.2, 3.3, 3.4, 3.5],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
]
pytext.fields.dict_field module
class pytext.fields.dict_field.DictFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))
numericalize(arr, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

pytext.fields.field module
class pytext.fields.field.ActionField(**kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

class pytext.fields.field.DocLabelField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.field.Field(*args, **kwargs)[source]

Bases: torchtext.data.field.Field

classmethod from_config(config)[source]
get_meta() → pytext.fields.field.FieldMeta[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
pad_length(n)[source]

Override to make pad_length to be multiple of 8 to support fp16 training

class pytext.fields.field.FieldMeta[source]

Bases: object

class pytext.fields.field.FloatField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.field.FloatVectorField(dim=0, dim_error_check=False, **kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.field.NestedField(*args, **kwargs)[source]

Bases: pytext.fields.field.Field, torchtext.data.field.NestedField

get_meta()[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
class pytext.fields.field.RawField(*args, is_target=False, **kwargs)[source]

Bases: torchtext.data.field.RawField

get_meta() → pytext.fields.field.FieldMeta[source]
class pytext.fields.field.SeqFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingNestedField

dummy_model_input = tensor([[[1]], [[1]]])
class pytext.fields.field.TextFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

dummy_model_input = tensor([[1], [1]])
class pytext.fields.field.VocabUsingField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.Field

Base class for all fields that need to build a vocabulary.

class pytext.fields.field.VocabUsingNestedField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField, pytext.fields.field.NestedField

Base class for all nested fields that need to build a vocabulary.

class pytext.fields.field.WordLabelField(use_bio_labels, **kwargs)[source]

Bases: pytext.fields.field.Field

get_meta()[source]
pytext.fields.field.create_fields(fields_config, field_cls_dict)[source]
pytext.fields.field.create_label_fields(label_configs, label_cls_dict)[source]
pytext.fields.text_field_with_special_unk module
class pytext.fields.text_field_with_special_unk.TextFeatureFieldWithSpecialUnk(*args, unkify_func=<function unkify>, **kwargs)[source]

Bases: pytext.fields.field.TextFeatureField

build_vocab(*args, min_freq=1, **kwargs)[source]

Code is exactly same as as torchtext.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.

numericalize(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]

Code is exactly same as torchtext.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.

Module contents
pytext.fields.create_fields(fields_config, field_cls_dict)[source]
pytext.fields.create_label_fields(label_configs, label_cls_dict)[source]
class pytext.fields.ActionField(**kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

class pytext.fields.CharFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = tensor([[[1, 1, 1]], [[1, 1, 1]]])
numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[str]]]) → List[List[List[str]]][source]

Example of minibatch:

[[['p', 'l', 'a', 'y', '<PAD>', '<PAD>'],
  ['t', 'h', 'a', 't', '<PAD>', '<PAD>'],
  ['t', 'r', 'a', 'c', 'k', '<PAD>'],
  ['o', 'n', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
  ['r', 'e', 'p', 'e', 'a', 't']
 ], ...
]
class pytext.fields.ContextualTokenEmbeddingField(**kwargs)[source]

Bases: pytext.fields.field.Field

numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]

Example of padded minibatch:

[[[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [3.1, 3.2, 3.3, 3.4, 3.5],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
]
class pytext.fields.DictFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))
numericalize(arr, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

class pytext.fields.DocLabelField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.Field(*args, **kwargs)[source]

Bases: torchtext.data.field.Field

classmethod from_config(config)[source]
get_meta() → pytext.fields.field.FieldMeta[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
pad_length(n)[source]

Override to make pad_length to be multiple of 8 to support fp16 training

class pytext.fields.FieldMeta[source]

Bases: object

class pytext.fields.FloatField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.FloatVectorField(dim=0, dim_error_check=False, **kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.RawField(*args, is_target=False, **kwargs)[source]

Bases: torchtext.data.field.RawField

get_meta() → pytext.fields.field.FieldMeta[source]
class pytext.fields.TextFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

dummy_model_input = tensor([[1], [1]])
class pytext.fields.VocabUsingField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.Field

Base class for all fields that need to build a vocabulary.

class pytext.fields.WordLabelField(use_bio_labels, **kwargs)[source]

Bases: pytext.fields.field.Field

get_meta()[source]
class pytext.fields.NestedField(*args, **kwargs)[source]

Bases: pytext.fields.field.Field, torchtext.data.field.NestedField

get_meta()[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
class pytext.fields.VocabUsingNestedField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField, pytext.fields.field.NestedField

Base class for all nested fields that need to build a vocabulary.

class pytext.fields.SeqFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingNestedField

dummy_model_input = tensor([[[1]], [[1]]])
class pytext.fields.TextFeatureFieldWithSpecialUnk(*args, unkify_func=<function unkify>, **kwargs)[source]

Bases: pytext.fields.field.TextFeatureField

build_vocab(*args, min_freq=1, **kwargs)[source]

Code is exactly same as as torchtext.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.

numericalize(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]

Code is exactly same as torchtext.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.

pytext.loss package

Submodules
pytext.loss.loss module
class pytext.loss.loss.AUCPRHingeLoss(config, weights=None, *args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, pytext.loss.loss.Loss

area under the precision-recall curve loss, Reference: “Scalable Learning of Non-Decomposable Objectives”, Section 5 TensorFlow Implementation: https://github.com/tensorflow/models/tree/master/research/global_objectives

forward(logits, targets, reduce=True, size_average=True, weights=None)[source]
Parameters:
  • logits – Variable \((N, C)\) where C = number of classes
  • targets – Variable \((N)\) where each value is 0 <= targets[i] <= C-1
  • weights – Coefficients for the loss. Must be a Tensor of shape [N] or [N, C], where N = batch_size, C = number of classes.
  • size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch. Default: True
  • reduce (bool, optional) – By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per input/target element instead and ignores size_average. Default: True
class pytext.loss.loss.BinaryCrossEntropyLoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.CosineEmbeddingLoss(config, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.CrossEntropyLoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.KLDivergenceBCELoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.KLDivergenceCELoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.LabelSmoothedCrossEntropyLoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.Loss(config=None, *args, **kwargs)[source]

Bases: pytext.config.component.Component

Base class for loss functions

class pytext.loss.loss.MAELoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

Mean absolute error or L1 loss, for regression tasks.

class pytext.loss.loss.MSELoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

Mean squared error or L2 loss, for regression tasks.

class pytext.loss.loss.MultiLabelSoftMarginLoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.NLLLoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.loss.PairwiseRankingLoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

Given embeddings for a query, positive response and negative response computes pairwise ranking hinge loss

static get_similarities(embeddings)[source]
Module contents
class pytext.loss.AUCPRHingeLoss(config, weights=None, *args, **kwargs)[source]

Bases: torch.nn.modules.module.Module, pytext.loss.loss.Loss

area under the precision-recall curve loss, Reference: “Scalable Learning of Non-Decomposable Objectives”, Section 5 TensorFlow Implementation: https://github.com/tensorflow/models/tree/master/research/global_objectives

forward(logits, targets, reduce=True, size_average=True, weights=None)[source]
Parameters:
  • logits – Variable \((N, C)\) where C = number of classes
  • targets – Variable \((N)\) where each value is 0 <= targets[i] <= C-1
  • weights – Coefficients for the loss. Must be a Tensor of shape [N] or [N, C], where N = batch_size, C = number of classes.
  • size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch. Default: True
  • reduce (bool, optional) – By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per input/target element instead and ignores size_average. Default: True
class pytext.loss.Loss(config=None, *args, **kwargs)[source]

Bases: pytext.config.component.Component

Base class for loss functions

class pytext.loss.CrossEntropyLoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.CosineEmbeddingLoss(config, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.BinaryCrossEntropyLoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.MultiLabelSoftMarginLoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.KLDivergenceBCELoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.KLDivergenceCELoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.MAELoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

Mean absolute error or L1 loss, for regression tasks.

class pytext.loss.MSELoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

Mean squared error or L2 loss, for regression tasks.

class pytext.loss.NLLLoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

class pytext.loss.PairwiseRankingLoss(config=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

Given embeddings for a query, positive response and negative response computes pairwise ranking hinge loss

static get_similarities(embeddings)[source]
class pytext.loss.LabelSmoothedCrossEntropyLoss(config, ignore_index=-100, weight=None, *args, **kwargs)[source]

Bases: pytext.loss.loss.Loss

pytext.metric_reporters package

Submodules
pytext.metric_reporters.channel module
class pytext.metric_reporters.channel.Channel(stages: Tuple[pytext.common.constants.Stage, ...] = (<Stage.TRAIN: 'Training'>, <Stage.EVAL: 'Evaluation'>, <Stage.TEST: 'Test'>))[source]

Bases: object

Channel defines how to format and report the result of a PyText job to an output stream.

stages

in which stages the report will be triggered, default is all stages, which includes train, eval, test

close()[source]
export(model, input_to_model=None, **kwargs)[source]
report(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]

Defines how to format and report data to the output channel.

Parameters:
  • stage (Stage) – train, eval or test
  • epoch (int) – current epoch
  • metrics (Any) – all metrics
  • model_select_metric (double) – a single numeric metric to pick best model
  • loss (double) – average loss
  • preds (List[Any]) – list of predictions
  • targets (List[Any]) – list of targets
  • scores (List[Any]) – list of scores
  • context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
class pytext.metric_reporters.channel.ConsoleChannel(stages: Tuple[pytext.common.constants.Stage, ...] = (<Stage.TRAIN: 'Training'>, <Stage.EVAL: 'Evaluation'>, <Stage.TEST: 'Test'>))[source]

Bases: pytext.metric_reporters.channel.Channel

Simple Channel that prints results to console.

report(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]

Defines how to format and report data to the output channel.

Parameters:
  • stage (Stage) – train, eval or test
  • epoch (int) – current epoch
  • metrics (Any) – all metrics
  • model_select_metric (double) – a single numeric metric to pick best model
  • loss (double) – average loss
  • preds (List[Any]) – list of predictions
  • targets (List[Any]) – list of targets
  • scores (List[Any]) – list of scores
  • context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
class pytext.metric_reporters.channel.FileChannel(stages, file_path)[source]

Bases: pytext.metric_reporters.channel.Channel

Simple Channel that writes results to a TSV file.

gen_content(metrics, loss, preds, targets, scores, context)[source]
get_title(context_keys=())[source]
report(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]

Defines how to format and report data to the output channel.

Parameters:
  • stage (Stage) – train, eval or test
  • epoch (int) – current epoch
  • metrics (Any) – all metrics
  • model_select_metric (double) – a single numeric metric to pick best model
  • loss (double) – average loss
  • preds (List[Any]) – list of predictions
  • targets (List[Any]) – list of targets
  • scores (List[Any]) – list of scores
  • context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
class pytext.metric_reporters.channel.TensorBoardChannel(summary_writer=None, metric_name='accuracy')[source]

Bases: pytext.metric_reporters.channel.Channel

TensorBoardChannel defines how to format and report the result of a PyText job to TensorBoard.

summary_writer

An instance of the TensorBoard SummaryWriter class, or an object that implements the same interface. https://pytorch.org/docs/stable/tensorboard.html

metric_name

The name of the default metric to display on the TensorBoard dashboard, defaults to “accuracy”

train_step

The training step count

add_scalars(prefix, metrics, epoch)[source]

Recursively flattens the metrics object and adds each field name and value as a scalar for the corresponding epoch using the summary writer.

Parameters:
  • prefix (str) – The tag prefix for the metric. Each field name in the metrics object will be prepended with the prefix.
  • metrics (Any) – The metrics object.
add_texts(tag, metrics)[source]

Recursively flattens the metrics object and adds each field name and value as a text using the summary writer. For example, if tag = “test”, and metrics = { accuracy: 0.7, scores: { precision: 0.8, recall: 0.6 } }, then under “tag=test” we will display “accuracy=0.7”, and under “tag=test/scores” we will display “precision=0.8” and “recall=0.6” in TensorBoard.

Parameters:
  • tag (str) – The tag name for the metric. If a field needs to be flattened further, it will be prepended as a prefix to the field name.
  • metrics (Any) – The metrics object.
close()[source]

Closes the summary writer.

export(model, input_to_model=None, **kwargs)[source]

Draws the neural network representation graph in TensorBoard.

Parameters:
  • model (Any) – the model object.
  • input_to_model (Any) – the input to the model (required for PyTorch models, since its execution graph is defined by run).
report(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, meta, model, optimizer, *args)[source]

Defines how to format and report data to TensorBoard using the summary writer. In the current implementation, during the train/eval phase we recursively report each metric field as scalars, and during the test phase we report the final metrics to be displayed as texts.

Also visualizes the internal model states (weights, biases) as histograms in TensorBoard.

Parameters:
  • stage (Stage) – train, eval or test
  • epoch (int) – current epoch
  • metrics (Any) – all metrics
  • model_select_metric (double) – a single numeric metric to pick best model
  • loss (double) – average loss
  • preds (List[Any]) – list of predictions
  • targets (List[Any]) – list of targets
  • scores (List[Any]) – list of scores
  • context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
  • meta (Dict[str, Any]) – global metadata, such as target names
  • model (nn.Module) – the PyTorch neural network model
pytext.metric_reporters.classification_metric_reporter module
class pytext.metric_reporters.classification_metric_reporter.ClassificationMetricReporter(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
classmethod from_config_and_label_names(config, label_names: List[str])[source]
get_meta()[source]

Get global meta data that is not specific to any batch, the data will be pass along to channels

get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

class pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric[source]

Bases: enum.Enum

An enumeration.

ACCURACY = 'accuracy'
LABEL_AVG_PRECISION = 'label_avg_precision'
LABEL_F1 = 'label_f1'
LABEL_ROC_AUC = 'label_roc_auc'
MACRO_F1 = 'macro_f1'
MCC = 'mcc'
NEGATIVE_LOSS = 'negative_loss'
ROC_AUC = 'roc_auc'
class pytext.metric_reporters.classification_metric_reporter.MultiLabelClassificationMetricReporter(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]

Bases: pytext.metric_reporters.classification_metric_reporter.ClassificationMetricReporter

calculate_metric()[source]

Calculate metrics, each sub class should implement it

predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

pytext.metric_reporters.compositional_metric_reporter module
class pytext.metric_reporters.compositional_metric_reporter.CompositionalMetricReporter(actions_vocab, channels: List[pytext.metric_reporters.channel.Channel], text_column_name: str = 'tokenized_text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

create_frame_prediction_pairs()[source]
classmethod from_config(config, metadata: pytext.data.data_handler.CommonMetadata = None, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer] = None)[source]
gen_extra_context()[source]

Generate any extra intermediate context data for metric calculation

get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

static node_to_metrics_node(node: Union[pytext.data.data_structures.annotation.Intent, pytext.data.data_structures.annotation.Slot], start: int = 0) → pytext.metrics.intent_slot_metrics.Node[source]

The input start is the absolute start position in utterance

predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

static tree_from_tokens_and_indx_actions(token_str_list: List[str], actions_vocab: List[str], actions_indices: List[int], validate_tree: bool = True)[source]
static tree_to_metric_node(tree: pytext.data.data_structures.annotation.Tree) → pytext.metrics.intent_slot_metrics.Node[source]

Creates a Node from tree assuming the utterance is a concatenation of the tokens by whitespaces. The function does not necessarily reproduce the original utterance as extra whitespaces can be introduced.

pytext.metric_reporters.disjoint_multitask_metric_reporter module
class pytext.metric_reporters.disjoint_multitask_metric_reporter.DisjointMultitaskMetricReporter(reporters: Dict[str, pytext.metric_reporters.metric_reporter.MetricReporter], loss_weights: Dict[str, float], target_task_name: Optional[str], use_subtask_select_metric: bool)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
add_channel(channel)[source]
batch_context(raw_batch, batch)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = False
report_metric(model, stage, epoch, reset=True, print_to_channels=True, optimizer=None)[source]

Calculate metrics and average loss, report all statistic data to channels

Parameters:
  • model (nn.Module) – the PyTorch neural network model.
  • stage (Stage) – training, evaluation or test
  • epoch (int) – current epoch
  • reset (bool) – if all data should be reset after report, default is True
  • print_to_channels (bool) – if report data to channels, default is True
report_realtime_metric(stage)[source]
pytext.metric_reporters.intent_slot_detection_metric_reporter module
class pytext.metric_reporters.intent_slot_detection_metric_reporter.IntentSlotMetricReporter(doc_label_names: List[str], word_label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel], slot_column_name: str = 'slots', text_column_name: str = 'text', token_tensorizer_name: str = 'tokens')[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

aggregate_preds(batch_preds, batch_context)[source]
aggregate_scores(batch_scores)[source]
aggregate_targets(batch_targets, batch_context)[source]
batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizers: Optional[Dict[KT, VT]] = None)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

get_raw_slot_str(raw_data_row)[source]
predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

pytext.metric_reporters.intent_slot_detection_metric_reporter.create_frame(text, intent_label, slot_names_str, byte_len)[source]
pytext.metric_reporters.intent_slot_detection_metric_reporter.frame_to_str(frame: pytext.metrics.intent_slot_metrics.Node)[source]
pytext.metric_reporters.language_model_metric_reporter module
class pytext.metric_reporters.language_model_metric_reporter.LanguageModelChannel(stages, file_path)[source]

Bases: pytext.metric_reporters.channel.FileChannel

gen_content(metrics, loss, preds, targets, scores, contexts)[source]
get_title(context_keys=())[source]
class pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter(channels, metadata, tensorizers, aggregate_metrics, perplexity_type, pep_format)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

LABELS_COLUMN = 'labels'
RAW_TEXT_COLUMN = 'text'
TOKENS_COLUMN = 'tokens'
UTTERANCE_COLUMN = 'utterance'
add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
aggregate_context(context)[source]
aggregate_scores(scores)[source]
batch_context(raw_batch, batch)[source]
calculate_loss() → float[source]

Calculate the average loss for all aggregated batch

calculate_metric() → pytext.metrics.language_model_metrics.LanguageModelMetric[source]

Calculate metrics, each sub class should implement it

compute_scores(logits, targets)[source]
classmethod from_config(config: pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter.Config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
get_model_select_metric(metrics) → float[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = True
class pytext.metric_reporters.language_model_metric_reporter.MaskedLMMetricReporter(channels, metadata, tensorizers, aggregate_metrics, perplexity_type, pep_format)[source]

Bases: pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter

add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
calculate_loss() → float[source]

Calculate the average loss for all aggregated batch

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
report_realtime_metric(stage)[source]
pytext.metric_reporters.language_model_metric_reporter.get_perplexity_func(perplexity_type)[source]
pytext.metric_reporters.metric_reporter module
class pytext.metric_reporters.metric_reporter.MetricReporter(channels, pep_format=False)[source]

Bases: pytext.config.component.Component

MetricReporter is responsible of three things:

  1. Aggregate output from trainer, which includes model inputs, predictions, targets, scores, and loss.
  2. Calculate metrics using the aggregated output, and define how the metric is used to find best model
  3. Optionally report the metrics and aggregated output to various channels
lower_is_better

Whether a lower metric indicates better performance. Set to True for e.g. perplexity, and False for e.g. accuracy. Default is False

Type:bool
channels

A list of Channel that will receive metrics and the aggregated trainer output then format and report them in any customized way.

Type:List[Channel]

MetricReporter is tightly-coupled with metric aggregation and computation which makes inheritance hard to reuse the parent functionalities and attributes. Next step is to decouple the metric aggregation and computation vs metric reporting.

add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
add_channel(channel)[source]
classmethod aggregate_data(all_data, new_batch)[source]

Aggregate a batch of data, basically just convert tensors to list of native python data

aggregate_preds(batch_preds, batch_context=None)[source]
aggregate_scores(batch_scores)[source]
aggregate_targets(batch_targets, batch_context=None)[source]
batch_context(raw_batch, batch)[source]
calculate_loss()[source]

Calculate the average loss for all aggregated batch

calculate_metric()[source]

Calculate metrics, each sub class should implement it

compare_metric(new_metric, old_metric)[source]

Check if new metric indicates better model performance

Returns:bool, true if model with new_metric performs better
gen_extra_context()[source]

Generate any extra intermediate context data for metric calculation

get_meta()[source]

Get global meta data that is not specific to any batch, the data will be pass along to channels

get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = False
predictions_to_report()[source]

Generate human readable predictions

report_metric(model, stage, epoch, reset=True, print_to_channels=True, optimizer=None)[source]

Calculate metrics and average loss, report all statistic data to channels

Parameters:
  • model (nn.Module) – the PyTorch neural network model.
  • stage (Stage) – training, evaluation or test
  • epoch (int) – current epoch
  • reset (bool) – if all data should be reset after report, default is True
  • print_to_channels (bool) – if report data to channels, default is True
report_realtime_metric(stage)[source]
targets_to_report()[source]

Generate human readable targets

class pytext.metric_reporters.metric_reporter.PureLossMetricReporter(channels, pep_format=False)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, *args, **kwargs)[source]
lower_is_better = True
pytext.metric_reporters.pairwise_ranking_metric_reporter module
class pytext.metric_reporters.pairwise_ranking_metric_reporter.PairwiseRankingMetricReporter(channels, pep_format=False)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
static get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

pytext.metric_reporters.regression_metric_reporter module
class pytext.metric_reporters.regression_metric_reporter.RegressionMetricReporter(channels, pep_format=False)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizers=None)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = False
pytext.metric_reporters.squad_metric_reporter module
class pytext.metric_reporters.squad_metric_reporter.SquadFileChannel(stages, file_path)[source]

Bases: pytext.metric_reporters.channel.FileChannel

gen_content(metrics, loss, preds, targets, scores, contexts, *args)[source]
get_title(context_keys=())[source]
class pytext.metric_reporters.squad_metric_reporter.SquadMetricReporter(channels: List[pytext.metric_reporters.channel.Channel], n_best_size: int, max_answer_length: int, ignore_impossible: bool, has_answer_labels: List[str], tensorizer=None, false_label='False')[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

ANSWERS_COLUMN = 'answers'
DOC_COLUMN = 'doc'
QUES_COLUMN = 'question'
ROW_INDEX = 'id'
add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **contexts)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
aggregate_preds(new_batch, context=None)[source]
aggregate_scores(new_batch)[source]
aggregate_targets(new_batch, context=None)[source]
batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, *args, tensorizers=None, **kwargs)[source]
get_model_select_metric(metric: pytext.metrics.squad_metrics.SquadMetrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

pytext.metric_reporters.word_tagging_metric_reporter module
class pytext.metric_reporters.word_tagging_metric_reporter.NERMetricReporter(label_names: List[str], pad_idx: int, channels: List[pytext.metric_reporters.channel.Channel], use_bio_labels: bool = True)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric() → pytext.metrics.PRF1Metrics[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizer)[source]
static get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

class pytext.metric_reporters.word_tagging_metric_reporter.SequenceTaggingMetricReporter(label_names, pad_idx, channels)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizer)[source]
static get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

class pytext.metric_reporters.word_tagging_metric_reporter.Span(label, start, end)[source]

Bases: tuple

end

Alias for field number 2

label

Alias for field number 0

start

Alias for field number 1

class pytext.metric_reporters.word_tagging_metric_reporter.WordTaggingMetricReporter(label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel])[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

calculate_loss()[source]

Calculate the average loss for all aggregated batch

calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

process_pred(pred: List[int]) → List[str][source]

pred is a list of token label index

pytext.metric_reporters.word_tagging_metric_reporter.convert_bio_to_spans(bio_sequence: List[str]) → List[pytext.metric_reporters.word_tagging_metric_reporter.Span][source]

Process the output and convert to spans for evaluation.

pytext.metric_reporters.word_tagging_metric_reporter.get_slots(word_names)[source]
Module contents
class pytext.metric_reporters.Channel(stages: Tuple[pytext.common.constants.Stage, ...] = (<Stage.TRAIN: 'Training'>, <Stage.EVAL: 'Evaluation'>, <Stage.TEST: 'Test'>))[source]

Bases: object

Channel defines how to format and report the result of a PyText job to an output stream.

stages

in which stages the report will be triggered, default is all stages, which includes train, eval, test

close()[source]
export(model, input_to_model=None, **kwargs)[source]
report(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]

Defines how to format and report data to the output channel.

Parameters:
  • stage (Stage) – train, eval or test
  • epoch (int) – current epoch
  • metrics (Any) – all metrics
  • model_select_metric (double) – a single numeric metric to pick best model
  • loss (double) – average loss
  • preds (List[Any]) – list of predictions
  • targets (List[Any]) – list of targets
  • scores (List[Any]) – list of scores
  • context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
class pytext.metric_reporters.MetricReporter(channels, pep_format=False)[source]

Bases: pytext.config.component.Component

MetricReporter is responsible of three things:

  1. Aggregate output from trainer, which includes model inputs, predictions, targets, scores, and loss.
  2. Calculate metrics using the aggregated output, and define how the metric is used to find best model
  3. Optionally report the metrics and aggregated output to various channels
lower_is_better

Whether a lower metric indicates better performance. Set to True for e.g. perplexity, and False for e.g. accuracy. Default is False

Type:bool
channels

A list of Channel that will receive metrics and the aggregated trainer output then format and report them in any customized way.

Type:List[Channel]

MetricReporter is tightly-coupled with metric aggregation and computation which makes inheritance hard to reuse the parent functionalities and attributes. Next step is to decouple the metric aggregation and computation vs metric reporting.

add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
add_channel(channel)[source]
classmethod aggregate_data(all_data, new_batch)[source]

Aggregate a batch of data, basically just convert tensors to list of native python data

aggregate_preds(batch_preds, batch_context=None)[source]
aggregate_scores(batch_scores)[source]
aggregate_targets(batch_targets, batch_context=None)[source]
batch_context(raw_batch, batch)[source]
calculate_loss()[source]

Calculate the average loss for all aggregated batch

calculate_metric()[source]

Calculate metrics, each sub class should implement it

compare_metric(new_metric, old_metric)[source]

Check if new metric indicates better model performance

Returns:bool, true if model with new_metric performs better
gen_extra_context()[source]

Generate any extra intermediate context data for metric calculation

get_meta()[source]

Get global meta data that is not specific to any batch, the data will be pass along to channels

get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = False
predictions_to_report()[source]

Generate human readable predictions

report_metric(model, stage, epoch, reset=True, print_to_channels=True, optimizer=None)[source]

Calculate metrics and average loss, report all statistic data to channels

Parameters:
  • model (nn.Module) – the PyTorch neural network model.
  • stage (Stage) – training, evaluation or test
  • epoch (int) – current epoch
  • reset (bool) – if all data should be reset after report, default is True
  • print_to_channels (bool) – if report data to channels, default is True
report_realtime_metric(stage)[source]
targets_to_report()[source]

Generate human readable targets

class pytext.metric_reporters.ClassificationMetricReporter(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
classmethod from_config_and_label_names(config, label_names: List[str])[source]
get_meta()[source]

Get global meta data that is not specific to any batch, the data will be pass along to channels

get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

class pytext.metric_reporters.MultiLabelClassificationMetricReporter(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]

Bases: pytext.metric_reporters.classification_metric_reporter.ClassificationMetricReporter

calculate_metric()[source]

Calculate metrics, each sub class should implement it

predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

class pytext.metric_reporters.RegressionMetricReporter(channels, pep_format=False)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizers=None)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = False
class pytext.metric_reporters.IntentSlotMetricReporter(doc_label_names: List[str], word_label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel], slot_column_name: str = 'slots', text_column_name: str = 'text', token_tensorizer_name: str = 'tokens')[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

aggregate_preds(batch_preds, batch_context)[source]
aggregate_scores(batch_scores)[source]
aggregate_targets(batch_targets, batch_context)[source]
batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizers: Optional[Dict[KT, VT]] = None)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

get_raw_slot_str(raw_data_row)[source]
predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

class pytext.metric_reporters.LanguageModelMetricReporter(channels, metadata, tensorizers, aggregate_metrics, perplexity_type, pep_format)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

LABELS_COLUMN = 'labels'
RAW_TEXT_COLUMN = 'text'
TOKENS_COLUMN = 'tokens'
UTTERANCE_COLUMN = 'utterance'
add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
aggregate_context(context)[source]
aggregate_scores(scores)[source]
batch_context(raw_batch, batch)[source]
calculate_loss() → float[source]

Calculate the average loss for all aggregated batch

calculate_metric() → pytext.metrics.language_model_metrics.LanguageModelMetric[source]

Calculate metrics, each sub class should implement it

compute_scores(logits, targets)[source]
classmethod from_config(config: pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter.Config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
get_model_select_metric(metrics) → float[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

lower_is_better = True
class pytext.metric_reporters.SquadMetricReporter(channels: List[pytext.metric_reporters.channel.Channel], n_best_size: int, max_answer_length: int, ignore_impossible: bool, has_answer_labels: List[str], tensorizer=None, false_label='False')[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

ANSWERS_COLUMN = 'answers'
DOC_COLUMN = 'doc'
QUES_COLUMN = 'question'
ROW_INDEX = 'id'
add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **contexts)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
aggregate_preds(new_batch, context=None)[source]
aggregate_scores(new_batch)[source]
aggregate_targets(new_batch, context=None)[source]
batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, *args, tensorizers=None, **kwargs)[source]
get_model_select_metric(metric: pytext.metrics.squad_metrics.SquadMetrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

class pytext.metric_reporters.WordTaggingMetricReporter(label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel])[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

calculate_loss()[source]

Calculate the average loss for all aggregated batch

calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata)[source]
get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

process_pred(pred: List[int]) → List[str][source]

pred is a list of token label index

class pytext.metric_reporters.CompositionalMetricReporter(actions_vocab, channels: List[pytext.metric_reporters.channel.Channel], text_column_name: str = 'tokenized_text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

create_frame_prediction_pairs()[source]
classmethod from_config(config, metadata: pytext.data.data_handler.CommonMetadata = None, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer] = None)[source]
gen_extra_context()[source]

Generate any extra intermediate context data for metric calculation

get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

static node_to_metrics_node(node: Union[pytext.data.data_structures.annotation.Intent, pytext.data.data_structures.annotation.Slot], start: int = 0) → pytext.metrics.intent_slot_metrics.Node[source]

The input start is the absolute start position in utterance

predictions_to_report()[source]

Generate human readable predictions

targets_to_report()[source]

Generate human readable targets

static tree_from_tokens_and_indx_actions(token_str_list: List[str], actions_vocab: List[str], actions_indices: List[int], validate_tree: bool = True)[source]
static tree_to_metric_node(tree: pytext.data.data_structures.annotation.Tree) → pytext.metrics.intent_slot_metrics.Node[source]

Creates a Node from tree assuming the utterance is a concatenation of the tokens by whitespaces. The function does not necessarily reproduce the original utterance as extra whitespaces can be introduced.

class pytext.metric_reporters.PairwiseRankingMetricReporter(channels, pep_format=False)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

add_batch_stats(n_batches, preds, targets, scores, loss, m_input, **context)[source]

Aggregates a batch of output data (predictions, scores, targets/true labels and loss).

Parameters:
  • n_batches (int) – number of current batch
  • preds (torch.Tensor) – predictions of current batch
  • targets (torch.Tensor) – targets of current batch
  • scores (torch.Tensor) – scores of current batch
  • loss (double) – average loss of current batch
  • m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
  • context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]
static get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

class pytext.metric_reporters.SequenceTaggingMetricReporter(label_names, pad_idx, channels)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizer)[source]
static get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

class pytext.metric_reporters.PureLossMetricReporter(channels, pep_format=False)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

calculate_metric()[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, *args, **kwargs)[source]
lower_is_better = True
class pytext.metric_reporters.NERMetricReporter(label_names: List[str], pad_idx: int, channels: List[pytext.metric_reporters.channel.Channel], use_bio_labels: bool = True)[source]

Bases: pytext.metric_reporters.metric_reporter.MetricReporter

batch_context(raw_batch, batch)[source]
calculate_metric() → pytext.metrics.PRF1Metrics[source]

Calculate metrics, each sub class should implement it

classmethod from_config(config, tensorizer)[source]
static get_model_select_metric(metrics)[source]

Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures

pytext.metrics package

Submodules
pytext.metrics.intent_slot_metrics module
class pytext.metrics.intent_slot_metrics.AllMetrics[source]

Bases: tuple

Aggregated class for intent-slot related metrics.

top_intent_accuracy

Accuracy of the top-level intent.

frame_accuracy

Frame accuracy.

frame_accuracies_by_depth

Frame accuracies bucketized by depth of the gold tree.

bracket_metrics

Bracket metrics for intents and slots. For details, see the function compute_intent_slot_metrics().

tree_metrics

Tree metrics for intents and slots. For details, see the function compute_intent_slot_metrics().

loss

Cross entropy loss.

bracket_metrics

Alias for field number 4

frame_accuracies_by_depth

Alias for field number 3

frame_accuracy

Alias for field number 1

frame_accuracy_top_k

Alias for field number 2

loss

Alias for field number 6

print_metrics() → None[source]
top_intent_accuracy

Alias for field number 0

tree_metrics

Alias for field number 5

pytext.metrics.intent_slot_metrics.FrameAccuraciesByDepth = typing.Dict[int, pytext.metrics.intent_slot_metrics.FrameAccuracy]

Frame accuracies bucketized by depth of the gold tree.

class pytext.metrics.intent_slot_metrics.FrameAccuracy[source]

Bases: tuple

Frame accuracy for a collection of intent frame predictions.

Frame accuracy means the entire tree structure of the predicted frame matches that of the gold frame.

frame_accuracy

Alias for field number 1

num_samples

Alias for field number 0

class pytext.metrics.intent_slot_metrics.FramePredictionPair[source]

Bases: tuple

Pair of predicted and gold intent frames.

expected_frame

Alias for field number 1

predicted_frame

Alias for field number 0

class pytext.metrics.intent_slot_metrics.IntentSlotConfusions[source]

Bases: tuple

Aggregated class for intent and slot confusions.

intent_confusions

Confusion counts for intents.

slot_confusions

Confusion counts for slots.

intent_confusions

Alias for field number 0

slot_confusions

Alias for field number 1

class pytext.metrics.intent_slot_metrics.IntentSlotMetrics[source]

Bases: tuple

Precision/recall/F1 metrics for intents and slots.

intent_metrics

Precision/recall/F1 metrics for intents.

slot_metrics

Precision/recall/F1 metrics for slots.

overall_metrics

Combined precision/recall/F1 metrics for all nodes (merging intents and slots).

intent_metrics

Alias for field number 0

overall_metrics

Alias for field number 2

print_metrics() → None[source]
slot_metrics

Alias for field number 1

class pytext.metrics.intent_slot_metrics.IntentsAndSlots[source]

Bases: tuple

Collection of intents and slots in an intent frame.

intents

Alias for field number 0

slots

Alias for field number 1

class pytext.metrics.intent_slot_metrics.Node(label: str, span: pytext.data.data_structures.node.Span, children: Optional[AbstractSet[Node]] = None, text: str = None)[source]

Bases: pytext.data.data_structures.node.Node

Subclass of the base Node class, used for metric purposes. It is immutable so that hashing can be done on the class.

label

Label of the node.

Type:str
span

Span of the node.

Type:Span
children

frozenset of the node’s children, left empty when computing bracketing metrics.

Type:frozenset of Node
text

Text the node covers (=utterance[span.start:span.end])

Type:str
class pytext.metrics.intent_slot_metrics.NodesPredictionPair[source]

Bases: tuple

Pair of predicted and expected sets of nodes.

expected_nodes

Alias for field number 1

predicted_nodes

Alias for field number 0

pytext.metrics.intent_slot_metrics.compare_frames(predicted_frame: pytext.metrics.intent_slot_metrics.Node, expected_frame: pytext.metrics.intent_slot_metrics.Node, tree_based: bool, intent_per_label_confusions: Optional[pytext.metrics.PerLabelConfusions] = None, slot_per_label_confusions: Optional[pytext.metrics.PerLabelConfusions] = None) → pytext.metrics.intent_slot_metrics.IntentSlotConfusions[source]

Compares two intent frames and returns TP, FP, FN counts for intents and slots. Optionally collects the per label TP, FP, FN counts.

Parameters:
  • predicted_frame – Predicted intent frame.
  • expected_frame – Gold intent frame.
  • tree_based – Whether to get the tree-based confusions (if True) or bracket-based confusions (if False). For details, see the function compute_intent_slot_metrics().
  • intent_per_label_confusions – If provided, update the per label confusions for intents as well. Defaults to None.
  • slot_per_label_confusions – If provided, update the per label confusions for slots as well. Defaults to None.
Returns:

IntentSlotConfusions, containing confusion counts for intents and slots.

pytext.metrics.intent_slot_metrics.compute_all_metrics(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair], top_intent_accuracy: bool = True, frame_accuracy: bool = True, frame_accuracies_by_depth: bool = True, bracket_metrics: bool = True, tree_metrics: bool = True, overall_metrics: bool = False, all_predicted_frames: List[List[pytext.metrics.intent_slot_metrics.Node]] = None, calculated_loss: float = None) → pytext.metrics.intent_slot_metrics.AllMetrics[source]

Given a list of predicted and gold intent frames, computes intent-slot related metrics.

Parameters:
  • frame_pairs – List of predicted and gold intent frames.
  • top_intent_accuracy – Whether to compute top intent accuracy or not. Defaults to True.
  • frame_accuracy – Whether to compute frame accuracy or not. Defaults to True.
  • frame_accuracies_by_depth – Whether to compute frame accuracies by depth or not. Defaults to True.
  • bracket_metrics – Whether to compute bracket metrics or not. Defaults to True.
  • tree_metrics – Whether to compute tree metrics or not. Defaults to True.
  • overall_metrics – If bracket_metrics or tree_metrics is true, decides whether to compute overall (merging intents and slots) metrics for them. Defaults to False.
Returns:

AllMetrics which contains intent-slot related metrics.

pytext.metrics.intent_slot_metrics.compute_frame_accuracies_by_depth(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair]) → Dict[int, pytext.metrics.intent_slot_metrics.FrameAccuracy][source]

Given a list of predicted and gold intent frames, splits the predictions into buckets according to the depth of the gold trees, and computes frame accuracy for each bucket.

Parameters:frame_pairs – List of predicted and gold intent frames.
Returns:FrameAccuraciesByDepth, a map from depths to their corresponding frame accuracies.
pytext.metrics.intent_slot_metrics.compute_frame_accuracy(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair]) → float[source]

Computes frame accuracy given a list of predicted and gold intent frames.

Parameters:frame_pairs – List of predicted and gold intent frames.
Returns:Frame accuracy. For a prediction, frame accuracy is achieved if the entire tree structure of the predicted frame matches that of the gold frame.
pytext.metrics.intent_slot_metrics.compute_frame_accuracy_top_k(frame_pairs: List[pytext.metrics.intent_slot_metrics.FramePredictionPair], all_frames: List[List[pytext.metrics.intent_slot_metrics.Node]]) → float[source]
pytext.metrics.intent_slot_metrics.compute_intent_slot_metrics(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair], tree_based: bool, overall_metrics: bool = True) → pytext.metrics.intent_slot_metrics.IntentSlotMetrics[source]

Given a list of predicted and gold intent frames, computes precision, recall and F1 metrics for intents and slots, either in tree-based or bracket-based manner.

The following assumptions are taken on intent frames: 1. The root node is an intent, 2. Children of intents are always slots, and children of slots are always intents.

For tree-based metrics, a node (an intent or slot) in the predicted frame is considered a true positive only if the subtree rooted at this node has an exact copy in the gold frame, otherwise it is considered a false positive. A false negative is a node in the gold frame that does not have an exact subtree match in the predicted frame.

For bracket-based metrics, a node in the predicted frame is considered a true positive if there is a node in the gold frame having the same label and span (but not necessarily the same children). The definitions of false positives and false negatives are similar to the above.

Parameters:
  • frame_pairs – List of predicted and gold intent frames.
  • tree_based – Whether to compute tree-based metrics (if True) or bracket-based metrics (if False).
  • overall_metrics – Whether to compute overall (merging intents and slots) metrics or not. Defaults to True.
Returns:

IntentSlotMetrics, containing precision/recall/F1 metrics for intents and slots.

pytext.metrics.intent_slot_metrics.compute_metric_at_k(references: List[pytext.metrics.intent_slot_metrics.Node], hypothesis: List[List[pytext.metrics.intent_slot_metrics.Node]], metric_fn: Callable[[pytext.metrics.intent_slot_metrics.Node, pytext.metrics.intent_slot_metrics.Node], bool] = <function <lambda>>) → List[float][source]

Computes a boolean metric at each position in the ranked list of hypothesis, and returns an average for each position over all examples. By default metric_fn is comparing if frames are equal.

pytext.metrics.intent_slot_metrics.compute_prf1_metrics(nodes_pairs: Sequence[pytext.metrics.intent_slot_metrics.NodesPredictionPair]) → Tuple[pytext.metrics.AllConfusions, pytext.metrics.PRF1Metrics][source]

Computes precision/recall/F1 metrics given a list of predicted and expected sets of nodes.

Parameters:nodes_pairs – List of predicted and expected node sets.
Returns:A tuple, of which the first member contains the confusion information, and the second member contains the computed precision/recall/F1 metrics.
pytext.metrics.intent_slot_metrics.compute_top_intent_accuracy(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair]) → float[source]

Computes accuracy of the top-level intent.

Parameters:frame_pairs – List of predicted and gold intent frames.
Returns:Prediction accuracy of the top-level intent.
pytext.metrics.language_model_metrics module
class pytext.metrics.language_model_metrics.LanguageModelMetric[source]

Bases: tuple

Class for language model metrics.

perplexity_per_word

Average perplexity per word of the dataset.

perplexity_per_word

Alias for field number 0

print_metrics()[source]
pytext.metrics.language_model_metrics.compute_language_model_metric(loss_per_word: float) → pytext.metrics.language_model_metrics.LanguageModelMetric[source]
pytext.metrics.squad_metrics module
class pytext.metrics.squad_metrics.SquadMetrics(num_examples, exact_matches, f1_score)[source]

Bases: tuple

exact_matches

Alias for field number 1

f1_score

Alias for field number 2

num_examples

Alias for field number 0

print_metrics() → None[source]
Module contents
class pytext.metrics.AllConfusions[source]

Bases: object

Aggregated class for per label confusions.

per_label_confusions

Per label confusion information.

confusions

Overall TP, FP and FN counts across the labels in per_label_confusions.

compute_metrics() → pytext.metrics.PRF1Metrics[source]
confusions
per_label_confusions
class pytext.metrics.ClassificationMetrics[source]

Bases: tuple

Metric class for various classification metrics.

accuracy

Overall accuracy of predictions.

macro_prf1_metrics

Macro precision/recall/F1 scores.

per_label_soft_scores

Per label soft metrics.

mcc

Matthews correlation coefficient.

roc_auc

Area under the Receiver Operating Characteristic curve.

loss

Training loss (only used for selecting best model, no need to print).

accuracy

Alias for field number 0

loss

Alias for field number 5

macro_prf1_metrics

Alias for field number 1

mcc

Alias for field number 3

per_label_soft_scores

Alias for field number 2

print_metrics(report_pep=False) → None[source]
print_pep()[source]
roc_auc

Alias for field number 4

class pytext.metrics.Confusions(TP: int = 0, FP: int = 0, FN: int = 0)[source]

Bases: object

Confusion information for a collection of predictions.

TP

Number of true positives.

FP

Number of false positives.

FN

Number of false negatives.

FN
FP
TP
compute_metrics() → pytext.metrics.PRF1Scores[source]
class pytext.metrics.LabelListPrediction[source]

Bases: tuple

Label list predictions of an example.

label_scores

Confidence scores that each label receives.

predicted_label

List of indices of the predicted label.

expected_label

List of indices of the true label.

expected_label

Alias for field number 2

label_scores

Alias for field number 0

predicted_label

Alias for field number 1

class pytext.metrics.LabelPrediction[source]

Bases: tuple

Label predictions of an example.

label_scores

Confidence scores that each label receives.

predicted_label

Index of the predicted label. This is usually the label with the highest confidence score in label_scores.

expected_label

Index of the true label.

expected_label

Alias for field number 2

label_scores

Alias for field number 0

predicted_label

Alias for field number 1

class pytext.metrics.MacroPRF1Metrics[source]

Bases: tuple

Aggregated metric class for macro precision/recall/F1 scores.

per_label_scores

Mapping from label string to the corresponding precision/recall/F1 scores.

macro_scores

Macro precision/recall/F1 scores across the labels in per_label_scores.

macro_scores

Alias for field number 1

per_label_scores

Alias for field number 0

print_metrics(indentation='') → None[source]
class pytext.metrics.MacroPRF1Scores[source]

Bases: tuple

Macro precision/recall/F1 scores (averages across each label).

num_label

Number of distinct labels.

precision

Equally weighted average of precisions for each label.

recall

Equally weighted average of recalls for each label.

f1

Equally weighted average of F1 scores for each label.

f1

Alias for field number 3

num_labels

Alias for field number 0

precision

Alias for field number 1

recall

Alias for field number 2

pytext.metrics.PRECISION_AT_RECALL_THRESHOLDS = [0.2, 0.4, 0.6, 0.8, 0.9]

Basic metric classes and functions for single-label prediction problems. Extending to multi-label support

class pytext.metrics.PRF1Metrics[source]

Bases: tuple

Metric class for all types of precision/recall/F1 scores.

per_label_scores

Map from label string to the corresponding precision/recall/F1 scores.

macro_scores

Macro precision/recall/F1 scores across the labels in per_label_scores.

micro_scores

Micro (regular) precision/recall/F1 scores for the same collection of predictions.

macro_scores

Alias for field number 1

micro_scores

Alias for field number 2

per_label_scores

Alias for field number 0

print_metrics() → None[source]
class pytext.metrics.PRF1Scores[source]

Bases: tuple

Precision/recall/F1 scores for a collection of predictions.

true_positives

Number of true positives.

false_positives

Number of false positives.

false_negatives

Number of false negatives.

precision

TP / (TP + FP).

recall

TP / (TP + FN).

f1

2 * TP / (2 * TP + FP + FN).

f1

Alias for field number 5

false_negatives

Alias for field number 2

false_positives

Alias for field number 1

precision

Alias for field number 3

recall

Alias for field number 4

true_positives

Alias for field number 0

class pytext.metrics.PairwiseRankingMetrics[source]

Bases: tuple

Metric class for pairwise ranking

num_examples

number of samples

Type:int
accuracy

how many times did we rank in the correct order

Type:float
average_score_difference

average score(higherRank) - score(lowerRank)

Type:float
accuracy

Alias for field number 1

average_score_difference

Alias for field number 2

num_examples

Alias for field number 0

print_metrics() → None[source]
class pytext.metrics.PerLabelConfusions[source]

Bases: object

Per label confusion information.

label_confusions_map

Map from label string to the corresponding confusion counts.

compute_metrics() → pytext.metrics.MacroPRF1Metrics[source]
label_confusions_map
update(label: str, item: str, count: int) → None[source]

Increase one of TP, FP or FN count for a label by certain amount.

Parameters:
  • label – Label to be modified.
  • item – Type of count to be modified, should be one of “TP”, “FP” or “FN”.
  • count – Amount to be added to the count.
Returns:

None

class pytext.metrics.RealtimeMetrics[source]

Bases: tuple

Realtime Metrics for tracking training progress and performance.

samples

number of samples

Type:int
tps

tokens per second

Type:float
ups

updates per second

Type:float
samples

Alias for field number 0

tps

Alias for field number 1

ups

Alias for field number 2

class pytext.metrics.RegressionMetrics[source]

Bases: tuple

Metrics for regression tasks.

num_examples

number of examples

Type:int
pearson_correlation

correlation between predictions and labels

Type:float
mse

mean-squared error between predictions and labels

Type:float
mse

Alias for field number 2

num_examples

Alias for field number 0

pearson_correlation

Alias for field number 1

print_metrics()[source]
class pytext.metrics.SoftClassificationMetrics[source]

Bases: tuple

Classification scores that are independent of thresholds.

average_precision

Alias for field number 0

decision_thresh_at_precision

Alias for field number 2

decision_thresh_at_recall

Alias for field number 4

precision_at_recall

Alias for field number 3

recall_at_precision

Alias for field number 1

roc_auc

Alias for field number 5

pytext.metrics.average_precision_score(y_true_sorted: numpy.ndarray, y_score_sorted: numpy.ndarray) → float[source]

Computes average precision, which summarizes the precision-recall curve as the precisions achieved at each threshold weighted by the increase in recall since the previous threshold.

Parameters:
  • y_true_sorted – Numpy array sorted according to decreasing confidence scores indicating whether each prediction is correct.
  • Numpy array of confidence scores for the predictions in (y_score_sorted) – decreasing order.
Returns:

Average precision score.

TODO: This is too slow, improve the performance

pytext.metrics.compute_classification_metrics(predictions: Sequence[pytext.metrics.LabelPrediction], label_names: Sequence[str], loss: float, average_precisions: bool = True, recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → pytext.metrics.ClassificationMetrics[source]

A general function that computes classification metrics given a list of label predictions.

Parameters:
  • predictions – Label predictions, including the confidence score for each label.
  • label_names – Indexed label names.
  • average_precisions – Whether to compute average precisions for labels or not. Defaults to True.
  • recall_at_precision_thresholds – precision thresholds at which to calculate recall
  • precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns:

ClassificationMetrics which contains various classification metrics.

pytext.metrics.compute_matthews_correlation_coefficients(TP: int, FP: int, FN: int, TN: int) → float[source]

Computes Matthews correlation coefficient, a way to summarize all four counts (TP, FP, FN, TN) in the confusion matrix of binary classification.

Parameters:
  • TP – Number of true positives.
  • FP – Number of false positives.
  • FN – Number of false negatives.
  • TN – Number of true negatives.
Returns:

Matthews correlation coefficient, which is sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)).

pytext.metrics.compute_multi_label_classification_metrics(predictions: Sequence[pytext.metrics.LabelListPrediction], label_names: Sequence[str], loss: float, average_precisions: bool = True, recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → pytext.metrics.ClassificationMetrics[source]

A general function that computes classification metrics given a list of multi-label predictions.

Parameters:
  • predictions – multi-label predictions, including the confidence score for each label.
  • label_names – Indexed label names.
  • average_precisions – Whether to compute average precisions for labels or not. Defaults to True.
  • recall_at_precision_thresholds – precision thresholds at which to calculate recall
  • precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns:

ClassificationMetrics which contains various classification metrics.

pytext.metrics.compute_multi_label_soft_metrics(predictions: Sequence[pytext.metrics.LabelListPrediction], label_names: Sequence[str], recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → Dict[str, pytext.metrics.SoftClassificationMetrics][source]

Computes multi-label soft classification metrics

Parameters:
  • predictions – multi-label predictions, including the confidence score for each label.
  • label_names – Indexed label names.
  • recall_at_precision_thresholds – precision thresholds at which to calculate recall
  • precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns:

Dict from label strings to their corresponding soft metrics.

pytext.metrics.compute_pairwise_ranking_metrics(predictions: Sequence[int], scores: Sequence[float]) → pytext.metrics.PairwiseRankingMetrics[source]

Computes metrics for pairwise ranking given sequences of predictions and scores

Parameters:
  • predictions – 1 if ranking was correct, 0 if ranking was incorrect
  • scores – score(higher-ranked-sample) - score(lower-ranked-sample)
Returns:

PairwiseRankingMetrics object

pytext.metrics.compute_prf1(tp: int, fp: int, fn: int) → Tuple[float, float, float][source]
pytext.metrics.compute_regression_metrics(predictions: Sequence[float], targets: Sequence[float]) → pytext.metrics.RegressionMetrics[source]

Computes metrics for regression tasks.abs

Parameters:
  • predictions – 1-D sequence of float predictions
  • targets – 1-D sequence of float labels
Returns:

RegressionMetrics object

pytext.metrics.compute_roc_auc(predictions: Sequence[pytext.metrics.LabelPrediction], target_class: int = 0) → Optional[float][source]

Computes area under the Receiver Operating Characteristic curve, for binary classification. Implementation based off of (and explained at) https://www.ibm.com/developerworks/community/blogs/jfp/entry/Fast_Computation_of_AUC_ROC_score?lang=en.

pytext.metrics.compute_soft_metrics(predictions: Sequence[pytext.metrics.LabelPrediction], label_names: Sequence[str], recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → Dict[str, pytext.metrics.SoftClassificationMetrics][source]

Computes soft classification metrics given a list of label predictions.

Parameters:
  • predictions – Label predictions, including the confidence score for each label.
  • label_names – Indexed label names.
  • recall_at_precision_thresholds – precision thresholds at which to calculate recall
  • precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns:

Dict from label strings to their corresponding soft metrics.

pytext.metrics.precision_at_recall(y_true_sorted: numpy.ndarray, y_score_sorted: numpy.ndarray, thresholds: Sequence[float]) → Tuple[Dict[float, float], Dict[float, float]][source]

Computes precision at various recall levels

Parameters:
  • y_true_sorted – Numpy array sorted according to decreasing confidence scores indicating whether each prediction is correct.
  • y_score_sorted – Numpy array of confidence scores for the predictions in decreasing order.
  • thresholds – Sequence of floats indicating the requested recall thresholds
Returns:

Dictionary of maximum precision at requested recall thresholds. Dictionary of decision thresholds resulting in max precision at requested recall thresholds.

pytext.metrics.recall_at_precision(y_true_sorted: numpy.ndarray, y_score_sorted: numpy.ndarray, thresholds: Sequence[float]) → Dict[float, float][source]

Computes recall at various precision levels

Parameters:
  • y_true_sorted – Numpy array sorted according to decreasing confidence scores indicating whether each prediction is correct.
  • y_score_sorted – Numpy array of confidence scores for the predictions in decreasing order.
  • thresholds – Sequence of floats indicating the requested precision thresholds
Returns:

Dictionary of maximum recall at requested precision thresholds.

pytext.metrics.safe_division(n: Union[int, float], d: int) → float[source]
pytext.metrics.sort_by_score(y_true_list: Sequence[bool], y_score_list: Sequence[float])[source]

pytext.models package

Subpackages
pytext.models.decoders package
Submodules
pytext.models.decoders.decoder_base module
class pytext.models.decoders.decoder_base.DecoderBase(config: pytext.config.pytext_config.ConfigBase)[source]

Bases: pytext.models.module.Module

Base class for all decoder modules.

Parameters:config (ConfigBase) – Configuration object.
in_dim

Dimension of input Tensor passed to the decoder.

Type:int
out_dim

Dimension of output Tensor produced by the decoder.

Type:int
forward(*input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder()[source]

Returns the decoder module.

get_in_dim() → int[source]

Returns the dimension of the input Tensor that the decoder accepts.

get_out_dim() → int[source]

Returns the dimension of the input Tensor that the decoder emits.

pytext.models.decoders.intent_slot_model_decoder module
class pytext.models.decoders.intent_slot_model_decoder.IntentSlotModelDecoder(config: pytext.models.decoders.intent_slot_model_decoder.IntentSlotModelDecoder.Config, in_dim_doc: int, in_dim_word: int, out_dim_doc: int, out_dim_word: int)[source]

Bases: pytext.models.decoders.decoder_base.DecoderBase

IntentSlotModelDecoder implements the decoder layer for intent-slot models. Intent-slot models jointly predict intent and slots from an utterance. At the core these models learn to jointly perform document classification and word tagging tasks.

IntentSlotModelDecoder accepts arguments for decoding both document
classification and word tagging tasks, namely, in_dim_doc and in_dim_word.
Parameters:
  • config (type) – Configuration object of type IntentSlotModelDecoder.Config.
  • in_dim_doc (type) – Dimension of input Tensor for projecting document
  • representation.
  • in_dim_word (type) – Dimension of input Tensor for projecting word
  • representation.
  • out_dim_doc (type) – Dimension of projected output Tensor for document
  • classification.
  • out_dim_word (type) – Dimension of projected output Tensor for word tagging.
use_doc_probs_in_word

Whether to use intent probabilities for

Type:bool
predicting slots.
doc_decoder

Document/intent decoder module.

Type:type
word_decoder

Word/slot decoder module.

Type:type
forward(x_d: torch.Tensor, x_w: torch.Tensor, dense: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder() → List[torch.nn.modules.module.Module][source]

Returns the document and word decoder modules.

pytext.models.decoders.mlp_decoder module
class pytext.models.decoders.mlp_decoder.MLPDecoder(config: pytext.models.decoders.mlp_decoder.MLPDecoder.Config, in_dim: int, out_dim: int = 0)[source]

Bases: pytext.models.decoders.decoder_base.DecoderBase

MLPDecoder implements a fully connected network and uses ReLU as the activation function. The module projects an input tensor to out_dim.

Parameters:
  • config (Config) – Configuration object of type MLPDecoder.Config.
  • in_dim (int) – Dimension of input Tensor passed to MLP.
  • out_dim (int) – Dimension of output Tensor produced by MLP. Defaults to 0.
mlp

Module that implements the MLP.

Type:type
out_dim

Dimension of the output of this module.

Type:type
hidden_dims

Dimensions of the outputs of hidden layers.

Type:List[int]
forward(*input) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder() → List[torch.nn.modules.module.Module][source]

Returns the MLP module that is used as a decoder.

pytext.models.decoders.mlp_decoder_query_response module
class pytext.models.decoders.mlp_decoder_query_response.MLPDecoderQueryResponse(config: pytext.models.decoders.mlp_decoder_query_response.MLPDecoderQueryResponse.Config, from_dim: int, to_dim: int)[source]

Bases: pytext.models.decoders.decoder_base.DecoderBase

Implements a ‘two-tower’ MLP: one for query and one for response Used in search pairwise ranking: both pos_response and neg_response use the response-MLP

forward(*x) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder() → List[torch.nn.modules.module.Module][source]

Returns the decoder module.

static get_mlp(from_dim: int, to_dim: int, hidden_dims: List[int])[source]
Module contents
class pytext.models.decoders.DecoderBase(config: pytext.config.pytext_config.ConfigBase)[source]

Bases: pytext.models.module.Module

Base class for all decoder modules.

Parameters:config (ConfigBase) – Configuration object.
in_dim

Dimension of input Tensor passed to the decoder.

Type:int
out_dim

Dimension of output Tensor produced by the decoder.

Type:int
forward(*input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder()[source]

Returns the decoder module.

get_in_dim() → int[source]

Returns the dimension of the input Tensor that the decoder accepts.

get_out_dim() → int[source]

Returns the dimension of the input Tensor that the decoder emits.

class pytext.models.decoders.MLPDecoder(config: pytext.models.decoders.mlp_decoder.MLPDecoder.Config, in_dim: int, out_dim: int = 0)[source]

Bases: pytext.models.decoders.decoder_base.DecoderBase

MLPDecoder implements a fully connected network and uses ReLU as the activation function. The module projects an input tensor to out_dim.

Parameters:
  • config (Config) – Configuration object of type MLPDecoder.Config.
  • in_dim (int) – Dimension of input Tensor passed to MLP.
  • out_dim (int) – Dimension of output Tensor produced by MLP. Defaults to 0.
mlp

Module that implements the MLP.

Type:type
out_dim

Dimension of the output of this module.

Type:type
hidden_dims

Dimensions of the outputs of hidden layers.

Type:List[int]
forward(*input) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder() → List[torch.nn.modules.module.Module][source]

Returns the MLP module that is used as a decoder.

class pytext.models.decoders.IntentSlotModelDecoder(config: pytext.models.decoders.intent_slot_model_decoder.IntentSlotModelDecoder.Config, in_dim_doc: int, in_dim_word: int, out_dim_doc: int, out_dim_word: int)[source]

Bases: pytext.models.decoders.decoder_base.DecoderBase

IntentSlotModelDecoder implements the decoder layer for intent-slot models. Intent-slot models jointly predict intent and slots from an utterance. At the core these models learn to jointly perform document classification and word tagging tasks.

IntentSlotModelDecoder accepts arguments for decoding both document
classification and word tagging tasks, namely, in_dim_doc and in_dim_word.
Parameters:
  • config (type) – Configuration object of type IntentSlotModelDecoder.Config.
  • in_dim_doc (type) – Dimension of input Tensor for projecting document
  • representation.
  • in_dim_word (type) – Dimension of input Tensor for projecting word
  • representation.
  • out_dim_doc (type) – Dimension of projected output Tensor for document
  • classification.
  • out_dim_word (type) – Dimension of projected output Tensor for word tagging.
use_doc_probs_in_word

Whether to use intent probabilities for

Type:bool
predicting slots.
doc_decoder

Document/intent decoder module.

Type:type
word_decoder

Word/slot decoder module.

Type:type
forward(x_d: torch.Tensor, x_w: torch.Tensor, dense: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_decoder() → List[torch.nn.modules.module.Module][source]

Returns the document and word decoder modules.

pytext.models.embeddings package
Submodules
pytext.models.embeddings.char_embedding module
class pytext.models.embeddings.char_embedding.CharacterEmbedding(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], highway_layers: int, projection_dim: Optional[int], *args, **kwargs)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.

Implementation is loosely based on https://arxiv.org/abs/1508.06615.

Parameters:
  • num_embeddings (int) – Total number of characters (vocabulary size).
  • embed_dim (int) – Size of character embeddings to be passed to convolutions.
  • out_channels (int) – Number of output channels.
  • kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
  • highway_layers (int) – Number of highway layers applied to pooled output.
  • projection_dim (int) – If specified, size of output embedding for token, via a linear projection from convolution output.
char_embed

Character embedding table.

Type:nn.Embedding
convs

Convolution layers that operate on character

Type:nn.ModuleList
embeddings.
highway_layers

Highway layers on top of convolution output.

Type:nn.Module
projection

Final linear layer to token embedding.

Type:nn.Module
embedding_dim

Dimension of the final token embedding produced.

Type:int
forward(chars: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.

Parameters:
  • chars (torch.Tensor) – Batch of sentences where each token is broken
  • characters. (into) –
  • Dimension – batch size X maximum sentence length X maximum word length
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.CharFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, vocab_size: Optional[int] = None)[source]

Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding.
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of CharacterEmbedding.

Return type:

type

class pytext.models.embeddings.char_embedding.Highway(input_dim: int, num_layers: int = 1)[source]

Bases: torch.nn.modules.module.Module

A Highway layer <https://arxiv.org/abs/1505.00387>. Adopted from the AllenNLP implementation.

forward(x: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_parameters()[source]
pytext.models.embeddings.contextual_token_embedding module
class pytext.models.embeddings.contextual_token_embedding.ContextualTokenEmbedding(embedding_dim: int)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for providing token embeddings from a pretrained model.

forward(embedding: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.ContextualTokenEmbeddingConfig, *args, **kwargs)[source]
pytext.models.embeddings.dict_embedding module
class pytext.models.embeddings.dict_embedding.DictEmbedding(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, pad_index: int = 1, unk_index: int = 0, mobile: bool = False)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.sparse.Embedding

Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be

[
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.

Parameters:
  • num_embeddings (int) – Total number of dictionary features (vocabulary size).
  • embed_dim (int) – Size of embedding vector.
  • pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.
pooling_type

Type of pooling for combining the dictionary feature embeddings.

Type:PoolingType
find_and_replace(tensor: torch.Tensor, find_val: int, replace_val: int) → torch.Tensor[source]

torch.where is not supported for mobile ONNX, this hack allows a mobile exported version of torch.where which is computationally more expensive

forward(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.

Parameters:
  • feats (torch.Tensor) – Batch of sentences with dictionary feature ids. shape: [bsz, seq_len * max_feat_per_token]
  • weights (torch.Tensor) – Batch of sentences with dictionary feature weights for the dictionary features. shape: [bsz, seq_len * max_feat_per_token]
  • lengths (torch.Tensor) – Batch of sentences with the number of dictionary features per token. shape: [bsz, seq_len]
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.DictFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None)[source]

Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (DictFeatConfig) – Configuration object specifying all the
  • of DictEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of DictEmbedding.

Return type:

type

pytext.models.embeddings.embedding_base module
class pytext.models.embeddings.embedding_base.EmbeddingBase(embedding_dim: int)[source]

Bases: pytext.models.module.Module

Base class for token level embedding modules.

Parameters:embedding_dim (int) – Size of embedding vector.
num_emb_modules

Number of ways to embed a token.

Type:int
embedding_dim

Size of embedding vector.

Type:int
get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize module parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer.

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140328813503320'>)[source]

Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.embedding_list module
class pytext.models.embeddings.embedding_list.EmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.container.ModuleList

There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.

Parameters:
  • embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to
  • a token. (embed) –
  • concat (bool) – Whether to concatenate the embedding vectors emitted from
  • modules. (embeddings) –
num_emb_modules

Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total

Type:int
input_start_indices

List of indices of the sub-embeddings in the embedding list.

Type:List[int]
concat

Whether to concatenate the embedding vectors emitted from embeddings modules.

Type:bool
embedding_dim

Total embedding size, can be a single int or tuple of int depending on concat setting

forward(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:*emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors.
Returns:
If concat is True then
a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type:Union[torch.Tensor, Tuple[torch.Tensor]]
get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize child embedding parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer. The param_groups from each child embedding are aligned at the first (lowest) param_group.

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140328813503320'>)[source]

Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.word_embedding module
class pytext.models.embeddings.word_embedding.WordEmbedding(num_embeddings: int, embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, unk_token_idx: int = 0, mlp_layer_dims: List[int] = (), padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.

Note: Embedding weights for UNK token are always initialized to zeros.

Parameters:
  • num_embeddings (int) – Total number of words/tokens (vocabulary size).
  • embedding_dim (int) – Size of embedding vector.
  • embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
  • init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
  • unk_token_idx (int) – Index of UNK token in the word vocabulary.
  • mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.
forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

freeze()[source]
classmethod from_config(config: pytext.config.field_config.WordFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (WordFeatConfig) – Configuration object specifying all the
  • of WordEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of WordEmbedding.

Return type:

type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140328813503320'>)[source]

Overridden in sub classes to implement Tensorboard visualization of embedding space

Module contents
class pytext.models.embeddings.EmbeddingBase(embedding_dim: int)[source]

Bases: pytext.models.module.Module

Base class for token level embedding modules.

Parameters:embedding_dim (int) – Size of embedding vector.
num_emb_modules

Number of ways to embed a token.

Type:int
embedding_dim

Size of embedding vector.

Type:int
get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize module parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer.

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140328813503320'>)[source]

Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.EmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.container.ModuleList

There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.

Parameters:
  • embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to
  • a token. (embed) –
  • concat (bool) – Whether to concatenate the embedding vectors emitted from
  • modules. (embeddings) –
num_emb_modules

Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total

Type:int
input_start_indices

List of indices of the sub-embeddings in the embedding list.

Type:List[int]
concat

Whether to concatenate the embedding vectors emitted from embeddings modules.

Type:bool
embedding_dim

Total embedding size, can be a single int or tuple of int depending on concat setting

forward(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:*emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors.
Returns:
If concat is True then
a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type:Union[torch.Tensor, Tuple[torch.Tensor]]
get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize child embedding parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer. The param_groups from each child embedding are aligned at the first (lowest) param_group.

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140328813503320'>)[source]

Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.WordEmbedding(num_embeddings: int, embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, unk_token_idx: int = 0, mlp_layer_dims: List[int] = (), padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.

Note: Embedding weights for UNK token are always initialized to zeros.

Parameters:
  • num_embeddings (int) – Total number of words/tokens (vocabulary size).
  • embedding_dim (int) – Size of embedding vector.
  • embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
  • init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
  • unk_token_idx (int) – Index of UNK token in the word vocabulary.
  • mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.
forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

freeze()[source]
classmethod from_config(config: pytext.config.field_config.WordFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (WordFeatConfig) – Configuration object specifying all the
  • of WordEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of WordEmbedding.

Return type:

type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140328813503320'>)[source]

Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.DictEmbedding(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, pad_index: int = 1, unk_index: int = 0, mobile: bool = False)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.sparse.Embedding

Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be

[
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.

Parameters:
  • num_embeddings (int) – Total number of dictionary features (vocabulary size).
  • embed_dim (int) – Size of embedding vector.
  • pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.
pooling_type

Type of pooling for combining the dictionary feature embeddings.

Type:PoolingType
find_and_replace(tensor: torch.Tensor, find_val: int, replace_val: int) → torch.Tensor[source]

torch.where is not supported for mobile ONNX, this hack allows a mobile exported version of torch.where which is computationally more expensive

forward(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.

Parameters:
  • feats (torch.Tensor) – Batch of sentences with dictionary feature ids. shape: [bsz, seq_len * max_feat_per_token]
  • weights (torch.Tensor) – Batch of sentences with dictionary feature weights for the dictionary features. shape: [bsz, seq_len * max_feat_per_token]
  • lengths (torch.Tensor) – Batch of sentences with the number of dictionary features per token. shape: [bsz, seq_len]
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.DictFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None)[source]

Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (DictFeatConfig) – Configuration object specifying all the
  • of DictEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of DictEmbedding.

Return type:

type

class pytext.models.embeddings.CharacterEmbedding(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], highway_layers: int, projection_dim: Optional[int], *args, **kwargs)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.

Implementation is loosely based on https://arxiv.org/abs/1508.06615.

Parameters:
  • num_embeddings (int) – Total number of characters (vocabulary size).
  • embed_dim (int) – Size of character embeddings to be passed to convolutions.
  • out_channels (int) – Number of output channels.
  • kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
  • highway_layers (int) – Number of highway layers applied to pooled output.
  • projection_dim (int) – If specified, size of output embedding for token, via a linear projection from convolution output.
char_embed

Character embedding table.

Type:nn.Embedding
convs

Convolution layers that operate on character

Type:nn.ModuleList
embeddings.
highway_layers

Highway layers on top of convolution output.

Type:nn.Module
projection

Final linear layer to token embedding.

Type:nn.Module
embedding_dim

Dimension of the final token embedding produced.

Type:int
forward(chars: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.

Parameters:
  • chars (torch.Tensor) – Batch of sentences where each token is broken
  • characters. (into) –
  • Dimension – batch size X maximum sentence length X maximum word length
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.CharFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, vocab_size: Optional[int] = None)[source]

Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding.
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of CharacterEmbedding.

Return type:

type

class pytext.models.embeddings.ContextualTokenEmbedding(embedding_dim: int)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for providing token embeddings from a pretrained model.

forward(embedding: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.ContextualTokenEmbeddingConfig, *args, **kwargs)[source]
pytext.models.ensembles package
Submodules
pytext.models.ensembles.bagging_doc_ensemble module
class pytext.models.ensembles.bagging_doc_ensemble.BaggingDocEnsembleModel(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]

Bases: pytext.models.ensembles.ensemble.EnsembleModel

Ensemble class that uses bagging for ensembling document classification models.

forward(*args, **kwargs) → torch.Tensor[source]

Call forward() method of each document classification sub-model by passing all arguments and named arguments to the sub-models, collect the logits from them and average their values.

Returns:Logits from the ensemble.
Return type:torch.Tensor
pytext.models.ensembles.bagging_intent_slot_ensemble module
class pytext.models.ensembles.bagging_intent_slot_ensemble.BaggingIntentSlotEnsembleModel(config: pytext.models.ensembles.bagging_intent_slot_ensemble.BaggingIntentSlotEnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]

Bases: pytext.models.ensembles.ensemble.EnsembleModel

Ensemble class that uses bagging for ensembling intent-slot models.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of BaggingIntentSlotEnsemble.
  • models (List[Model]) – List of intent-slot model objects.
use_crf

Whether to use CRF for word tagging task.

Type:bool
output_layer

Output layer of intent-slot model responsible for computing loss and predictions.

Type:IntentSlotOutputLayer
forward(*args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]

Call forward() method of each intent-slot sub-model by passing all arguments and named arguments to the sub-models, collect the logits from them and average their values.

Returns:Logits from the ensemble.
Return type:torch.Tensor
merge_sub_models() → None[source]

Merges all sub-models’ transition matrices when using CRF. Otherwise does nothing.

torchscriptify(tensorizers, traced_model)[source]
pytext.models.ensembles.ensemble module
class pytext.models.ensembles.ensemble.EnsembleModel(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]

Bases: pytext.models.model.Model

Base class for ensemble models.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of Ensemble.
  • models (List[Model]) – List of sub-model objects.
output_layer

Responsible for computing loss and predictions.

Type:OutputLayerBase
models

ModuleList container for sub-model objects.

Type:nn.ModuleList]
arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], *args, **kwargs)[source]

Factory method to construct an instance of Ensemble or one its derived classes from the module’s config object and tensorizers It creates sub-models in the ensemble using the sub-model’s configuration.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of Ensemble.
  • tensorizers (Dict[str, Tensorizer]) – Tensorizer specifying all the parameters of the input features to the model.
Returns:

An instance of Ensemble.

Return type:

type

get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
merge_sub_models()[source]
save_modules(base_path: str = '', suffix: str = '') → None[source]

Saves the modules of all sub_models in the Ensemble.

Parameters:
  • base_path (str) – Path of base directory. Defaults to “”.
  • suffix (str) – Suffix to add to the file name to save. Defaults to “”.
torchscriptify(tensorizers, traced_model)[source]
vocab_to_export(tensorizers)[source]
Module contents
class pytext.models.ensembles.BaggingDocEnsembleModel(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]

Bases: pytext.models.ensembles.ensemble.EnsembleModel

Ensemble class that uses bagging for ensembling document classification models.

forward(*args, **kwargs) → torch.Tensor[source]

Call forward() method of each document classification sub-model by passing all arguments and named arguments to the sub-models, collect the logits from them and average their values.

Returns:Logits from the ensemble.
Return type:torch.Tensor
class pytext.models.ensembles.BaggingIntentSlotEnsembleModel(config: pytext.models.ensembles.bagging_intent_slot_ensemble.BaggingIntentSlotEnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]

Bases: pytext.models.ensembles.ensemble.EnsembleModel

Ensemble class that uses bagging for ensembling intent-slot models.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of BaggingIntentSlotEnsemble.
  • models (List[Model]) – List of intent-slot model objects.
use_crf

Whether to use CRF for word tagging task.

Type:bool
output_layer

Output layer of intent-slot model responsible for computing loss and predictions.

Type:IntentSlotOutputLayer
forward(*args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]

Call forward() method of each intent-slot sub-model by passing all arguments and named arguments to the sub-models, collect the logits from them and average their values.

Returns:Logits from the ensemble.
Return type:torch.Tensor
merge_sub_models() → None[source]

Merges all sub-models’ transition matrices when using CRF. Otherwise does nothing.

torchscriptify(tensorizers, traced_model)[source]
class pytext.models.ensembles.EnsembleModel(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]

Bases: pytext.models.model.Model

Base class for ensemble models.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of Ensemble.
  • models (List[Model]) – List of sub-model objects.
output_layer

Responsible for computing loss and predictions.

Type:OutputLayerBase
models

ModuleList container for sub-model objects.

Type:nn.ModuleList]
arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], *args, **kwargs)[source]

Factory method to construct an instance of Ensemble or one its derived classes from the module’s config object and tensorizers It creates sub-models in the ensemble using the sub-model’s configuration.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of Ensemble.
  • tensorizers (Dict[str, Tensorizer]) – Tensorizer specifying all the parameters of the input features to the model.
Returns:

An instance of Ensemble.

Return type:

type

get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
merge_sub_models()[source]
save_modules(base_path: str = '', suffix: str = '') → None[source]

Saves the modules of all sub_models in the Ensemble.

Parameters:
  • base_path (str) – Path of base directory. Defaults to “”.
  • suffix (str) – Suffix to add to the file name to save. Defaults to “”.
torchscriptify(tensorizers, traced_model)[source]
vocab_to_export(tensorizers)[source]
pytext.models.language_models package
Submodules
pytext.models.language_models.lmlstm module
class pytext.models.language_models.lmlstm.LMLSTM(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase = <pytext.config.field_config.WordFeatConfig object>, representation: pytext.models.representations.representation_base.RepresentationBase = <pytext.models.representations.bilstm.BiLSTM.Config object>, decoder: pytext.models.decoders.decoder_base.DecoderBase = <pytext.models.decoders.mlp_decoder.MLPDecoder.Config object>, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase = <pytext.models.output_layers.lm_output_layer.LMOutputLayer.Config object>, stateful: bool = False, exporter: object = <class 'pytext.exporters.exporter.ModelExporter'>)[source]

Bases: pytext.models.model.BaseModel

LMLSTM implements a word-level language model that uses LSTMs to represent the document.

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
classmethod checkTokenConfig(tokens: Optional[pytext.data.tensorizers.TokenTensorizer.Config])[source]
cpu()[source]

Moves all model parameters and buffers to the CPU.

Returns:self
Return type:Module
forward(tokens: torch.Tensor, seq_len: torch.Tensor) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.language_models.lmlstm.LMLSTM.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
init_hidden(bsz: int) → Tuple[torch.Tensor, torch.Tensor][source]

Initialize the hidden states of the LSTM if the language model is stateful.

Parameters:bsz (int) – Batch size.
Returns:Initialized hidden state and cell state of the LSTM.
Return type:Tuple[torch.Tensor, torch.Tensor]
vocab_to_export(tensorizers)[source]
pytext.models.language_models.lmlstm.repackage_hidden(hidden: Union[torch.Tensor, Tuple[torch.Tensor, ...]]) → Union[torch.Tensor, Tuple[torch.Tensor, ...]][source]

Wraps hidden states in new Tensors, to detach them from their history.

Parameters:hidden (Union[torch.Tensor, Tuple[torch.Tensor, ..]]) – Tensor or a tuple of tensors to repackage.
Returns:Repackaged output
Return type:Union[torch.Tensor, Tuple[torch.Tensor, ..]]
Module contents
pytext.models.output_layers package
Submodules
pytext.models.output_layers.distance_output_layer module
class pytext.models.output_layers.distance_output_layer.OutputScore[source]

Bases: enum.IntEnum

An enumeration.

norm_cosine = 2
raw_cosine = 1
sigmoid_cosine = 3
class pytext.models.output_layers.distance_output_layer.PairwiseCosineDistanceOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Union[pytext.loss.loss.BinaryCrossEntropyLoss, pytext.loss.loss.CosineEmbeddingLoss, pytext.loss.loss.MAELoss, pytext.loss.loss.MSELoss, pytext.loss.loss.NLLLoss] = None, score_threshold: bool = 0.9, score_type: pytext.models.output_layers.distance_output_layer.OutputScore = <OutputScore.norm_cosine: 2>)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

classmethod from_config(config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]
get_loss(logits: torch.Tensor, targets: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]

Compute and return the loss given logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model loss.

Return type:

torch.Tensor

get_pred(logits: torch.Tensor, targets: torch.Tensor, *args, **kwargs)[source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

pytext.models.output_layers.distance_output_layer.get_norm_cosine_scores(cosine_sim_scores)[source]
pytext.models.output_layers.distance_output_layer.get_sigmoid_scores(cosine_sim_scores)[source]
pytext.models.output_layers.doc_classification_output_layer module
class pytext.models.output_layers.doc_classification_output_layer.BinaryClassificationOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

See OutputLayerBase.export_to_caffe2().

get_pred(logit, *args, **kwargs)[source]

See OutputLayerBase.get_pred().

torchscript_predictions()[source]
class pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for document classification models. It supports CrossEntropyLoss and BinaryCrossEntropyLoss per document.

Parameters:loss_fn (Union[CrossEntropyLoss, BinaryCrossEntropyLoss]) – The loss function to use for computing loss. Defaults to None.
loss_fn

The loss function to use for computing loss.

classmethod from_config(config: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer.Config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]
get_pred(logit, *args, **kwargs)[source]

Compute and return prediction and scores from the model.

Prediction is computed using argmax over the document label/target space.

Scores are sigmoid or softmax scores over the model logits depending on the loss component being used.

Parameters:logit (torch.Tensor) – Logits returned DocModel.
Returns:Model prediction and scores.
Return type:Tuple[torch.Tensor, torch.Tensor]
class pytext.models.output_layers.doc_classification_output_layer.ClassificationScores(classes, score_function)[source]

Bases: torch.jit.ScriptModule

class pytext.models.output_layers.doc_classification_output_layer.MultiLabelOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

See OutputLayerBase.export_to_caffe2().

get_pred(logit, *args, **kwargs)[source]

See OutputLayerBase.get_pred().

torchscript_predictions()[source]
class pytext.models.output_layers.doc_classification_output_layer.MulticlassOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

See OutputLayerBase.export_to_caffe2().

get_pred(logit, *args, **kwargs)[source]

See OutputLayerBase.get_pred().

torchscript_predictions()[source]
pytext.models.output_layers.doc_regression_output_layer module
class pytext.models.output_layers.doc_regression_output_layer.RegressionOutputLayer(loss_fn: pytext.loss.loss.MSELoss, squash_to_unit_range: bool = False)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for doc regression models. Currently only supports Mean Squared Error loss.

Parameters:
  • loss (MSELoss) – config for MSE loss
  • squash_to_unit_range (bool) – whether to clamp the output to the range [0, 1], via a sigmoid.
classmethod from_config(config: pytext.models.output_layers.doc_regression_output_layer.RegressionOutputLayer.Config)[source]
get_loss(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]

Compute regression loss from logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model loss.

Return type:

torch.Tensor

get_pred(logit, *args, **kwargs)[source]

Compute predictions and scores from the model (unlike in classification, where prediction = “most likely class” and scores = “log probs”, here these are the same values). If squash_to_unit_range is True, fit prediction to [0, 1] via a sigmoid.

Parameters:logit (torch.Tensor) – Logits returned from the model.
Returns:Model prediction and scores.
Return type:Tuple[torch.Tensor, torch.Tensor]
pytext.models.output_layers.intent_slot_output_layer module
class pytext.models.output_layers.intent_slot_output_layer.IntentSlotOutputLayer(doc_output: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer, word_output: pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for joint intent classification and slot-filling models. Intent classification is a document classification problem and slot filling is a word tagging problem. Thus terms these can be used interchangeably in the documentation.

Parameters:
  • doc_output (ClassificationOutputLayer) – Output layer for intent classification task. See ClassificationOutputLayer for details.
  • word_output (WordTaggingOutputLayer) – Output layer for slot filling task. See WordTaggingOutputLayer for details.
doc_output

Output layer for intent classification task.

Type:type
word_output

Output layer for slot filling task.

Type:type
export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: List[torch.Tensor], doc_out_name: str, word_out_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the intent slot output layer to Caffe2. See OutputLayerBase.export_to_caffe2() for details.

classmethod from_config(config: pytext.models.output_layers.intent_slot_output_layer.IntentSlotOutputLayer.Config, doc_labels: pytext.data.utils.Vocabulary, word_labels: pytext.data.utils.Vocabulary)[source]
get_loss(logits: Tuple[torch.Tensor, torch.Tensor], targets: Tuple[torch.Tensor, torch.Tensor], context: Dict[str, Any] = None, *args, **kwargs) → torch.Tensor[source]

Compute and return the averaged intent and slot-filling loss.

Parameters:
  • logit (Tuple[torch.Tensor, torch.Tensor]) – Logits returned by JointModel. It is a tuple containing logits for intent classification and slot filling.
  • targets (Tuple[torch.Tensor, torch.Tensor]) – Tuple of target Tensors containing true document label/target and true word labels/targets.
  • context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns:

Averaged intent and slot loss.

Return type:

torch.Tensor

get_pred(logits: Tuple[torch.Tensor, torch.Tensor], targets: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (Tuple[torch.Tensor, torch.Tensor]) – Logits returned by JointModel. It’s tuple containing logits for intent classification and slot filling.
  • targets (Optional[torch.Tensor]) – Not applicable. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

torchscript_predictions()[source]
class pytext.models.output_layers.intent_slot_output_layer.IntentSlotScores(doc_scores: torch.jit.ScriptModule, word_scores: torch.jit.ScriptModule)[source]

Bases: torch.nn.modules.module.Module

forward(logits: Tuple[torch.Tensor, torch.Tensor], context: Dict[str, torch.Tensor]) → Tuple[List[Dict[str, float]], List[List[Dict[str, float]]]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.output_layers.lm_output_layer module
class pytext.models.output_layers.lm_output_layer.LMOutputLayer(target_names: List[str], loss_fn: pytext.loss.loss.Loss = None, config=None, pad_token_idx=-100)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for language models. It supports CrossEntropyLoss per word.

Parameters:loss_fn (CrossEntropyLoss) – Cross-entropy loss component. Defaults to None.
loss_fn

Cross-entropy loss component for computing loss.

static calculate_perplexity(sequence_loss: torch.Tensor) → torch.Tensor[source]
classmethod from_config(config: pytext.models.output_layers.lm_output_layer.LMOutputLayer.Config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]
get_loss(logit: torch.Tensor, target: torch.Tensor, context: Dict[str, Any], reduce=True) → torch.Tensor[source]

Compute word prediction loss by comparing prediction of each word in the sentence with the true word.

Parameters:
  • logit (torch.Tensor) – Logit returned by LMLSTM.
  • targets (torch.Tensor) – Not applicable for language models.
  • context (Dict[str, Any]) – Not applicable. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Word prediction loss.

Return type:

torch.Tensor

get_pred(logits: torch.Tensor, *args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]

Compute and return prediction and scores from the model. Prediction is computed using argmax over the word label/target space. Scores are softmax scores over the model logits.

Parameters:
  • logits (torch.Tensor) – Logits returned LMLSTM.
  • targets (torch.Tensor) – True words.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

pytext.models.output_layers.output_layer_base module
class pytext.models.output_layers.output_layer_base.OutputLayerBase(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.module.Module

Base class for all output layers in PyText. The responsibilities of this layer are

  1. Implement how loss is computed from logits and targets.
  2. Implement how to get predictions from logits.
  3. Implement the Caffe2 operator for performing the above tasks. This is
    used when PyText exports PyTorch model to Caffe2.
Parameters:loss_fn (type) – The loss function object to use for computing loss. Defaults to None.
loss_fn

The loss function object to use for computing loss.

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the output layer to Caffe2 by manually adding the necessary operators to the init_net and predict_net and, returns the list of external output blobs to be added to the model. By default this does nothing, so any sub-class must override this method (if necessary).

To learn about Caffe2 computation graphs and why we need two networks, init_net and predict_net/exec_net read https://caffe2.ai/docs/intro-tutorial#null__nets-and-operators.

Parameters:
  • workspace (core.workspace) – Caffe2 workspace to use for adding the operator. See https://caffe2.ai/docs/workspace.html to learn about Caffe2 workspace.
  • init_net (core.Net) – Caffe2 init_net to add the operator to.
  • predict_net (core.Net) – Caffe2 predict_net to add the operator to.
  • model_out (torch.Tensor) – Output logit Tensor from the model to .
  • output_name (str) – Name of model_out to use in Caffe2 net.
  • label_names (List[str]) – List of names of the targets/labels to expose from the Caffe2 net.
Returns:

List of output blobs that the output_layer

generates.

Return type:

List[core.BlobReference]

get_loss(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]

Compute and return the loss given logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model loss.

Return type:

torch.Tensor

get_pred(logit: torch.Tensor, targets: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

pytext.models.output_layers.pairwise_ranking_output_layer module
class pytext.models.output_layers.pairwise_ranking_output_layer.PairwiseRankingOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

classmethod from_config(config)[source]
get_pred(logit, targets, context)[source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

pytext.models.output_layers.squad_output_layer module
class pytext.models.output_layers.squad_output_layer.SquadOutputLayer(loss_fn: pytext.loss.loss.Loss, ignore_impossible: bool = True, pos_loss_weight: float = 0.5, has_answer_loss_weight: float = 0.5, has_answer_labels: Iterable[str] = ('False', 'True'), false_label: str = 'False', max_answer_len: int = 30, hard_weight: float = 0.0, is_kd: bool = False)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

classmethod from_config(config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[Iterable[str]] = None, is_kd: bool = False)[source]
get_loss(logits: Tuple[torch.Tensor, ...], targets: Tuple[torch.Tensor, ...], contexts: Optional[Dict[str, Any]] = None, *args, **kwargs) → torch.Tensor[source]

Compute and return the loss given logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.=
Returns:

Model loss.

Return type:

torch.Tensor

get_position_preds(start_pos_logits: torch.Tensor, end_pos_logits: torch.Tensor, max_span_length: int)[source]
get_pred(logits: torch.Tensor, targets: torch.Tensor, contexts: Dict[str, List[Any]]) → Tuple[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor]][source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

pytext.models.output_layers.utils module
class pytext.models.output_layers.utils.OutputLayerUtils[source]

Bases: object

static gen_additional_blobs(predict_net: caffe2.python.core.Net, probability_out, model_out: torch.Tensor, output_name: str, label_names: List[str]) → List[caffe2.python.core.BlobReference][source]

Utility method to generate additional blobs for human readable result for models that use explicit labels.

pytext.models.output_layers.word_tagging_output_layer module
class pytext.models.output_layers.word_tagging_output_layer.CRFOutputLayer(num_tags, labels: pytext.data.utils.Vocabulary, *args)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for word tagging models that use Conditional Random Field.

Parameters:num_tags (int) – Total number of possible word tags.
num_tags

Total number of possible word tags.

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the CRF output layer to Caffe2. See OutputLayerBase.export_to_caffe2() for details.

classmethod from_config(config: pytext.config.component.ComponentMeta.__new__.<locals>.Config, labels: pytext.data.utils.Vocabulary)[source]
get_loss(logit: torch.Tensor, target: torch.Tensor, context: Dict[str, Any], reduce=True)[source]

Compute word tagging loss by using CRF.

Parameters:
  • logit (torch.Tensor) – Logit returned by WordTaggingModel.
  • targets (torch.Tensor) – True document label/target.
  • context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

get_pred(logit: torch.Tensor, target: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None)[source]

Compute and return prediction and scores from the model.

Prediction is computed using CRF decoding.

Scores are softmax scores over the model logits where the logits are computed by rearranging the word logits such that decoded word tag has the highest valued logits. This is done because with CRF, the highest valued word tag for a given may not be part of the overall set of word tags. In order for argmax to work, we rearrange the logit values.

Parameters:
  • logit (torch.Tensor) – Logits returned WordTaggingModel.
  • target (torch.Tensor) – Not applicable. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

torchscript_predictions()[source]
class pytext.models.output_layers.word_tagging_output_layer.CRFWordTaggingScores(classes: List[str], crf)[source]

Bases: pytext.models.output_layers.word_tagging_output_layer.WordTaggingScores

forward(logits: torch.Tensor, context: Dict[str, torch.Tensor]) → List[List[Dict[str, float]]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for word tagging models. It supports CrossEntropyLoss per word.

Parameters:loss_fn (CrossEntropyLoss) – Cross-entropy loss component. Defaults to None.
loss_fn

Cross-entropy loss component.

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the word tagging output layer to Caffe2.

classmethod from_config(config: pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer.Config, labels: pytext.data.utils.Vocabulary)[source]
get_loss(logit: torch.Tensor, target: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor]], context: Dict[str, Any], reduce: bool = True) → torch.Tensor[source]

Compute word tagging loss by comparing prediction of each word in the sentence with its true label/target.

Parameters:
  • logit (torch.Tensor) – Logit returned by WordTaggingModel.
  • targets (torch.Tensor) – True document label/target.
  • context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Word tagging loss for all words in the sentence.

Return type:

torch.Tensor

get_pred(logit: torch.Tensor, *args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]

Compute and return prediction and scores from the model. Prediction is computed using argmax over the word label/target space. Scores are softmax scores over the model logits.

Parameters:logit (torch.Tensor) – Logits returned WordTaggingModel.
Returns:Model prediction and scores.
Return type:Tuple[torch.Tensor, torch.Tensor]
torchscript_predictions()[source]
class pytext.models.output_layers.word_tagging_output_layer.WordTaggingScores(classes)[source]

Bases: torch.nn.modules.module.Module

forward(logits: torch.Tensor, context: Optional[Dict[str, torch.Tensor]] = None) → List[List[Dict[str, float]]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Module contents
class pytext.models.output_layers.OutputLayerBase(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.module.Module

Base class for all output layers in PyText. The responsibilities of this layer are

  1. Implement how loss is computed from logits and targets.
  2. Implement how to get predictions from logits.
  3. Implement the Caffe2 operator for performing the above tasks. This is
    used when PyText exports PyTorch model to Caffe2.
Parameters:loss_fn (type) – The loss function object to use for computing loss. Defaults to None.
loss_fn

The loss function object to use for computing loss.

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the output layer to Caffe2 by manually adding the necessary operators to the init_net and predict_net and, returns the list of external output blobs to be added to the model. By default this does nothing, so any sub-class must override this method (if necessary).

To learn about Caffe2 computation graphs and why we need two networks, init_net and predict_net/exec_net read https://caffe2.ai/docs/intro-tutorial#null__nets-and-operators.

Parameters:
  • workspace (core.workspace) – Caffe2 workspace to use for adding the operator. See https://caffe2.ai/docs/workspace.html to learn about Caffe2 workspace.
  • init_net (core.Net) – Caffe2 init_net to add the operator to.
  • predict_net (core.Net) – Caffe2 predict_net to add the operator to.
  • model_out (torch.Tensor) – Output logit Tensor from the model to .
  • output_name (str) – Name of model_out to use in Caffe2 net.
  • label_names (List[str]) – List of names of the targets/labels to expose from the Caffe2 net.
Returns:

List of output blobs that the output_layer

generates.

Return type:

List[core.BlobReference]

get_loss(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]

Compute and return the loss given logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model loss.

Return type:

torch.Tensor

get_pred(logit: torch.Tensor, targets: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

class pytext.models.output_layers.CRFOutputLayer(num_tags, labels: pytext.data.utils.Vocabulary, *args)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for word tagging models that use Conditional Random Field.

Parameters:num_tags (int) – Total number of possible word tags.
num_tags

Total number of possible word tags.

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the CRF output layer to Caffe2. See OutputLayerBase.export_to_caffe2() for details.

classmethod from_config(config: pytext.config.component.ComponentMeta.__new__.<locals>.Config, labels: pytext.data.utils.Vocabulary)[source]
get_loss(logit: torch.Tensor, target: torch.Tensor, context: Dict[str, Any], reduce=True)[source]

Compute word tagging loss by using CRF.

Parameters:
  • logit (torch.Tensor) – Logit returned by WordTaggingModel.
  • targets (torch.Tensor) – True document label/target.
  • context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

get_pred(logit: torch.Tensor, target: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None)[source]

Compute and return prediction and scores from the model.

Prediction is computed using CRF decoding.

Scores are softmax scores over the model logits where the logits are computed by rearranging the word logits such that decoded word tag has the highest valued logits. This is done because with CRF, the highest valued word tag for a given may not be part of the overall set of word tags. In order for argmax to work, we rearrange the logit values.

Parameters:
  • logit (torch.Tensor) – Logits returned WordTaggingModel.
  • target (torch.Tensor) – Not applicable. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

torchscript_predictions()[source]
class pytext.models.output_layers.ClassificationOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for document classification models. It supports CrossEntropyLoss and BinaryCrossEntropyLoss per document.

Parameters:loss_fn (Union[CrossEntropyLoss, BinaryCrossEntropyLoss]) – The loss function to use for computing loss. Defaults to None.
loss_fn

The loss function to use for computing loss.

classmethod from_config(config: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer.Config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]
get_pred(logit, *args, **kwargs)[source]

Compute and return prediction and scores from the model.

Prediction is computed using argmax over the document label/target space.

Scores are sigmoid or softmax scores over the model logits depending on the loss component being used.

Parameters:logit (torch.Tensor) – Logits returned DocModel.
Returns:Model prediction and scores.
Return type:Tuple[torch.Tensor, torch.Tensor]
class pytext.models.output_layers.RegressionOutputLayer(loss_fn: pytext.loss.loss.MSELoss, squash_to_unit_range: bool = False)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for doc regression models. Currently only supports Mean Squared Error loss.

Parameters:
  • loss (MSELoss) – config for MSE loss
  • squash_to_unit_range (bool) – whether to clamp the output to the range [0, 1], via a sigmoid.
classmethod from_config(config: pytext.models.output_layers.doc_regression_output_layer.RegressionOutputLayer.Config)[source]
get_loss(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]

Compute regression loss from logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model loss.

Return type:

torch.Tensor

get_pred(logit, *args, **kwargs)[source]

Compute predictions and scores from the model (unlike in classification, where prediction = “most likely class” and scores = “log probs”, here these are the same values). If squash_to_unit_range is True, fit prediction to [0, 1] via a sigmoid.

Parameters:logit (torch.Tensor) – Logits returned from the model.
Returns:Model prediction and scores.
Return type:Tuple[torch.Tensor, torch.Tensor]
class pytext.models.output_layers.WordTaggingOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

Output layer for word tagging models. It supports CrossEntropyLoss per word.

Parameters:loss_fn (CrossEntropyLoss) – Cross-entropy loss component. Defaults to None.
loss_fn

Cross-entropy loss component.

export_to_caffe2(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]

Exports the word tagging output layer to Caffe2.

classmethod from_config(config: pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer.Config, labels: pytext.data.utils.Vocabulary)[source]
get_loss(logit: torch.Tensor, target: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor]], context: Dict[str, Any], reduce: bool = True) → torch.Tensor[source]

Compute word tagging loss by comparing prediction of each word in the sentence with its true label/target.

Parameters:
  • logit (torch.Tensor) – Logit returned by WordTaggingModel.
  • targets (torch.Tensor) – True document label/target.
  • context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Word tagging loss for all words in the sentence.

Return type:

torch.Tensor

get_pred(logit: torch.Tensor, *args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]

Compute and return prediction and scores from the model. Prediction is computed using argmax over the word label/target space. Scores are softmax scores over the model logits.

Parameters:logit (torch.Tensor) – Logits returned WordTaggingModel.
Returns:Model prediction and scores.
Return type:Tuple[torch.Tensor, torch.Tensor]
torchscript_predictions()[source]
class pytext.models.output_layers.PairwiseRankingOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

classmethod from_config(config)[source]
get_pred(logit, targets, context)[source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

class pytext.models.output_layers.PairwiseCosineDistanceOutputLayer(target_names: Optional[List[str]] = None, loss_fn: Union[pytext.loss.loss.BinaryCrossEntropyLoss, pytext.loss.loss.CosineEmbeddingLoss, pytext.loss.loss.MAELoss, pytext.loss.loss.MSELoss, pytext.loss.loss.NLLLoss] = None, score_threshold: bool = 0.9, score_type: pytext.models.output_layers.distance_output_layer.OutputScore = <OutputScore.norm_cosine: 2>)[source]

Bases: pytext.models.output_layers.output_layer_base.OutputLayerBase

classmethod from_config(config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]
get_loss(logits: torch.Tensor, targets: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]

Compute and return the loss given logits and targets.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • target (torch.Tensor) – True label/target to compute loss against.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
  • reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns:

Model loss.

Return type:

torch.Tensor

get_pred(logits: torch.Tensor, targets: torch.Tensor, *args, **kwargs)[source]

Compute and return prediction and scores from the model.

Parameters:
  • logit (torch.Tensor) – Logits returned Model.
  • targets (Optional[torch.Tensor]) – True label/target. Only used by LMOutputLayer. Defaults to None.
  • context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata by the DataHandler. Defaults to None.
Returns:

Model prediction and scores.

Return type:

Tuple[torch.Tensor, torch.Tensor]

class pytext.models.output_layers.OutputLayerUtils[source]

Bases: object

static gen_additional_blobs(predict_net: caffe2.python.core.Net, probability_out, model_out: torch.Tensor, output_name: str, label_names: List[str]) → List[caffe2.python.core.BlobReference][source]

Utility method to generate additional blobs for human readable result for models that use explicit labels.

pytext.models.qna package
Submodules
pytext.models.qna.bert_squad_qa module
class pytext.models.qna.bert_squad_qa.BertSquadQAModel(encoder: torch.nn.modules.module.Module, decoder: torch.nn.modules.module.Module, has_ans_decoder: torch.nn.modules.module.Module, output_layer: torch.nn.modules.module.Module, stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>, is_kd: bool = False)[source]

Bases: pytext.models.bert_classification_models.NewBertModel

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
forward(*inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.qna.bert_squad_qa.BertSquadQAModel.Config, tensorizers)[source]
pytext.models.qna.dr_qa module
class pytext.models.qna.dr_qa.DrQAModel(dropout: torch.nn.modules.module.Module, embedding: torch.nn.modules.module.Module, ques_rnn: torch.nn.modules.module.Module, doc_rnn: torch.nn.modules.module.Module, ques_self_attn: torch.nn.modules.module.Module, ques_aligned_doc_attn: torch.nn.modules.module.Module, start_attn: torch.nn.modules.module.Module, end_attn: torch.nn.modules.module.Module, doc_rep_pool: torch.nn.modules.module.Module, has_ans_decoder: torch.nn.modules.module.Module, output_layer: torch.nn.modules.module.Module, is_kd: bool = False)[source]

Bases: pytext.models.model.BaseModel

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
classmethod create_embedding(model_config: pytext.models.qna.dr_qa.DrQAModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
forward(doc_tokens: torch.Tensor, doc_seq_len: torch.Tensor, doc_mask: torch.Tensor, ques_tokens: torch.Tensor, ques_seq_len: torch.Tensor, ques_mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.qna.dr_qa.DrQAModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
Module contents
pytext.models.representations package
Subpackages
pytext.models.representations.transformer package
Submodules
pytext.models.representations.transformer.multihead_attention module
class pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.

forward(query, key_padding_mask)[source]

Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.

pytext.models.representations.transformer.positional_embedding module
class pytext.models.representations.transformer.positional_embedding.PositionalEmbedding(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.

This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

forward(input)[source]

Input is expected to be of size [batch_size x sequence_length].

max_positions()[source]

Maximum number of supported positions.

pytext.models.representations.transformer.positional_embedding.make_positions(tensor, pad_index: int)[source]

Replace non-padding symbols with their position numbers. Position numbers begin at pad_index+1. Padding symbols are ignored.

pytext.models.representations.transformer.residual_mlp module
class pytext.models.representations.transformer.residual_mlp.GeLU[source]

Bases: torch.nn.modules.module.Module

Component class to wrap F.gelu.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.residual_mlp.ResidualMLP(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]

Bases: torch.nn.modules.module.Module

A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.

Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.transformer.sentence_encoder module
class pytext.models.representations.transformer.sentence_encoder.SentenceEncoder(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.

To use RoBERTa with this, download the RoBERTa public weights as roberta.weights

>>> encoder = SentenceEncoder()
>>> weights = torch.load("roberta.weights")
>>> encoder.load_roberta_state_dict(weights)

Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.

extract_features(tokens)[source]
forward(tokens)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

load_roberta_state_dict(state_dict)[source]
pytext.models.representations.transformer.sentence_encoder.merge_input_projection(state)[source]

New checkpoints of fairseq multihead attention split in_projections into k,v,q projections. This function merge them back to to make it compatible.

pytext.models.representations.transformer.sentence_encoder.remove_state_keys(state, keys_regex)[source]

Remove keys from state that match a regex

pytext.models.representations.transformer.sentence_encoder.rename_component_from_root(state, old_name, new_name)[source]

Rename keys from state using full python paths

pytext.models.representations.transformer.sentence_encoder.rename_state_keys(state, keys_regex, replacement)[source]

Rename keys from state that match a regex; replacement can use capture groups

pytext.models.representations.transformer.sentence_encoder.translate_roberta_state_dict(state_dict)[source]

Translate the public RoBERTa weights to ones which match SentenceEncoder.

pytext.models.representations.transformer.transformer module
class pytext.models.representations.transformer.transformer.Transformer(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(tokens: torch.Tensor) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.transformer.TransformerLayer(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(input, key_padding_mask)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Module contents

This directory contains modules for implementing a productionized RoBERTa model. These modules implement the same Transformer components that are implemented in the fairseq library, however they’re distilled down to just the elements which are used in the final RoBERTa model, and within that are restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The SentenceEncoder specifically can be used to load model weights directly from the publicly release RoBERTa weights, and it will translate these weights to the corresponding values in this implementation.

class pytext.models.representations.transformer.MultiheadSelfAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.

forward(query, key_padding_mask)[source]

Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.

class pytext.models.representations.transformer.PositionalEmbedding(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.

This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

forward(input)[source]

Input is expected to be of size [batch_size x sequence_length].

max_positions()[source]

Maximum number of supported positions.

class pytext.models.representations.transformer.ResidualMLP(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]

Bases: torch.nn.modules.module.Module

A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.

Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.SentenceEncoder(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.

To use RoBERTa with this, download the RoBERTa public weights as roberta.weights

>>> encoder = SentenceEncoder()
>>> weights = torch.load("roberta.weights")
>>> encoder.load_roberta_state_dict(weights)

Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.

extract_features(tokens)[source]
forward(tokens)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

load_roberta_state_dict(state_dict)[source]
class pytext.models.representations.transformer.Transformer(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(tokens: torch.Tensor) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.TransformerLayer(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(input, key_padding_mask)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Submodules
pytext.models.representations.attention module
class pytext.models.representations.attention.DotProductSelfAttention(input_dim)[source]

Bases: pytext.models.module.Module

Given vector w and token vectors = {t1, t2, …, t_n}, compute self attention weights to weighs the tokens * a_j = softmax(w . t_j)

forward(tokens, tokens_mask)[source]
Input:
x: batch_size * seq_len * input_dim x_mask: batch_size * seq_len (1 for padding, 0 for true)
Output:
alpha: batch_size * seq_len
classmethod from_config(config: pytext.models.representations.attention.DotProductSelfAttention.Config)[source]
class pytext.models.representations.attention.MultiplicativeAttention(p_hidden_dim, q_hidden_dim, normalize)[source]

Bases: pytext.models.module.Module

Given sequence P and vector q, computes attention weights for each element in P by matching q with each element in P using multiplicative attention. * a_i = softmax(p_i . W . q)

forward(p_seq: torch.Tensor, q: torch.Tensor, p_mask: torch.Tensor)[source]
Input:
p_seq: batch_size * p_seq_len * p_hidden_dim q: batch_size * q_hidden_dim p_mask: batch_size * p_seq_len (1 for padding, 0 for true)
Output:
attn_scores: batch_size * p_seq_len
classmethod from_config(config: pytext.models.representations.attention.MultiplicativeAttention.Config)[source]
class pytext.models.representations.attention.SequenceAlignedAttention(proj_dim)[source]

Bases: pytext.models.module.Module

Given sequences P and Q, computes attention weights for each element in P by matching Q with each element in P. * a_i_j = softmax(p_i . q_j) where softmax is computed by summing over q_j

forward(p: torch.Tensor, q: torch.Tensor, q_mask: torch.Tensor)[source]
Input:
p: batch_size * p_seq_len * dim q: batch_size * q_seq_len * dim q_mask: batch_size * q_seq_len (1 for padding, 0 for true)
Output:
matched_seq: batch_size * doc_seq_len * dim
classmethod from_config(config: pytext.models.representations.attention.SequenceAlignedAttention.Config)[source]
pytext.models.representations.augmented_lstm module
class pytext.models.representations.augmented_lstm.AugmentedLSTM(config: pytext.models.representations.augmented_lstm.AugmentedLSTM.Config, embed_dim: int, padding_value: float = 0.0)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

AugmentedLSTM implements a generic AugmentedLSTM representation layer. AugmentedLSTM is an LSTM which optionally appends an optional highway network to the output layer. Furthermore the dropout controlls the level of variational dropout done.

Parameters:
  • config (Config) – Configuration object of type BiLSTM.Config.
  • embed_dim (int) – The number of expected features in the input.
  • padding_value (float) – Value for the padded elements. Defaults to 0.0.
padding_value

Value for the padded elements.

Type:float
forward_layers

A module list of unidirectional AugmentedLSTM layers moving forward in time.

Type:nn.ModuleList
backward_layers

A module list of unidirectional AugmentedLSTM layers moving backward in time.

Type:nn.ModuleList
representation_dim

The calculated dimension of the output features of AugmentedLSTM.

Type:int
forward(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Given an input batch of sequential data such as word embeddings, produces a AugmentedLSTM representation of the sequential input and new state tensors.

Parameters:
  • embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
  • seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers x num_directions * nhid). Defaults to None.
Returns:

AgumentedLSTM representation of input and the state of the LSTM t = seq_len. Shape of representation is (bsize x seq_len x representation_dim). Shape of each state is (bsize x num_layers * num_directions x nhid).

Return type:

Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]

class pytext.models.representations.augmented_lstm.AugmentedLSTMCell(embed_dim: int, lstm_dim: int, use_highway: bool, use_bias: bool = True)[source]

Bases: torch.nn.modules.module.Module

AugmentedLSTMCell implements a AugmentedLSTM cell. :param embed_dim: The number of expected features in the input. :type embed_dim: int :param lstm_dim: Number of features in the hidden state of the LSTM. :type lstm_dim: int :param Defaults to 32.: :param use_highway: If True we append a highway network to the :type use_highway: bool :param outputs of the LSTM.: :param use_bias: If True we use a bias in our LSTM calculations, otherwise :type use_bias: bool :param we don’t.:

input_linearity

Fused weight matrix which computes a linear function over the input.

Type:nn.Module
state_linearity

Fused weight matrix which computes a linear function over the states.

Type:nn.Module
forward(x: torch.Tensor, states=typing.Tuple[torch.Tensor, torch.Tensor], variational_dropout_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Warning: DO NOT USE THIS LAYER DIRECTLY, INSTEAD USE the AugmentedLSTM class

Parameters:
  • x (torch.Tensor) – Input tensor of shape (bsize x input_dim).
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x nhid). Defaults to None.
Returns:

Returned states. Shape of each state is (bsize x nhid).

Return type:

Tuple[torch.Tensor, torch.Tensor]

reset_parameters()[source]
class pytext.models.representations.augmented_lstm.AugmentedLSTMUnidirectional(embed_dim: int, lstm_dim: int, go_forward: bool = True, recurrent_dropout_probability: float = 0.0, use_highway: bool = True, use_input_projection_bias: bool = True)[source]

Bases: torch.nn.modules.module.Module

AugmentedLSTMUnidirectional implements a one-layer single directional AugmentedLSTM layer. AugmentedLSTM is an LSTM which optionally appends an optional highway network to the output layer. Furthermore the dropout controlls the level of variational dropout done.

Parameters:
  • embed_dim (int) – The number of expected features in the input.
  • lstm_dim (int) – Number of features in the hidden state of the LSTM. Defaults to 32.
  • go_forward (bool) – Whether to compute features left to right (forward) or right to left (backward).
  • recurrent_dropout_probability (float) – Variational dropout probability to use. Defaults to 0.0.
  • use_highway (bool) – If True we append a highway network to the outputs of the LSTM.
  • use_input_projection_bias (bool) – If True we use a bias in our LSTM calculations, otherwise we don’t.
cell

AugmentedLSTMCell that is applied at every timestep.

Type:AugmentedLSTMCell
forward(inputs: torch.nn.utils.rnn.PackedSequence, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.nn.utils.rnn.PackedSequence, Tuple[torch.Tensor, torch.Tensor]][source]

Warning: DO NOT USE THIS LAYER DIRECTLY, INSTEAD USE the AugmentedLSTM class

Given an input batch of sequential data such as word embeddings, produces a single layer unidirectional AugmentedLSTM representation of the sequential input and new state tensors.

Parameters:
  • inputs (PackedSequence) – Input tensor of shape (bsize x seq_len x input_dim).
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (1 x bsize x num_directions * nhid). Defaults to None.
Returns:

AgumentedLSTM representation of input and the state of the LSTM t = seq_len. Shape of representation is (bsize x seq_len x representation_dim). Shape of each state is (1 x bsize x nhid).

Return type:

Tuple[PackedSequence, Tuple[torch.Tensor, torch.Tensor]]

get_dropout_mask(dropout_probability: float, tensor_for_masking: torch.Tensor) → torch.Tensor[source]
pytext.models.representations.bilstm module
class pytext.models.representations.bilstm.BiLSTM(config: pytext.models.representations.bilstm.BiLSTM.Config, embed_dim: int, padding_value: float = 0.0)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

BiLSTM implements a multi-layer bidirectional LSTM representation layer preceded by a dropout layer.

Parameters:
  • config (Config) – Configuration object of type BiLSTM.Config.
  • embed_dim (int) – The number of expected features in the input.
  • padding_value (float) – Value for the padded elements. Defaults to 0.0.
padding_value

Value for the padded elements.

Type:float
dropout

Dropout layer preceding the LSTM.

Type:nn.Dropout
lstm

LSTM layer that operates on the inputs.

Type:nn.LSTM
representation_dim

The calculated dimension of the output features of BiLSTM.

Type:int
forward(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation of the sequential input and new state tensors.

Parameters:
  • embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
  • seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns:

Bidirectional

LSTM representation of input and the state of the LSTM t = seq_len. Shape of representation is (bsize x seq_len x representation_dim). Shape of each state is (bsize x num_layers * num_directions x nhid).

Return type:

Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]

pytext.models.representations.bilstm_doc_attention module
class pytext.models.representations.bilstm_doc_attention.BiLSTMDocAttention(config: pytext.models.representations.bilstm_doc_attention.BiLSTMDocAttention.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

BiLSTMDocAttention implements a multi-layer bidirectional LSTM based representation for documents with or without pooling. The pooling can be max pooling, mean pooling or self attention.

Parameters:
  • config (Config) – Configuration object of type BiLSTMDocAttention.Config.
  • embed_dim (int) – The number of expected features in the input.
dropout

Dropout layer preceding the LSTM.

Type:nn.Dropout
lstm

Module that implements the LSTM.

Type:nn.Module
attention

Module that implements the attention or pooling.

Type:nn.Module
dense

Module that implements the non-linear projection over attended representation.

Type:nn.Module
representation_dim

The calculated dimension of the output features of the BiLSTMDocAttention representation.

Type:int
forward(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, *args, states: Tuple[torch.Tensor, torch.Tensor] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation with or without pooling of the sequential input and new state tensors.

Parameters:
  • embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
  • seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns:

Bidirectional

LSTM representation of input and the state of the LSTM at t = seq_len.

Return type:

Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]

pytext.models.representations.bilstm_doc_slot_attention module
class pytext.models.representations.bilstm_doc_slot_attention.BiLSTMDocSlotAttention(config: pytext.models.representations.bilstm_doc_slot_attention.BiLSTMDocSlotAttention.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

BiLSTMDocSlotAttention implements a multi-layer bidirectional LSTM based representation with support for various attention mechanisms.

In default mode, when attention configuration is not provided, it behaves like a multi-layer LSTM encoder and returns the output features from the last layer of the LSTM, for each t. When document_attention configuration is provided, it produces a fixed-sized document representation. When slot_attention configuration is provide, it attends on output of each cell of LSTM module to produce a fixed sized word representation.

Parameters:
  • config (Config) – Configuration object of type BiLSTMDocSlotAttention.Config.
  • embed_dim (int) – The number of expected features in the input.
dropout

Dropout layer preceding the LSTM.

Type:nn.Dropout
relu

An instance of the ReLU layer.

Type:nn.ReLU
lstm

Module that implements the LSTM.

Type:nn.Module
use_doc_attention

If True, indicates using document attention.

Type:bool
doc_attention

Module that implements document attention.

Type:nn.Module
self.projection_d

A sequence of dense layers for projection over document representation.

Type:nn.Sequential
use_word_attention

If True, indicates using word attention.

Type:bool
word_attention

Module that implements word attention.

Type:nn.Module
self.projection_w

A sequence of dense layers for projection over word representation.

Type:nn.Sequential
representation_dim

The calculated dimension of the output features of the BiLSTMDocAttention representation.

Type:int
forward(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, *args, states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation the appropriate attention.

Parameters:
  • embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
  • seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns:

Tensors containing the document and the word representation of the input.

Return type:

Tuple[torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]

pytext.models.representations.bilstm_slot_attn module
class pytext.models.representations.bilstm_slot_attn.BiLSTMSlotAttention(config: pytext.models.representations.bilstm_slot_attn.BiLSTMSlotAttention.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

BiLSTMSlotAttention implements a multi-layer bidirectional LSTM based representation with attention over slots.

Parameters:
  • config (Config) – Configuration object of type BiLSTMSlotAttention.Config.
  • embed_dim (int) – The number of expected features in the input.
dropout

Dropout layer preceding the LSTM.

Type:nn.Dropout
lstm

Module that implements the LSTM.

Type:nn.Module
attention

Module that implements the attention.

Type:nn.Module
dense

Module that implements the non-linear projection over attended representation.

Type:nn.Module
representation_dim

The calculated dimension of the output features of the SlotAttention representation.

Type:int
forward(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, *args, states: torch.Tensor = None, **kwargs) → torch.Tensor[source]

Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation with or without Slot attention.

Parameters:
  • embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
  • seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
  • states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns:

Bidirectional LSTM representation of input with or

without slot attention.

Return type:

torch.Tensor

pytext.models.representations.biseqcnn module
class pytext.models.representations.biseqcnn.BSeqCNNRepresentation(config: pytext.models.representations.biseqcnn.BSeqCNNRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

This class is an implementation of the paper https://arxiv.org/pdf/1606.07783. It is a bidirectional CNN model that captures context like RNNs do.

The module expects that input mini-batch is already padded.

TODO: Current implementation has a single layer conv-maxpool operation.

forward(inputs: torch.Tensor, *args) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.biseqcnn.ContextualWordConvolution(in_channels: int, out_channels: int, kernel_sizes: List[int])[source]

Bases: torch.nn.modules.module.Module

forward(words: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.contextual_intent_slot_rep module
class pytext.models.representations.contextual_intent_slot_rep.ContextualIntentSlotRepresentation(config: pytext.models.representations.contextual_intent_slot_rep.ContextualIntentSlotRepresentation.Config, embed_dim: Tuple[int, ...])[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

Representation for a contextual intent slot model

The inputs are two embeddings: word level embedding containing dictionary features, sequence (contexts) level embedding. See following diagram for the representation implementation that combines the two embeddings. Seq_representation is concatenated with word_embeddings.

+-----------+
| word_embed|--------------------------->+   +--------------------+
+-----------+                            |   | doc_representation |
+-----------+   +-------------------+    |-->+--------------------+
| seq_embed |-->| seq_representation|--->+   | word_representation|
+-----------+   +-------------------+        +--------------------+
                                              joint_representation
forward(word_seq_embed: Tuple[torch.Tensor, torch.Tensor], word_lengths: torch.Tensor, seq_lengths: torch.Tensor, *args) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.deepcnn module
class pytext.models.representations.deepcnn.DeepCNNRepresentation(config: pytext.models.representations.deepcnn.DeepCNNRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

DeepCNNRepresentation implements CNN representation layer preceded by a dropout layer. CNN representation layer is based on the encoder in the architecture proposed by Gehring et. al. in Convolutional Sequence to Sequence Learning.

Parameters:
  • config (Config) – Configuration object of type DeepCNNRepresentation.Config.
  • embed_dim (int) – The number of expected features in the input.
forward(inputs: torch.Tensor, *args) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.deepcnn.SeparableConv1d(input_channels: int, output_channels: int, kernel_size: int, padding: int, dilation: int, bottleneck: int)[source]

Bases: torch.nn.modules.module.Module

Implements a 1d depthwise separable convolutional layer. In regular convolutional layers, the input channels are mixed with each other to produce each output channel. Depthwise separable convolutions decompose this process into two smaller convolutions – a depthwise and pointwise convolution.

The depthwise convolution spatially convolves each input channel separately, then the pointwise convolution projects this result into a new channel space. This process reduces the number of FLOPS used to compute a convolution and also exhibits a regularization effect. The general behavior – including the input parameters – is equivalent to nn.Conv1d.

bottleneck controls the behavior of the pointwise convolution. Instead of upsampling directly, we split the pointwise convolution into two pieces: the first convolution downsamples into a (sufficiently small) low dimension and the second convolution upsamples into the target (higher) dimension. Creating this bottleneck significantly cuts the number of parameters with minimal loss in performance.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.deepcnn.Trim1d(trim)[source]

Bases: torch.nn.modules.module.Module

Trims a 1d convolutional output. Used to implement history-padding by removing excess padding from the right.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.deepcnn.create_conv_package(index: int, activation: pytext.config.module_config.Activation, in_channels: int, out_channels: int, kernel_size: int, causal: bool, dilated: bool, separable: bool, bottleneck: int, weight_norm: bool)[source]

Creates a convolutional layer with the specified arguments.

Parameters:
  • index (int) – Index of a convolutional layer in the stack.
  • activation (Activation) – Activation function.
  • in_channels (int) – Number of input channels.
  • out_channels (int) – Number of output channels.
  • kernel_size (int) – Size of 1d convolutional filter.
  • causal (bool) – Whether the convolution is causal or not. If set, it
  • for the temporal ordering of the inputs. (accounts) –
  • dilated (bool) – Whether the convolution is dilated or not. If set,
  • receptive field of the convolutional stack grows exponentially. (the) –
  • separable (bool) – Whether to use depthwise separable convolutions
  • not -- see SeparableConv1d. (or) –
  • bottleneck (int) – Bottleneck channel dimension for depthwise separable
  • See SeparableConv1d for an in-depth explanation. (convolutions.) –
  • weight_norm (bool) – Whether to add weight normalization to the
  • convolutions or not. (regular) –
pytext.models.representations.deepcnn.pool(pooling_type, words)[source]
pytext.models.representations.docnn module
class pytext.models.representations.docnn.DocNNRepresentation(config: pytext.models.representations.docnn.DocNNRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

CNN based representation of a document.

conv_and_pool(x, conv)[source]
forward(embedded_tokens: torch.Tensor, *args) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.huggingface_bert_sentence_encoder module
class pytext.models.representations.huggingface_bert_sentence_encoder.HuggingFaceBertSentenceEncoder(config: pytext.models.representations.huggingface_bert_sentence_encoder.HuggingFaceBertSentenceEncoder.Config, output_encoded_layers: bool, *args, **kwargs)[source]

Bases: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase

Generate sentence representation using the open source HuggingFace BERT model. This class implements loading the model weights from a pre-trained model file.

pytext.models.representations.jointcnn_rep module
class pytext.models.representations.jointcnn_rep.JointCNNRepresentation(config: pytext.models.representations.jointcnn_rep.JointCNNRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

forward(embedded_tokens: torch.Tensor, *args) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.jointcnn_rep.SharedCNNRepresentation(config: pytext.models.representations.jointcnn_rep.SharedCNNRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

forward(embedded_tokens: torch.Tensor, *args) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.ordered_neuron_lstm module
class pytext.models.representations.ordered_neuron_lstm.OrderedNeuronLSTM(config: pytext.models.representations.ordered_neuron_lstm.OrderedNeuronLSTM.Config, embed_dim: int, padding_value: Optional[float] = 0.0)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

forward(rep: torch.Tensor, seq_lengths: torch.Tensor, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.ordered_neuron_lstm.OrderedNeuronLSTMLayer(embed_dim: int, lstm_dim: int, padding_value: float, dropout: float)[source]

Bases: pytext.models.module.Module

forward(embedded_tokens: torch.Tensor, states: Tuple[torch.Tensor, torch.Tensor], seq_lengths: List[int]) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.pair_rep module
class pytext.models.representations.pair_rep.PairRepresentation(config: pytext.models.representations.pair_rep.PairRepresentation.Config, embed_dim: Tuple[int, ...])[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

Wrapper representation for a pair of inputs.

Takes a tuple of inputs: the left sentence, and the right sentence(s). Returns a representation of the pair of sentences, either as a concatenation of the two sentence embeddings or as a “siamese” representation which also includes their difference and elementwise product (arXiv:1705.02364). If more than two inputs are provided, the extra inputs are assumed to be extra “right” sentences, and the output will be the stacked pair representations of the left sentence together with all right sentences. This is more efficient than separately computing all these pair representations, because the left sentence will not need to be re-embedded multiple times.

forward(embeddings: Tuple[torch.Tensor, ...], *lengths) → torch.Tensor[source]

Computes the pair representations.

Parameters:
  • embeddings – token embeddings of the left sentence, followed by the token embeddings of the right sentence(s).
  • lengths – the corresponding sequence lengths.
Returns:

A tensor of shape (num_right_inputs, batch_size, rep_size), with the first dimension squeezed if one.

pytext.models.representations.pass_through module
class pytext.models.representations.pass_through.PassThroughRepresentation(config: pytext.config.component.ComponentMeta.__new__.<locals>.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

forward(embedded_tokens: torch.Tensor, *args) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.pooling module
class pytext.models.representations.pooling.BoundaryPool(config: pytext.models.representations.pooling.BoundaryPool.Config, n_input: int)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.pooling.LastTimestepPool(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor, seq_lengths: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.pooling.MaxPool(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.pooling.MeanPool(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor, seq_lengths: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.pooling.NoPool(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.pooling.SelfAttention(config: pytext.models.representations.pooling.SelfAttention.Config, n_input: int)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights(init_range: float = 0.1) → None[source]
pytext.models.representations.pure_doc_attention module
class pytext.models.representations.pure_doc_attention.PureDocAttention(config: pytext.models.representations.pure_doc_attention.PureDocAttention.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

pooling (e.g. max pooling or self attention) followed by optional MLP

forward(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor = None, *args) → Any[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.representation_base module
class pytext.models.representations.representation_base.RepresentationBase(config)[source]

Bases: pytext.models.module.Module

forward(*inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_representation_dim()[source]
pytext.models.representations.seq_rep module
class pytext.models.representations.seq_rep.SeqRepresentation(config: pytext.models.representations.seq_rep.SeqRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

Representation for a sequence of sentences Each sentence will be embedded with a DocNN model, then all the sentences are embedded with another DocNN/BiLSTM model

forward(embedded_seqs: torch.Tensor, seq_lengths: torch.Tensor, *args) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.slot_attention module
class pytext.models.representations.slot_attention.SlotAttention(config: pytext.models.representations.slot_attention.SlotAttention.Config, n_input: int, batch_first: bool = True)[source]

Bases: pytext.models.module.Module

forward(inputs: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.sparse_transformer_sentence_encoder module
class pytext.models.representations.sparse_transformer_sentence_encoder.SparseTransformerSentenceEncoder(config: pytext.models.representations.sparse_transformer_sentence_encoder.SparseTransformerSentenceEncoder.Config, output_encoded_layers: bool, padding_idx: int, vocab_size: int, *args, **kwarg)[source]

Bases: pytext.models.representations.transformer_sentence_encoder.TransformerSentenceEncoder

Implementation of the Transformer Sentence Encoder. This directly makes use of the TransformerSentenceEncoder module in Fairseq.

A few interesting config options:
  • encoder_normalize_before detemines whether the layer norm is applied before or after self_attention. This is similar to original implementation from Google.
  • activation_fn can be set to ‘gelu’ instead of the default of ‘relu’.
  • project_representation adds a linear projection + tanh to the pooled output in the style of BERT.
pytext.models.representations.stacked_bidirectional_rnn module
class pytext.models.representations.stacked_bidirectional_rnn.RnnType[source]

Bases: enum.Enum

An enumeration.

GRU = 'gru'
LSTM = 'lstm'
RNN = 'rnn'
class pytext.models.representations.stacked_bidirectional_rnn.StackedBidirectionalRNN(config: pytext.models.representations.stacked_bidirectional_rnn.StackedBidirectionalRNN.Config, input_size: int, padding_value: float = 0.0)[source]

Bases: pytext.models.module.Module

StackedBidirectionalRNN implements a multi-layer bidirectional RNN with an option to return outputs from all the layers of RNN.

Parameters:
  • config (Config) – Configuration object of type BiLSTM.Config.
  • embed_dim (int) – The number of expected features in the input.
  • padding_value (float) – Value for the padded elements. Defaults to 0.0.
padding_value

Value for the padded elements.

Type:float
dropout

Dropout layer preceding the LSTM.

Type:nn.Dropout
lstm

LSTM layer that operates on the inputs.

Type:nn.LSTM
representation_dim

The calculated dimension of the output features of BiLSTM.

Type:int
forward(tokens, tokens_mask)[source]
Parameters:
  • tokens – batch, max_seq_len, hidden_size
  • tokens_mask – batch, max_seq_len (1 for padding, 0 for true)
Output:
tokens_encoded: batch, max_seq_len, hidden_size * num_layers if
concat_layers = True else batch, max_seq_len, hidden_size
pytext.models.representations.traced_transformer_encoder module
class pytext.models.representations.traced_transformer_encoder.TraceableTransformerWrapper(eager_encoder: fairseq.modules.transformer_sentence_encoder.TransformerSentenceEncoder)[source]

Bases: torch.nn.modules.module.Module

forward(tokens: torch.Tensor, segment_labels: torch.Tensor = None, positions: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.traced_transformer_encoder.TracedTransformerEncoder(eager_encoder: fairseq.modules.transformer_sentence_encoder.TransformerSentenceEncoder, tokens: torch.Tensor, segment_labels: torch.Tensor = None, positions: torch.Tensor = None)[source]

Bases: torch.nn.modules.module.Module

forward(tokens: torch.Tensor, segment_labels: torch.Tensor = None, positions: torch.Tensor = None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.transformer_sentence_encoder module
class pytext.models.representations.transformer_sentence_encoder.TransformerSentenceEncoder(config: pytext.models.representations.transformer_sentence_encoder.TransformerSentenceEncoder.Config, output_encoded_layers: bool, padding_idx: int, vocab_size: int, *args, **kwarg)[source]

Bases: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase

Implementation of the Transformer Sentence Encoder. This directly makes use of the TransformerSentenceEncoder module in Fairseq.

A few interesting config options:
  • encoder_normalize_before detemines whether the layer norm is applied before or after self_attention. This is similar to original implementation from Google.
  • activation_fn can be set to ‘gelu’ instead of the default of ‘relu’.
  • projection_dim adds a linear projection to projection_dim + tanh to the pooled output in the style of BERT.
load_state_dict(state_dict)[source]

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Parameters:
  • state_dict (dict) – a dict containing parameters and persistent buffers.
  • strict (bool, optional) – whether to strictly enforce that the keys in state_dict match the keys returned by this module’s state_dict() function. Default: True
Returns:

  • missing_keys is a list of str containing the missing keys
  • unexpected_keys is a list of str containing the unexpected keys

Return type:

NamedTuple with missing_keys and unexpected_keys fields

upgrade_state_dict_named(state_dict)[source]
pytext.models.representations.transformer_sentence_encoder_base module
class pytext.models.representations.transformer_sentence_encoder_base.PoolingMethod[source]

Bases: enum.Enum

Pooling Methods are chosen from the “Feature-based Approachs” section in https://arxiv.org/pdf/1810.04805.pdf

AVG_CONCAT_LAST_4_LAYERS = 'avg_concat_last_4_layers'
AVG_LAST_LAYER = 'avg_last_layer'
AVG_SECOND_TO_LAST_LAYER = 'avg_second_to_last_layer'
AVG_SUM_LAST_4_LAYERS = 'avg_sum_last_4_layers'
CLS_TOKEN = 'cls_token'
NO_POOL = 'no_pool'
class pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase(config: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase.Config, output_encoded_layers=False, *args, **kwargs)[source]

Bases: pytext.models.representations.representation_base.RepresentationBase

Base class for all Bi-directional Transformer based Sentence Encoders. All children of this class should implement an _encoder function which takes as input: tokens, [optional] segment labels and a pad mask and outputs both the sentence representation (output of _pool_encoded_layers) and the output states of all the intermediate Transformer layers as a list of tensors.

Input tuple consists of the following elements: 1) tokens: torch tensor of size B x T which contains tokens ids 2) pad_mask: torch tensor of size B x T generated with the condition tokens != self.vocab.get_pad_index() 3) segment_labels: torch tensor of size B x T which contains the segment id of each token

Output tuple consists of the following elements: 1) encoded_layers: List of torch tensors where each tensor has shape B x T x C and there are num_transformer_layers + 1 of these. Each tensor represents the output of the intermediate transformer layers with the 0th element being the input to the first transformer layer (token + segment + position emebdding). 2) [Optional] pooled_output: Output of the pooling operation associated with config.pooling_method to the encoded_layers. Size B x C (or B x 4C if pooling = AVG_CONCAT_LAST_4_LAYERS)

forward(input_tuple: Tuple[torch.Tensor, ...], *args) → Tuple[torch.Tensor, ...][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase.Config, output_encoded_layers=False, *args, **kwargs)[source]
Module contents
pytext.models.semantic_parsers package
Subpackages
pytext.models.semantic_parsers.rnng package
Submodules
pytext.models.semantic_parsers.rnng.rnng_constant module
pytext.models.semantic_parsers.rnng.rnng_data_structures module
class pytext.models.semantic_parsers.rnng.rnng_data_structures.CompositionalNN(lstm_dim: int)[source]

Bases: torch.jit.ScriptModule

Combines a list / sequence of embeddings into one using a biLSTM

class pytext.models.semantic_parsers.rnng.rnng_data_structures.CompositionalSummationNN(lstm_dim: int)[source]

Bases: torch.jit.ScriptModule

Simpler version of CompositionalNN

class pytext.models.semantic_parsers.rnng.rnng_data_structures.Element(node: Any)[source]

Bases: object

Generic element representing a token / non-terminal / sub-tree on a stack. Used to compute valid actions in the RNNG parser.

class pytext.models.semantic_parsers.rnng.rnng_data_structures.ParserState(parser=None)[source]

Bases: object

Maintains state of the Parser. Useful for beam search

copy()[source]
finished()[source]
class pytext.models.semantic_parsers.rnng.rnng_data_structures.StackLSTM(lstm: torch.nn.modules.rnn.LSTM)[source]

Bases: collections.abc.Sized, typing.Generic

The Stack LSTM from Dyer et al: https://arxiv.org/abs/1505.08075

copy()[source]
element_from_top(index: int) → pytext.models.semantic_parsers.rnng.rnng_data_structures.Element[source]
embedding() → torch.Tensor[source]
Shapes:
return value: (1, lstm_hidden_dim)
pop() → Tuple[torch.Tensor, pytext.models.semantic_parsers.rnng.rnng_data_structures.Element][source]

Pops and returns tuple of output embedding (1, lstm_hidden_dim) and element

push(expression: torch.Tensor, element: pytext.models.semantic_parsers.rnng.rnng_data_structures.Element) → None[source]
Shapes:
expression: (1, lstm_input_dim)
pytext.models.semantic_parsers.rnng.rnng_parser module
class pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParser(ablation: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.AblationParams, constraints: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.RNNGConstraints, lstm_num_layers: int, lstm_dim: int, max_open_NT: int, dropout: float, actions_vocab, shift_idx: int, reduce_idx: int, ignore_subNTs_roots: List[int], valid_NT_idxs: List[int], valid_IN_idxs: List[int], valid_SL_idxs: List[int], embedding: pytext.models.embeddings.embedding_list.EmbeddingList, p_compositional)[source]

Bases: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
vocab_to_export(tensorizers)[source]
class pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase(ablation: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.AblationParams, constraints: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.RNNGConstraints, lstm_num_layers: int, lstm_dim: int, max_open_NT: int, dropout: float, actions_vocab, shift_idx: int, reduce_idx: int, ignore_subNTs_roots: List[int], valid_NT_idxs: List[int], valid_IN_idxs: List[int], valid_SL_idxs: List[int], embedding: pytext.models.embeddings.embedding_list.EmbeddingList, p_compositional)[source]

Bases: pytext.models.model.BaseModel

The Recurrent Neural Network Grammar (RNNG) parser from Dyer et al.: https://arxiv.org/abs/1602.07776 and Gupta et al.: https://arxiv.org/abs/1810.07942d. RNNG is a neural constituency parsing algorithm that explicitly models compositional structure of a sentence. It is able to learn about hierarchical relationship among the words and phrases in a given sentence thereby learning the underlying tree structure. The paper proposes generative as well as discriminative approaches. In PyText we have implemented the discriminative approach for modeling intent slot models. It is a top-down shift-reduce parser than can output trees with non-terminals (intent and slot labels) and terminals (tokens)

contextualize(context)[source]

Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by DisjointMultitaskModel for changing the task that should be trained with a given iterator.

forward(tokens: torch.Tensor, seq_lens: torch.Tensor, dict_feat: Optional[Tuple[torch.Tensor, ...]] = None, actions: Optional[List[List[int]]] = None, contextual_token_embeddings: Optional[torch.Tensor] = None, beam_size=1, top_k=1) → List[Tuple[torch.Tensor, torch.Tensor]][source]

RNNG forward function.

Parameters:
  • tokens (torch.Tensor) – list of tokens
  • seq_lens (torch.Tensor) – list of sequence lengths
  • dict_feat (Optional[Tuple[torch.Tensor, ..]]) – dictionary or gazetteer features for each token
  • actions (Optional[List[List[int]]]) – Used only during training. Oracle actions for the instances.
Returns:

list of top k tuple of predicted actions tensor and corresponding scores tensor. Tensor shape: (batch_size, action_length) (batch_size, action_length, number_of_actions)

classmethod from_config(model_config, feature_config=None, metadata: pytext.data.data_handler.CommonMetadata = None, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer] = None)[source]
get_loss(logits: List[Tuple[torch.Tensor, torch.Tensor]], target_actions: torch.Tensor, context: torch.Tensor)[source]
Shapes:
logits[1]: action scores: (1, action_length, number_of_actions) target_actions: (1, action_length)
get_param_groups_for_optimizer()[source]

This is called by code that looks for an instance of pytext.models.model.Model.

get_pred(logits: List[Tuple[torch.Tensor, torch.Tensor]], context=None, *args)[source]
Return Shapes:
preds: batch (1) * topk * action_len scores: batch (1) * topk * (action_len * number_of_actions)
get_single_pred(logits: Tuple[torch.Tensor, torch.Tensor], *args)[source]
push_action(state: pytext.models.semantic_parsers.rnng.rnng_data_structures.ParserState, target_action_idx: int) → None[source]

Used for updating the state with a target next action

Parameters:
  • state (ParserState) – The state of the stack, buffer and action
  • target_action_idx (int) – Index of the action to process
save_modules(*args, **kwargs)[source]

Save each sub-module in separate files for reusing later.

valid_actions(state: pytext.models.semantic_parsers.rnng.rnng_data_structures.ParserState) → List[int][source]

Used for restricting the set of possible action predictions

Parameters:state (ParserState) – The state of the stack, buffer and action
Returns:indices of the valid actions
Return type:List[int]
Module contents
Module contents
pytext.models.seq_models package
Submodules
pytext.models.seq_models.contextual_intent_slot module
class pytext.models.seq_models.contextual_intent_slot.ContextualIntentSlotModel(default_doc_loss_weight, default_word_loss_weight, *args, **kwargs)[source]

Bases: pytext.models.joint_model.IntentSlotModel

Joint Model for Intent classification and slot tagging with inputs of contextual information (sequence of utterances) and dictionary feature of the last utterance.

Training data should include: doc_label (string): intent classification label of either the sequence of utterances or just the last sentence word_label (string): slot tagging label of the last utterance in the format of start_idx:end_idx:slot_label, multiple slots are separated by a comma text (list of string): sequence of utterances for training dict_feat (dict): a dict of features that contains the feature of each word in the last utterance

Following is an example of raw columns from training data:

doc_label reply-where
word_label 10:20:restaurant_name
text [“dinner at 6?”, “wanna try Tomi Sushi?”]
dict_feat
{“tokenFeatList”: [{“tokenIdx”: 2, “features”: {“poi:eatery”: 0.66}},
{“tokenIdx”: 3, “features”: {“poi:eatery”: 0.66}}]}
arrange_model_inputs(tensor_dict)[source]
classmethod create_embedding(config, tensorizers)[source]
get_export_input_names(tensorizers)[source]
vocab_to_export(tensorizers)[source]
pytext.models.seq_models.seqnn module
class pytext.models.seq_models.seqnn.SeqNNModel(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.doc_model.DocModel

Classification model with sequence of utterances as input. It uses a docnn model (CNN or LSTM) to generate vector representation for each sequence, and then use an LSTM or BLSTM to capture the dynamics and produce labels for each sequence.

arrange_model_inputs(tensor_dict)[source]
class pytext.models.seq_models.seqnn.SeqNNModel_Deprecated(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.model.Model

Classification model with sequence of utterances as input. It uses a docnn model (CNN or LSTM) to generate vector representation for each sequence, and then use an LSTM or BLSTM to capture the dynamics and produce labels for each sequence.

DEPRECATED: Use SeqNNModel

Module contents
Submodules
pytext.models.bert_classification_models module
class pytext.models.bert_classification_models.BertPairwiseModel(encoder1, encoder2, decoder, output_layer, encode_relations)[source]

Bases: pytext.models.pair_classification_model.BasePairwiseModel

Bert Pairwise classification model

The model takes two sets of tokens (left and right), calculates their representations separately using shared BERT encoder and passes them to the decoder along with their absolute difference and elementwise product, all concatenated. Used for e.g. natural language inference.

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
forward(input_tuple1: Tuple[torch.Tensor, ...], input_tuple2: Tuple[torch.Tensor, ...]) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.bert_classification_models.BertPairwiseModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
save_modules(base_path: str = '', suffix: str = '')[source]

Save each sub-module in separate files for reusing later.

class pytext.models.bert_classification_models.NewBertModel(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]

Bases: pytext.models.model.BaseModel

BERT single sentence classification.

SUPPORT_FP16_OPTIMIZER = True
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
forward(encoder_inputs: Tuple[torch.Tensor, ...], *args) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.bert_classification_models.NewBertModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
pytext.models.bert_regression_model module
class pytext.models.bert_regression_model.NewBertRegressionModel(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]

Bases: pytext.models.bert_classification_models.NewBertModel

BERT single sentence (or concatenated sentences) regression.

classmethod from_config(config: pytext.models.bert_regression_model.NewBertRegressionModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
pytext.models.crf module
class pytext.models.crf.CRF(num_tags: int, ignore_index: int, default_label_pad_index: int)[source]

Bases: torch.nn.modules.module.Module

Compute the log-likelihood of the input assuming a conditional random field model.

Parameters:num_tags – The number of tags
decode(emissions: torch.Tensor, seq_lens: torch.Tensor) → torch.Tensor[source]

Given a set of emission probabilities, return the predicted tags.

Parameters:
  • emissions – Emission probabilities with expected shape of batch_size * seq_len * num_labels
  • seq_lens – Length of each input.
export_to_caffe2(workspace, init_net, predict_net, logits_output_name)[source]

Exports the crf layer to caffe2 by manually adding the necessary operators to the init_net and predict net.

Parameters:
  • init_net – caffe2 init net created by the current graph
  • predict_net – caffe2 net created by the current graph
  • workspace – caffe2 current workspace
  • output_names – current output names of the caffe2 net
  • py_model – original pytorch model object
Returns:

The updated predictions blob name

Return type:

string

forward(emissions: torch.Tensor, tags: torch.Tensor, reduce: bool = True) → torch.Tensor[source]

Compute log-likelihood of input.

Parameters:
  • emissions – Emission values for different tags for each input. The expected shape is batch_size * seq_len * num_labels. Padding is should be on the right side of the input.
  • tags – Actual tags for each token in the input. Expected shape is batch_size * seq_len
get_transitions()[source]
reset_parameters() → None[source]
set_transitions(transitions: torch.Tensor = None)[source]
pytext.models.disjoint_multitask_model module
class pytext.models.disjoint_multitask_model.DisjointMultitaskModel(models, loss_weights)[source]

Bases: pytext.models.model.Model

Wrapper model to train multiple PyText models that share parameters. Designed to be used for multi-tasking when the tasks have disjoint datasets.

Modules which have the same shared_module_key and type share parameters. Only need to configure the first such module in full in each case.

Parameters:models (type) – Dictionary of models of sub-tasks.
current_model

Current model to route the input batch to.

Type:type
contextualize(context)[source]

Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by DisjointMultitaskModel for changing the task that should be trained with a given iterator.

current_model
forward(*inputs) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_loss(logits, targets, context)[source]
get_pred(logits, targets=None, context=None, *args)[source]
save_modules(base_path, suffix='')[source]

Save each sub-module in separate files for reusing later.

class pytext.models.disjoint_multitask_model.NewDisjointMultitaskModel(models, loss_weights)[source]

Bases: pytext.models.disjoint_multitask_model.DisjointMultitaskModel

arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
pytext.models.distributed_model module
class pytext.models.distributed_model.DistributedModel(*args, **kwargs)[source]

Bases: torch.nn.parallel.distributed.DistributedDataParallel

Wrapper model class to train models in distributed data parallel manner. The way to use this class to train your module in distributed manner is:

distributed_model = DistributedModel(
    module=model,
    device_ids=[device_id0, device_id1],
    output_device=device_id0,
    broadcast_buffers=False,
)

where, model is the object of the actual model class you want to train in distributed manner.

cpu()[source]

Moves all model parameters and buffers to the CPU.

Returns:self
Return type:Module
eval(stage=<Stage.TEST: 'Test'>)[source]

Override to set stage

load_state_dict(*args, **kwargs)[source]

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Parameters:
  • state_dict (dict) – a dict containing parameters and persistent buffers.
  • strict (bool, optional) – whether to strictly enforce that the keys in state_dict match the keys returned by this module’s state_dict() function. Default: True
Returns:

  • missing_keys is a list of str containing the missing keys
  • unexpected_keys is a list of str containing the unexpected keys

Return type:

NamedTuple with missing_keys and unexpected_keys fields

state_dict(*args, **kwargs)[source]

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names.

Returns:a dictionary containing a whole state of the module
Return type:dict

Example:

>>> module.state_dict().keys()
['bias', 'weight']
train(mode=True)[source]

Override to set stage

pytext.models.doc_model module
class pytext.models.doc_model.ByteTokensDocumentModel(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.doc_model.DocModel

DocModel that receives both word IDs and byte IDs as inputs (concatenating word and byte-token embeddings to represent input tokens).

arrange_model_inputs(tensor_dict)[source]
classmethod create_embedding(config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
get_export_input_names(tensorizers)[source]
torchscriptify(tensorizers, traced_model)[source]
class pytext.models.doc_model.DocModel(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.model.Model

DocModel that’s compatible with the new Model abstraction, which is responsible for describing which inputs it expects and arranging its input tensors.

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
classmethod create_decoder(config: pytext.models.doc_model.DocModel.Config, representation_dim: int, num_labels: int)[source]
classmethod create_embedding(config: pytext.models.doc_model.DocModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
classmethod from_config(config: pytext.models.doc_model.DocModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
torchscriptify(tensorizers, traced_model)[source]
vocab_to_export(tensorizers)[source]
class pytext.models.doc_model.DocRegressionModel(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.doc_model.DocModel

Model that’s compatible with the new Model abstraction, and is configured for regression tasks (specifically for labels, predictions, and loss).

classmethod from_config(config: pytext.models.doc_model.DocRegressionModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
class pytext.models.doc_model.PersonalizedDocModel(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase, user_embedding: Optional[pytext.models.embeddings.embedding_base.EmbeddingBase] = None)[source]

Bases: pytext.models.doc_model.DocModel

DocModel that includes a user embedding which learns user features to produce personalized prediction. In this class, user-embedding is fed directly to the decoder (i.e., does not go through the encoders).

arrange_model_inputs(tensor_dict)[source]
forward(*inputs) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
get_export_input_names(tensorizers)[source]
torchscriptify(tensorizers, traced_model)[source]
vocab_to_export(tensorizers)[source]
pytext.models.joint_model module
class pytext.models.joint_model.IntentSlotModel(default_doc_loss_weight, default_word_loss_weight, *args, **kwargs)[source]

Bases: pytext.models.model.Model

A joint intent-slot model. This is framed as a model to do document classification model and word tagging tasks where the embedding and text representation layers are shared for both tasks.

The supported representation layers are based on bidirectional LSTM or CNN.

It can be instantiated just like any other Model.

This is in the new data handling design involving tensorizers; that is the difference between this and JointModel

arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
classmethod create_embedding(config, tensorizers)[source]
classmethod from_config(config, tensorizers)[source]
get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
get_weights_context(tensor_dict)[source]
vocab_to_export(tensorizers)[source]
pytext.models.masked_lm module
class pytext.models.masked_lm.MaskedLanguageModel(encoder: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase, decoder: pytext.models.decoders.mlp_decoder.MLPDecoder, output_layer: pytext.models.output_layers.lm_output_layer.LMOutputLayer, token_tensorizer: pytext.data.bert_tensorizer.BERTTensorizerBase, vocab: pytext.data.utils.Vocabulary, mask_prob: float = 0.15, mask_bos: float = False, masking_strategy: pytext.models.masking_utils.MaskingStrategy = <MaskingStrategy.RANDOM: 'random'>, stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>)[source]

Bases: pytext.models.model.BaseModel

Masked language model for BERT style pre-training.

SUPPORT_FP16_OPTIMIZER = True
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
forward(*inputs) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.masked_lm.MaskedLanguageModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
pytext.models.masking_utils module
class pytext.models.masking_utils.MaskingStrategy[source]

Bases: enum.Enum

An enumeration.

FREQUENCY = 'frequency_based'
RANDOM = 'random'
pytext.models.masking_utils.frequency_based_masking(tokens: None._VariableFunctions.tensor, token_sampling_weights: numpy.ndarray, mask_prob: float) → torch.Tensor[source]

Function to mask tokens based on frequency.

Inputs:
  1. tokens: Tensor with token ids of shape (batch_size x seq_len)
  2. token_sampling_weights: numpy array with shape (batch_size x seq_len)
    and each element representing the sampling weight assicated with the corresponding token in tokens
  3. mask_prob: Probability of masking a particular token
Outputs:
mask: Tensor with same shape as input tokens (batch_size x seq_len)
with masked tokens represented by a 1 and everything else as 0.
pytext.models.masking_utils.random_masking(tokens: None._VariableFunctions.tensor, mask_prob: float) → torch.Tensor[source]

Function to mask tokens randomly.

Inputs:
  1. tokens: Tensor with token ids of shape (batch_size x seq_len)
  2. mask_prob: Probability of masking a particular token
Outputs:
mask: Tensor with same shape as input tokens (batch_size x seq_len)
with masked tokens represented by a 1 and everything else as 0.
pytext.models.model module
class pytext.models.model.BaseModel(stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>)[source]

Bases: torch.nn.modules.module.Module, pytext.config.component.Component

Base model class which inherits from nn.Module. Also has a stage flag to indicate it’s in train, eval, or test stage. This is because the built-in train/eval flag in PyTorch can’t distinguish eval and test, which is required to support some use cases.

SUPPORT_FP16_OPTIMIZER = False
arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
contextualize(context)[source]

Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by DisjointMultitaskModel for changing the task that should be trained with a given iterator.

eval(stage=<Stage.TEST: 'Test'>)[source]

Override to explicitly maintain the stage (train, eval, test).

get_loss(logit, target, context)[source]
get_param_groups_for_optimizer() → List[Dict[str, List[torch.nn.parameter.Parameter]]][source]

Returns a list of parameter groups of the format {“params”: param_list}. The parameter groups loosely correspond to layers and are ordered from low to high. Currently, only the embedding layer can provide multiple param groups, and other layers are put into one param group. The output of this method is passed to the optimizer so that schedulers can change learning rates by layer.

get_pred(logit, target=None, context=None, *args)[source]
prepare_for_onnx_export_(**kwargs)[source]

Make model exportable via ONNX trace.

quantize()[source]

Quantize the model during export.

save_modules(base_path: str = '', suffix: str = '')[source]

Save each sub-module in separate files for reusing later.

train(mode=True)[source]

Override to explicitly maintain the stage (train, eval, test).

classmethod train_batch(model, batch, state=None)[source]
class pytext.models.model.Model(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.model.BaseModel

Generic single-task model class that expects four components:

  1. Embedding
  2. Representation
  3. Decoder
  4. Output Layer

Forward pass: embedding -> representation -> decoder -> output_layer

These four components have specific responsibilities as described below.

Embedding layer should implement the way to represent each token in the input text. It can be as simple as just token/word embedding or can be composed of multiple ways to represent a token, e.g., word embedding, character embedding, etc.

Representation layer should implement the way to encode the entire input text such that the output vector(s) can be used by decoder to produce logits. There is no restriction on the number of inputs it should encode. There is also not restriction on the number of ways to encode input.

Decoder layer should implement the way to consume the output of model’s representation and produce logits that can be used by the output layer to compute loss or generate predictions (and prediction scores/confidence)

Output layer should implement the way loss computation is done as well as the logic to generate predictions from the logits.

Let us discuss the joint intent-slot model as a case to go over these layers. The model predicts intent of input utterance and the slots in the utterance. (Refer to Train Intent-Slot model on ATIS Dataset for details about intent-slot model.)

  1. EmbeddingList layer is tasked with representing tokens. To do so we can use learnable word embedding table in conjunction with learnable character embedding table that are distilled to token level representation using CNN and pooling. Note: This class is meant to be reused by all models. It acts as a container of all the different ways of representing a token/word.
  2. BiLSTMDocSlotAttention is tasked with encoding the embedded input string for intent classification and slot filling. In order to do that it has a shared bidirectional LSTM layer followed by separate attention layers for document level attention and word level attention. Finally it produces two vectors per utterance.
  3. IntentSlotModelDecoder accepts the two input vectors from BiLSTMDocSlotAttention and produces logits for intent classification and slot filling. Conditioned on a flag it can also use the probabilities from intent classification for slot filling.
  4. IntentSlotOutputLayer implements the logic behind computing loss and prediction, as well as, how to export this layer to export to Caffe2. This is used by model exporter as a post-processing Caffe2 operator.
Parameters:
  • embedding (EmbeddingBase) – Description of parameter embedding.
  • representation (RepresentationBase) – Description of parameter representation.
  • decoder (DecoderBase) – Description of parameter decoder.
  • output_layer (OutputLayerBase) – Description of parameter output_layer.
embedding
representation
decoder
output_layer
classmethod compose_embedding(sub_emb_module_dict: Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase], metadata) → pytext.models.embeddings.embedding_list.EmbeddingList[source]

Default implementation is to compose an instance of EmbeddingList with all the sub-embedding modules. You should override this class method if you want to implement a specific way to embed tokens/words.

Parameters:sub_emb_module_dict (Dict[str, EmbeddingBase]) – Named dictionary of embedding modules each of which implement a way to embed/encode a token.
Returns:An instance of EmbeddingList.
Return type:EmbeddingList
classmethod create_embedding(feat_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata)[source]
classmethod create_sub_embs(emb_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata) → Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase][source]

Creates the embedding modules defined in the emb_config.

Parameters:
  • emb_config (FeatureConfig) – Object containing all the sub-embedding configurations.
  • metadata (CommonMetadata) – Object containing features and label metadata.
Returns:

Named dictionary of embedding modules.

Return type:

Dict[str, EmbeddingBase]

forward(*inputs) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.model.Model.Config, feat_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata)[source]
class pytext.models.model.ModelInputBase(**kwargs)[source]

Bases: pytext.config.pytext_config.ConfigBase

Base class for model inputs.

class pytext.models.model.ModelInputMeta[source]

Bases: pytext.config.pytext_config.ConfigBaseMeta

pytext.models.module module
class pytext.models.module.Module(config=None)[source]

Bases: torch.nn.modules.module.Module, pytext.config.component.Component

Generic module class that serves as base class for all PyText modules.

Parameters:config (type) – Module’s config object. Specific contents of this object depends on the module. Defaults to None.
freeze() → None[source]
pytext.models.module.create_module(module_config, *args, create_fn=<function _create_module_from_registry>, **kwargs)[source]

Create module object given the module’s config object. It depends on the global shared module registry. Hence, your module must be available for the registry. This entails that your module must be imported somewhere in the code path during module creation (ideally in your model class) for the module to be visible for registry.

Parameters:
  • module_config (type) – Module config object.
  • create_fn (type) – The function to use for creating the module. Use this parameter if your module creation requires custom code and pass your function here. Defaults to _create_module_from_registry().
Returns:

Description of returned object.

Return type:

type

pytext.models.pair_classification_model module
class pytext.models.pair_classification_model.BasePairwiseModel(decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase, encode_relations: bool)[source]

Bases: pytext.models.model.BaseModel

A base classification model that scores a pair of texts.

Subclasses need to implement the from_config, forward and save_modules.

forward(input1: Tuple[torch.Tensor, ...], input2: Tuple[torch.Tensor, ...])[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.pair_classification_model.BasePairwiseModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
save_modules(base_path: str = '', suffix: str = '')[source]

Save each sub-module in separate files for reusing later.

class pytext.models.pair_classification_model.PairwiseModel(embeddings: torch.nn.modules.container.ModuleList, representations: torch.nn.modules.container.ModuleList, decoder: pytext.models.decoders.mlp_decoder.MLPDecoder, output_layer: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer, encode_relations: bool)[source]

Bases: pytext.models.pair_classification_model.BasePairwiseModel

A classification model that scores a pair of texts, for example, a model for natural language inference.

The model shares embedding space (so it doesn’t support pairs of texts where left and right are in different languages). It uses bidirectional LSTM or CNN to represent the two documents, and concatenates them along with their absolute difference and elementwise product. This concatenated pair representation is passed to a multi-layer perceptron to decode to label/target space.

See https://arxiv.org/pdf/1705.02364.pdf for more details.

It can be instantiated just like any other Model.

EMBEDDINGS = ['embedding']
INPUTS_PAIR = [['tokens1'], ['tokens2']]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
forward(input1: Tuple[torch.Tensor, ...], input2: Tuple[torch.Tensor, ...]) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.pair_classification_model.PairwiseModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
save_modules(base_path: str = '', suffix: str = '')[source]

Save each sub-module in separate files for reusing later.

pytext.models.query_document_pairwise_ranking_model module
class pytext.models.query_document_pairwise_ranking_model.QueryDocPairwiseRankingModel(embeddings: torch.nn.modules.container.ModuleList, representations: torch.nn.modules.container.ModuleList, decoder: pytext.models.decoders.mlp_decoder.MLPDecoder, output_layer: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer, encode_relations: bool)[source]

Bases: pytext.models.pair_classification_model.PairwiseModel

Pairwise ranking model This model takes in a query, and two responses (pos_response and neg_response) It passes representations of the query and the two responses to a decoder pos_response should be ranked higher than neg_response - this is ensured by training with a ranking hinge loss function

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
forward(pos_response: Tuple[torch.Tensor, torch.Tensor], neg_response: Tuple[torch.Tensor, torch.Tensor], query: Tuple[torch.Tensor, torch.Tensor]) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.query_document_pairwise_ranking_model.QueryDocPairwiseRankingModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
pytext.models.roberta module
class pytext.models.roberta.RoBERTa(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]

Bases: pytext.models.bert_classification_models.NewBertModel

torchscriptify(tensorizers, traced_model)[source]

Using the traced model, create a ScriptModule which has a nicer API that includes generating tensors from simple data types, and returns classified values according to the output layer (eg. as a dict mapping class name to score)

class pytext.models.roberta.RoBERTaEncoder(config: pytext.models.roberta.RoBERTaEncoder.Config, output_encoded_layers: bool, **kwarg)[source]

Bases: pytext.models.roberta.RoBERTaEncoderBase

A PyTorch RoBERTa implementation

class pytext.models.roberta.RoBERTaEncoderBase(config: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase.Config, output_encoded_layers=False, *args, **kwargs)[source]

Bases: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase

class pytext.models.roberta.RoBERTaEncoderJit(config: pytext.models.roberta.RoBERTaEncoderJit.Config, output_encoded_layers: bool, **kwarg)[source]

Bases: pytext.models.roberta.RoBERTaEncoderBase

A TorchScript RoBERTa implementation

class pytext.models.roberta.RoBERTaWordTaggingModel(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]

Bases: pytext.models.model.BaseModel

Single Sentence Token-level Classification Model using XLM.

arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
forward(encoder_inputs: Tuple[torch.Tensor, ...], *args) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.roberta.RoBERTaWordTaggingModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
pytext.models.roberta.init_params(module)[source]

Initialize the RoBERTa weights for pre-training from scratch.

pytext.models.word_model module
class pytext.models.word_model.WordTaggingLiteModel(*args, **kwargs)[source]

Bases: pytext.models.word_model.WordTaggingModel

Also a word tagging model, but uses bytes as inputs to the model. Using bytes instead of words, the model does not need to store a word embedding table mapping words in the vocab to their embedding vector representations, but instead compute them on the fly using CharacterEmbedding. This produces an exported/serialized model that requires much less storage space as well as less memory during run/inference time.

arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
classmethod create_embedding(config, tensorizers)[source]
get_export_input_names(tensorizers)[source]
vocab_to_export(tensorizers)[source]
class pytext.models.word_model.WordTaggingModel(*args, **kwargs)[source]

Bases: pytext.models.model.Model

Word tagging model. It can be used for any task that requires predicting the tag for a word/token. For example, the following tasks can be modeled as word tagging tasks. This is not an exhaustive list. 1. Part of speech tagging. 2. Named entity recognition. 3. Slot filling for task oriented dialog.

It can be instantiated just like any other Model.

arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
classmethod create_embedding(config, tensorizers)[source]
classmethod from_config(config, tensorizers)[source]
get_export_input_names(tensorizers)[source]
get_export_output_names(tensorizers)[source]
vocab_to_export(tensorizers)[source]
Module contents
class pytext.models.Model(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]

Bases: pytext.models.model.BaseModel

Generic single-task model class that expects four components:

  1. Embedding
  2. Representation
  3. Decoder
  4. Output Layer

Forward pass: embedding -> representation -> decoder -> output_layer

These four components have specific responsibilities as described below.

Embedding layer should implement the way to represent each token in the input text. It can be as simple as just token/word embedding or can be composed of multiple ways to represent a token, e.g., word embedding, character embedding, etc.

Representation layer should implement the way to encode the entire input text such that the output vector(s) can be used by decoder to produce logits. There is no restriction on the number of inputs it should encode. There is also not restriction on the number of ways to encode input.

Decoder layer should implement the way to consume the output of model’s representation and produce logits that can be used by the output layer to compute loss or generate predictions (and prediction scores/confidence)

Output layer should implement the way loss computation is done as well as the logic to generate predictions from the logits.

Let us discuss the joint intent-slot model as a case to go over these layers. The model predicts intent of input utterance and the slots in the utterance. (Refer to Train Intent-Slot model on ATIS Dataset for details about intent-slot model.)

  1. EmbeddingList layer is tasked with representing tokens. To do so we can use learnable word embedding table in conjunction with learnable character embedding table that are distilled to token level representation using CNN and pooling. Note: This class is meant to be reused by all models. It acts as a container of all the different ways of representing a token/word.
  2. BiLSTMDocSlotAttention is tasked with encoding the embedded input string for intent classification and slot filling. In order to do that it has a shared bidirectional LSTM layer followed by separate attention layers for document level attention and word level attention. Finally it produces two vectors per utterance.
  3. IntentSlotModelDecoder accepts the two input vectors from BiLSTMDocSlotAttention and produces logits for intent classification and slot filling. Conditioned on a flag it can also use the probabilities from intent classification for slot filling.
  4. IntentSlotOutputLayer implements the logic behind computing loss and prediction, as well as, how to export this layer to export to Caffe2. This is used by model exporter as a post-processing Caffe2 operator.
Parameters:
  • embedding (EmbeddingBase) – Description of parameter embedding.
  • representation (RepresentationBase) – Description of parameter representation.
  • decoder (DecoderBase) – Description of parameter decoder.
  • output_layer (OutputLayerBase) – Description of parameter output_layer.
embedding
representation
decoder
output_layer
classmethod compose_embedding(sub_emb_module_dict: Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase], metadata) → pytext.models.embeddings.embedding_list.EmbeddingList[source]

Default implementation is to compose an instance of EmbeddingList with all the sub-embedding modules. You should override this class method if you want to implement a specific way to embed tokens/words.

Parameters:sub_emb_module_dict (Dict[str, EmbeddingBase]) – Named dictionary of embedding modules each of which implement a way to embed/encode a token.
Returns:An instance of EmbeddingList.
Return type:EmbeddingList
classmethod create_embedding(feat_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata)[source]
classmethod create_sub_embs(emb_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata) → Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase][source]

Creates the embedding modules defined in the emb_config.

Parameters:
  • emb_config (FeatureConfig) – Object containing all the sub-embedding configurations.
  • metadata (CommonMetadata) – Object containing features and label metadata.
Returns:

Named dictionary of embedding modules.

Return type:

Dict[str, EmbeddingBase]

forward(*inputs) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.models.model.Model.Config, feat_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata)[source]
class pytext.models.BaseModel(stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>)[source]

Bases: torch.nn.modules.module.Module, pytext.config.component.Component

Base model class which inherits from nn.Module. Also has a stage flag to indicate it’s in train, eval, or test stage. This is because the built-in train/eval flag in PyTorch can’t distinguish eval and test, which is required to support some use cases.

SUPPORT_FP16_OPTIMIZER = False
arrange_model_context(tensor_dict)[source]
arrange_model_inputs(tensor_dict)[source]
arrange_targets(tensor_dict)[source]
caffe2_export(tensorizers, tensor_dict, path, export_onnx_path=None)[source]
contextualize(context)[source]

Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by DisjointMultitaskModel for changing the task that should be trained with a given iterator.

eval(stage=<Stage.TEST: 'Test'>)[source]

Override to explicitly maintain the stage (train, eval, test).

get_loss(logit, target, context)[source]
get_param_groups_for_optimizer() → List[Dict[str, List[torch.nn.parameter.Parameter]]][source]

Returns a list of parameter groups of the format {“params”: param_list}. The parameter groups loosely correspond to layers and are ordered from low to high. Currently, only the embedding layer can provide multiple param groups, and other layers are put into one param group. The output of this method is passed to the optimizer so that schedulers can change learning rates by layer.

get_pred(logit, target=None, context=None, *args)[source]
prepare_for_onnx_export_(**kwargs)[source]

Make model exportable via ONNX trace.

quantize()[source]

Quantize the model during export.

save_modules(base_path: str = '', suffix: str = '')[source]

Save each sub-module in separate files for reusing later.

train(mode=True)[source]

Override to explicitly maintain the stage (train, eval, test).

classmethod train_batch(model, batch, state=None)[source]

pytext.optimizer package

Subpackages
pytext.optimizer.sparsifiers package
Submodules
pytext.optimizer.sparsifiers.blockwise_sparsifier module
class pytext.optimizer.sparsifiers.blockwise_sparsifier.BlockwiseMagnitudeSparsifier(sparsity, starting_epoch, frequency, block_size, columnwise_blocking, accumulate_mask, layerwise_pruning)[source]

Bases: pytext.optimizer.sparsifiers.sparsifier.L0_projection_sparsifier

running blockwise magnitude-based sparsification

Parameters:
  • block_size – define the size of each block
  • columnwise_blocking – define columnwise block if true
  • starting_epoch – sparsification_condition returns true only after starting_epoch
  • frequency – sparsification_condition only if number of steps devides frequency
  • accumulate_mask – if true, the mask after each .sparisfy() will be reused
  • sparsity – percentage of zeros among the UNPRUNED parameters.
  • on how the sparsifier work (Examples) –
  • matrix (2D) –
  • [ – 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
  • ]
  • 3 X 1 block (define) –
  • [***** *** 0 1 2 3 4 ****** *** 5 6 7 8 9 ****** *** 10 11 12 13 14 ****** *** 15 16 17 18 19 ****** *** 20 21 22 23 24 ****** ***
  • ]
  • l1 norm of each block and sort them. Retain blocks with largest (compute) –
  • values until sparsity threshold is met (absolute) –
classmethod from_config(config: pytext.optimizer.sparsifiers.blockwise_sparsifier.BlockwiseMagnitudeSparsifier.Config)[source]
get_current_sparsity(model)[source]
get_masks(model: torch.nn.modules.module.Module, pre_masks: List[torch.Tensor] = None) → List[torch.Tensor][source]

Note: this function returns the masks only but do not sparsify or modify the weights

prune x% of weights among the weights with “1” in pre_masks

Parameters:
  • model – Model
  • pre_masks – list of FloatTensors where “1” means retained the weight and “0” means pruned the weight
Returns:

List[torch.Tensor], intersection of new masks and pre_masks, so that “1” only if the weight is selected after new masking and pre_mask

Return type:

masks

get_sparsifiable_params(model, requires_name=False)[source]
pytext.optimizer.sparsifiers.sparsifier module
class pytext.optimizer.sparsifiers.sparsifier.CRF_L1_SoftThresholding(lambda_l1: float, starting_epoch: int, frequency: int)[source]

Bases: pytext.optimizer.sparsifiers.sparsifier.CRF_SparsifierBase

implement l1 regularization:
min Loss(x, y, CRFparams) + lambda_l1 * ||CRFparams||_1

and solve the optimiation problem via (stochastic) proximal gradient-based method i.e., soft-thresholding

param_updated = sign(CRFparams) * max ( abs(CRFparams) - lambda_l1, 0)

classmethod from_config(config: pytext.optimizer.sparsifiers.sparsifier.CRF_L1_SoftThresholding.Config)[source]
sparsify(state)[source]
class pytext.optimizer.sparsifiers.sparsifier.CRF_MagnitudeThresholding(sparsity, starting_epoch, frequency, grouping)[source]

Bases: pytext.optimizer.sparsifiers.sparsifier.CRF_SparsifierBase

magnitude-based (equivalent to projection onto l0 constraint set) sparsification on CRF transition matrix. Preserveing the top-k elements either rowwise or columnwise until sparsity constraint is met.

classmethod from_config(config: pytext.optimizer.sparsifiers.sparsifier.CRF_MagnitudeThresholding.Config)[source]
sparsify(state)[source]
class pytext.optimizer.sparsifiers.sparsifier.CRF_SparsifierBase(config=None, *args, **kwargs)[source]

Bases: pytext.optimizer.sparsifiers.sparsifier.Sparsifier

get_sparsifiable_params(model: torch.nn.modules.module.Module)[source]
get_transition_sparsity(transition)[source]
sparsification_condition(state)[source]
class pytext.optimizer.sparsifiers.sparsifier.L0_projection_sparsifier(sparsity, starting_epoch, frequency, layerwise_pruning=True, accumulate_mask=False)[source]

Bases: pytext.optimizer.sparsifiers.sparsifier.Sparsifier

L0 projection-based (unstructured) sparsification

Parameters:
  • weights (torch.Tensor) – input weight matrix
  • sparsity (float32) – the desired sparsity [0-1]
apply_masks(model: pytext.models.model.Model, masks: List[torch.Tensor])[source]

apply given masks to zero-out learnable weights in model

classmethod from_config(config: pytext.optimizer.sparsifiers.sparsifier.L0_projection_sparsifier.Config)[source]
get_masks(model: pytext.models.model.Model, pre_masks: List[torch.Tensor] = None) → List[torch.Tensor][source]

Note: this function returns the masks only but do not sparsify or modify the weights

prune x% of weights among the weights with “1” in pre_masks

Parameters:
  • model – Model
  • pre_masks – list of FloatTensors where “1” means retained the weight and “0” means pruned the weight
Returns:

List[torch.Tensor], intersection of new masks and pre_masks, so that “1” only if the weight is selected after new masking and pre_mask

Return type:

masks

get_sparsifiable_params(model: pytext.models.model.Model)[source]
sparsification_condition(state)[source]
sparsify(state)[source]

obtain a mask and apply the mask to sparsify

class pytext.optimizer.sparsifiers.sparsifier.Sparsifier(config=None, *args, **kwargs)[source]

Bases: pytext.config.component.Component

get_current_sparsity(model: pytext.models.model.Model) → float[source]
get_sparsifiable_params(*args, **kwargs)[source]
sparsification_condition(*args, **kwargs)[source]
sparsify(*args, **kwargs)[source]
Module contents
Submodules
pytext.optimizer.activations module
class pytext.optimizer.activations.GeLU[source]

Bases: torch.nn.modules.module.Module

Implements Gaussian Error Linear Units (GELUs).

Reference: Gaussian Error Linear Units (GELUs). Dan Hendrycks, Kevin Gimpel. Technical Report, 2017. https://arxiv.org/pdf/1606.08415.pdf

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.optimizer.activations.get_activation(name)[source]
pytext.optimizer.fairseq_fp16_utils module
class pytext.optimizer.fairseq_fp16_utils.Fairseq_FP16OptimizerMixin(*args, **kwargs)[source]

Bases: object

backward(loss)[source]

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

Compared to fairseq.optim.FairseqOptimizer.backward(), this function additionally dynamically scales the loss to avoid gradient underflow.

classmethod build_fp32_params(params)[source]
clip_grad_norm(max_norm)[source]

Clips gradient norm and updates dynamic loss scaler.

load_state_dict(state_dict, optimizer_overrides=None)[source]

Load an optimizer state dict.

In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.

multiply_grads(c)[source]

Multiplies grads by a constant c.

state_dict()[source]

Return the optimizer’s state dict.

step(closure=None)[source]

Performs a single optimization step.

zero_grad()[source]

Clears the gradients of all optimized parameters.

class pytext.optimizer.fairseq_fp16_utils.Fairseq_MemoryEfficientFP16OptimizerMixin(*args, **kwargs)[source]

Bases: object

backward(loss)[source]

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

Compared to fairseq.optim.FairseqOptimizer.backward(), this function additionally dynamically scales the loss to avoid gradient underflow.

clip_grad_norm(max_norm)[source]

Clips gradient norm and updates dynamic loss scaler.

load_state_dict(state_dict, optimizer_overrides=None)[source]

Load an optimizer state dict.

In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.

multiply_grads(c)[source]

Multiplies grads by a constant c.

state_dict()[source]

Return the optimizer’s state dict.

step(closure=None)[source]

Performs a single optimization step.

zero_grad()[source]

Clears the gradients of all optimized parameters.

pytext.optimizer.fp16_optimizer module
class pytext.optimizer.fp16_optimizer.DynamicLossScaler(init_scale, scale_factor, scale_window)[source]

Bases: object

check_overflow(params)[source]
check_overflow_(grad)[source]
unscale(grad)[source]
unscale_grads(param_groups)[source]
update_scale()[source]

According to overflow situation, adjust loss scale.

Once overflow happened, we decrease the scale by scale_factor. Setting tolerance is another approach depending on cases.

If we haven’t had overflows for #scale_window times, we should increase the scale by scale_factor.

upscale(loss)[source]
class pytext.optimizer.fp16_optimizer.FP16Optimizer(fp32_optimizer)[source]

Bases: pytext.optimizer.optimizers.Optimizer

backward(loss)[source]
clip_grad_norm(max_norm, model)[source]
finalize() → bool[source]
load_state_dict(state_dict)[source]
param_groups
pre_export(model)[source]
state_dict()[source]
step(closure=None)[source]
zero_grad()[source]
class pytext.optimizer.fp16_optimizer.FP16OptimizerApex(fp32_optimizer: pytext.optimizer.optimizers.Optimizer, model: torch.nn.modules.module.Module, opt_level: str, init_loss_scale: Optional[int], min_loss_scale: Optional[float])[source]

Bases: pytext.optimizer.fp16_optimizer.FP16Optimizer

backward(loss)[source]
clip_grad_norm(max_norm, model)[source]
classmethod from_config(fp16_config: pytext.optimizer.fp16_optimizer.FP16OptimizerApex.Config, model: torch.nn.modules.module.Module, fp32_config: pytext.optimizer.optimizers.Optimizer.Config, *unused)[source]
load_state_dict(state_dict)[source]
pre_export(model)[source]
state_dict()[source]
step(closure=None)[source]
zero_grad()[source]
class pytext.optimizer.fp16_optimizer.FP16OptimizerDeprecated(init_optimizer, init_scale, scale_factor, scale_window)[source]

Bases: object

finalize()[source]
load_state_dict(state_dict)[source]
scale_loss(loss)[source]
state_dict()[source]
step()[source]

Realize weights update.

Update the grads from model to master. During iteration for parameters, we check overflow after floating grads and copy. Then do unscaling.

If overflow doesn’t happen, call inner optimizer’s step() and copy back the updated weights from inner optimizer to model.

Update loss scale according to overflow checking result.

zero_grad()[source]
class pytext.optimizer.fp16_optimizer.FP16OptimizerFairseq(fp16_params, fp32_optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]

Bases: fairseq.optim.fp16_optimizer._FP16OptimizerMixin, pytext.optimizer.fp16_optimizer.FP16Optimizer

Wrap an optimizer to support FP16 (mixed precision) training.

clip_grad_norm(max_norm, unused_model)[source]

Clips gradient norm and updates dynamic loss scaler.

classmethod from_config(fp16_config: pytext.optimizer.fp16_optimizer.FP16OptimizerFairseq.Config, model: torch.nn.modules.module.Module, fp32_config: pytext.optimizer.optimizers.Optimizer.Config, num_accumulated_batches: int)[source]
pre_export(model)[source]
class pytext.optimizer.fp16_optimizer.GeneratorFP16Optimizer(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]

Bases: pytext.optimizer.fp16_optimizer.PureFP16Optimizer

load_state_dict(state_dict)[source]

Load an optimizer state dict.

We prefer the configuration of the existing optimizer instance. After we load state dict to inner_optimizer, we create the copy of references of parameters again as in init().

step()[source]

Updates weights.

Effects:

Check overflow, if not, when inner_optimizer supports memory-effcient step, do overall unscale and call memory-efficient step.

If it doesn’t support, modify each parameter list in param_groups of inner_optimizer to a generator of the tensors. Call normal step then, data type changing will be added automatically in that function.

No matter whether it is overflow, we need to update scale at the last step.

class pytext.optimizer.fp16_optimizer.MemoryEfficientFP16OptimizerFairseq(fp16_params, optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]

Bases: fairseq.optim.fp16_optimizer._MemoryEfficientFP16OptimizerMixin, pytext.optimizer.fp16_optimizer.FP16Optimizer

Wrap the mem efficient optimizer to support FP16 (mixed precision) training.

clip_grad_norm(max_norm, unused_model)[source]

Clips gradient norm and updates dynamic loss scaler.

classmethod from_config(fp16_config: pytext.optimizer.fp16_optimizer.MemoryEfficientFP16OptimizerFairseq.Config, model: torch.nn.modules.module.Module, fp32_config: pytext.optimizer.optimizers.Optimizer.Config, num_accumulated_batches: int)[source]
pre_export(model)[source]
class pytext.optimizer.fp16_optimizer.PureFP16Optimizer(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]

Bases: pytext.optimizer.fp16_optimizer.FP16OptimizerDeprecated

load_state_dict(state_dict)[source]

Load an optimizer state dict.

We prefer the configuration of the existing optimizer instance. Realize the same logic as in init() – point the param_groups of outer optimizer to that of the inner_optimizer.

scale_loss(loss)[source]

Scale the loss.

Parameters:loss (pytext.Loss) – loss function object
step()[source]

Updates the weights in inner optimizer.

If inner optimizer supports memory efficient, check overflow, unscale and call advanced step.

Otherwise, float weights and grads, check whether grads are overflow during the iteration, if not overflow, unscale grads and call inner optimizer’s step; If overflow happens, do nothing, wait to the end to call half weights and grads (grads will be eliminated in zero_grad)

pytext.optimizer.fp16_optimizer.convert_generator(params, scale)[source]

Create the generator for parameter tensors.

For each parameter, we float and unscale it. After the caller calls next(), we realize the half process and start next parameter’s processing.

pytext.optimizer.fp16_optimizer.generate_params(param_groups)[source]
pytext.optimizer.fp16_optimizer.initialize(model, optimizer, opt_level, init_scale=65536, scale_factor=2.0, scale_window=2000, memory_efficient=False)[source]
pytext.optimizer.fp16_optimizer.master_params(optimizer)[source]
pytext.optimizer.fp16_optimizer.scale_loss(loss, optimizer, delay_unscale=False)[source]
pytext.optimizer.lamb module
class pytext.optimizer.lamb.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, min_trust=None)[source]

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

Implements Lamb algorithm. THIS WAS DIRECTLY COPIED OVER FROM pytorch/contrib: https://github.com/cybertronai/pytorch-lamb It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962

Has the option for minimum trust LAMB as described in “Single Headed Attention RNN: Stop Thinking With Your Head” section 6.3 https://arxiv.org/abs/1911.11423

classmethod from_config(config: pytext.optimizer.lamb.Lamb.Config, model: torch.nn.modules.module.Module)[source]
step(closure=None)[source]

Performs a single optimization step.

Parameters:closure (callable, optional) – A closure that reevaluates the model and returns the loss.
pytext.optimizer.optimizers module
class pytext.optimizer.optimizers.Adagrad(parameters, lr, weight_decay)[source]

Bases: torch.optim.adagrad.Adagrad, pytext.optimizer.optimizers.Optimizer

classmethod from_config(config: pytext.optimizer.optimizers.Adagrad.Config, model: torch.nn.modules.module.Module)[source]
class pytext.optimizer.optimizers.Adam(parameters, lr, weight_decay, eps)[source]

Bases: torch.optim.adam.Adam, pytext.optimizer.optimizers.Optimizer

classmethod from_config(config: pytext.optimizer.optimizers.Adam.Config, model: torch.nn.modules.module.Module)[source]
class pytext.optimizer.optimizers.AdamW(parameters, lr, weight_decay, eps)[source]

Bases: torch.optim.adamw.AdamW, pytext.optimizer.optimizers.Optimizer

Adds PyText support for Decoupled Weight Decay Regularization for Adam as done in the paper: https://arxiv.org/abs/1711.05101 for more information read the fast.ai blog on this optimization method here: https://www.fast.ai/2018/07/02/adam-weight-decay/

classmethod from_config(config: pytext.optimizer.optimizers.AdamW.Config, model: torch.nn.modules.module.Module)[source]
class pytext.optimizer.optimizers.Optimizer(config=None, *args, **kwargs)[source]

Bases: pytext.config.component.Component

backward(loss)[source]
clip_grad_norm(max_norm, model=None)[source]
finalize() → bool[source]
multiply_grads(c)[source]

Multiplies grads by a constant c.

params

Return an iterable of the parameters held by the optimizer.

pre_export(model)[source]
class pytext.optimizer.optimizers.SGD(parameters, lr, momentum)[source]

Bases: torch.optim.sgd.SGD, pytext.optimizer.optimizers.Optimizer

classmethod from_config(config: pytext.optimizer.optimizers.SGD.Config, model: torch.nn.modules.module.Module)[source]
pytext.optimizer.optimizers.learning_rates(optimizer)[source]
pytext.optimizer.radam module
class pytext.optimizer.radam.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

Implements rectified adam as derived in the following paper: “On the Variance of the Adaptive Learning Rate and Beyond” (https://arxiv.org/abs/1908.03265)

This code is mostly a direct copy-paste of the code provided by the authors here: https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py

classmethod from_config(config: pytext.optimizer.radam.RAdam.Config, model: torch.nn.modules.module.Module)[source]
step(closure=None)[source]

Performs a single optimization step (parameter update).

Parameters:closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
pytext.optimizer.scheduler module
class pytext.optimizer.scheduler.BatchScheduler(config=None, *args, **kwargs)[source]

Bases: pytext.optimizer.scheduler.Scheduler

prepare(train_iter, total_epochs)[source]
class pytext.optimizer.scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)[source]

Bases: torch.optim.lr_scheduler.CosineAnnealingLR, pytext.optimizer.scheduler.BatchScheduler

Wrapper around torch.optim.lr_scheduler.CosineAnnealingLR See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.CosineAnnealingLR.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
step_batch(metrics=None, epoch=None)[source]
class pytext.optimizer.scheduler.CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]

Bases: torch.optim.lr_scheduler.CyclicLR, pytext.optimizer.scheduler.BatchScheduler

Wrapper around torch.optim.lr_scheduler.CyclicLR See the original documentation for more details

classmethod from_config(config: pytext.optimizer.scheduler.CyclicLR.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
step_batch(metrics=None, epoch=None)[source]
class pytext.optimizer.scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)[source]

Bases: torch.optim.lr_scheduler.ExponentialLR, pytext.optimizer.scheduler.Scheduler

Wrapper around torch.optim.lr_scheduler.ExponentialLR See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.ExponentialLR.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
step_epoch(metrics=None, epoch=None)[source]
class pytext.optimizer.scheduler.LmFineTuning(optimizer, cut_frac=0.1, ratio=32, non_pretrained_param_groups=2, lm_lr_multiplier=1.0, lm_use_per_layer_lr=False, lm_gradual_unfreezing=True, last_epoch=-1)[source]

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Fine-tuning methods from the paper “[arXiv:1801.06146]Universal Language Model Fine-tuning for Text Classification”.

Specifically, modifies training schedule using slanted triangular learning rates, discriminative fine-tuning (per-layer learning rates), and gradual unfreezing.

classmethod from_config(config: pytext.optimizer.scheduler.LmFineTuning.Config, optimizer)[source]
get_lr()[source]
step_batch(metrics=None, epoch=None)[source]
class pytext.optimizer.scheduler.PolynomialDecayScheduler(optimizer, warmup_steps, total_steps, end_learning_rate, power)[source]

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Applies a polynomial decay with lr warmup to the learning rate.

It is commonly observed that a monotonically decreasing learning rate, whose degree of change is carefully chosen, results in a better performing model.

This scheduler linearly increase learning rate from 0 to final value at the beginning of training, determined by warmup_steps. Then it applies a polynomial decay function to an optimizer step, given a provided base_lrs to reach an end_learning_rate after total_steps.

classmethod from_config(config: pytext.optimizer.scheduler.PolynomialDecayScheduler.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
get_lr()[source]
prepare(train_iter, total_epochs)[source]
step_batch()[source]
class pytext.optimizer.scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)[source]

Bases: torch.optim.lr_scheduler.ReduceLROnPlateau, pytext.optimizer.scheduler.Scheduler

Wrapper around torch.optim.lr_scheduler.ReduceLROnPlateau See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.ReduceLROnPlateau.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
step_epoch(metrics, epoch)[source]
class pytext.optimizer.scheduler.Scheduler(config=None, *args, **kwargs)[source]

Bases: pytext.config.component.Component

Schedulers help in adjusting the learning rate during training. Scheduler is a wrapper class over schedulers which can be available in torch library or for custom implementations. There are two kinds of lr scheduling that is supported by this class. Per epoch scheduling and per batch scheduling. In per epoch scheduling, the learning rate is adjusted at the end of each epoch and in per batch scheduling the learning rate is adjusted after the forward and backward pass through one batch during the training.

There are two main methods that needs to be implemented by the Scheduler. step_epoch() is called at the end of each epoch and step_batch() is called at the end of each batch in the training data.

prepare() method can be used by BatchSchedulers to initialize any attributes they may need.

prepare(train_iter, total_epochs)[source]
step_batch(**kwargs) → None[source]
step_epoch(**kwargs) → None[source]
class pytext.optimizer.scheduler.SchedulerWithWarmup(optimizer, warmup_scheduler, scheduler, switch_steps)[source]

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Wraps another scheduler with a warmup phase. After warmup_steps defined in warmup_scheduler.warmup_steps, the scheduler will switch to use the specified scheduler in scheduler.

warmup_scheduler: is the configuration for the WarmupScheduler, that warms up learning rate over warmup_steps linearly.

scheduler: is the main scheduler that will be applied after the warmup phase (once warmup_steps have passed)

classmethod from_config(config: pytext.optimizer.scheduler.SchedulerWithWarmup.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
get_lr()[source]
prepare(train_iter, total_epochs)[source]
step_batch()[source]
step_epoch(metrics, epoch)[source]
class pytext.optimizer.scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)[source]

Bases: torch.optim.lr_scheduler.StepLR, pytext.optimizer.scheduler.Scheduler

Wrapper around torch.optim.lr_scheduler.StepLR See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.StepLR.Config, optimizer)[source]
step_epoch(metrics=None, epoch=None)[source]
class pytext.optimizer.scheduler.WarmupScheduler(optimizer, warmup_steps, inverse_sqrt_decay)[source]

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Scheduler to linearly increase the learning rate from 0 to its final value over a number of steps:

lr = base_lr * current_step / warmup_steps

After the warm-up phase, the scheduler has the option of decaying the learning rate as the inverse square root of the number of training steps taken:

lr = base_lr * sqrt(warmup_steps) / sqrt(current_step)
classmethod from_config(config: pytext.optimizer.scheduler.WarmupScheduler.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]
get_lr()[source]
prepare(train_iter, total_epochs)[source]
step_batch()[source]
pytext.optimizer.swa module
class pytext.optimizer.swa.StochasticWeightAveraging(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

add_param_group(param_group)[source]

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:
  • param_group (dict) – Specifies what Tensors should be optimized along
  • group specific optimization options. (with) –
static bn_update(loader, model, device=None)[source]

Updates BatchNorm running_mean, running_var buffers in the model.

It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model.

Parameters:
  • loader (torch.utils.data.DataLoader) – dataset loader to compute the activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
  • model (torch.nn.Module) – model for which we seek to update BatchNorm statistics.
  • device (torch.device, optional) – If set, data will be trasferred to device before being passed into model.
finalize()[source]

Swaps the values of the optimized variables and swa buffers.

It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.

classmethod from_config(config: pytext.optimizer.swa.StochasticWeightAveraging.Config, model: torch.nn.modules.module.Module)[source]
load_state_dict(state_dict)[source]

Loads the optimizer state.

Parameters:state_dict (dict) – SWA optimizer state. Should be an object returned from a call to state_dict.
state_dict()[source]

Returns the state of SWA as a dict.

It contains three entries:
  • opt_state - a dict holding current optimization state of the base
    optimizer. Its content differs between optimizer classes.
  • swa_state - a dict containing current state of SWA. For each
    optimized variable it contains swa_buffer keeping the running average of the variable
  • param_groups - a dict containing all parameter groups
step(closure=None)[source]

Performs a single optimization step.

In automatic mode also updates SWA running averages.

update_swa()[source]

Updates the SWA running averages of all optimized parameters.

update_swa_group(group)[source]

Updates the SWA running averages for the given parameter group.

Parameters:param_group (dict) – Specifies for what parameter group SWA running averages should be updated

Examples

>>> # automatic mode
>>> base_opt = torch.optim.SGD([{'params': [x]},
>>>             {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9)
>>> opt = torchcontrib.optim.SWA(base_opt)
>>> for i in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input), target).backward()
>>>     opt.step()
>>>     if i > 10 and i % 5 == 0:
>>>         # Update SWA for the second parameter group
>>>         opt.update_swa_group(opt.param_groups[1])
>>> opt.swap_swa_sgd()
Module contents

pytext.task package

Submodules
pytext.task.disjoint_multitask module
class pytext.task.disjoint_multitask.DisjointMultitask(target_task_name, exporters, **kwargs)[source]

Bases: pytext.task.task.TaskBase

Modules which have the same shared_module_key and type share parameters. Only the first instance of such module should be configured in tasks list.

export(multitask_model, export_path, metric_channels, export_onnx_path=None)[source]

Wrapper method to export PyTorch model to Caffe2 model using Exporter.

Parameters:
  • export_path (str) – file path of exported caffe2 model
  • metric_channels – output the PyTorch model’s execution graph to
  • export_onnx_path (str) – file path of exported onnx model
classmethod from_config(task_config: pytext.task.disjoint_multitask.DisjointMultitask.Config, metadata=None, model_state=None, tensorizers=None, rank=0, world_size=1)[source]

Create the task from config, and optionally load metadata/model_state This function will create components including DataHandler, Trainer, MetricReporter, Exporter, and wire them up.

Parameters:
  • task_config (Task.Config) – the config of the current task
  • metadata – saved global context of this task, e.g: vocabulary, will be generated by DataHandler if it’s None
  • model_state – saved model parameters, will be loaded into model when given
class pytext.task.disjoint_multitask.NewDisjointMultitask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task._NewTask

Multitask training based on underlying subtasks. To share parameters between modules from different tasks, specify the same shared_module_key. Only the first instance of each shared module should be configured in tasks list. Only the multitask trainer (not the per-task trainers) is used.

export(model, export_path, metric_channels=None, export_onnx_path=None)[source]

Wrapper method to export PyTorch model to Caffe2 model using Exporter.

Parameters:
  • export_path (str) – file path of exported caffe2 model
  • metric_channels (List[Channel]) – outputs of model’s execution graph
  • export_onnx_path (str) – file path of exported onnx model
classmethod from_config(task_config: pytext.task.disjoint_multitask.NewDisjointMultitask.Config, unused_metadata=None, model_state=None, tensorizers=None, rank=0, world_size=1)[source]

Create the task from config, and optionally load metadata/model_state This function will create components including DataHandler, Trainer, MetricReporter, Exporter, and wire them up.

Parameters:
  • task_config (Task.Config) – the config of the current task
  • metadata – saved global context of this task, e.g: vocabulary, will be generated by DataHandler if it’s None
  • model_state – saved model parameters, will be loaded into model when given
torchscript_export(model, export_path, quantize=False)[source]
pytext.task.new_task module
class pytext.task.new_task.NewTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task._NewTask

pytext.task.new_task.create_schema(tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], extra_schema: Optional[Dict[str, Type[CT_co]]] = None) → Dict[str, Type[CT_co]][source]
pytext.task.new_task.create_tensorizers(model_inputs: Union[pytext.models.model.BaseModel.Config.ModelInput, Dict[str, pytext.data.tensorizers.Tensorizer.Config]]) → Dict[str, pytext.data.tensorizers.Tensorizer][source]
pytext.task.serialize module
class pytext.task.serialize.CheckpointManager[source]

Bases: object

CheckpointManager is class abstraction to manage training job’s checkpoints with different IO and storage, using two functions: save() and load().

DELIMITER = '-'
generate_checkpoint_path(config: pytext.config.pytext_config.PyTextConfig, identifier: str)[source]
get_latest_checkpoint_path() → str[source]

Return most recent saved checkpoint path in str Returns: checkpoint_path (str)

get_post_training_snapshot_path() → str[source]
list() → List[str][source]

Return all existing checkpoint path in str Returns: checkpoint_path_list (List[str]), list elements are in the same order of checkpoint saving

load(load_path: str, overwrite_config=None)[source]

Loads a checkpoint from disk. :param load_path: the file path to load for checkpoint :type load_path: str

Returns: task (Task), config (PyTextConfig) and training_state (TrainingState)

save(config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None, identifier: str = None) → str[source]

save a checkpoint to given path, config, model and training_state together represent the checkpoint. When identifier is None, this function is used to save post-training snapshot

pytext.task.serialize.get_latest_checkpoint_path(dir_path: Optional[str] = None) → str[source]

Get the latest checkpoint path :param dir_path: the dir to scan for existing checkpoint files. Default: if None, :param the latest checkpoint path saved in momery will be returned:

Returns: checkpoint_path

pytext.task.serialize.get_post_training_snapshot_path() → str[source]
pytext.task.serialize.load(load_path: str, overwrite_config=None)[source]

Load task, config and training state from a saved snapshot by default, it will construct the task using the saved config then load metadata and model state.

if overwrite_task is specified, it will construct the task using overwrite_task then load metadata and model state.

pytext.task.serialize.load_checkpoint(f: io.IOBase, overwrite_config=None)[source]
pytext.task.serialize.load_v1(state)[source]
pytext.task.serialize.load_v2(state)[source]
pytext.task.serialize.load_v3(state, overwrite_config=None)[source]
pytext.task.serialize.register_snapshot_loader(version)[source]
pytext.task.serialize.save(config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None, identifier: Optional[str] = None) → str[source]

Save all stateful information of a training task to a specified file-like object, will save the original config, model state, metadata, training state if training is not completed Args: identifier (str): used to identify a checkpoint within a training job, used as a suffix for save path config (PytextConfig): contains all raw parameter/hyper-parameters for training task model (Model): actual model in training training_state (TrainingState): stateful infomation during training Returns: identifier (str): if identifier is not specified, will save to config.save_snapshot_path to be consistent to post-training snapshot; if specified, will be used to save checkpoint during training, identifier is used to identify checkpoints in the same training

pytext.task.serialize.save_checkpoint(f: io.IOBase, config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None) → str[source]
pytext.task.serialize.set_checkpoint_manager(manager: pytext.task.serialize.CheckpointManager) → None[source]
pytext.task.task module
class pytext.task.task.TaskBase(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]

Bases: pytext.config.component.Component

Task is the central place to define and wire up components for data processing, model training, metric reporting, etc. Task class has a Config class containing the config of each component in a descriptive way.

export(model, export_path, metric_channels=None, export_onnx_path=None)[source]

Wrapper method to export PyTorch model to Caffe2 model using Exporter.

Parameters:
  • export_path (str) – file path of exported caffe2 model
  • metric_channels (List[Channel]) – outputs of model’s execution graph
  • export_onnx_path (str) – file path of exported onnx model
classmethod format_prediction(predictions, scores, context, target_meta)[source]

Format the prediction and score from model output, by default just return them in a dict

classmethod from_config(task_config, metadata=None, model_state=None, tensorizers=None, rank=1, world_size=0)[source]

Create the task from config, and optionally load metadata/model_state This function will create components including DataHandler, Trainer, MetricReporter, Exporter, and wire them up.

Parameters:
  • task_config (Task.Config) – the config of the current task
  • metadata – saved global context of this task, e.g: vocabulary, will be generated by DataHandler if it’s None
  • model_state – saved model parameters, will be loaded into model when given
predict(examples)[source]

Generates predictions using PyTorch model. The difference with test() is that this should be used when the the examples do not have any true label/target.

Parameters:examples – json format examples, input names should match the names specified in this task’s features config
test(test_path)[source]

Wrapper method to compute test metrics on holdout blind test dataset.

Parameters:test_path (str) – test data file path
train(train_config, rank=0, world_size=1, training_state=None)[source]

Wrapper method to train the model using Trainer object.

Parameters:
  • train_config (PyTextConfig) – config for training
  • rank (int) – for distributed training only, rank of the gpu, default is 0
  • world_size (int) – for distributed training only, total gpu to use, default is 1
class pytext.task.task.Task_Deprecated(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]

Bases: pytext.task.task.TaskBase

pytext.task.task.create_task(task_config, metadata=None, model_state=None, tensorizers=None, rank=0, world_size=1)[source]

Create a task by finding task class in registry and invoking the from_config function of the class, see from_config() for more details

pytext.task.tasks module
class pytext.task.tasks.BertPairRegressionTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.tasks.DocumentRegressionTask

class pytext.task.tasks.DocumentClassificationTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

classmethod format_prediction(predictions, scores, context, target_names)[source]

Format the prediction and score from model output, by default just return them in a dict

class pytext.task.tasks.DocumentRegressionTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.EnsembleTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

classmethod example_config()[source]
train_single_model(train_config, model_id, rank=0, world_size=1)[source]
class pytext.task.tasks.IntentSlotTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.LMTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.MaskedLMTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.NewBertClassificationTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.tasks.DocumentClassificationTask

class pytext.task.tasks.NewBertPairClassificationTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.tasks.DocumentClassificationTask

class pytext.task.tasks.PairwiseClassificationTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.QueryDocumentPairwiseRankingTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.RoBERTaNERTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

classmethod create_metric_reporter(config: pytext.task.tasks.RoBERTaNERTask.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
class pytext.task.tasks.SemanticParsingTask(data: pytext.data.data.Data, model: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParser, metric_reporter: pytext.metric_reporters.compositional_metric_reporter.CompositionalMetricReporter, trainer: pytext.trainers.hogwild_trainer.HogwildTrainer)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.SeqNNTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.SquadQATask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

class pytext.task.tasks.WordTaggingTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task.NewTask

classmethod create_metric_reporter(config: pytext.task.tasks.WordTaggingTask.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]
Module contents
class pytext.task.NewTask(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]

Bases: pytext.task.new_task._NewTask

class pytext.task.Task_Deprecated(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]

Bases: pytext.task.task.TaskBase

class pytext.task.TaskBase(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]

Bases: pytext.config.component.Component

Task is the central place to define and wire up components for data processing, model training, metric reporting, etc. Task class has a Config class containing the config of each component in a descriptive way.

export(model, export_path, metric_channels=None, export_onnx_path=None)[source]

Wrapper method to export PyTorch model to Caffe2 model using Exporter.

Parameters:
  • export_path (str) – file path of exported caffe2 model
  • metric_channels (List[Channel]) – outputs of model’s execution graph
  • export_onnx_path (str) – file path of exported onnx model
classmethod format_prediction(predictions, scores, context, target_meta)[source]

Format the prediction and score from model output, by default just return them in a dict

classmethod from_config(task_config, metadata=None, model_state=None, tensorizers=None, rank=1, world_size=0)[source]

Create the task from config, and optionally load metadata/model_state This function will create components including DataHandler, Trainer, MetricReporter, Exporter, and wire them up.

Parameters:
  • task_config (Task.Config) – the config of the current task
  • metadata – saved global context of this task, e.g: vocabulary, will be generated by DataHandler if it’s None
  • model_state – saved model parameters, will be loaded into model when given
predict(examples)[source]

Generates predictions using PyTorch model. The difference with test() is that this should be used when the the examples do not have any true label/target.

Parameters:examples – json format examples, input names should match the names specified in this task’s features config
test(test_path)[source]

Wrapper method to compute test metrics on holdout blind test dataset.

Parameters:test_path (str) – test data file path
train(train_config, rank=0, world_size=1, training_state=None)[source]

Wrapper method to train the model using Trainer object.

Parameters:
  • train_config (PyTextConfig) – config for training
  • rank (int) – for distributed training only, rank of the gpu, default is 0
  • world_size (int) – for distributed training only, total gpu to use, default is 1
pytext.task.save(config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None, identifier: Optional[str] = None) → str[source]

Save all stateful information of a training task to a specified file-like object, will save the original config, model state, metadata, training state if training is not completed Args: identifier (str): used to identify a checkpoint within a training job, used as a suffix for save path config (PytextConfig): contains all raw parameter/hyper-parameters for training task model (Model): actual model in training training_state (TrainingState): stateful infomation during training Returns: identifier (str): if identifier is not specified, will save to config.save_snapshot_path to be consistent to post-training snapshot; if specified, will be used to save checkpoint during training, identifier is used to identify checkpoints in the same training

pytext.task.load(load_path: str, overwrite_config=None)[source]

Load task, config and training state from a saved snapshot by default, it will construct the task using the saved config then load metadata and model state.

if overwrite_task is specified, it will construct the task using overwrite_task then load metadata and model state.

pytext.task.create_task(task_config, metadata=None, model_state=None, tensorizers=None, rank=0, world_size=1)[source]

Create a task by finding task class in registry and invoking the from_config function of the class, see from_config() for more details

pytext.task.get_latest_checkpoint_path(dir_path: Optional[str] = None) → str[source]

Get the latest checkpoint path :param dir_path: the dir to scan for existing checkpoint files. Default: if None, :param the latest checkpoint path saved in momery will be returned:

Returns: checkpoint_path

pytext.torchscript package

Subpackages
pytext.torchscript.tensorizer package
Submodules
pytext.torchscript.tensorizer.bert module
class pytext.torchscript.tensorizer.bert.ScriptBERTTensorizer(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase

class pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer

pytext.torchscript.tensorizer.normalizer module
class pytext.torchscript.tensorizer.normalizer.VectorNormalizer(dim: int, do_normalization: bool = True)[source]

Bases: torch.nn.modules.module.Module

Performs in-place normalization over all features of a dense feature vector by doing (x - mean)/stddev for each x in the feature vector.

This is a ScriptModule so that the normalize function can be called at training time in the tensorizer, as well as at inference time by using it in your torchscript forward function. To use this in your tensorizer update_meta_data must be called once per row in your initialize function, and then calculate_feature_stats must be called upon the last time it runs. See usage in FloatListTensorizer for an example.

Setting do_normalization=False will make the normalize function an identity function.

calculate_feature_stats()[source]
forward()[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

normalize(vec: List[List[float]])[source]
update_meta_data(vec)[source]
pytext.torchscript.tensorizer.roberta module
class pytext.torchscript.tensorizer.roberta.ScriptRoBERTaTensorizer(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase

class pytext.torchscript.tensorizer.roberta.ScriptRoBERTaTensorizerWithIndices(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase

pytext.torchscript.tensorizer.tensorizer module
class pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer[source]

Bases: torch.jit.ScriptModule

class pytext.torchscript.tensorizer.tensorizer.VocabLookup(vocab: pytext.torchscript.vocab.ScriptVocabulary)[source]

Bases: torch.jit.ScriptModule

TorchScript implementation of lookup_tokens() in pytext/data/tensorizers.py

pytext.torchscript.tensorizer.xlm module
class pytext.torchscript.tensorizer.xlm.ScriptXLMTensorizer(tokenizer: torch.jit.ScriptModule, token_vocab: pytext.torchscript.vocab.ScriptVocabulary, language_vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int, default_language: str)[source]

Bases: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer

Module contents
class pytext.torchscript.tensorizer.ScriptBERTTensorizer(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase

class pytext.torchscript.tensorizer.ScriptRoBERTaTensorizer(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase

class pytext.torchscript.tensorizer.ScriptRoBERTaTensorizerWithIndices(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]

Bases: pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase

class pytext.torchscript.tensorizer.ScriptXLMTensorizer(tokenizer: torch.jit.ScriptModule, token_vocab: pytext.torchscript.vocab.ScriptVocabulary, language_vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int, default_language: str)[source]

Bases: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer

class pytext.torchscript.tensorizer.VectorNormalizer(dim: int, do_normalization: bool = True)[source]

Bases: torch.nn.modules.module.Module

Performs in-place normalization over all features of a dense feature vector by doing (x - mean)/stddev for each x in the feature vector.

This is a ScriptModule so that the normalize function can be called at training time in the tensorizer, as well as at inference time by using it in your torchscript forward function. To use this in your tensorizer update_meta_data must be called once per row in your initialize function, and then calculate_feature_stats must be called upon the last time it runs. See usage in FloatListTensorizer for an example.

Setting do_normalization=False will make the normalize function an identity function.

calculate_feature_stats()[source]
forward()[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

normalize(vec: List[List[float]])[source]
update_meta_data(vec)[source]
pytext.torchscript.tokenizer package
Submodules
pytext.torchscript.tokenizer.bpe module
class pytext.torchscript.tokenizer.bpe.ScriptBPE(vocab: Dict[str, int], eow: str = '_EOW')[source]

Bases: torch.jit.ScriptModule

Byte-pair encoding implementation in TorchScript.

vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.

eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.

>>> import io
>>> vocab_file = io.StringIO('''
hello_EOW 20
world_EOW 18
th  17
is_EOW 16
bpe_EOW 15
! 14
h 13
t 6
s_EOW 2
i -1
ii -2
''')
>>> bpe = ScriptBPE.from_vocab_file(vocab_file)
>>> bpe.tokenize(["hello", "world", "this", "is", "bpe"])
["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"]
>>> bpe.tokenize(["iiiis"])
["ii", "i", "is_EOW"]
classmethod from_vocab_file(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
classmethod from_vocab_filename(vocab_filename: str) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
static load_vocab(file: io.IOBase) → Dict[str, int][source]
pytext.torchscript.tokenizer.tokenizer module
class pytext.torchscript.tokenizer.tokenizer.ScriptBPETokenizer(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.tokenizer.ScriptDoNothingTokenizer(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.tokenizer.ScriptTextTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

class pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

class pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: torch.jit.ScriptModule

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

Module contents
class pytext.torchscript.tokenizer.ScriptBPE(vocab: Dict[str, int], eow: str = '_EOW')[source]

Bases: torch.jit.ScriptModule

Byte-pair encoding implementation in TorchScript.

vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.

eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.

>>> import io
>>> vocab_file = io.StringIO('''
hello_EOW 20
world_EOW 18
th  17
is_EOW 16
bpe_EOW 15
! 14
h 13
t 6
s_EOW 2
i -1
ii -2
''')
>>> bpe = ScriptBPE.from_vocab_file(vocab_file)
>>> bpe.tokenize(["hello", "world", "this", "is", "bpe"])
["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"]
>>> bpe.tokenize(["iiiis"])
["ii", "i", "is_EOW"]
classmethod from_vocab_file(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
classmethod from_vocab_filename(vocab_filename: str) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
static load_vocab(file: io.IOBase) → Dict[str, int][source]
class pytext.torchscript.tokenizer.ScriptBPETokenizer(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.ScriptDoNothingTokenizer(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.ScriptTextTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

class pytext.torchscript.tokenizer.ScriptTokenTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

Submodules
pytext.torchscript.module module
class pytext.torchscript.module.ScriptModule(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: torch.jit.ScriptModule

class pytext.torchscript.module.ScriptTextModule(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]

Bases: pytext.torchscript.module.ScriptModule

class pytext.torchscript.module.ScriptTokenLanguageModule(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]

Bases: pytext.torchscript.module.ScriptModule

class pytext.torchscript.module.ScriptTokenLanguageModuleWithDenseFeature(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]

Bases: pytext.torchscript.module.ScriptModule

class pytext.torchscript.module.ScriptTokenModule(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]

Bases: pytext.torchscript.module.ScriptModule

pytext.torchscript.module.get_script_module_cls(input_type: pytext.torchscript.utils.ScriptInputType) → torch.jit.ScriptModule[source]
pytext.torchscript.utils module
class pytext.torchscript.utils.ScriptInputType[source]

Bases: enum.Enum

An enumeration.

is_text()[source]
is_token()[source]
text = 1
token = 2
pytext.torchscript.vocab module
class pytext.torchscript.vocab.ScriptVocabulary(vocab_list, unk_idx: int = 0, pad_idx: int = -1, bos_idx: int = -1, eos_idx: int = -1)[source]

Bases: torch.jit.ScriptModule

Module contents

pytext.trainers package

Submodules
pytext.trainers.ensemble_trainer module
class pytext.trainers.ensemble_trainer.EnsembleTrainer(real_trainers)[source]

Bases: pytext.trainers.trainer.TrainerBase

Trainer for ensemble models

real_trainer

the actual trainer to run

Type:Trainer
classmethod from_config(config: pytext.trainers.ensemble_trainer.EnsembleTrainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
train(train_iter, eval_iter, model, *args, **kwargs)[source]
pytext.trainers.hogwild_trainer module
class pytext.trainers.hogwild_trainer.HogwildTrainer(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]

Bases: pytext.trainers.trainer.Trainer

classmethod from_config(config: pytext.trainers.hogwild_trainer.HogwildTrainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
run_epoch(state: pytext.trainers.training_state.TrainingState, data_iter: torchtext.data.iterator.Iterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)
set_up_training(state: pytext.trainers.training_state.TrainingState, training_data)
class pytext.trainers.hogwild_trainer.HogwildTrainer_Deprecated(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]

Bases: pytext.trainers.trainer.Trainer

classmethod from_config(config: pytext.trainers.hogwild_trainer.HogwildTrainer_Deprecated.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
run_epoch(state: pytext.trainers.training_state.TrainingState, data_iter: torchtext.data.iterator.Iterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]
set_up_training(state: pytext.trainers.training_state.TrainingState, training_data)[source]
pytext.trainers.trainer module
class pytext.trainers.trainer.TaskTrainer(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]

Bases: pytext.trainers.trainer.Trainer

run_step(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]

Our run_step is a bit different, because we’re wrapping the model forward call with model.train_batch, which arranges tensors and gets loss, etc.

Whenever “samples” contains more than one mini-batch (sample_size > 1), we want to accumulate gradients locally and only call all-reduce in the last backwards pass.

class pytext.trainers.trainer.Trainer(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]

Bases: pytext.trainers.trainer.TrainerBase

Base Trainer class that provide ways to
1 Train model, compute metrics against eval set and use the metrics for model selection. 2 Test trained model, compute and publish metrics against a blind test set.
epochs

Training epochs

Type:int
early_stop_after

Stop after how many epochs when the eval metric is not improving

Type:int
max_clip_norm

Clip gradient norm if set

Type:Optional[float]
report_train_metrics

Whether metrics on training data should be computed and reported.

Type:bool
target_time_limit_seconds

Target time limit for training in seconds. If the expected time to train another epoch exceeds this limit, stop training.

Type:float
backprop(state, loss)[source]
continue_training(state: pytext.trainers.training_state.TrainingState) → bool[source]
classmethod from_config(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
load_best_model(state: pytext.trainers.training_state.TrainingState)[source]
optimizer_step(state)[source]
run_epoch(state: pytext.trainers.training_state.TrainingState, data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]
run_step(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]
save_checkpoint(state: pytext.trainers.training_state.TrainingState, train_config: pytext.config.pytext_config.PyTextConfig) → str[source]
set_up_training(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator)[source]
sparsification_step(state)[source]
test(test_iter, model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]
train(training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig, rank: int = 0) → Tuple[torch.nn.modules.module.Module, Any][source]

Train and eval a model, the model states will be modified. :param train_iter: batch iterator of training data :type train_iter: BatchIterator :param eval_iter: batch iterator of evaluation data :type eval_iter: BatchIterator :param model: model to be trained :type model: Model :param metric_reporter: compute metric based on training :type metric_reporter: MetricReporter :param output and report results to console, file.. etc: :param train_config: training config :type train_config: PyTextConfig :param training_result: only meaningful for Hogwild training. default :type training_result: Optional :param is None: :param rank: only used in distributed training, the rank of the current :type rank: int :param training thread, evaluation will only be done in rank 0:

Returns:the trained model together with the best metric
Return type:model, best_metric
train_from_state(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig) → Tuple[torch.nn.modules.module.Module, Any][source]

Train and eval a model from a given training state will be modified. This function iterates epochs specified in config, and for each epoch do:

  1. Train model using training data, aggregate and report training results
  2. Adjust learning rate if scheduler is specified
  3. Evaluate model using evaluation data
  4. Calculate metrics based on evaluation results and select best model
Parameters:
  • training_state (TrainingState) – contrains stateful information to be
  • to restore a training job (able) –
  • train_iter (BatchIterator) – batch iterator of training data
  • eval_iter (BatchIterator) – batch iterator of evaluation data
  • model (Model) – model to be trained
  • metric_reporter (MetricReporter) – compute metric based on training output and report results to console, file.. etc
  • train_config (PyTextConfig) – training config
Returns:

the trained model together with the best metric

Return type:

model, best_metric

update_best_model(state: pytext.trainers.training_state.TrainingState, train_config: pytext.config.pytext_config.PyTextConfig, eval_metric)[source]
zero_grads(state)[source]
class pytext.trainers.trainer.TrainerBase(config=None, *args, **kwargs)[source]

Bases: pytext.config.component.Component

pytext.trainers.trainer.cycle(iterator: Iterable[Any]) → Iterable[Any][source]

Like itertools.cycle, but will call iter on the original iterable instead. This limits it to not be able to run on say raw generators, but also doesn’t store a copy of the iterable in memory for repetition.

pytext.trainers.trainer.maybe_accumulate_gradients(exit_stack, model, index, sample_size)[source]
pytext.trainers.training_state module
class pytext.trainers.training_state.TrainingState(**kwargs)[source]

Bases: object

best_model_metric = None
best_model_state = None
epoch = 0
epochs_since_last_improvement = 0
rank = 0
stage = 'Training'
step_counter = 0
tensorizers = None
Module contents
class pytext.trainers.Trainer(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]

Bases: pytext.trainers.trainer.TrainerBase

Base Trainer class that provide ways to
1 Train model, compute metrics against eval set and use the metrics for model selection. 2 Test trained model, compute and publish metrics against a blind test set.
epochs

Training epochs

Type:int
early_stop_after

Stop after how many epochs when the eval metric is not improving

Type:int
max_clip_norm

Clip gradient norm if set

Type:Optional[float]
report_train_metrics

Whether metrics on training data should be computed and reported.

Type:bool
target_time_limit_seconds

Target time limit for training in seconds. If the expected time to train another epoch exceeds this limit, stop training.

Type:float
backprop(state, loss)[source]
continue_training(state: pytext.trainers.training_state.TrainingState) → bool[source]
classmethod from_config(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
load_best_model(state: pytext.trainers.training_state.TrainingState)[source]
optimizer_step(state)[source]
run_epoch(state: pytext.trainers.training_state.TrainingState, data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]
run_step(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]
save_checkpoint(state: pytext.trainers.training_state.TrainingState, train_config: pytext.config.pytext_config.PyTextConfig) → str[source]
set_up_training(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator)[source]
sparsification_step(state)[source]
test(test_iter, model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]
train(training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig, rank: int = 0) → Tuple[torch.nn.modules.module.Module, Any][source]

Train and eval a model, the model states will be modified. :param train_iter: batch iterator of training data :type train_iter: BatchIterator :param eval_iter: batch iterator of evaluation data :type eval_iter: BatchIterator :param model: model to be trained :type model: Model :param metric_reporter: compute metric based on training :type metric_reporter: MetricReporter :param output and report results to console, file.. etc: :param train_config: training config :type train_config: PyTextConfig :param training_result: only meaningful for Hogwild training. default :type training_result: Optional :param is None: :param rank: only used in distributed training, the rank of the current :type rank: int :param training thread, evaluation will only be done in rank 0:

Returns:the trained model together with the best metric
Return type:model, best_metric
train_from_state(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig) → Tuple[torch.nn.modules.module.Module, Any][source]

Train and eval a model from a given training state will be modified. This function iterates epochs specified in config, and for each epoch do:

  1. Train model using training data, aggregate and report training results
  2. Adjust learning rate if scheduler is specified
  3. Evaluate model using evaluation data
  4. Calculate metrics based on evaluation results and select best model
Parameters:
  • training_state (TrainingState) – contrains stateful information to be
  • to restore a training job (able) –
  • train_iter (BatchIterator) – batch iterator of training data
  • eval_iter (BatchIterator) – batch iterator of evaluation data
  • model (Model) – model to be trained
  • metric_reporter (MetricReporter) – compute metric based on training output and report results to console, file.. etc
  • train_config (PyTextConfig) – training config
Returns:

the trained model together with the best metric

Return type:

model, best_metric

update_best_model(state: pytext.trainers.training_state.TrainingState, train_config: pytext.config.pytext_config.PyTextConfig, eval_metric)[source]
zero_grads(state)[source]
class pytext.trainers.TrainingState(**kwargs)[source]

Bases: object

best_model_metric = None
best_model_state = None
epoch = 0
epochs_since_last_improvement = 0
rank = 0
stage = 'Training'
step_counter = 0
tensorizers = None
class pytext.trainers.EnsembleTrainer(real_trainers)[source]

Bases: pytext.trainers.trainer.TrainerBase

Trainer for ensemble models

real_trainer

the actual trainer to run

Type:Trainer
classmethod from_config(config: pytext.trainers.ensemble_trainer.EnsembleTrainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
train(train_iter, eval_iter, model, *args, **kwargs)[source]
class pytext.trainers.HogwildTrainer(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]

Bases: pytext.trainers.trainer.Trainer

classmethod from_config(config: pytext.trainers.hogwild_trainer.HogwildTrainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
run_epoch(state: pytext.trainers.training_state.TrainingState, data_iter: torchtext.data.iterator.Iterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)
set_up_training(state: pytext.trainers.training_state.TrainingState, training_data)
class pytext.trainers.HogwildTrainer_Deprecated(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]

Bases: pytext.trainers.trainer.Trainer

classmethod from_config(config: pytext.trainers.hogwild_trainer.HogwildTrainer_Deprecated.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]
run_epoch(state: pytext.trainers.training_state.TrainingState, data_iter: torchtext.data.iterator.Iterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]
set_up_training(state: pytext.trainers.training_state.TrainingState, training_data)[source]
class pytext.trainers.TaskTrainer(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]

Bases: pytext.trainers.trainer.Trainer

run_step(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]

Our run_step is a bit different, because we’re wrapping the model forward call with model.train_batch, which arranges tensors and gets loss, etc.

Whenever “samples” contains more than one mini-batch (sample_size > 1), we want to accumulate gradients locally and only call all-reduce in the last backwards pass.

pytext.utils package

Submodules
pytext.utils.ascii_table module
pytext.utils.ascii_table.ascii_table(data, human_column_names=None, footer=None, indentation='', alignments=())[source]
pytext.utils.ascii_table.ascii_table_from_dict(dict, key_name, value_name, indentation='')[source]
pytext.utils.ascii_table.ordered_unique(sequence)[source]
pytext.utils.cuda module
pytext.utils.cuda.FloatTensor(*args)[source]
pytext.utils.cuda.GetTensor(tensor)[source]
pytext.utils.cuda.LongTensor(*args)[source]
pytext.utils.cuda.Variable(data, *args, **kwargs)[source]
pytext.utils.cuda.device()[source]
pytext.utils.cuda.tensor(data, dtype)[source]
pytext.utils.cuda.var_to_numpy(v)[source]
pytext.utils.cuda.zerovar(*size)[source]
pytext.utils.data module
class pytext.utils.data.ResultRow(name, metrics_dict)[source]

Bases: object

class pytext.utils.data.ResultTable(metrics, class_names, labels, preds)[source]

Bases: object

class pytext.utils.data.Slot(label: str, start: int, end: int)[source]

Bases: object

B_LABEL_PREFIX = 'B-'
I_LABEL_PREFIX = 'I-'
NO_LABEL_SLOT = 'NoLabel'
b_label_name
i_label_name
token_label(use_bio_labels, token_start, token_end)[source]
token_overlap(token_start, token_end)[source]
pytext.utils.data.align_slot_labels(token_ranges: List[Tuple[int, int]], slots_field: str, use_bio_labels: bool = False)[source]
pytext.utils.data.byte_length(text: str) → int[source]

Return the string length in term of byte offset

pytext.utils.data.char_offset_to_byte_offset(text: str, char_offset: int) → int[source]

Convert a char offset to byte offset

pytext.utils.data.get_substring_from_offsets(text: str, start: Optional[int], end: Optional[int], byte_offset: bool = True) → str[source]

Access substring of a text using byte offset, if the switch is turned on. Otherwise return substring as the usual text[start:end]

pytext.utils.data.is_number(string)[source]
pytext.utils.data.merge_token_labels_by_bio(token_ranges, labels)[source]
pytext.utils.data.merge_token_labels_by_label(token_ranges, labels)[source]
pytext.utils.data.merge_token_labels_to_slot(token_ranges, labels, use_bio_label=True)[source]
pytext.utils.data.no_tokenize(s: Any) → Any[source]
pytext.utils.data.parse_and_align_slot_labels_list(token_ranges: List[Tuple[int, int]], slots_field: str, use_bio_labels: bool = False)[source]
pytext.utils.data.parse_json_array(json_text: str) → List[str][source]
pytext.utils.data.parse_slot_string(slots_field: str) → List[pytext.utils.data.Slot][source]
pytext.utils.data.parse_token(utterance: str, token_range: List[int]) → List[Tuple[str, Tuple[int, int]]][source]
pytext.utils.data.simple_tokenize(s: str) → List[str][source]
pytext.utils.data.strip_bio_prefix(label)[source]
pytext.utils.data.unkify(token: str)[source]
pytext.utils.distributed module
pytext.utils.distributed.dist_init(distributed_rank: int, world_size: int, init_method: str, device_id: int, backend: str = 'nccl', gpu_streams: int = 1)[source]

1. After spawn process per GPU, we want all workers to call init_process_group around the same time or times out. 2. After dist_init, we want all workers to start calling all_reduce/barrier around the same time or NCCL timeouts.

pytext.utils.distributed.force_print(*args, **kwargs)[source]
pytext.utils.distributed.get_shard_range(dataset_size: int, rank: int, world_size: int)[source]

In case dataset_size is not evenly divided by world_size, we need to pad one extra example in each shard shard_len = dataset_size // world_size + 1

Case 1 rank < remainder: each shard start position is rank * shard_len

Case 2 rank >= remainder: without padding, each shard start position is rank * (shard_len - 1) + remainder = rank * shard_len - (rank - remainder) But to make sure all shard have same size, we need to pad one extra example when rank >= remainder, so start_position = start_position - 1

For example, dataset_size = 21, world_size = 8 rank 0 to 4: [0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14] rank 5 to 7: [14, 15, 16], [16, 17, 18], [18, 19, 20]

pytext.utils.distributed.suppress_output()[source]
pytext.utils.documentation module
pytext.utils.documentation.find_config_class(class_name)[source]

Return the set of PyText classes matching that name. Handles fully-qualified class_name including module.

pytext.utils.documentation.get_class_members_recursive(obj)[source]

Find all the field names for a given class and their default value.

pytext.utils.documentation.get_config_fields(obj)[source]

Return a dict of config help for this object, where: - key: config name - value: (default, type, options)

  • default: default value for this key if not specified
  • type: type for this config value, as a string
  • options: possible values for this config, only if type = Union

If the type is “Union”, the options give the lists of class names that are possible, and the default is one of those class names.

pytext.utils.documentation.get_subclasses(klass, stop_classes=(<class 'pytext.models.module.Module'>, <class 'pytext.config.component.Component'>, <class 'torch.nn.modules.module.Module'>))[source]
pytext.utils.documentation.pretty_print_config_class(obj)[source]

Pretty-print the fields of one object.

pytext.utils.documentation.replace_components(root, component, base_class)[source]

Recursively look at all fields in config to find where component would fit. This is used to change configs so that they don’t use default values. Return the chain of field names, from child to parent.

pytext.utils.embeddings module
class pytext.utils.embeddings.PretrainedEmbedding(embeddings_path: str = None, lowercase_tokens: bool = True, skip_header: bool = True, delimiter: str = ' ')[source]

Bases: object

Utility class for loading/caching/initializing word embeddings

cache_pretrained_embeddings(cache_path: str) → None[source]

Cache the processed embedding vectors and vocab to a file for faster loading

initialize_embeddings_weights(str_to_idx: Dict[str, int], unk: str, embed_dim: int, init_strategy: pytext.config.field_config.EmbedInitStrategy) → torch.Tensor[source]

Initialize embeddings weights of shape (len(str_to_idx), embed_dim) from the pretrained embeddings vectors. Words that are not in the pretrained embeddings list will be initialized according to init_strategy. :param str_to_idx: a dict that maps words to indices that the model expects :param unk: unknown token :param embed_dim: the embeddings dimension :param init_strategy: method of initializing new tokens :returns: a float tensor of dimension (vocab_size, embed_dim)

load_cached_embeddings(cache_path: str) → None[source]

Load cached embeddings from file

load_pretrained_embeddings(raw_embeddings_path: str, append: bool = False, dialect: str = None, lowercase_tokens: bool = True, skip_header: bool = True, delimiter: str = ' ') → None[source]

Loading raw embeddings vectors from file in the format: num_words dim word_i v0, v1, v2, …., v_dim word_2 v0, v1, v2, …., v_dim …. Optionally appends _dialect to every token in the vocabulary (for XLU embeddings).

pytext.utils.embeddings.append_dialect(word: str, dialect: str) → str[source]
pytext.utils.file_io module

TODO: @stevenliu Deprecate this file after borc available in PyPI

class pytext.utils.file_io.PathManager[source]

Bases: object

static copy(*args, **kwargs) → bool[source]
static exists(path: str) → bool[source]
static get_local_path(path: str) → str[source]
static isdir(path: str) → bool[source]
static isfile(path: str) → bool[source]
static ls(path: str) → List[str][source]
static mkdirs(*args, **kwargs)[source]
static open(*args, **kwargs)[source]
static rm(*args, **kwargs)[source]
pytext.utils.label module
pytext.utils.label.get_label_weights(vocab_dict: Dict[str, int], label_weights: Dict[str, float])[source]
pytext.utils.lazy module
class pytext.utils.lazy.Infer(resolve_fn)[source]

Bases: object

A value which can be inferred from a forward pass. Infer objects should be passed as arguments or keyword arguments to Lazy objects; see Lazy documentation for more details.

classmethod dimension(dim)[source]

A helper for creating Infer arguments looking at specific dimensions.

class pytext.utils.lazy.Lazy(module_class, *args, **kwargs)[source]

Bases: torch.nn.modules.module.Module

A module which is able to infer some of its parameters from the inputs to its first forward pass. Lazy wraps any other nn.Module, and arguments can be passed that will be used to construct that wrapped Module after the first forward pass. If any of these arguments are Infer objects, those arguments will be replaced by calling the callback of the Infer object on the forward pass input.

For instance, >>> Lazy(nn.Linear, Infer(lambda input: input.size(-1)), 4) Lazy()

takes its in_features dimension from the last dimension of the input to its forward pass. This can be simplified to

>>> Lazy(nn.Linear, Infer.dimension(-1), 4)

or a partial can be created, for instance

>>> LazyLinear = Lazy.partial(nn.Linear, Infer.dimension(-1))
>>> LazyLinear(4)
Lazy()

Finally, these Lazy objects explicitly forbid treating themselves normally; they must instead be replaced by calling init_lazy_modules on your model before training. For instance,

>>> ll = lazy.Linear(4)
>>> seq = nn.Sequential(ll)
>>> seq
Sequential(
    0: Lazy(),
)
>>> init_lazy_modules(seq, torch.rand(1, 2)
Sequential(
    0: Linear(in_features=2, out_features=4, bias=True)
)
forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod partial(module_class, *args, **kwargs)[source]
resolve()[source]

Must make a call to forward before calling this function; returns the full nn.Module object constructed using inferred arguments/dimensions.

exception pytext.utils.lazy.UninitializedLazyModuleError[source]

Bases: Exception

A lazy module was used improperly.

pytext.utils.lazy.init_lazy_modules(module: torch.nn.modules.module.Module, dummy_input: Tuple[torch.Tensor, ...]) → torch.nn.modules.module.Module[source]

Finalize an nn.Module which has Lazy components. This will both mutate internal modules which have Lazy elements, and return a new non-lazy nn.Module (in case the top-level module itself is Lazy).

Parameters:
  • module – An nn.Module which may be lazy or contain Lazy subcomponents
  • dummy_input – module is called with this input to ensure that Lazy subcomponents have been able to infer any parameters they need
Returns:

The full nn.Module object constructed using inferred arguments/dimensions.

class pytext.utils.lazy.lazy_property(fget)[source]

Bases: object

More or less copy-pasta: http://stackoverflow.com/a/6849299 Meant to be used for lazy evaluation of an object attribute. property should represent non-mutable data, as it replaces itself.

pytext.utils.lazy.replace_lazy_modules(module)[source]
pytext.utils.loss module
class pytext.utils.loss.LagrangeMultiplier[source]

Bases: torch.autograd.function.Function

static backward(ctx, grad_output)[source]

Defines a formula for differentiating the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs did forward() return, and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, input)[source]

Performs the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

The context can be used to store tensors that can be then retrieved during the backward pass.

pytext.utils.loss.build_class_priors(labels, class_priors=None, weights=None, positive_pseudocount=1.0, negative_pseudocount=1.0)[source]

build class priors, if necessary. For each class, the class priors are estimated as (P + sum_i w_i y_i) / (P + N + sum_i w_i), where y_i is the ith label, w_i is the ith weight, P is a pseudo-count of positive labels, and N is a pseudo-count of negative labels.

Parameters:
  • labels – A Tensor with shape [batch_size, num_classes]. Entries should be in [0, 1].
  • class_priors – None, or a floating point Tensor of shape [C] containing the prior probability of each class (i.e. the fraction of the training data consisting of positive examples). If None, the class priors are computed from targets with a moving average.
  • weightsTensor of shape broadcastable to labels, [N, 1] or [N, C], where N = batch_size, C = num_classes`
  • positive_pseudocount – Number of positive labels used to initialize the class priors.
  • negative_pseudocount – Number of negative labels used to initialize the class priors.
Returns:

A Tensor of shape [num_classes] consisting of the

weighted class priors, after updating with moving average ops if created.

Return type:

class_priors

pytext.utils.loss.false_postives_upper_bound(labels, logits, weights)[source]

false_positives_upper_bound defined in paper: “Scalable Learning of Non-Decomposable Objectives”

Parameters:
  • labels – A Tensor of shape broadcastable to logits.
  • logits – A Tensor of shape [N, C] or [N, C, K]. If the third dimension is present, the lower bound is computed on each slice [:, :, k] independently.
  • weights – Per-example loss coefficients, with shape broadcast-compatible with that of labels. i.e. [N, 1] or [N, C]
Returns:

A Tensor of shape [C] or [C, K].

pytext.utils.loss.lagrange_multiplier(x)[source]
pytext.utils.loss.range_to_anchors_and_delta(precision_range, num_anchors)[source]

Calculates anchor points from precision range.

Parameters:
  • precision_range – an interval (a, b), where 0.0 <= a <= b <= 1.0
  • num_anchors – int, number of equally spaced anchor points.
Returns:

A Tensor of [num_anchors] equally spaced values

in the interval precision_range.

delta: The spacing between the values in precision_values.

Return type:

precision_values

Raises:

ValueError – If precision_range is invalid.

pytext.utils.loss.true_positives_lower_bound(labels, logits, weights)[source]

true_positives_lower_bound defined in paper: “Scalable Learning of Non-Decomposable Objectives”

Parameters:
  • labels – A Tensor of shape broadcastable to logits.
  • logits – A Tensor of shape [N, C] or [N, C, K]. If the third dimension is present, the lower bound is computed on each slice [:, :, k] independently.
  • weights – Per-example loss coefficients, with shape [N, 1] or [N, C]
Returns:

A Tensor of shape [C] or [C, K].

pytext.utils.loss.weighted_hinge_loss(labels, logits, positive_weights=1.0, negative_weights=1.0)[source]
Parameters:
  • labels – one-hot representation Tensor of shape broadcastable to logits
  • logits – A Tensor of shape [N, C] or [N, C, K]
  • positive_weights – Scalar or Tensor
  • negative_weights – same shape as positive_weights
Returns:

3D Tensor of shape [N, C, K], where K is length of positive weights or 2D Tensor of shape [N, C]

pytext.utils.meter module
class pytext.utils.meter.Meter[source]

Bases: object

avg
reset()[source]
update(val=1)[source]
class pytext.utils.meter.TimeMeter[source]

Bases: pytext.utils.meter.Meter

Computes the average occurrence of some event per second

avg
elapsed_time
reset()[source]
update(val=1)[source]
pytext.utils.mobile_onnx module
pytext.utils.mobile_onnx.add_feats_numericalize_ops(init_net, predict_net, vocab_map, input_names)[source]
pytext.utils.mobile_onnx.create_context(init_net)[source]
pytext.utils.mobile_onnx.create_vocab_index(vocab_list, net, net_workspace, index_name)[source]
pytext.utils.mobile_onnx.create_vocab_indices_map(init_net, vocab_map)[source]
pytext.utils.mobile_onnx.get_numericalize_net(init_net, predict_net, vocab_map, input_names)[source]
pytext.utils.mobile_onnx.pytorch_to_caffe2(model, export_input, external_input_names, output_names, export_path, export_onnx_path=None)[source]
pytext.utils.model module
pytext.utils.model.get_mismatched_param(models: Iterable[torch.nn.modules.module.Module], rel_epsilon: Optional[float] = None, abs_epsilon: Optional[float] = None) → str[source]

Return the name of the first mismatched parameter. Return an empty string if all the parameters of the modules are identical.

pytext.utils.model.to_onehot(feat: pytext.utils.cuda.Variable, size: int) → pytext.utils.cuda.Variable[source]

Transform features into one-hot vectors

pytext.utils.onnx module
pytext.utils.onnx.add_feats_numericalize_ops(c2_prepared, vocab_map, input_names)[source]
pytext.utils.onnx.convert_caffe2_blob_name(blob_name)[source]
pytext.utils.onnx.create_vocab_index(vocab_list, net, net_workspace, index_name)[source]
pytext.utils.onnx.create_vocab_indices_map(c2_prepared, init_net, vocab_map)[source]
pytext.utils.onnx.export_nets_to_predictor_file(c2_prepared, input_names, output_names, predictor_path, extra_params=None)[source]
pytext.utils.onnx.get_numericalize_net(c2_prepared, vocab_map, input_names)[source]
pytext.utils.onnx.pytorch_to_caffe2(model, export_input, external_input_names, output_names, export_path, export_onnx_path=None)[source]
pytext.utils.path module
pytext.utils.path.get_absolute_path(file_path: str) → str[source]
pytext.utils.path.get_pytext_home()[source]
pytext.utils.precision module
pytext.utils.precision.delay_unscale()[source]
pytext.utils.precision.maybe_float(tensor)[source]
pytext.utils.precision.maybe_half(tensor)[source]
pytext.utils.precision.pad_length(n)[source]
pytext.utils.precision.set_fp16(fp16_enabled: bool)[source]
pytext.utils.tensor module
pytext.utils.test module
pytext.utils.test.import_tests_module(packages_to_scan=None)[source]
pytext.utils.timing module
class pytext.utils.timing.HierarchicalTimer[source]

Bases: object

pop()[source]
push(label, caller_id)[source]
snapshot()[source]
time(label)[source]
class pytext.utils.timing.Snapshot[source]

Bases: object

report(report_pep=False)[source]
class pytext.utils.timing.SnapshotList[source]

Bases: list

lists are not weakref-able by default.

class pytext.utils.timing.Timings(sum: float = 0.0, count: int = 0, max: float = -inf, times: List[T] = None)[source]

Bases: object

add(time)[source]
average
p50
p90
p99
pytext.utils.timing.format_time(seconds)[source]
pytext.utils.timing.report_snapshot(fn)[source]
pytext.utils.torch module
class pytext.utils.torch.CPUOnlyParameter(*args, **kwargs)[source]

Bases: torch.nn.parameter.Parameter

cuda(device=None, non_blocking=False) → Tensor[source]

Returns a copy of this object in CUDA memory.

If this object is already in CUDA memory and on the correct device, then no copy is performed and the original object is returned.

Parameters:
  • device (torch.device) – The destination GPU device. Defaults to the current CUDA device.
  • non_blocking (bool) – If True and the source is in pinned memory, the copy will be asynchronous with respect to the host. Otherwise, the argument has no effect. Default: False.
Module contents
pytext.utils.cls_vars(cls)[source]
pytext.utils.set_random_seeds(seed, use_deterministic_cudnn)[source]

Submodules

pytext.builtin_task module

pytext.builtin_task.add_include(path)[source]

Import tasks (and associated components) from the folder name.

pytext.builtin_task.register_builtin_tasks()[source]

pytext.main module

class pytext.main.Attrs[source]

Bases: object

pytext.main.gen_config_impl(task_name, *args, **kwargs)[source]
pytext.main.run_single(rank: int, config_json: str, world_size: int, dist_init_method: Optional[str], metadata: Union[Dict[str, pytext.data.data_handler.CommonMetadata], pytext.data.data_handler.CommonMetadata, None], metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]])[source]
pytext.main.train_model_distributed(config, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]])[source]

pytext.workflow module

pytext.workflow.batch_predict(model_file: str, examples: List[Dict[str, Any]])[source]
pytext.workflow.dict_zip(*dicts, value_only=False)[source]
pytext.workflow.export_saved_model_to_caffe2(saved_model_path: str, export_caffe2_path: str, output_onnx_path: str = None) → None[source]
pytext.workflow.export_saved_model_to_torchscript(saved_model_path: str, path: str, quantize: bool = False) → None[source]
pytext.workflow.get_logits(snapshot_path: str, use_cuda_if_available: bool, output_path: Optional[str] = None, test_path: Optional[str] = None, field_names: Optional[List[str]] = None, dump_raw_input: bool = False)[source]
pytext.workflow.prepare_task(config: pytext.config.pytext_config.PyTextConfig, dist_init_url: str = None, device_id: int = 0, rank: int = 0, world_size: int = 1, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None, metadata: pytext.data.data_handler.CommonMetadata = None) → Tuple[pytext.task.task.Task_Deprecated, pytext.trainers.training_state.TrainingState][source]
pytext.workflow.prepare_task_metadata(config: pytext.config.pytext_config.PyTextConfig) → pytext.data.data_handler.CommonMetadata[source]

Loading the whole dataset into cpu memory on every single processes could cause OOMs for data parallel distributed training. To avoid such practice, we move the operations that required loading the whole dataset out of spawn, and pass the context to every single process.

pytext.workflow.save_and_export(config: pytext.config.pytext_config.PyTextConfig, task: pytext.task.task.Task_Deprecated, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None) → None[source]
pytext.workflow.test_model(test_config: pytext.config.pytext_config.TestConfig, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]], test_out_path: str) → Any[source]
pytext.workflow.test_model_from_snapshot_path(snapshot_path: str, use_cuda_if_available: bool, test_path: Optional[str] = None, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None, test_out_path: str = '', field_names: Optional[List[str]] = None)[source]
pytext.workflow.train_model(config: pytext.config.pytext_config.PyTextConfig, dist_init_url: str = None, device_id: int = 0, rank: int = 0, world_size: int = 1, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None, metadata: pytext.data.data_handler.CommonMetadata = None) → Tuple[source]

Module contents

pytext.batch_predict_caffe2_model(pytext_model_file: str, caffe2_model_file: str, db_type: str = 'minidb', data_source: Optional[pytext.data.sources.data_source.DataSource] = None, use_cuda=False, task: Optional[pytext.task.new_task.NewTask] = None, train_config: Optional[pytext.config.pytext_config.PyTextConfig] = None)[source]
pytext.create_predictor(config: pytext.config.pytext_config.PyTextConfig, model_file: Optional[str] = None, db_type: str = 'minidb', task: Optional[pytext.task.new_task.NewTask] = None) → Callable[[Mapping[str, str]], Mapping[str, numpy.array]][source]

Create a simple prediction API from a training config and an exported caffe2 model file. This model file should be created by calling export on a trained model snapshot.

pytext.load_config(filename: str) → pytext.config.pytext_config.PyTextConfig[source]

Load a PyText configuration file from a file path. See pytext.config.pytext_config for more info on configs.

Indices and tables