PyText Documentation¶
PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces and abstractions for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine. We use PyText at Facebook to iterate quickly on new modeling ideas and then seamlessly ship them at scale.
Core PyText Features:
- Production ready models for various NLP/NLU tasks:
- Text classifiers
- Sequence taggers
- Joint intent-slot model
- Contextual intent-slot models
- Extensible components that allow easy creation of new models and tasks
- Ensemble training support
- Distributed-training support (using the new C10d backend in PyTorch 1.0)
- Reference implementation and a pre-trained model for the paper: Gupta et al. (2018): Semantic Parsing for Task Oriented Dialog using Hierarchical Representations
How To Use¶
Please follow the tutorial series in Getting Started to get a sense of how to train a basic model and deploy to production.
After that, you can explore more options of builtin models and training methods in Training More Advanced Models
If you want to use PyText as a library and build your own models, please check the tutorial in Extending PyText
Note
All the demo configs and test data for the tutorials can be found in source code. You can either install PyText from source or download the files manually from GitHub.
Installation¶
PyText requires Python 3.6+
PyText is available in the Python Package Index via
$ pip install pytext-nlp
The easiest way to get started on most systems is to create a virtualenv
$ python3 -m venv pytext_venv
$ source pytext_venv/bin/activate
(pytext_venv) $ pip install pytext-nlp
This will install a version of PyTorch depending on your system. See PyTorch for more information. If you are using MacOS or Windows, this likely will not include GPU support by default; if you are using Linux, you should automatically get a version of PyTorch compatible with CUDA 9.0.
If you need a different version of PyTorch, follow the instructions on the PyTorch website to install the appropriate version of PyTorch before installing PyText
OS Dependencies¶
if you’re having issues getting things to run, these guides might help
On Windows¶
Coming Soon!
On Linux¶
For Ubuntu/Debian distros, you might need to run the following command:
$ sudo apt-get install protobuf-compiler libprotoc-dev
For rpm-based distros, you might need to run the following command:
$ sudo yum install protobuf-devel
Install From Source¶
$ git clone git@github.com:facebookresearch/pytext.git
$ cd pytext
$ source activation_venv
(pytext_venv) $ pip install torch # go to https://pytorch.org for platform specific installs
(pytext_venv) $ ./install_deps
Once that is installed, you can run the unit tests. We recommend using pytest as a runner.
(pytext_venv) $ pip install -U pytest
(pytext_venv) $ pytest
# If you want to measure test coverage, we recommend `pytest-cov`
(pytext_venv) $ pip install -U pytest-cov
(pytext_venv) $ pytest --cov=pytext
To resume development in an already checked-out repo:
$ cd pytext
$ source activation_venv
To exit the virtual environment:
(pytext_venv) $ deactivate
Cloud VM Setup¶
This guide will cover all the setup work you have to do in order to be able to easily install PyText on a cloud VM . Note that while these instructions worked when they were written, they may become incorrect or out of date. If they do, please send us a Pull Request!
After following these instructions, you should be good to either follow the Installation instructions or the Install From Source instructions
Amazon Web Services¶
Coming Soon
Google Cloud Engine¶
If you have problems launching your VM, make sure you have a non-zero gpu quota, click here to learn about quotas
This guide uses Google’s Deep Learning VM as a base.
Setting Up Your VM
- Click “Launch on Compute Engine”
- Configure the VM:
- The default 2CPU K80 setup is fine for most tutorials, if you need more, change it here.
- For Framework, select one of the Base images, rather than one with a framework pre-installed. Note which version of CUDA you choose for later.
- When you’re ready, click “Deploy”
- When your VM is done loading, you can SSH into it from the GCE Console
- Install Python 3.6 (based on this RoseHosting blog post ):
$ sudo nano /etc/apt/sources.list
- add
deb http://ftp.de.debian.org/debian testing main
to the list $ echo 'APT::Default-Release "stable";' | sudo tee -a /etc/apt/apt.conf.d/00local
$ sudo apt-get update
$ sudo apt-get -t testing install python3.6
$ sudo apt-get install python3.6-venv protobuf-compiler libprotoc-dev
Microsoft Azure¶
This guide uses the Azure Ubuntu Server 18.04 LTS image as a base
Setting Up Your VM
- From the Azure Dashboard, select “Virtual Machines” and then click “add”
- Give your VM a name and select the region you want it in, keeping in mind that GPU servers are not present in all regions
- For this tutorial, you should select “Ubuntu Server 18.04 LTS” as your image
- Click “Change size” in order to select a GPU server.
- Note that the default filters won’t show GPU servers, we recommend clearing all filters except “family” and setting “family” to GPU
- For this tutorial, we will use the NC6 VM Size, but this should work on the larger and faster VMs as well
- Make sure you set up SSH access, we recommend using a public key rather than a password. * don’t forget to “allow selected ports” and select SSH
- install Nvidia driver and CUDA, (based on https://askubuntu.com/a/1036265)
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt-get install ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
- reboot:
sudo shutdown -r now
sudo apt install nvidia-cuda-toolkit gcc-6
- install OS dependencies:
sudo apt-get install python3-venv protobuf-compiler libprotoc-dev
Train your first model¶
Once you’ve installed PyText you can start training your first model!
This tutorial series is an overview of using PyText, and will cover the main concepts PyText uses to interact with the world. It won’t deal with modifying the code (e.g. hacking on new model architectures). By the end, you should have a high-quality text classification model that can be used in production.
You can use PyText as a library either in your own scripts or in a Jupyter notebook, but the fastest way to start training is through the PyText command line tool. This tool will automatically be in your path when you install PyText!
(pytext) $ pytext
Usage: pytext [OPTIONS] COMMAND [ARGS]...
Configs can be passed by file or directly from json. If neither --config-
file or --config-json is passed, attempts to read the file from stdin.
Example:
pytext train < demo/configs/docnn.json
Options:
--config-file TEXT
--config-json TEXT
--help Show this message and exit.
Commands:
export Convert a pytext model snapshot to a caffe2 model.
predict Start a repl executing examples against a caffe2 model.
test Test a trained model snapshot.
train Train a model and save the best snapshot.
Background¶
Fundamentally, “machine learning” means learning a function automatically. Your training, evaluation, and test datasets are examples of inputs and their corresponding outputs which show how that function behaves. A model is an implementation of that function. To train a model means to make a statistical implementation of that function that uses the training data as a rubric. To predict using a model means to take a trained implementation and apply it to new inputs, thus predicting what the result of the idealized function would be on those inputs.
More examples to train on usually corresponds to more accurate and better-generalizing models. This can mean thousands to millions or billions of examples depending on the task (function) you’re trying to learn.
PyText Configs¶
Training a state-of-the-art PyText model on a dataset is primarily about configuration. Picking your training dataset, your model parameters, your training parameters, and so on, is a central part of building high-quality text models.
Configuration is a central part of every component within PyText, and the config system that we provide allows for all of these configurations to be easily expressible in JSON format. PyText comes in-built with a number of example configurations that can train in-built models, and we have a system for automatically documenting the default configurations and possible configuration values.
PyText Modes¶
- train - Using a configuration, initialize a model and train it. Save the best model found as a model snapshot. This snapshot is something that can be loaded back in to PyText and trained further, tested, or exported.
- test - Load a trained model snapshot and evaluate its performance against a test set.
- export - Save the model as a serialized Caffe2 model, which is a stable model representation that can be loaded in production. (PyTorch model snapshots aren’t very durable; if you update parts of your runtime environment, they may be invalidated).
- predict - Provide a simple REPL which lets you run inputs through your exported Caffe2 model and get a tangible sense for how your model will behave.
Train your first model¶
To get our feet wet, let’s run one of the demo configurations included with PyText.
(pytext) $ cat demo/configs/docnn.json
{
"version": 8,
"task": {
"DocumentClassificationTask": {
"data": {
"source": {
"TSVDataSource": {
"field_names": ["label", "slots", "text"],
"train_filename": "tests/data/train_data_tiny.tsv",
"test_filename": "tests/data/test_data_tiny.tsv",
"eval_filename": "tests/data/test_data_tiny.tsv"
}
}
},
"model": {
"DocModel": {
"representation": {
"DocNNRepresentation": {}
}
}
}
}
}
}
This config will train a document classification model (DocNN) to detect the “class” of a series of commands given to a smart assistant. Let’s take a quick look at the dataset:
(pytext) $ head -2 tests/data/train_data_tiny.tsv
alarm/modify_alarm 16:24:datetime,39:57:datetime change my alarm tomorrow to wake me up 30 minutes earlier
alarm/set_alarm Turn on all my alarms
(pytext) $ wc -l tests/data/train_data_tiny.tsv
10 tests/data/train_data_tiny.tsv
As you can see, the dataset is quite small, so don’t get your hopes up on accuracy! We included this dataset for running unit tests against our models. PyText uses data in a tab separated format, as specified in the config by TSVDataSource. The order of the columns can be configured, but here we use the default. The first column is the “class”, the output label that we’re trying to predict. The second column is word-level tags, which we’re not trying to predict yet, so ignore them for now. The last column here is the input text, which is the command whose class (the first column) the model tries to predict.
Let’s train the model!
(pytext) $ pytext train < demo/configs/docnn.json
... [snip]
Stage.TEST
Epoch:1
loss: 1.646484
Accuracy: 50.00
Soft Metrics:
+--------------------------+-------------------+---------+
| Label | Average precision | ROC AUC |
+--------------------------+-------------------+---------+
| alarm/modify_alarm | nan | 0.000 |
| alarm/set_alarm | 1.000 | 1.000 |
| alarm/snooze_alarm | nan | 0.000 |
| alarm/time_left_on_alarm | 0.333 | 0.333 |
| reminder/set_reminder | 1.000 | 1.000 |
| reminder/show_reminders | nan | 0.000 |
| weather/find | nan | 0.000 |
+--------------------------+-------------------+---------+
Recall at Precision
+--------------------------+---------+---------+---------+---------+---------+
| Label | R@P 0.2 | R@P 0.4 | R@P 0.6 | R@P 0.8 | R@P 0.9 |
+--------------------------+---------+---------+---------+---------+---------+
| alarm/modify_alarm | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| alarm/set_alarm | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| alarm/snooze_alarm | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| alarm/time_left_on_alarm | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| reminder/set_reminder | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| reminder/show_reminders | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| weather/find | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
+--------------------------+---------+---------+---------+---------+---------+
saving result to file /tmp/test_out.txt
The model ran over the training set 10 times. This output is the result of evaluating the model on the test set, and tracking how well it did. If you’re not familiar with these accuracy measurements,
- Precision - The number of times the model guessed this label and was right
- Recall - The number of times the model correctly identified this label, out of every time it shows up in the test set. If this number is low for a label, the model should be predicting this label more.
- F1 - A harmonic mean of recall and precision.
- Support - The number of times this label shows up in the test set.
As you can see, the training results were pretty bad. We ran over the data 10 times, and in that time managed to learn how to predict only one of the labels in the test set successfully. In fact, many of the labels were never predicted at all! With 10 examples, that’s not too surprising. See the next tutorial to run on a real dataset and get more usable results.
Execute your first model¶
In Train your first model, we learnt how to train a small, simple model. We can continue this tutorial with that model here. This procedure can be used for any pytext model by supplying the matching config. For example, the much more powerful model from Train Intent-Slot model on ATIS Dataset can be executed using this same procedure.
Evaluate the model¶
We want to run the model on our test dataset and see how well it performs. Some results have been abbreviated for clarity.
(pytext) $ pytext test < demo/configs/docnn.json
Stage.TEST
loss: 2.059336
Accuracy: 20.00
Macro P/R/F1 Scores:
Label Precision Recall F1 Support
reminder/set_reminder 25.00 100.00 40.00 1
alarm/time_left_on_alarm 0.00 0.00 0.00 1
alarm/show_alarms 0.00 0.00 0.00 1
alarm/set_alarm 0.00 0.00 0.00 2
Overall macro scores 6.25 25.00 10.00
Soft Metrics:
Label Average precision
alarm/set_alarm 50.00
alarm/time_left_on_alarm 20.00
reminder/set_reminder 25.00
alarm/show_alarms 20.00
weather/find nan
alarm/modify_alarm nan
alarm/snooze_alarm nan
reminder/show_reminders nan
Label Recall at precision 0.2
alarm/set_alarm 100.00
Label Recall at precision 0.4
alarm/set_alarm 100.00
Label Recall at precision 0.6
alarm/set_alarm 0.00
Label Recall at precision 0.8
alarm/set_alarm 0.00
Label Recall at precision 0.9
alarm/set_alarm 0.00
Label Recall at precision 0.2
alarm/time_left_on_alarm 100.00
Label Recall at precision 0.4
alarm/time_left_on_alarm 0.00
Label Recall at precision 0.6
alarm/time_left_on_alarm 0.00
... [snip]
reminder/show_reminders 0.00
Label Recall at precision 0.6
reminder/show_reminders 0.00
Label Recall at precision 0.8
reminder/show_reminders 0.00
Label Recall at precision 0.9
reminder/show_reminders 0.00
Export the model¶
When you save a PyTorch model, the snapshot uses pickle for serialization. This means that simple code changes (e.g. a word embedding update) can cause backward incompatibilities with your deployed model. To combat this, you can export your model into the Caffe2 format using in-built ONNX integration. The exported Caffe2 model would have the same behavior regardless of changes in PyText or in your development code.
Exporting a model is pretty simple:
(pytext) $ pytext export --help
Usage: pytext export [OPTIONS]
Convert a pytext model snapshot to a caffe2 model.
Options:
--model TEXT the pytext snapshot model file to load
--output-path TEXT where to save the exported model
--help Show this message and exit.
You can also pass in a configuration to infer some of these options. In this case let’s do that because depending on how you’re following along your snapshot might be in different places!
(pytext) $ pytext export --output-path exported_model.c2 < demo/configs/docnn.json
...[snip]
Saving caffe2 model to: exported_model.c2
This file now contains all of the information needed to run your model.
There’s an important distinction between what a model does and what happens before/after the model is called, i.e. the preprocessing and postprocessing steps. PyText strives to do as little preprocessing as possible, but one step that is very often needed is tokenization of the input text. This will happen automatically with our prediction interface, and if this behavior ever changes, we’ll make sure that old models are still supported. The model file you export will always work, and you don’t necessarily need PyText to use it! Depending on your use case you can implement preprocessing yourself and call the model directly, but that’s outside the scope of this tutorial.
Make a simple app¶
Let’s put this all into practice! How might we make a simple web app that loads an exported model and does something meaningful with it?
To run the following code, you should
(pytext) $ pip install flask
Then we implement a minimal Flask web server.
import sys
import flask
import pytext
config_file = sys.argv[1]
model_file = sys.argv[2]
config = pytext.load_config(config_file)
predictor = pytext.create_predictor(config, model_file)
app = flask.Flask(__name__)
@app.route('/get_flight_info', methods=['GET', 'POST'])
def get_flight_info():
text = flask.request.data.decode()
# Pass the inputs to PyText's prediction API
result = predictor({"text": text})
# Results is a list of output blob names and their scores.
# The blob names are different for joint models vs doc models
# Since this tutorial is for both, let's check which one we should look at.
doc_label_scores_prefix = (
'scores:' if any(r.startswith('scores:') for r in result)
else 'doc_scores:'
)
# For now let's just output the top document label!
best_doc_label = max(
(label for label in result if label.startswith(doc_label_scores_prefix)),
key=lambda label: result[label][0],
# Strip the doc label prefix here
)[len(doc_label_scores_prefix):]
return flask.jsonify({"question": f"Are you asking about {best_doc_label}?"})
app.run(host='0.0.0.0', port='8080', debug=True)
Execute the app
(pytext) $ python flask_app.py demo/configs/docnn.json exported_model.c2
* Serving Flask app "flask_app" (lazy loading)
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: on
Then in a separate terminal window
$ function ask_about() { curl http://localhost:8080/get_flight_info -H "Content-Type: text/plain" -d "$1" }
$ ask_about 'I am looking for flights from San Francisco to Minneapolis'
{
"question": "Are you asking about flight?"
}
$ ask_about 'How much does a trip to NY cost?'
{
"question": "Are you asking about airfare?"
}
$ ask_about "Which airport should I go to?"
{
"question": "Are you asking about airport?"
}
Visualize Model Training with TensorBoard¶
Visualizations can be helpful in allowing you to better understand, debug and optimize your models during training. By default, all models trained using PyText can be visualized using TensorBoard <https://www.tensorflow.org/guide/summaries_and_tensorboard>.
Here, we will explore how to visualize the model from the tutorial Train Intent-Slot model on ATIS Dataset.
1. Install TensorBoard visualization server¶
The TensorBoard web server is required to host your visualizations. To install it, run
$ pip install tensorboard
2. Verify TensorBoard events in current working directory¶
Complete the tutorial from Train Intent-Slot model on ATIS Dataset if you have not done so.
Once that is done, you should be able to see a TensorBoard events file in the working directory where you trained your model. The file path will be something like <WORKING_DIR>/runs/<DATETIME>_<MACHINE_NAME>/events.out.tfevents.<TIMESTAMP>.<MACHINE_NAME>
.
3. Launch the visualization server¶
To launch the visualization server, run:
$ tensorboard --logdir=$EVENTS_FOLDER
$EVENTS_FOLDER
is the folder containing the events file in 2., which is something like <WORKING_DIR>/runs/<DATETIME>_<MACHINE_NAME>
.
Note: The TensorBoard web server might fail to run might fail to run if TensorFlow is not installed. This dependency is not ideal, but if you see ModuleNotFoundError: No module named ‘tensorflow’ when running the above command, you can install TensorFlow using:
$ pip install tensorflow
4. View your visualizations¶
After launching the visualization server, you can view your visualizations in a web browser at http://localhost:6006.
PyText visualizes the training metrics as scalars, test metrics as texts, and also the shape of the neural network architecture graph. Below are some screenshots of what you will see:
Training Metrics:

Test Metrics:

Model Graph:

Use PyText models in your app¶
Once you have a PyText model exported to Caffe2, you can host it on a simple web server in the cloud. Then your applications (web/mobile) can make requests to this server and use the returned predictions from the model.
In this tutorial, we’ll take the intent-slot model trained in Train Intent-Slot model on ATIS Dataset, and host it on a Flask server running on an Amazon EC2 instance. Then we’ll write an iOS app which can identify city names in users’ messages by querying the server.
1. Setup an EC2 instance¶
Amazon EC2 is a service which lets you host servers in the cloud for any arbitrary purpose. Use the official documentation to sign up, create an IAM profile and a key pair. Sign in into the EC2 Management Console and launch a new instance with the default Amazon Linux 2 AMI. In the Configure Security Group step, Add a Rule with type HTTP and port 80.
Connect to your instance using the steps here. Once you’re logged in, install the required dependencies -
$ cd ~
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
$ chmod +x miniconda.sh
$ ./miniconda.sh -b -p ~/miniconda
$ rm -f miniconda.sh
$ source ~/miniconda/bin/activate
$ conda install -y protobuf
$ conda install -y boto3 flask future numpy pip
$ conda install -y pytorch -c pytorch
$ sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to 8080
We’ll make the server listen to (randomly selected) port 8080 and redirect requests coming to port 80 (HTTP), since running a server on latter requires administrative privileges.
2. Implement and test the server¶
Upload your trained model (models/atis_joint_model.c2
) and the server files (demo/flask_server/*
) to the instance using scp.
The server handles a GET request with a text field by running it through the model and dumping the output back to a JSON.
@app.route('/')
def predict():
return json.dumps(atis.predict(request.args.get('text', '')))
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
The code in demo/flask_server/atis.py
does the pre-processing (tokenization) and post-processing (extract spans of city names) specific to the ATIS model.
Run the server using
$ python server.py
Test it out by finding your IPv4 Public IP on the EC2 Management Console page and pointing your browser to it. The server will respond with the character spans of the city names e.g.

3. Implement the iOS app¶
Install Xcode and CocoaPods if you haven’t already.
We use the open-source MessageKit to bootstrap our iOS app. Clone the app from our sister repository, and run -
$ pod install
$ open PyTextATIS.workspace
The comments in ViewController.swift
explain the modifications over the base code. Change the IP address in that file to your instance’s and run the app!

Serve Models in Production¶
We have seen how to use PyText models in an app using Flask in the previous tutorial, but the server implementation still requires a Python runtime. Caffe2 models are designed to perform well even in production scenarios with high requirements for performance and scalability.
In this tutorial, we will implement a Thrift server in C++, in order to extract the maximum performance from our exported Caffe2 intent-slot model trained on the ATIS dataset. We will also prepare a Docker image which can be deployed to your cloud provider of choice.
The full source code for the implemented server in this tutorial can be found in the demos directory.
To complete this tutorial, you will need to have Docker installed.
1. Create a Dockerfile and install dependencies¶
The first step is to prepare our Docker image with the necessary dependencies. In an empty, folder, create a Dockerfile with the following contents:
Dockerfile
FROM ubuntu:16.04
# Install Caffe2 + dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
git \
libgoogle-glog-dev \
libgtest-dev \
libiomp-dev \
libleveldb-dev \
liblmdb-dev \
libopencv-dev \
libopenmpi-dev \
libsnappy-dev \
openmpi-bin \
openmpi-doc \
python-dev \
python-pip
RUN pip install --upgrade pip
RUN pip install setuptools wheel
RUN pip install future numpy protobuf typing hypothesis pyyaml
RUN apt-get install -y --no-install-recommends \
libgflags-dev \
cmake
RUN git clone https://github.com/pytorch/pytorch.git
WORKDIR pytorch
RUN git submodule update --init --recursive
RUN python setup.py install
# Install Thrift + dependencies
WORKDIR /
RUN apt-get update && apt-get install -y \
libboost-dev \
libboost-test-dev \
libboost-program-options-dev \
libboost-filesystem-dev \
libboost-thread-dev \
libevent-dev \
automake \
libtool \
curl \
flex \
bison \
pkg-config \
libssl-dev
RUN curl https://www-us.apache.org/dist/thrift/0.11.0/thrift-0.11.0.tar.gz --output thrift-0.11.0.tar.gz
RUN tar -xvf thrift-0.11.0.tar.gz
WORKDIR thrift-0.11.0
RUN ./bootstrap.sh
RUN ./configure
RUN make
RUN make install
2. Add Thrift API¶
Thrift is a software library for developing scalable cross-language services. It comes with a client code generation engine enabling services to be interfaced across the network on multiple languages or devices. We will use Thrift to create a service which serves our model.
Our C++ server will expose a very simple API that receives an sentence/utterance as a string, and return a map of label names(string) -> scores(list<double>). For document scores, the list will only contain one score, and for word scores, the list will contain one score per word. The corresponding thrift spec fo the API is below:
predictor.thrift
namespace cpp predictor_service
service Predictor {
// Returns list of scores for each label
map<string,list<double>> predict(1:string doc),
}
3. Implement server code¶
Now, we will write our server’s code. The first thing our server needs to be able to do is to load the model from a file path into the Caffe2 workspace and initialize it. We do that in the constructor of our PredictorHandler
thrift server class:
server.cpp
class PredictorHandler : virtual public PredictorIf {
private:
NetDef mPredictNet;
Workspace mWorkspace;
NetDef loadAndInitModel(Workspace& workspace, string& modelFile) {
auto db = unique_ptr<DBReader>(new DBReader("minidb", modelFile));
auto metaNetDef = runGlobalInitialization(move(db), &workspace);
const auto predictInitNet = getNet(
*metaNetDef.get(),
PredictorConsts::default_instance().predict_init_net_type()
);
CAFFE_ENFORCE(workspace.RunNetOnce(predictInitNet));
auto predictNet = NetDef(getNet(
*metaNetDef.get(),
PredictorConsts::default_instance().predict_net_type()
));
CAFFE_ENFORCE(workspace.CreateNet(predictNet));
return predictNet;
}
...
public:
PredictorHandler(string &modelFile): mWorkspace("workspace") {
mPredictNet = loadAndInitModel(mWorkspace, modelFile);
}
...
}
Now that our model is loaded, we need to implement the predict API method which is our main interface to clients. The implementation needs to do the following:
- Pre-process the input sentence into tokens
- Feed the input as tensors to the model
- Run the model
- Extract and populate the results into the response
server.cpp
class PredictorHandler : virtual public PredictorIf {
...
public:
void predict(map<string, vector<double>>& _return, const string& doc) {
// Pre-process: tokenize input doc
vector<string> tokens;
string docCopy = doc;
tokenize(tokens, docCopy);
// Feed input to model as tensors
Tensor valTensor = TensorCPUFromValues<string>(
{static_cast<int64_t>(1), static_cast<int64_t>(tokens.size())}, {tokens}
);
BlobGetMutableTensor(mWorkspace.CreateBlob("tokens_vals_str:value"), CPU)
->CopyFrom(valTensor);
Tensor lensTensor = TensorCPUFromValues<int>(
{static_cast<int64_t>(1)}, {static_cast<int>(tokens.size())}
);
BlobGetMutableTensor(mWorkspace.CreateBlob("tokens_lens"), CPU)
->CopyFrom(lensTensor);
// Run the model
CAFFE_ENFORCE(mWorkspace.RunNet(mPredictNet.name()));
// Extract and populate results into the response
for (int i = 0; i < mPredictNet.external_output().size(); i++) {
string label = mPredictNet.external_output()[i];
_return[label] = vector<double>();
Tensor scoresTensor = mWorkspace.GetBlob(label)->Get<Tensor>();
for (int j = 0; j < scoresTensor.numel(); j++) {
float score = scoresTensor.data<float>()[j];
_return[label].push_back(score);
}
}
}
...
}
The full source code for server.cpp can be found here.
Note: The source code in the demo also implements a REST proxy for the Thrift server to make it easy to test and make calls over simple HTTP, however it is not covered in the scope of this tutorial since the Thrift protocol is what we’ll use in production.
4. Build and compile scripts¶
To build our server, we need to provide necessary headers during compile time and the required dependent libraries during link time: libthrift.so, libcaffe2.so, libprotobuf.so and libc10.so. The Makefile below does this:
Makefile
CPPFLAGS += -g -std=c++11 -std=c++14 \
-I./gen-cpp \
-I/pytorch -I/pytorch/build \
-I/pytorch/aten/src/ \
-I/pytorch/third_party/protobuf/src/
CLIENT_LDFLAGS += -lthrift
SERVER_LDFLAGS += -L/pytorch/build/lib -lthrift -lcaffe2 -lprotobuf -lc10
# ...
server: server.o gen-cpp/Predictor.o
g++ $^ $(SERVER_LDFLAGS) -o $@
clean:
rm -f *.o server
In our Dockerfile, we also add some steps to copy our local files into the docker image, compile the app, and add the necessary library search paths.
Dockerfile
# Copy local files to /app
COPY . /app
WORKDIR /app
# Compile app
RUN thrift -r --gen cpp predictor.thrift
RUN make
# Add library search paths
RUN echo '/pytorch/build/lib/' >> /etc/ld.so.conf.d/local.conf
RUN echo '/usr/local/lib/' >> /etc/ld.so.conf.d/local.conf
RUN ldconfig
5. Test/Run the server¶
This section assumes that your local files match the one found here.
Now that you have implemented your server, we will run the following commands to take it for a test run. In your server folder:
- Build the image:
$ docker build -t predictor_service .
If successful, you should see the message “Successfully tagged predictor_service:latest”.
- Run the server. We use models/atis_joint_model.c2 as the local path to our model file (add your trained model there):
$ docker run -it -p 8080:8080 predictor_service:latest ./server models/atis_joint_model.c2
If successful, you should see the message “Server running. Thrift port: 9090, REST port: 8080”
- Test our server by sending a test utterance “Flight from Seattle to San Francisco”:
$ curl -G "http://localhost:8080" --data-urlencode "doc=Flights from Seattle to San Francisco"
If successful, you should see the scores printed out on the console. On further inspection, the doc score for “flight”, the 3rd word score for “B-fromloc.city_name” corresponding to “Seattle”, the 5th word score for “B-toloc.city_name” corresponding to “San”, and the 6th word score for “I-toloc.city_name” corresponding to “Francisco” should be close to 0.
doc_scores:flight:-2.07426e-05
word_scores:B-fromloc.city_name:-14.5363 -12.8977 -0.000172928 -12.9868 -9.94603 -16.0366
word_scores:B-toloc.city_name:-15.2309 -15.9051 -9.89932 -12.077 -0.000134 -8.52712
word_scores:I-toloc.city_name:-13.1989 -16.8094 -15.9375 -12.5332 -10.7318 -0.000501401
Congratulations! You have now built your own server that can serve your PyText models in production!
We also provide a Docker image on Docker Hub with this example, which you can freely use and adapt to your needs.
Config Files Explained¶
PyText Models and training Tasks contain many components, and each components expects many parameters to define their behavior. PyText uses a config
to specify those parameters. The config
is can be loaded from a JSON file, which is what we describe here.
Structure of a Config File¶
A typical config
file only contains the parameters specific to your project. Here’s a fully working JSON file, and it does not need to be more complicated than this:
{
"task": {
"DocumentClassificationTask": {
"data": {
"source": {
"TSVDataSource": {
"field_names": ["label", "text"],
"train_filename": "my/data/train.tsv",
"eval_filename": "my/data/eval.tsv",
"test_filename": "my/data/test.tsv"
}
},
"model": {
"embedding": {
"embed_dim": 200
}
}
}
},
"version": 15
}
At the top level, the most important settings are the “task” and the “version”. “task” defines the Task
component to be used, which specifies where to get the “data”, which “model” to train, which “trainer” to use, and which “metric_reporter” will present the results.
Each of those parameters can be a Component
that is specified by its class name, or omitted to use the default class with its default parameters. In the example above, we specify TSVDataSource
to use this class, but we skip the model class name because we want to use the default DocModel
.
The “version” number helps PyText maintain backwards compatibility. PyText will use config adapters to internally try and update the configs to match the latest component parameters so you don’t have to keep changing your configs at each PyText update. To manually update your config to the latest version, you can use the update-config command.
Parameters in Config File¶
Parameters are either a component or a value. In the config above, we see that “field_names” expects a list of strings, “train_filename” expects a string, and “embed_dim” expects an integer.
“source” and “model” however expect a component, and as we’ve seen in the previous section, we can optionally specify the class name of a component if we decide to use a component that is not the default. We can tell whether it’s a class name or a parameter name by looking at the first letter: class names start with an upper case letter. For “source” we decided to specify TSVDataSource
, but for “model” we did not and decided to let DocumentClassificationTask
use its default DocModel
. We could have specified the class name like this, and that would be equivalent:
"model": {
"DocModel": {
"embedding": {
"embed_dim": 100
}
}
}
In the next example, the default representation for DocModel
is BiLSTMDocAttention
. We did not specify “representation” before because we were happy with this default. But if we decide to use DocNNRepresentation
instead, we would modify the config like this:
"model": {
"embedding": {
"embed_dim": 100
},
"representation": {
"DocNNRepresentation": {
}
}
}
In this example we just want to change the class of “representation” and use its default parameters, so we don’t need to specify any of them and we can leave its parameters set empty {}.
To explore more components parameters and their possible values, you can use the help-config command or browse the class documentation.
Changing a Config File¶
Users typically start with an existing config file, or create one using the gen-default-config command, and then edit it to tweak the parameters.
The file generated by gen-default-config
is very large, because it contains the default value of every parameter for every component. Any of those parameters can be omitted from the config file, because PyText can recover their default values.
In general, you should remove from your config file all the parameters you don’t want to override and keep those you do want to override now, or you might want to tweak later.
For example, TSVDataSource
can use a different “delimiter”, but in most cases we want to use the default “\t” for tab-separated-values files (TSV), so the config above does not specify “delimiter”: “\t”. If we wanted to load a CVS file, we could override this default by adding our own “delimiter” to our config (and since CVS fields can be “quoted”, unlike TSV where this option’s default is false, we’d also override it with true.)
"TSVDataSource": {
"delimiter": ",",
"quoted": true,
"field_names": ["label", "text"],
"train_filename": "my/data/train.csv",
"eval_filename": "my/data/eval.csv",
"test_filename": "my/data/test.csv"
}
The config at the top of this page is a fully working example. It could be simplified even further by removing the “model” section if you don’t want to change any of the model parameters, but in this case I guess the author decided to tweak “embed_dim”.
JSON Format Primer¶
A few notes about the JSON syntax and the differences with python:
- field names and string values should all be quoted with “double-quotes”
- booleans are lower case: true, false
- no trailing comma (after the last value of a block)
- empty value is: null
- indentation is optional but recommended for readability
- the first character must be { and the last one must be }
- obviously all brackets must be balanced: {}, []
Config Commands¶
This page explains the usage of the commands help-config
to explore PyText components, and gen-default-config
to create a config file with custom components and parameters.
Exploring Config Options¶
You can explore PyText Components with the command help-config
. This will print the documentation of the component, its full module name, its base class, as well as the list of its config parameters, their type and their default value.
$ pytext help-config LMTask
=== pytext.task.tasks.LMTask (NewTask) ===
data = Data
exporter = null
features = FeatureConfig
featurizer = SimpleFeaturizer
metric_reporter: LanguageModelMetricReporter = LanguageModelMetricReporter
model: LMLSTM = LMLSTM
trainer = TaskTrainer
You can drill down to the component you’re interested in. For example, if you want to know more about the model LMLSTM
, you can use the same command. Notice how PyText lists the possible values for Union types (for example with representation below.)
$ pytext help-config LMLSTM
=== pytext.models.language_models.lmlstm.LMLSTM (BaseModel) ===
"""
`LMLSTM` implements a word-level language model that uses LSTMs to
represent the document.
"""
ModelInput = LMLSTM.Config.ModelInput
caffe2_format: (ExporterType)
PREDICTOR (default)
INIT_PREDICT
decoder: (one of)
None
MLPDecoder (default)
embedding: WordFeatConfig = WordEmbedding
inputs: LMLSTM.Config.ModelInput = ModelInput
output_layer: LMOutputLayer = LMOutputLayer
representation: (one of)
DeepCNNRepresentation
BiLSTM (default)
stateful: bool
tied_weights: bool
PyText internally registers all the component classes, so we can look up and find any component using the class name or their aliases. For example somewhere in PyText we have import DeepCNNRepresentation as CNN
, so we would normally look up DeepCNNRepresentation
, but if we know that this class has an alias we can look up CNN
instead, and print the information about this class:
$ pytext help-config CNN
=== pytext.models.representations.deepcnn.DeepCNNRepresentation (RepresentationBase) ===
"""
`DeepCNNRepresentation` implements CNN representation layer
preceded by a dropout layer. CNN representation layer is based on the encoder
in the architecture proposed by Gehring et. al. in Convolutional Sequence to
Sequence Learning.
Args:
config (Config): Configuration object of type DeepCNNRepresentation.Config.
embed_dim (int): The number of expected features in the input.
"""
cnn: CNNParams = CNNParams
dropout: float = 0.3
Creating a Config File¶
The command gen-default-config
creates a json config files for a given Task
using the default value for all the parameters. You must specify the class name of the Task
. The json config will be printed in the terminal, so you need to send it to a file using of your choice (for example my_config.json
) to be able to edit it and use it.
$ pytext gen-default-config LMTask > my_config.json
INFO - Applying task option: LMTask
...
In the help-config LMLSTM
above, we see that representation is by default BiLSTM
, but could also be DeepCNNRepresentation
. (This can be because the type is declared as a Union of valid alternatives, or because the type is a base class.) Those two classes will have different parameters, so we can’t just edit the my_config.json and replace the class name.
We can specify which components to use by adding any number of class names to the command. Let’s create this config, and we’ll use add DeepCNNRepresentation
to our command. gen-default-config
will look up this class name and find that it is a suitable representation component for the LMLSTM
model in our LMTask
.
$ pytext gen-default-config LMTask DeepCNNRepresentation > my_config.json
INFO - Applying task option: LMTask
INFO - Applying class option: task->model->representation = CNN
...
This also works with parameters which are not component class names. You can specify the parameter name and its value, and gen-default-config
will automatically apply this parameter to the right component.
$ pytext gen-default-config LMTask epochs=200
INFO - Applying task option: LMTask
INFO - Applying parameter option to task.trainer.epochs : epochs=200
...
Sometimes the same parameter name is used by multiple components. In this case PyText prints the list of those parameters with their full config path. You can then simply use the last part of the path that is enough to differentiate them and pick the one you want. In the next example, we omit the prefix task.model. because we don’t need it to find where to apply our parameter representation.dropout.
$ pytext gen-default-config LMTask dropout=0.7 > my_config.json
INFO - Applying task option: LMTask
...
Exception: Multiple possibilities for dropout=0.7: task.model.representation.dropout, task.model.decoder.dropout
$ pytext gen-default-config LMTask representation.dropout=0.7 > my_config.json
INFO - Applying task option: LMTask
INFO - Applying parameter option to task.model.representation.dropout : representation.dropout=0.7
...
You can add any number and combination of those parameters. Please note that they will be applied in order, so if you want to change a component class and some of its parameters, you must specify the parameters in this order (component first, then parameters). If you don’t do that, your parameters changes will be ignored. For example, changing representation.dropout first, then overriding the representation component will replace the default representation with a new CNN
component with all the parameter using the default value.
Look at this bad example: you can verify that the representation dropout is 0.3 (the default value for CNN
) and not 0.7 as we specified, because CNN was applied after and replaced the component that had its dropout modified first.
$ pytext gen-default-config LMTask representation.dropout=0.7 CNN > my_config.json
INFO - Applying task option: LMTask
INFO - Applying parameter option to task.model.representation.dropout : representation.dropout=0.7
INFO - Applying class option: task->model->representation = CNN
...
Now let’s combine everything:
$ pytext gen-default-config LMTask BlockShardedTSVDataSource CNN dilated=True epochs=200 representation.dropout=0.7 > my_config.json
INFO - Applying task option: LMTask
INFO - Applying class option: task->data->source = BlockShardedTSVDataSource
INFO - Applying class option: task->model->representation = CNN
INFO - Applying parameter option to task.model.representation.cnn.dilated : dilated=True
INFO - Applying parameter option to task.trainer.epochs : epochs=200
INFO - Applying parameter option to task.model.representation.dropout : representation.dropout=0.2
...
Updating a Config File¶
When there’s a new release of PyText, some component parameters might change because of bug fixes or new features. While PyText has config_adapters that can internally transform old configs to map them to the latest components, it is sometimes useful to update your config file to the current version. This can be done with the command update-config
:
$ pytext update-config < my_config_old.json > my_config_new.json
Train Intent-Slot model on ATIS Dataset¶
OBSOLETE This documentation is using the old API and needs to be updated with the new classes configs.
Intent detection and Slot filling are two common tasks in Natural Language Understanding for personal assistants. Given a user’s “utterance” (e.g. Set an alarm for 10 pm), we detect its intent (set_alarm) and tag the slots required to fulfill the intent (10 pm).
The two tasks can be modeled as text classification and sequence labeling, respectively. We can train two separate models, but training a joint model has been shown to perform better.
In this tutorial, we will train a joint intent-slot model in PyText on the ATIS (Airline Travel Information System) dataset. Note that to download the dataset, you will need a Kaggle account for which you can sign up for free.
1. Prepare the data¶
The in-built PyText data-handler expects the data to be stored in a tab-separated file that contains the intent label, slot label and the raw utterance.
Download the data locally and use the script below to preprocess it into format PyText expects
$ unzip <download_dir>/atis.zip -d <download_dir>/atis
$ python3 demo/atis_joint_model/data_processor.py
--download-folder <download_dir>/atis --output-directory demo/atis_joint_model/
The script will also randomly split the training data into training and validation sets. All the pre-processed data will be written to the output-directory argument specified in the command.
An alternative approach here would be to write a custom data-handler for your custom data format, but that is beyond the scope of this tutorial.
2. Download Pre-trained word embeddings¶
Word embeddings are the vector representations of the different words understood by your model. Pre-trained word embeddings can significantly improve the accuracy of your model, since they have been trained on vast amounts of data. In this tutorial, we’ll use GloVe embeddings, which can be downloaded by:
$ curl https://nlp.stanford.edu/data/wordvecs/glove.6B.zip > demo/atis_joint_model/glove.6B.zip
$ unzip demo/atis_joint_model/glove.6B.zip -d demo/atis_joint_model
The downloaded file size is ~800 MB.
3. Train the model¶
To train a PyText model, you need to pick the right task and model architecture, among other parameters. Default values are available for many parameters and can give reasonable results in most cases. The following is a sample config which can train a joint intent-slot model
{
"config": {
"task": {
"IntentSlotTask": {
"data": {
"Data": {
"source": {
"TSVDataSource": {
"field_names": [
"label",
"slots",
"text",
"doc_weight",
"word_weight"
],
"train_filename": "demo/atis_joint_model/atis.processed.train.csv",
"eval_filename": "demo/atis_joint_model/atis.processed.val.csv",
"test_filename": "demo/atis_joint_model/atis.processed.test.csv"
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 128,
"eval_batch_size": 128,
"test_batch_size": 128,
"pool_num_batches": 10000
}
},
"sort_key": "tokens",
"in_memory": true
}
},
"model": {
"representation": {
"BiLSTMDocSlotAttention": {
"pooling": {
"SelfAttention": {}
}
}
},
"output_layer": {
"doc_output": {
"loss": {
"CrossEntropyLoss": {}
}
},
"word_output": {
"CRFOutputLayer": {}
}
},
"word_embedding": {
"embed_dim": 100,
"pretrained_embeddings_path": "demo/atis_joint_model/glove.6B.100d.txt"
}
},
"trainer": {
"epochs": 20,
"optimizer": {
"Adam": {
"lr": 0.001
}
}
}
}
}
}
}
We explain some of the parameters involved:
IntentSlotTask
trains a joint model for document classification and word tagging.- The
Model
has multiple layers - - We use BiLSTM model with attention as the representation layer. The pooling attribute decides the attention technique used. - We use different loss functions for document classification (Cross Entropy Loss) and slot filling (CRF layer) - Pre-trained word embeddings are provided within the word_embedding attribute.
To train the PyText model,
(pytext) $ pytext train < sample_config.json
3. Tune the model and get final results¶
Tuning the model’s hyper-parameters is key to obtaining the best model accuracy. Using hyper-parameter sweeps on learning rate, number of layers, dimension and dropout of BiLSTM etc., we can achieve a F1 score of ~95% on slot labels which is close to the state-of-the-art. The fine-tuned model config is available at demos/atis_intent_slot/atis_joint_config.json
To train the model using fine tuned model config,
(pytext) $ pytext train < demo/atis_joint_model/atis_joint_config.json
4. Generate predictions¶
Lets make the model run on some sample utterances! You can input one by running
(pytext) $ pytext --config-file demo/atis_joint_model/atis_joint_config.json \
predict --exported-model /tmp/atis_joint_model.c2 <<< '{"text": "flights from colorado"}'
The response from the model is log of probabilities for different intents and slots, with the correct intent and slot hopefully having the highest.
In the following snippet of the model’s response, we see that the intent doc_scores:flight and slot word_scores:fromloc.city_name for third word “colorado” have the highest predictions.
{
....
'doc_scores:flight': array([-0.00016726], dtype=float32),
'doc_scores:ground_service+ground_fare': array([-25.865768], dtype=float32),
'doc_scores:meal': array([-17.864975], dtype=float32),
..,
'word_scores:airline_name': array([[-12.158762],
[-15.142928],
[ -8.991585]], dtype=float32),
'word_scores:fromloc.city_name': array([[-1.5084317e+01],
[-1.3880151e+01],
[-1.4416825e-02]], dtype=float32),
'word_scores:fromloc.state_code': array([[-17.824356],
[-17.89767 ],
[ -9.848984]], dtype=float32),
'word_scores:meal': array([[-15.079164],
[-17.229427],
[-17.529446]], dtype=float32),
'word_scores:transport_type': array([[-14.722928],
[-16.700478],
[-13.4414 ]], dtype=float32),
...
}
Hierarchical intent and slot filling¶
In this tutorial, we will train a semantic parser for task oriented dialog by modeling hierarchical intents and slots (Gupta et al. , Semantic Parsing for Task Oriented Dialog using Hierarchical Representations, EMNLP 2018). The underlying model used in the paper is the Recurrent Neural Network Grammar (Dyer et al., Recurrent Neural Network Grammar, NAACL 2016). RNNG is neural constituency parser that explicitly models the compositional tree structure of the words and phrases in an utterance.
1. Fetch the dataset¶
Download the dataset to a local directory. We will refer to this as base_dir in the next section.
$ curl -o top-dataset-semantic-parsing.zip -L https://fb.me/semanticparsingdialog
$ unzip top-dataset-semantic-parsing.zip
2. Prepare configuration file¶
Prepare the configuration file for training. A sample config file can be found in your PyText repository at demo/configs/rnng.json
. If you haven’t set up PyText, please follow Installation, then make the following changes in the config:
- Set train_path to base_dir/top-dataset-semantic-parsing/train.tsv.
- Set eval_path to base_dir/top-dataset-semantic-parsing/eval.tsv.
- Set test_path to base_dir/top-dataset-semantic-parsing/test.tsv.
3. Train a model with the downloaded dataset¶
Train the model using the command below
(pytext) $ pytext train < demo/configs/rnng.json
The output will look like:
Merged Intent and Slot Metrics
P = 24.03 R = 31.90, F1 = 27.41.
This will take about hour. If you want to train with a smaller dataset to make it quick then generate a subset of the dataset using the commands below and update the paths in demo/configs/rnng.json
:
$ head -n 1000 base_dir/top-dataset-semantic-parsing/train.tsv > base_dir/top-dataset-semantic-parsing/train_small.tsv
$ head -n 100 base_dir/top-dataset-semantic-parsing/eval.tsv > base_dir/top-dataset-semantic-parsing/eval_small.tsv
$ head -n 100 base_dir/top-dataset-semantic-parsing/test.tsv > base_dir/top-dataset-semantic-parsing/test_small.tsv
If you now train the model with smaller datasets, the output will look like:
Merged Intent and Slot Metrics
P = 24.03 R = 31.90, F1 = 27.41.
4. Test the model interactively against input utterances.¶
Load the model using the command below
(pytext) $ pytext predict-py --model-file=/tmp/model.pt
please input a json example, the names should be the same with column_to_read in model training config:
This will give you a REPL prompt. You can enter an utterance to get back the model’s prediction repeatedly. You should enter in a json format shown below. Once done press Ctrl+D.
{"text": "order coffee from starbucks"}
You should see an output like:
[{'prediction': [7, 0, 5, 0, 1, 0, 3, 0, 1, 1],
'score': [
0.44425372408062447,
0.8018286800064633,
0.6880680051949267,
0.9891564979506277,
0.9999506231665385,
0.9992705616574005,
0.34512090135492923,
0.9999979545618913,
0.9999998668826438,
0.9999998686418744]}]
We have also provided a pre-trained model which you may download here
Multitask training with disjoint datasets¶
In this tutorial, we will jointly train a classification task with a language modeling task in a multitask setting. The models will share the embedding and representation layers.
We will use the following datasets:
- Binarized Stanford Sentiment Treebank (SST-2), which is part of the GLUE benchmark. This dataset contains segments from movie reviews labeled with their binary sentiment.
- WikiText-2, a medium-size language modeling dataset with text extracted from Wikipedia.
1. Fetch and prepare the dataset¶
Download the dataset in a local directory. We will refer to this as base_dir in the next section.
$ curl "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip" -o wikitext-2-v1.zip
$ unzip wikitext-2-v1.zip
$ curl "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8" -o SST-2.zip
$ unzip SST-2.zip
Remove headers from SST-2 data:
$ cd base_dir/SST-2
$ sed -i '1d' train.tsv
$ sed -i '1d' dev.tsv
Remove empty lines from WikiText:
$ cd base_dir/wikitext-2
$ sed -i '/^\s*$/d' train.tsv
$ sed -i '/^\s*$/d' valid.tsv
$ sed -i '/^\s*$/d' test.tsv
2. Train a base model¶
Prepare the configuration file for training. A sample config file for the base document classification model can be found in your PyText repository at demo/configs/sst2.json
. If you haven’t set up PyText, please follow Installation, then make the following changes in the config:
- Set train_path to base_dir/SST-2/train.tsv.
- Set eval_path to base_dir/SST-2/eval.tsv.
- Set test_path to base_dir/SST-2/test.tsv.
The test set labels for this tasks are not openly available, therefore we will use the dev set. Train the model using the command below.
(pytext) $ pytext train < demo/configs/sst2.json
The output will look like:
Stage.EVAL
loss: 0.472868
Accuracy: 85.67
3. Configure for multitasking¶
The example configuration for this tutorial is at demo/configs/multitask_sst_lm.json
.
The main configuration is under tasks, which is a dictionary of task name to task config:
"task_weights": {
"SST2": 1,
"LM": 1
},
"tasks": {
"SST2": {
"DocClassificationTask": { ... }
},
"LM": {
"LMTask": { ... }
}
}
You can also modify task_weights to weight the loss for each task. The sub-tasks can be configured as you would in a single task setting, with the exception of changes described in the next sections.
3. Train the model¶
You can train the model with
(pytext) $ pytext train < demo/configs/multitask_sst_lm.json
The output will look like
Stage.EVAL
loss: 0.455871
Accuracy: 86.12
Not a great improvement, but we used a very primitive language modeling task (bi-directional with no masking) for the purposes of this tutorial. Happy multitasking!
Data Parallel Distributed Training¶
Distributed training enables one to easily parallelize computations across processes and clusters of machines. To do so, it leverages messaging passing semantics allowing each process to communicate data to any of the other processes.
PyText exploits DistributedDataParallel
for synchronizing gradients and torch.multiprocessing
to spawn multiple processes which each setup the distributed environment with NCCL as
default backend, initialize the process group, and finally execute the given run function.
The module is replicated on each machine and each device (e.g every single process),
and each such replica handles a portion of the input partitioned by PyText’s DataHandler
.
For more on distributed training in PyTorch, refer to Writing distributed applications with PyTorch.
In this tutorial, we will train a DocNN model on a single node with 8 GPUs using the SST dataset.
1. Requirement¶
Distributed training is only available for GPUs, so you’ll need GPU-equipped server or virtual machine to run this tutorial.
- Notes:
- This demo use a local temporary file for initializing the distributed processes group, which means it only works on a single node. Please make sure to set distributed_world_size less than or equal to the maximum available GPUs on the server.
- For distributed training on clusters of machines, you can use a shared file accessible to all the hosts (ex: file:///mnt/nfs/sharedfile) or the TCP init method. More info on distributed initialization.
- In
demo/configs/distributed_docnn.json
, set distributed_world_size to 1 to disable distributed training, and set use_cuda_if_available to false to disable training on GPU.
2. Fetch the dataset¶
Download the SST dataset (The Stanford Sentiment Treebank) to a local directory. We will refer to this as base_dir in the next section.
$ unzip SST-2.zip && cd SST-2
$ sed 1d train.tsv | head -1000 > train_tiny.tsv
$ sed 1d dev.tsv | head -100 > eval_tiny.tsv
3. Prepare configuration file¶
Prepare the configuration file for training. A sample config file can be found in your PyText repository at demo/configs/distributed_docnn.json
. If you haven’t set up PyText, please follow Installation.
The two parameters that are used for distributed training are:
- distributed_world_size: total number of GPUs used for distributed training, e.g. if set to 40 with every server having 8 GPU, 5 servers will be fully used.
- use_cuda_if_available: set to true for training on GPUs.
For this tutorial, please change the following in the config file.
- Set train_path to base_dir/train_tiny.tsv.
- Set eval_path to base_dir/eval_tiny.tsv.
- Set test_path to base_dir/eval_tiny.tsv.
4. Train model with the downloaded dataset¶
Train the model using the command below
(pytext) $ pytext train < demo/configs/distributed_docnn.json
XLM-RoBERTa¶
Introduction¶
XLM-R (XLM-RoBERTa, Unsupervised Cross-lingual Representation Learning at Scale) is a scaled cross lingual sentence encoder. It is
trained on 2.5T
of data across 100
languages data filtered from
Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross
lingual benchmarks.
Pre-trained models¶
Model | Description | #params | vocab size | Download |
---|---|---|---|---|
xlmr.base.v0 |
XLM-R using the BERT-base architecture | 250M | 250k | xlm.base.v0.tar.gz |
xlmr.large.v0 |
XLM-R using the BERT-large architecture | 560M | 250k | xlm.large.v0.tar.gz |
(Note: The above models are still under training, we will update the weights, once fully trained, the results are based on the above checkpoints.)
Results¶
Model | average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
roberta.large.mnli (TRANSLATE-TEST) |
77.8 | 91.3 | 82.9 | 84.3 | 81.2 | 81.7 | 83.1 | 78.3 | 76.8 | 76.6 | 74.2 | 74.1 | 77.5 | 70.9 | 66.7 | 66.8 |
xlmr.large.v0 (TRANSLATE-TRAIN-ALL) |
82.4 | 88.7 | 85.2 | 85.6 | 84.6 | 83.6 | 85.5 | 82.4 | 81.6 | 80.9 | 83.4 | 80.9 | 83.3 | 79.8 | 75.9 | 74.3 |
Model | average | en | es | de | ar | hi | vi | zh |
---|---|---|---|---|---|---|---|---|
BERT-large |
80.2/67.4 | |||||||
mBERT |
57.7 / 41.6 | 77.7 / 65.2 | 64.3 / 46.6 | 57.9 / 44.3 | 45.7 / 29.8 | 43.8 / 29.7 | 57.1 / 38.6 | 57.5 / 37.3 |
xlmr.large.v0 |
70.0 / 52.2 | 80.1 / 67.7 | 73.2 / 55.1 | 68.3 / 53.7 | 62.8 / 43.7 | 68.3 / 51.0 | 70.5 / 50.1 | 67.1 / 44.4 |
Citation¶
@article{
title = {Unsupervised Cross-lingual Representation Learning at Scale},
author = {Alexis Conneau and Kartikay Khandelwal
and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek
and Francisco Guzm\'an and Edouard Grave and Myle Ott
and Luke Zettlemoyer and Veselin Stoyanov
},
journal={},
year = {2019},
}
Architecture Overview¶
PyText is designed to help users build end to end pipelines for training and inference. A number of default pipelines are implemented for popular tasks which can be used as-is. Users are free to extend or replace one or more of the pipelines’s components.
The following figure describes the relationship between the major components of PyText:

Note: some models might implement a single “encoder_decoder” component while others implement two components: a representation and a decoder.
Model¶
The Model class is the central concept in PyText. It defines the neural network architecture. PyText provides models for common NLP jobs. Users can implement their custom model in two ways:
- subclassing
Model
will give you most of the functions for the common architecture embedding -> representation -> decoder -> output_layer. - if you need more flexibility, you can subclass the more basic
BaseModel
which makes no assumptions about architectures, allowing you to implement any model.
Most PyText models implement Model
and use the following architecture:
- model
- model_input
- tensorizers
- embeddings
- encoder+decoder
- output_layer
- loss
- prediction
- model_input: defines how the input strings will be transformed into tensors. This is done by input-specific “Tensorizers”. For example, the
TokenTensorizer
takes a sentence, tokenize it and looks up in its vocabulary to create the corresponding tensor. (The vocabulary is created during initialization by doing a first pass on the inputs.) In addition to the inputs, we also define here how to handle other data that can be found in the input files, such as the “labels” (arguably an output, but true labels are used an input during training). - embeddings: this step transforms the tensors created by model_input into embeddings. Each model_input (tensorizer) will be associated to a compatible embedding class (for example:
WordEmbedding
, orCharacterEmbedding
). (see pytext/models/embeddings/) - representation: also called “encoder”, this can be one of the provided classes, such as those using a CNN (for example
DocNNRepresentation
), those using an LSTM (for exampleBiLSTMDocAttention
), or any other type of representation. The parameters will depend on the representation selected. (see pytext/models/representations/) - decoder: this is typically an MLP (Multi-Layer Perceptron). If you use the default
MLPDecoder
, hidden_dims is the most useful parameter, which is an array containing the number of nodes in each hidden layer. (see pytext/models/decoders/) - output_layer: this is where the human-understandable output of the model is defined. For example, a document classification can automatically use the “labels” vocabulary defined in model_input as outputs. output_layer also defines the loss function to use during training. (see pytext/models/output_layers/)
Task: training definition¶
To train the model, we define a Task
, which will tell PyText how to load the data, which model to use, how to train it, as well as the how to measure metrics.
The Task is defined with the following information:
- data: defines where to find and how to handle the data: see data_source and batcher.
- data -> data_source: The format of the input data (training, eval and testing) can differ a lot depending on the source. PyText provides
TSVDataSource
to read from the common tab-separated files. Users can easily write their own custom implementation if their files have a different format. - data -> batcher: The batcher is responsible for grouping the input data into batches that will be processed one at a time. train_batch_size, eval_batch_size and test_batch_size can be changed to reduce the running time (while increasing the memory requirements). The default
Batcher
takes the input sequentially, which is adequat in most cases. Alternatively, PoolingBatcher shuffles the inputs to make sure the data is not in order, which could introduce a biais in the results. - trainer: This defines a number of useful options for the training runs, like number of epochs, whether to report_train_metrics only during eval, and the random_seed to use.
- metric_reporter: different models will need to report different metrics. (For example, common metrics for document classification are precision, recall, f1 score.) Each PyText task can use a corresponding default metric reporter class, but users might want to use alternatives or implement their own.
- exporter: defines how to export the model so it can be used in production. PyText currently exports to caffe2 via onnx or torchscript.
- model: (see above)
How Data is Consumed¶
- data_source: Defines where the data can be found (for example: one training file, one eval file, and one test file) and the schema (field names). The data_source class will read each entry one by one (for example: each line in a TSV file) and convert each one into a row, which is a python dict of field name to entry value. Values are converted automatically if their type is specified.
- tensorizer: Defines how rows are transformed into tensors. Tensorizers listed in the model will use one or more fields in the row to create a tensor or a tuple of tensors. To do that, some tensorizers will split the field values using a tokenizer that can be overridden in the config. Tensorizers typically have a vocabulary that allows them to map words or labels to numbers, and it’s built during the initialization phase by scanning the data once. (Alternatively, it can be loaded from file.)
- model -> arrange_model_inputs(): At this point, we have a python dict of tensorizer name to tensor or a tuple of tensors. Model has the method arrange_model_inputs() which flattens this python dict into a list tensors or tuple of tensors in the right order for the Model’s forward method.
- model -> forward(): This is where the magic happens. Input tensors are passed to the embbedings forward methods, then the results are passed to the encoder/decoder forward methods, and finally the ouput layer produces a prediction.
Config Example¶
We only specify the options we want to override. Everything else will use the default values. A typical config might look like this:
{
"task": {
"MyTask": {
"data": {
"source": {
"TSVDataSource": {
"field_names": ["label", "slots", "text"],
"train_filename": "data/my_train_data.tsv",
"test_filename": "data/my_test_data.tsv",
"eval_filename": "data/my_eval_data.tsv"
}
}
}
}
}
}
Code Example¶
class MyTask(NewTask):
class Config(NewTask.Config):
model: MyModel.Config = MyModel.Config()
class MyModel(Model):
class Config(Model.Config):
class ModelInput(Model.Config.ModelInput):
tokens: TokenTensorizer.Config = TokenTensorizer.Config()
labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config()
inputs: ModelInput = ModelInput()
embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: Union[
BiLSTMSlotAttention.Config,
BSeqCNNRepresentation.Config,
PassThroughRepresentation.Config,
] = BiLSTMSlotAttention.Config()
output_layer: Union[
WordTaggingOutputLayer.Config, CRFOutputLayer.Config
] = WordTaggingOutputLayer.Config()
decoder: MLPDecoder.Config = MLPDecoder.Config()
@classmethod
def from_config(cls, config, tensorizers):
vocab = tensorizers["tokens"].vocab
embedding = create_module(config.embedding, vocab=vocab)
labels = tensorizers["labels"].vocab
representation = create_module(
config.representation, embed_dim=embedding.embedding_dim
)
decoder = create_module(
config.decoder,
in_dim=representation.representation_dim,
out_dim=len(labels),
)
output_layer = create_module(config.output_layer, labels=labels)
return cls(embedding, representation, decoder, output_layer)
def arrange_model_inputs(self, tensor_dict):
tokens, seq_lens, _ = tensor_dict["tokens"]
return (tokens, seq_lens)
def arrange_targets(self, tensor_dict):
return tensor_dict["labels"]
def forward(
self,
tokens: torch.Tensor,
) -> List[torch.Tensor]:
embeddings = [self.token_embedding(tokens)]
final_embedding = torch.cat(embeddings, -1)
representation = self.representation(final_embedding)
return self.decoder(representation)
Custom Data Format¶
PyText’s default reader is TSVDataSource
to read your dataset if it’s in tsv format (tab-separated values). In many cases, your data is going to be in a different format. You could write a pre-processing script to format your data into tsv format, but it’s easier and more convenient to implement your own DataSource
component so that PyText can read your data directly, without any preprocessing.
This tutorial explains how to implement a simple DataSource
that can read the ATIS data and to perform a classification task using the “intent” labels.
1. Download the data¶
Download the ATIS (Airline Travel Information System) dataset and unzip it in a directory. Note that to download the dataset, you will need a Kaggle account for which you can sign up for free. The zip file is about 240KB.
$ unzip <download_dir>/atis.zip -d <download_dir>/atis
2. The data format¶
The ATIS dataset has a few defining characterics:
- it has a train set and a test set, but not eval set
- the data is split into a “dict” file, which is a vocab file containing the words or labels, and the train and test sets, which only contain integers representing the word indexes.
- sentences always start with the token 178 = BOS (Beginning Of Sentence) and end with the token 179 = EOS (End Of Sentence).
$ tail atis/atis.dict.vocab.csv
y
year
yes
yn
york
you
your
yx
yyz
zone
$ tail atis/atis.test.query.csv
178 479 0 545 851 264 882 429 851 915 330 179
178 479 902 851 264 180 428 444 736 521 301 851 915 330 179
178 818 581 207 827 204 616 915 330 179
178 479 0 545 851 264 180 428 444 299 851 619 937 301 654 887 200 435 621 740 179
178 818 581 207 827 204 482 827 619 937 301 229 179
178 688 423 207 827 429 444 299 851 218 203 482 827 619 937 301 229 826 236 621 740 253 130 689 179
178 423 581 180 428 444 299 851 218 203 482 827 619 937 301 229 179
178 479 0 545 851 431 444 589 851 297 654 212 200 179
178 479 932 545 851 264 180 730 870 428 444 511 301 851 297 179
178 423 581 180 428 826 427 444 587 851 810 179
Our DataSource
must then resolve the words from the vocab files to rebuild the sentences and labels as strings. It must also take a subset the one of train or test dataset to create the eval dataset. Since the test set is pretty small, we’ll use the train set for that purpose and randomly take a small fraction (say 25%) to create the eval set. Finally, we can safely remove the first and last tokens of every query (BOS and EOS), as they don’t add any value for classification.
The ATIS dataset also has information for slots tagging that we’ll ignore because we only care about classification in this tutorial.
3. DataSource¶
PyText defines a DataSource
to read the data. It expect each row of data to be represented as a python dict where the keys are the column names and the values are the columns properly typed.
Most of the time, the dataset will come as strings and the casting to the proper types can be inferred automatically from the other components in the config. To make the implementation of a new DataSource
easier, PyText provides the class RootDataSource
that does this type lookup for you. Most users should use RootDataSource
as a base class.
4. Implementing AtisIntentDataSource¶
We will write the all the code for our AtisIntentDataSource
in the file my_classifier/source.py.
First, let’s write the utilities that will help us read the data: a function to load the vocab files, and the generator that uses them to rebuild the sentences and labels. We return pytext.data.utils.UNK for unknown words. We store the indexes as strings to avoid casting from and to ints when reading the inputs.:
def load_vocab(file_path):
"""
Given a file, prepare the vocab dictionary where each line is the value and
(line_no - 1) is the key
"""
vocab = {}
with open(file_path, "r") as file_contents:
for idx, word in enumerate(file_contents):
vocab[str(idx)] = word.strip()
return vocab
def reader(file_path, vocab):
with open(file_path, "r") as reader:
for line in reader:
yield " ".join(
vocab.get(s.strip(), UNK)
# ATIS every row starts/ends with BOS/EOS: remove them
for s in line.split()[1:-1]
)
Then we declate the DataSource
class itself: AtisIntentDataSource
. It inherits from RootDataSource
, which gives us the automatic lookup of data types. We declare all the config parameters that will be useful, and give sensible default values so that the general case where users provide only path and field_names will likely work. We load the vocab files for queries and intent only once in the constructor and keep them in memory for the entire run:
class AtisIntentDataSource(RootDataSource):
def __init__(
self,
path="my_directory",
field_names=None,
validation_split=0.25,
random_seed=12345,
# Filenames can be overridden if necessary
intent_filename="atis.dict.intent.csv",
vocab_filename="atis.dict.vocab.csv",
test_queries_filename="atis.test.query.csv",
test_intent_filename="atis.test.intent.csv",
train_queries_filename="atis.train.query.csv",
train_intent_filename="atis.train.intent.csv",
**kwargs,
):
super().__init__(**kwargs)
field_names = field_names or ["text", "label"]
assert len(field_names or []) == 2, \
"AtisIntentDataSource only handles 2 field_names: {}".format(field_names)
self.random_seed = random_seed
self.eval_split = eval_split
# Load the vocab dict in memory for the readers
self.words = load_vocab(os.path.join(path, vocab_filename))
self.intents = load_vocab(os.path.join(path, intent_filename))
self.query_field = field_names[0]
self.intent_field = field_names[1]
self.test_queries_filepath = os.path.join(path, test_queries_filename)
self.test_intent_filepath = os.path.join(path, test_intent_filename)
self.train_queries_filepath = os.path.join(path, train_queries_filename)
self.train_intent_filepath = os.path.join(path, train_intent_filename)
To generate the eval data set, we need to randomly select some of the rows in training, but in a consistent and repeatable way. This is not strictly needed, and the training will work if the selection were completely random, but having a consistent sequence will help with debugging and give comparable results from training to training. In order to do that, we need to use the same seed for a new random number generator each time we start reading the train data set. The function below can be used for either training or eval and ensures that those two sets are complement of each other, with the ratio determined by eval_split. This function returns True or False depending on whether the row should be included or not:
def _selector(self, select_eval):
"""
This selector ensures that the same pseudo-random sequence is
always the used from the Beginning. The `select_eval` parameter
guarantees that the training set and eval set are exact complements.
"""
rng = Random(self.random_seed)
def fn():
return select_eval ^ (rng.random() >= self.eval_split)
return fn
Next, we write the function that iterates through both the reader for the queries (sentences) and the reader for the intents (labels) simultaneously. It yields each row in the form a python dictionnary, where the keys are the field_names. We can pass an optional function to select a subset of the row (ie: _selector defined above); the default is to select all the rows:
def _iter_rows(self, query_reader, intent_reader, select_fn=lambda: True):
for query_str, intent_str in zip(query_reader, intent_reader):
if select_fn():
yield {
# in ATIS every row starts/ends with BOS/EOS: remove them
self.query_field: query_str[4:-4],
self.intent_field: intent_str,
}
Finally, we tie everything toghether by implementing the 3 API methods of RootDataSource
. Each of those methods should return a generator that can iterate through the specific dataset entirely. For the test dataset, we simply return all the row presented by the data in test_queries_filepath and test_intent_filepath, using the corresponding vocab:
def raw_test_data_generator(self):
return iter(self._iter_rows(
query_reader=reader(
self.test_queries_filepath,
self.words,
),
intent_reader=reader(
self.test_intent_filepath,
self.intents,
),
))
For the eval and train datasets, we read the same files train_queries_filepath and train_intent_filepath, but we select some of the rows for eval and the rest for train:
def raw_train_data_generator(self):
return iter(self._iter_rows(
query_reader=reader(
self.train_queries_filepath,
self.words,
),
intent_reader=reader(
self.train_intent_filepath,
self.intents,
),
select_fn=self._selector(select_eval=False),
))
def raw_eval_data_generator(self):
return iter(self._iter_rows(
query_reader=reader(
self.train_queries_filepath,
self.words,
),
intent_reader=reader(
self.train_intent_filepath,
self.intents,
),
select_fn=self._selector(select_eval=True),
))
RootDataSource
needs to know how it should transform the values in the dictionnaries created by the raw generators into the types matching the tensorizers used in the model. Fortunately, RootDataSource
already provides a number of type conversion functions like the one below, so we don’t need to do it for strings. If we did need to do it, we would declare one like this for AtisIntentDataSource
.:
@AtisIntentDataSource.register_type(str)
def load_string(s):
return s
The full source code for this tutorial can be found in demo/datasource/source.py, which include the imports needed.
5. Testing AtisIntentDataSource¶
For rapid dev-test cycles, we add a simple main code printing the generated data in the terminal:
if __name__ == "__main__":
import sys
src = AtisIntentDataSource(
path=sys.argv[1],
field_names=["query", "intent"],
schema={},
)
for row in src.raw_train_data_generator():
print("TRAIN", row)
for row in src.raw_eval_data_generator():
print("EVAL", row)
for row in src.raw_test_data_generator():
print("TEST", row)
We test our class to make sure we’re getting the right data.
$ python3 my_classifier/source.py atis | head -n 3
TRAIN {'query': 'what flights are available from pittsburgh to baltimore on thursday morning', 'intent': 'flight'}
TRAIN {'query': 'cheapest airfare from tacoma to orlando', 'intent': 'airfare'}
TRAIN {'query': 'round trip fares from pittsburgh to philadelphia under 1000 dollars', 'intent': 'airfare'}
$ python3 my_classifier/source.py atis | cut -d " " -f 1 | uniq -c
3732 TRAIN
1261 EVAL
893 TEST
6. Training the Model¶
First let’s get a config using our new AtisIntentDataSource
$ pytext --include my_classifier gen-default-config DocumentClassificationTask AtisIntentDataSource > my_classifier/config.json
Including: my_classifier
... importing module: my_classifier.source
... importing: <class 'my_classifier.source.AtisIntentDataSource'>
INFO - Applying option: task->data->source = AtisIntentDataSource
This default config contains all the parameters with their default value. So we edit the config to remove the parameters that we don’t care about, and we edit the ones we care about. We only want to run 3 epochs for now. It looks like this.
$ cat my_classifier/config.json
{
"debug_path": "my_classifier.debug",
"export_caffe2_path": "my_classifier.caffe2.predictor",
"export_onnx_path": "my_classifier.onnx",
"save_snapshot_path": "my_classifier.pt",
"task": {
"DocumentClassificationTask": {
"data": {
"Data": {
"source": {
"AtisIntentDataSource": {
"field_names": ["text", "label"],
"path": "atis",
"random_seed": 12345,
"validation_split": 0.25
}
}
}
},
"metric_reporter": {
"output_path": "my_classifier.out"
},
"trainer": {
"epochs": 3
}
}
},
"test_out_path": "my_classifier_test.out",
"version": 12
}
And, at last, we can train the model
$ pytext --include my_classifier train < my_classifier/config.json
Notes¶
In the current version of PyText, we need to explicitly declare a few more things, like the Config class (that looks like the __init__ parameters) and the from_config method:
class Config(RootDataSource.Config):
path: str = "."
field_names: List[str] = ["text", "label"]
validation_split: float = 0.25
random_seed: int = 12345
# Filenames can be overridden if necessary
intent_filename: str = "atis.dict.intent.csv"
vocab_filename: str = "atis.dict.vocab.csv"
test_queries_filename: str = "atis.test.query.csv"
test_intent_filename: str = "atis.test.intent.csv"
train_queries_filename: str = "atis.train.query.csv"
train_intent_filename: str = "atis.train.intent.csv"
# Config mimics the constructor
# This will be the default in future pytext.
@classmethod
def from_config(cls, config: Config, schema: Dict[str, Type]):
return cls(schema=schema, **config._asdict())
Custom Tensorizer¶
Tensorizer
is the class that prepares your data coming out of the data source and transforms it into tensors suitable for processing. Each tensorizer knows how to prepare the input data from specific columns. In order to do that, the tensorizer (after initialization, such as creating or loading the vocabulary for look-ups) executes the following steps:
- Its
Config
defines which column name(s) the tensorizer will look at numberize()
takes one row and transform the strings into numberstensorize()
takes a batch of rows and creates the tensors
PyText provides a number of tensorizers for the most common cases. However, if you have your own custom features that don’t have a suitable Tensorizer
, you will need to write your own class. Fortunately it’s quite easy: you simply need to create a class that inherits from Tensorizer
(or one of its subclasses), and implement a few functions.
First a Config
inner class, from_config class method, and the constructor __init__. This is just to declare member variables.
The tensorizer should declare the schema of your Tensorizer
by defining a column_schema property which returns a list of tuples, one for each field/column read from the data source. Each tuple specifies the name of the column, and the type of the data. By specifying the type of your data, the data source will automatically parse the inputs and pass objects of those types to the tensorizers. You don’t need to parse your own inputs.
For example, SeqTokenTensorizer
reads one column from the input data. The data is formatted like a json list of strings: [“where do you wanna meet?”, “MPK”]. The schema declaration is like this:
@property
def column_schema(self):
return [(self.column, List[str])]
Another example with GazetteerTensorizer
: it needs 2 columns, one string for the text itself, and one for the gazetteer features formatted like a complex json object. (The Gazetteer type is registered in the data source to automatically convert the raw strings from the input to this type.) The schema declaration is like this:
Gazetteer = List[Dict[str, Dict[str, float]]]
@property
def column_schema(self):
return [(self.text_column, str), (self.dict_column, Gazetteer)]
Example Implementation¶
Let’s implement a simple word tensorizer that creates a tensor with the word indexes from a vocabulary.
class MyWordTensorizer(Tensorizer):
class Config(Tensorizer.Config):
#: The name of the text column to read from the data source.
column: str = "text"
@classmethod
def from_config(cls, config: Config):
return cls(column=config.column)
def __init__(self, column):
self.column = column
self.vocab = vocab
@property
def column_schema(self):
return [(self.column, str)]
Next we need to build the vocabulary by reading the training data and count the words. Since multiple tensorizers might need to read the data, we parallelize the reading part and the tensorizers use the pattern row = yield to read their inputs. In this simple example, our “tokenize” function is just going to split on spaces.
def _tokenize(self, row):
raw_text = row[self.column]
return raw_text.split()
def initialize(self):
"""Build vocabulary based on training corpus."""
vocab_builder = VocabBuilder()
try:
while True:
row = yield
words = _tokenize(row)
vocab_builder.add_all(words)
except GeneratorExit:
self.vocab = vocab_builder.make_vocab()
The most important method is numberize, which takes a row and transforms it into list of numbers. The exact meaning of those numbers is arbitrary and depends on the design of the model. In our case, we look up the word indexes in the vocabulary.
def numberize(self, row):
"""Look up tokens in vocabulary to get their corresponding index"""
words = _tokenize(row)
idx = self.vocab.lookup_all(words)
# LSTM representations need the length of the sequence
return idx, len(idx)
Because LSTM-based representations need the length of the sequence to only consider the useful values and ignore the padding, we also return the length of each sequence.
Finally, the last function will create properly padded torch.Tensors from the batches produced by numberize. Numberized results can be cached for performance. We have a separate function to tensorize them because they are shuffled and batched differently (at each epoch), and then they will need different padding (because padding dimensions depend on the batch).
def tensorize(self, batch):
tokens, seq_lens = zip(*batch)
return (
pad_and_tensorize(tokens, self.vocab.get_pad_index()),
pad_and_tensorize(seq_lens),
)
LSTM-based representations implemented in Torch also need the batches to be sorted by sequence length descending, so we’re add in a sort function.
def sort_key(self, row):
# LSTM representations need the batches to be sorted by descending seq_len
return row[1]
The full code is in demo/examples/tensorizer.py
Testing¶
We can test our tensorizer with the following code that initializes the vocab, then tries the numberize function:
rows = [
{"text": "I want some coffee"},
{"text": "Turn it up"},
]
tensorizer = MyWordTensorizer(column="text")
# Vocabulary starts with 0 and 1 for Unknown and Padding.
# The rest of the vocabulary is built by the rows in order.
init = tensorizer.initialize()
init.send(None) # start the loop
for row in rows:
init.send(row)
init.close()
# Verify numberize.
numberized_rows = (tensorizer.numberize(r) for r in rows)
words, seq_len = next(numberized_rows)
assert words == [2, 3, 4, 5]
assert seq_len == 4 # "I want some coffee" has 4 words
words, seq_len = next(numberized_rows)
assert words == [6, 7, 8]
assert seq_len == 3 # "Turn it up" has 3 words
# test again, this time also make the tensors
numberized_rows = (tensorizer.numberize(r) for r in rows)
words_tensors, seq_len_tensors = tensorizer.tensorize(numberized_rows)
# Notice the padding (1) of the 2nd tensor to match the dimension
assert words_tensors.equal(torch.tensor([[2, 3, 4, 5], [6, 7, 8, 1]]))
assert seq_len_tensors.equal(torch.tensor([4, 3]))
Using External Dense Features¶
Sometime you want to add external features to augment the inputs to your model. For example, if you want to classify a text that has an image associated to it, you might want to process the image separately and use features of this image along with the text to help the classifier. Those features are added in the input data as one extra field (column) and should look like a list of floats (json).
Let’s look at a simple example, first without the dense feature, then we add dense features.
Example: Simple Model¶
First, here’s an example of a simple classifier that uses just the text and no dense features. (This is only showing the relevant parts of the model code for simplicity.)
class MyModel(Model):
class Config(Model.Config):
class ModelInput(Model.Config.InputConfig):
tokens: TokenTensorizer.Config = TokenTensorizer.Config()
labels: LabelTensorizer.Config = LabelTensorizer.Config()
inputs: ModelInput = ModelInput()
token_embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: RepresentationBase.Config = DocNNRepresentation.Config()
decoder: DecoderBase.Config = MLPDecoder.Config()
output_layer: OutputLayerBase.Config = ClassificationOutputLayer.Config()
def from_config(cls, config, tensorizers):
token_embedding = create_module(config.token_embedding, tensorizer=tensorizers["tokens"])
representation = create_module(config.representation, embed_dim=token_embedding.embedding_dim)
labels = tensorizers["labels"].vocab
decoder = create_module(
config.decoder,
in_dim=representation.representation_dim
out_dim=len(labels),
)
output_layer = create_module(config.output_layer, labels=labels)
return cls(token_embedding, representation, decoder, output_layer)
def arrange_model_inputs(self, tensor_dict):
return (tensor_dict["tokens"],)
def forward(
self,
tokens_in: Tuple[torch.Tensor, torch.Tensor],
) -> List[torch.Tensor]:
word_tokens, seq_lens = tokens
embedding_out = self.embedding(word_tokens)
representation_out = self.representation(embedding_out, seq_lens)
return self.decoder(representation_out)
Example: Simple Model With Dense Features¶
To use the dense features, you will typically write your model to use them directly in the decoder, bypassing the embeddings and representation stages that process the text part of your inputs. Here’s the same example again, this time with the dense features added (see lines marked with <–).
class MyModel(Model):
class Config(Model.Config):
class ModelInput(Model.Config.InputConfig):
tokens: TokenTensorizer.Config = TokenTensorizer.Config()
dense: FloatListTensorizer.Config = FloatListTensorizer.Config() # <--
labels: LabelTensorizer.Config = LabelTensorizer.Config()
inputs: ModelInput = ModelInput()
token_embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: RepresentationBase.Config = DocNNRepresentation.Config()
decoder: DecoderBase.Config = MLPDecoder.Config()
output_layer: OutputLayerBase.Config = ClassificationOutputLayer.Config()
def from_config(cls, config, tensorizers):
token_embedding = create_module(config.token_embedding, tensorizer=tensorizers["tokens"])
representation = create_module(config.representation, embed_dim=token_embedding.embedding_dim)
dense_dim = tensorizers["dense"].out_dim # <--
labels = tensorizers["labels"].vocab
decoder = create_module(
config.decoder,
in_dim=representation.representation_dim + dense_dim # <--
out_dim=len(labels),
)
output_layer = create_module(config.output_layer, labels=labels)
return cls(token_embedding, representation, decoder, output_layer)
def arrange_model_inputs(self, tensor_dict):
return (tensor_dict["tokens"], tensor_dict["dense"]) # <--
def forward(
self,
tokens_in: Tuple[torch.Tensor, torch.Tensor],
dense_in: torch.Tensor, # <--
) -> List[torch.Tensor]:
word_tokens, seq_lens = tokens
embedding_out = self.embedding(word_tokens)
representation_out = self.representation(embedding_out, seq_lens)
representation_out = torch.cat((representation_out, dense_in), 1) # <--
return self.decoder(representation_out)
Creating A New Model¶
PyText uses a Model
class as a central place to define components for data processing, model training, etc. and wire up those components.
In this tutorial, we will create a word tagging model for the ATIS dataset. The format of the ATIS dataset is explained in the Custom Data Format, so we will not repeat it here. We are going to create a similar data source that uses the slot tagging information rather than the intent information. We won’t describe in detail how this data source is created but you can look at the Custom Data Format, and the full source code for this tutorial in demo/my_tagging
for more information.
This model will predict a “slot”, also called “tag” or “label”, for each word in the utterance, using the IOB2 format), where the O tag is used for Outside (no match), B- for Beginning and I- for Inside (continuation). Here’s an example:
{
"text": "please list the flights from newark to los angeles",
"slots": "O O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name"
}
1. The Components¶
The first step is to specify the components used in this model by listing them in the Config class, the corresponding from_config
function, and the constructor __init__
.
Thanks to the modular nature of PyText, we can simply use many included common components, such as TokenTensorizer
, WordEmbedding
, BiLSTMSlotAttention
and MLPDecoder
. Since we’re also using the common pattern of embedding -> representation -> decoder -> output_layer, we use Model
as a base class, so we don’t need to write __init__
.
ModelInput defines how the data that is read will be transformed into tensors. This is done using a Tensorizer
. These components take one or several columns (often strings) from each input row and create the corresponding numeric features in a properly padded tensor. The tensorizers will to be initialized first, and in this step they will often parse the training data to create their Vocabulary
.
In our case, the utterance is in the column “text” (which is the default column name for this tensorizer), and is composed of tokens (words), so we can use the TokenTensorizer
. The Vocabulary
will be created from all the utterances.
The slots are also composed of tokens: the IOB2 tags. We can also use TokenTensorizer
for the column “slots”. This Vocabulary
will be the list of IOB2 tags found in the “slots” column of the training data. This is a different column name, so we specify it.
class MyTaggingModel(Model):
class Config(ConfigBase):
class ModelInput(Model.Config.ModelInput):
tokens: TokenTensorizer.Config = TokenTensorizer.Config()
slots: TokenTensorizer.Config = TokenTensorizer.Config(column="slots")
inputs: ModelInput = ModelInput()
embedding: WordEmbedding.Config = WordEmbedding.Config()
representation: BiLSTMSlotAttention.Config = BiLSTMSlotAttention.Config,
decoder: MLPDecoder.Config = MLPDecoder.Config()
output_layer: MyTaggingOutputLayer.Config = MyTaggingOutputLayer.Config()
2. from_config method¶
from_config
is where the components are created with the proper parameters. Some come from the Config (passed by the user in json format), some use the default values, others are dicated by the model’s architecture so that the different components fit with each other. For example, the representation layer needs to know the dimension of the embeddings it will receive, the decoder needs to know the dimension of the representation layer before it and the size of the slots vocab to output.
In this model, we only need one embedding: the one of the tokens. The slots don’t have embeddings because while they are listed as input (in ModelInput), they are actually outputs and the will be used in the output layer. (During training, they are inputs as true values.)
@classmethod
def from_config(cls, config, tensorizers):
embedding = create_module(config.embedding, tensorizer=tensorizers["tokens"])
representation = create_module(
config.representation, embed_dim=embedding.embedding_dim
)
slots = tensorizers["slots"].vocab
decoder = create_module(
config.decoder,
in_dim=representation.representation_dim,
out_dim=len(slots),
)
output_layer = MyTaggingOutputLayer(slots, CrossEntropyLoss(None))
# call __init__ constructor from super class Model
return cls(embedding, representation, decoder, output_layer)
3. Forward method¶
The forward
method contains the execution logic calling each of those components and passing the results of one to the next. It will be called for every row transformed into tensors.
TokenTensorizer
returns the tensor for the tokens themselves and also the sequence length, which is the number of tokens in the utterances. This is because we need to pad the tensors in a batch to give them all the same dimensions, and LSTM-based reprentations need to differentiate the padding from the actual tokens.
def forward(
self,
word_tokens: torch.Tensor,
seq_lens: torch.Tensor,
) -> List[torch.Tensor]:
# fetch embeddings for the tokens in the utterance
embedding = self.embedding(word_tokens)
# pass the embeddings to the BiLSTMSlotAttention layer.
# LSTM-based representations also need seq_lens.
representation = self.representation(embedding, seq_lens)
# some LSTM representations return extra tensors, we don't use those.
if isinstance(representation, tuple):
representation = representation[0]
# finally run the results through the decoder
return self.decoder(representation)
4. Complete MyTaggingModel¶
To finish this class, we need to define a few more functions.
All the inputs are placed in a python dict where the key is the name of the tensorizer as defined in ModelInput, and the value is the tensor for this input row.
First, we define how the inputs will be passed to the forward
function in arrange_model_inputs
. In our case, the only input passed to the forward
function is the tensors from the “tokens” input. As explained above, TokenTensorizer
returns 2 tensors: the tokens and the sequence length. (Actually it returns 3 tensors, we’ll ignore the 3rd one, the token ranges, in this tutorial)
Then we define arrange_targets
, which is doing something similar for the targets, which are passed to the loss function during training. In our case, it’s the “slots” tensorizer doing that. The padding value can be passed to the loss function (unlike LSTM representations), so we only need the first tensor.
def arrange_model_inputs(self, tensor_dict):
tokens, seq_lens, _ = tensor_dict["tokens"]
return (tokens, seq_lens)
def arrange_targets(self, tensor_dict):
slots, _, _ = tensor_dict["slots"]
return slots
5. Output Layer¶
So far, our model is using the same components as any other model, including a common classification model, except for two things: the BiLSTMSlotAttention and the output layer.
BiLSTMSlotAttention is a multi-layer bidirectional LSTM based representation with attention over slots. The implementation of this representation is outside the scope of this tutorial, and this component is already included in PyText, so we’ll just use it.
The output layer can be simple enough and demonstrates a few important notions in PyText, like how the loss function is tied to the output layer. We implement it like this:
class MyTaggingOutputLayer(OutputLayerBase):
class Config(OutputLayerBase.Config):
loss: CrossEntropyLoss.Config = CrossEntropyLoss.Config()
@classmethod
def from_config(cls, config, vocab, pad_token):
return cls(
vocab,
create_loss(config.loss, ignore_index=pad_token),
)
def get_loss(self, logit, target, context, reduce=True):
# flatten the logit from [batch_size, seq_lens, dim] to
# [batch_size * seq_lens, dim]
return self.loss_fn(logit.view(-1, logit.size()[-1]), target.view(-1), reduce)
def get_pred(self, logit, *args, **kwargs):
preds = torch.max(logit, 2)[1]
scores = F.log_softmax(logit, 2)
return preds, scores
6. Metric Reporter¶
Next we need to write a MetricReporter
to calculate metrics and report model training/test results:
The MetricReporter
base class aggregates all the output from Trainer, including predictions, scores and targets. The default aggregation behavior is concatenating the tensors from each batch and converting it to list. If you want different aggregation behavior, you can override it with your own implementation. Here we use the compute_classification_metrics method provided in pytext.metrics to get the precision/recall/F1 scores. PyText ships with a few common metric calculation methods, but you can easily incorporate other libraries, such as sklearn.
In the __init__
method, we can pass a list of Channel
to report the results to any output stream. We use a simple ConsoleChannel
that prints everything to stdout and a TensorBoardChannel
that outputs metrics to TensorBoard:
class MyTaggingMetricReporter(MetricReporter):
@classmethod
def from_config(cls, config, vocab):
return MyTaggingMetricReporter(
channels=[ConsoleChannel(), TensorBoardChannel()],
label_names=vocab
)
def __init__(self, label_names, channels):
super().__init__(channels)
self.label_names = label_names
def calculate_metric(self):
return compute_classification_metrics(
list(
itertools.chain.from_iterable(
(
LabelPrediction(s, p, e)
for s, p, e in zip(scores, pred, expect)
)
for scores, pred, expect in zip(
self.all_scores, self.all_preds, self.all_targets
)
)
),
self.label_names,
self.calculate_loss(),
)
7. Task¶
Finally, we declare a task by inheriting from NewTask
. This base class specifies the training parameters of the model: the data source and batcher, the trainer class (most models will use the default one), and the metric reporter.
Since our metric reporter needs to be initialized with a specific vocab, we need to define the classmethod create_metric_reporter so that PyText can construct it properly.
class MyTaggingTask(NewTask):
class Config(NewTask.Config):
model: MyTaggingModel.Config = MyTaggingModel.Config()
metric_reporter: MyTaggingMetricReporter.Config = MyTaggingMetricReporter.Config()
@classmethod
def create_metric_reporter(cls, config, tensorizers):
return MyTaggingMetricReporter(
channels=[ConsoleChannel(), TensorBoardChannel()],
label_names=list(tensorizers["slots"].vocab),
)
8. Generate sample config and train the model¶
Save all your files in the same directory. For example, I saved all my files in my_tagging/
.Now you can tell PyText to include your classes with the parameter --include my_tagging
Now that we have a fully functional class:~Task, we can generate a default JSON config for it by using the pytext cli tool.
(pytext) $ pytext --include my_tagging gen-default-config MyTaggingTask > my_config.json
Tweak the config as you like, for instance change the number of epochs. Most importantly, specify the path to your ATIS dataset. Then train the model with:
(pytext) $ pytext --include my_tagging train < my_config.json
Hacking PyText¶
Using your own classes in PyText¶
Most people just want to create their own components and use them to load their data, train models, etc. In this case, you just need to put all your .py files in a directory and include it with the option –include <my directory>. PyText will be able to find your code and import your classes. This works with PyText from pip install or from github sources.
Changing PyText¶
Why would you want to change PyText? Maybe you want to fix one of the github issues, or you want to experiment with your own changes that you can’t simply include and you would like to see included in PyText’s future releases. In this case you need to download the sources and submit them back to github. Since getting your changes ready and integrated can take some time, you might need to keep your sources up to date. Here’s how to do it.
Installation¶
First, make a copy of the PyText repo into your github account. For that (you need a github account), go to the PyText repo and click the Fork button at top-right of the page.
Once the fork is complete, clone your fork onto your computer by clicking the “Clone or download” button and copying the URL. Then, in a terminal, use the git clone command to clone the repo in the directory of your choice.
$ git clone https://github.com/<your_account>/pytext.git
To be able to update your github fork with the latest changes from Facebook’s PyText sources, you need to add it as a “remote” with this command. (This can be done later.) The name “upstream” is what’s typically used, but you can use any name you want for your remotes.
$ git remote add upstream https://github.com/facebookresearch/pytext.git
Now you should have 2 remotes: origin is your own github fork, and upstream is facebook’s github repo.
Now you can install the PyText dependencies in a virtual environment. (This means the dependencies will be installed inside the directory pytext_venv under pytext/, not in your machine’s system directory.) Notice the (pytext_venv) in the terminal prompt when it’s active.
$ cd pytext
$ source activation_venv
(pytext_venv) $ ./install_deps
To exit the virtual environment:
(pytext_venv) $ deactivate
Writing Code¶
After you’ve made some code changes, you need to create a branch to commit your code. Do not commit your code in your master branch! Give your branch a name that represents what your experiment is about. Then add your changes and commit them.
$ git checkout -b <my_experiment>
$ git status -sb
... # list of files you changed
$ git add <file1> <file2>
$ git diff --cached # see the code changes you added
# ... maybe keep changing and run git add again
$ git commit # save your changes
$ git show # optional, look at the code changes
$ git push --set-upstream origin <my_experiment> # send your branch to your github fork
At this point you should be able to see your branch in your github repo and create a Pull Request to Facebook’s github if you want to submit it for review and later be integrated.
Keeping Up-to-Date¶
To resume development in an already cloned repo, you might need re-activate the virtual environment:
$ cd pytext
$ source activation_venv
If you need to update your github repo with the latest changes in the Facebook upstream repo, fetch the changes with this command, merge your master with those changes, and push the changes to your github forks. In order to do that, you can’t have any pending changes, so make sure you commit your current work to a branch.
$ git fetch upstream
$ git checkout master
$ git merge upstream/master
$ git push
Important: never commit changes in your master. Doing this would prevent further updates. Instead, always commit changes to a branch. (See below for more on this.)
Finally, you might need to rebase your branches to the latest master. Check out the branch, rebase it, and (optionally) push it again to your github fork.
$ git checkout <my_experiment>
$ git rebase master
$ git push # optional
Modifying your Pull Request¶
Many times you will need to modify your code and submit your pull request again. Maybe you found a bug that you need to fix, or you want to integrate some feedback you got in the pull request, or after you rebased your branch you had to solve a conflict.
If you’re going to change your pull request, it’s always a good idea to start by rebasing your branch on the lastest upstream/master (see above.)
After making your changes, amend to your existing commit rather than creating a new commit on top of it. This is to ensure your changes are in a single clean commit that does not contain your failed experiments. At this point, you will have a branch <my_experiment>, and the branch you pushed to your github forked origin/<my_experiment>. Then you will need to force the push to replace the github branch with your changes. The pull request will be automatically updated upstream.
$ git commit --amend
$ git push --force
Addendum¶
One commit or multiple commits?¶
For most contributions, you will want to keep your pull request as a single, clean commit. It’s better to amend the same commit rather than keeping the entire history of intermediate edits.
If your change is more involved, it might be better to create multiple commits, as long as each commit does one thing and is self contained.
Code Quality¶
In order to get your pull request integrated with PyText, it needs to pass the tests and be reviewed. The pull requests will automatically run the circleci tests, and they must be all green for your pull request to be accepted. These tests include building the documentation, run the unit tests under python 3.6 and 3.7, and run the linter black to verify code formatting. You can run the linter yourself after installing it with pip install black.
If all the tests are green, people will start reviewing your changes. (You too can review other pull requests and make comments and suggestions.) If reviewers ask questions or make suggestions, try your best to answer them with comments or code changes.
A very common reason to reject a pull request is lack of unit testing. Make sure your code is covered by unit tests (add your own tests) to make sure they work now and also in the future when other people make changes to your code!
Creating Documentation¶
Whether you want to add documentation for your feature in code, or just change the existing the documentation, you will need to test it locally. First install extra dependencies needed to build the documentation:
$ pip install --upgrade -r docs_requirements.txt
$ pip install --upgrade -r pytext/docs/requirements.txt
Then you can build the documentation
$ cd pytext/docs
$ make html
Finally you can look at the documentation produced with a URL like this file:///<path_to_pytext_sources>/pytext/docs/build/html/hacking_pytext.html
Useful git alias¶
One of the most useful command for git is one where you print the commits and branches like a tree. This is a complex command most useful when stored as an alias, so we’re giving it here.
$ git config --global alias.lg "log --pretty=tformat:'%C(yellow)%h %Cgreen(%ad)%Cred%d %Creset%s %C(bold blue)<%cn>%Creset' --decorate --date=short --date=local --graph --all"
$ # try it
$ git lg
pytext¶
config¶
field_config¶
FeatureConfig¶
Component: Module
-
class
pytext.config.field_config.
FeatureConfig
[source] Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- word_feat: WordEmbedding.Config = WordEmbedding.Config()
- seq_word_feat: Optional[WordEmbedding.Config] =
None
- dict_feat: Optional[DictEmbedding.Config] =
None
- char_feat: Optional[CharacterEmbedding.Config] =
None
- dense_feat: Optional[FloatVectorConfig] =
None
- contextual_token_embedding: Optional[ContextualTokenEmbedding.Config] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"word_feat": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"seq_word_feat": null,
"dict_feat": null,
"char_feat": null,
"dense_feat": null,
"contextual_token_embedding": null
}
FloatVectorConfig¶
-
class
pytext.config.field_config.
FloatVectorConfig
[source] Bases:
ConfigBase
All Attributes (including base classes)
- dim: int =
0
- export_input_names: list[str] =
['float_vec_vals']
- dim_error_check: bool =
False
Default JSON
{
"dim": 0,
"export_input_names": [
"float_vec_vals"
],
"dim_error_check": false
}
module_config¶
CNNParams¶
-
class
pytext.config.module_config.
CNNParams
[source] Bases:
ConfigBase
All Attributes (including base classes)
- kernel_num: int =
100
- kernel_sizes: list[int] =
[3, 4]
- weight_norm: bool =
False
- dilated: bool =
False
- causal: bool =
False
Default JSON
{
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
pytext_config¶
PyTextConfig¶
-
class
pytext.config.pytext_config.
PyTextConfig
[source] Bases:
ConfigBase
All Attributes (including base classes)
- task: Union[TaskBase.Config, Task_Deprecated.Config, _NewTask.Config, NewTask.Config, DisjointMultitask.Config, NewDisjointMultitask.Config, QueryDocumentPairwiseRankingTask.Config, EnsembleTask.Config, DocumentClassificationTask.Config, DocumentRegressionTask.Config, NewBertClassificationTask.Config, NewBertPairClassificationTask.Config, BertPairRegressionTask.Config, WordTaggingTask.Config, IntentSlotTask.Config, LMTask.Config, MaskedLMTask.Config, PairwiseClassificationTask.Config, RoBERTaNERTask.Config, SeqNNTask.Config, SquadQATask.Config, SemanticParsingTask.Config]
- use_cuda_if_available: bool =
True
- use_fp16: bool =
False
- distributed_world_size: int =
1
- gpu_streams_for_distributed_training: int =
1
- load_snapshot_path: str =
''
- save_snapshot_path: str =
'/tmp/model.pt'
- use_config_from_snapshot: bool =
True
- auto_resume_from_snapshot: bool =
False
- export_caffe2_path: Optional[str] =
None
- export_onnx_path: str =
'/tmp/model.onnx'
- export_torchscript_path: Optional[str] =
None
- torchscript_quantize: Optional[bool] =
False
- modules_save_dir: str =
''
- save_module_checkpoints: bool =
False
- save_all_checkpoints: bool =
False
- use_tensorboard: bool =
True
- random_seed: Optional[int] =
None
- Seed value to seed torch, python, and numpy random generators.
- use_deterministic_cudnn: bool =
False
- Whether to allow CuDNN to behave deterministically.
- report_eval_results: bool =
False
- include_dirs: Optional[list[str]] =
None
- version: int
- use_cuda_for_testing: bool =
True
- test_out_path: str =
'/tmp/test_out.txt'
- debug_path: str =
'/tmp/model.debug'
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
data¶
batch_sampler¶
AlternatingRandomizedBatchSampler.Config¶
Component: AlternatingRandomizedBatchSampler
-
class
AlternatingRandomizedBatchSampler.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- unnormalized_iterator_probs: dict[str, float]
- second_unnormalized_iterator_probs: dict[str, float]
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
BaseBatchSampler.Config¶
Component: BaseBatchSampler
-
class
BaseBatchSampler.
Config
Bases:
Component.Config
All Attributes (including base classes)
- Subclasses
EvalBatchSampler.Config
Default JSON
{}
EvalBatchSampler.Config¶
Component: EvalBatchSampler
-
class
EvalBatchSampler.
Config
Bases:
BaseBatchSampler.Config
All Attributes (including base classes)
Default JSON
{}
RandomizedBatchSampler.Config¶
Component: RandomizedBatchSampler
-
class
RandomizedBatchSampler.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- unnormalized_iterator_probs: dict[str, float]
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
RoundRobinBatchSampler.Config¶
Component: RoundRobinBatchSampler
-
class
RoundRobinBatchSampler.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- iter_to_set_epoch: str =
''
Default JSON
{
"iter_to_set_epoch": ""
}
bert_tensorizer¶
BERTTensorizer.Config¶
Component: BERTTensorizer
-
class
BERTTensorizer.
Config
[source] Bases:
BERTTensorizerBase.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['text']
- tokenizer: Tokenizer.Config = WordPieceTokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
'/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
- max_seq_len: int =
256
- Subclasses
SquadForBERTTensorizer.Config
SquadForBERTTensorizerForKD.Config
Default JSON
{
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 256
}
BERTTensorizerBase.Config¶
Component: BERTTensorizerBase
-
class
BERTTensorizerBase.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['text']
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
''
- max_seq_len: int =
256
- Subclasses
BERTTensorizer.Config
RoBERTaTensorizer.Config
RoBERTaTokenLevelTensorizer.Config
SquadForBERTTensorizer.Config
SquadForBERTTensorizerForKD.Config
SquadForRoBERTaTensorizer.Config
Default JSON
{
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"base_tokenizer": null,
"vocab_file": "",
"max_seq_len": 256
}
data¶
Batcher.Config¶
Component: Batcher
-
class
Batcher.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- train_batch_size: int =
16
- Make batches of this size when possible. If there’s not enough data, might generate some smaller batches.
- eval_batch_size: int =
16
- test_batch_size: int =
16
- Subclasses
PoolingBatcher.Config
DynamicPoolingBatcher.Config
ExponentialDynamicPoolingBatcher.Config
LinearDynamicPoolingBatcher.Config
Default JSON
{
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16
}
Data.Config¶
Component: Data
-
class
Data.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- source: DataSource.Config = TSVDataSource.Config()
- Specify where training/test/eval data come from. The default value will not provide any data.
- batcher: Batcher.Config = PoolingBatcher.Config()
- How training examples are split into batches for the optimizer.
- sort_key: Optional[str] =
None
- in_memory: Optional[bool] =
True
- cache numberized result in memory, turn off when CPU memory bound.
- Subclasses
PackedLMData.Config
Default JSON
{
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
PoolingBatcher.Config¶
Component: PoolingBatcher
-
class
PoolingBatcher.
Config
[source] Bases:
Batcher.Config
All Attributes (including base classes)
- train_batch_size: int =
16
- eval_batch_size: int =
16
- test_batch_size: int =
16
- pool_num_batches: int =
10000
- Size of a pool expressed in number of batches
- num_shuffled_pools: int =
1
- How many pool-sized chunks to load at a time for shuffling
- Subclasses
DynamicPoolingBatcher.Config
ExponentialDynamicPoolingBatcher.Config
LinearDynamicPoolingBatcher.Config
Default JSON
{
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
data_handler¶
DataHandler.Config¶
Component: DataHandler
-
class
DataHandler.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- columns_to_read: list[str] =
[]
- shuffle: bool =
True
- sort_within_batch: bool =
True
- train_path: str =
'train.tsv'
- eval_path: str =
'eval.tsv'
- test_path: str =
'test.tsv'
- train_batch_size: int =
128
- eval_batch_size: int =
128
- test_batch_size: int =
128
- column_mapping: dict[str, str] =
{}
- Subclasses
DisjointMultitaskDataHandler.Config
Default JSON
{
"columns_to_read": [],
"shuffle": true,
"sort_within_batch": true,
"train_path": "train.tsv",
"eval_path": "eval.tsv",
"test_path": "test.tsv",
"train_batch_size": 128,
"eval_batch_size": 128,
"test_batch_size": 128,
"column_mapping": {}
}
disjoint_multitask_data¶
DisjointMultitaskData.Config¶
Component: DisjointMultitaskData
-
class
DisjointMultitaskData.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- sampler: BaseBatchSampler.Config = RoundRobinBatchSampler.Config()
- test_key: Optional[str] =
None
Default JSON
{
"sampler": {
"RoundRobinBatchSampler": {
"iter_to_set_epoch": ""
}
},
"test_key": null
}
disjoint_multitask_data_handler¶
DisjointMultitaskDataHandler.Config¶
Component: DisjointMultitaskDataHandler
-
class
DisjointMultitaskDataHandler.
Config
[source] Bases:
DataHandler.Config
Configuration class for DisjointMultitaskDataHandler.
-
upsample
¶ If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.
Type: bool
-
All Attributes (including base classes)
- columns_to_read: list[str] =
[]
- shuffle: bool =
True
- sort_within_batch: bool =
True
- train_path: str =
'train.tsv'
- eval_path: str =
'eval.tsv'
- test_path: str =
'test.tsv'
- train_batch_size: int =
128
- eval_batch_size: int =
128
- test_batch_size: int =
128
- column_mapping: dict[str, str] =
{}
- upsample: bool =
True
Default JSON
{
"columns_to_read": [],
"shuffle": true,
"sort_within_batch": true,
"train_path": "train.tsv",
"eval_path": "eval.tsv",
"test_path": "test.tsv",
"train_batch_size": 128,
"eval_batch_size": 128,
"test_batch_size": 128,
"column_mapping": {},
"upsample": true
}
dynamic_pooling_batcher¶
BatcherSchedulerConfig¶
Component: Module
-
class
pytext.data.dynamic_pooling_batcher.
BatcherSchedulerConfig
[source] Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- start_batch_size: int =
32
- end_batch_size: int =
256
- epoch_period: int =
10
- step_size: int =
1
- Subclasses
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"start_batch_size": 32,
"end_batch_size": 256,
"epoch_period": 10,
"step_size": 1
}
DynamicPoolingBatcher.Config¶
Component: DynamicPoolingBatcher
-
class
DynamicPoolingBatcher.
Config
[source] Bases:
PoolingBatcher.Config
All Attributes (including base classes)
- train_batch_size: int =
16
- eval_batch_size: int =
16
- test_batch_size: int =
16
- pool_num_batches: int =
10000
- num_shuffled_pools: int =
1
- scheduler_config: BatcherSchedulerConfig = BatcherSchedulerConfig()
- Subclasses
ExponentialDynamicPoolingBatcher.Config
LinearDynamicPoolingBatcher.Config
Default JSON
{
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1,
"scheduler_config": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"start_batch_size": 32,
"end_batch_size": 256,
"epoch_period": 10,
"step_size": 1
}
}
ExponentialBatcherSchedulerConfig¶
Component: Module
-
class
pytext.data.dynamic_pooling_batcher.
ExponentialBatcherSchedulerConfig
[source] Bases:
BatcherSchedulerConfig
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- start_batch_size: int =
32
- end_batch_size: int =
256
- epoch_period: int =
10
- step_size: int =
1
- gamma: float =
5
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"start_batch_size": 32,
"end_batch_size": 256,
"epoch_period": 10,
"step_size": 1,
"gamma": 5
}
ExponentialDynamicPoolingBatcher.Config¶
Component: ExponentialDynamicPoolingBatcher
-
class
ExponentialDynamicPoolingBatcher.
Config
[source] Bases:
DynamicPoolingBatcher.Config
All Attributes (including base classes)
- train_batch_size: int =
16
- eval_batch_size: int =
16
- test_batch_size: int =
16
- pool_num_batches: int =
10000
- num_shuffled_pools: int =
1
- scheduler_config: ExponentialBatcherSchedulerConfig = BatcherSchedulerConfig()
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
LinearDynamicPoolingBatcher.Config¶
Component: LinearDynamicPoolingBatcher
-
class
LinearDynamicPoolingBatcher.
Config
Bases:
DynamicPoolingBatcher.Config
All Attributes (including base classes)
- train_batch_size: int =
16
- eval_batch_size: int =
16
- test_batch_size: int =
16
- pool_num_batches: int =
10000
- num_shuffled_pools: int =
1
- scheduler_config: BatcherSchedulerConfig = BatcherSchedulerConfig()
Default JSON
{
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1,
"scheduler_config": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"start_batch_size": 32,
"end_batch_size": 256,
"epoch_period": 10,
"step_size": 1
}
}
featurizer¶
featurizer¶
Featurizer.Config¶
Component: Featurizer
-
class
Featurizer.
Config
Bases:
Component.Config
All Attributes (including base classes)
Default JSON
{}
simple_featurizer¶
SimpleFeaturizer.Config¶
Component: SimpleFeaturizer
-
class
SimpleFeaturizer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- sentence_markers: Optional[tuple[str, str]] =
None
- lowercase_tokens: bool =
True
- split_regex: str =
'\\s+'
- convert_to_bytes: bool =
False
Default JSON
{
"sentence_markers": null,
"lowercase_tokens": true,
"split_regex": "\\s+",
"convert_to_bytes": false
}
packed_lm_data¶
PackedLMData.Config¶
Component: PackedLMData
-
class
PackedLMData.
Config
[source] Bases:
Data.Config
All Attributes (including base classes)
- source: DataSource.Config = TSVDataSource.Config()
- batcher: Batcher.Config = PoolingBatcher.Config()
- sort_key: Optional[str] =
None
- in_memory: Optional[bool] =
True
- max_seq_len: int =
128
Default JSON
{
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true,
"max_seq_len": 128
}
roberta_tensorizer¶
RoBERTaTensorizer.Config¶
Component: RoBERTaTensorizer
-
class
RoBERTaTensorizer.
Config
[source] Bases:
BERTTensorizerBase.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['text']
- tokenizer: Tokenizer.Config = GPT2BPETokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt'
- max_seq_len: int =
256
- Subclasses
RoBERTaTokenLevelTensorizer.Config
SquadForRoBERTaTensorizer.Config
Default JSON
{
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256
}
RoBERTaTokenLevelTensorizer.Config¶
Component: RoBERTaTokenLevelTensorizer
-
class
RoBERTaTokenLevelTensorizer.
Config
[source] Bases:
RoBERTaTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['text']
- tokenizer: Tokenizer.Config = GPT2BPETokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt'
- max_seq_len: int =
256
- labels_columns: list[str] =
['label']
- labels: list[str] =
[]
Default JSON
{
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256,
"labels_columns": [
"label"
],
"labels": []
}
sources¶
conllu¶
CoNLLUNERDataSource.Config¶
Component: CoNLLUNERDataSource
-
class
CoNLLUNERDataSource.
Config
Bases:
CoNLLUPOSDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- language: Optional[str] =
None
- train_filename: Optional[str] =
None
- test_filename: Optional[str] =
None
- eval_filename: Optional[str] =
None
- field_names: Optional[list[str]] =
None
- delimiter: str =
'\t'
Default JSON
{
"column_mapping": {},
"language": null,
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t"
}
CoNLLUPOSDataSource.Config¶
Component: CoNLLUPOSDataSource
-
class
CoNLLUPOSDataSource.
Config
[source] Bases:
RootDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- language: Optional[str] =
None
- Name of the language. If not set, languages will be empty.
- train_filename: Optional[str] =
None
- Filename of training set. If not set, iteration will be empty.
- test_filename: Optional[str] =
None
- Filename of testing set. If not set, iteration will be empty.
- eval_filename: Optional[str] =
None
- Filename of eval set. If not set, iteration will be empty.
- field_names: Optional[list[str]] =
None
- Field names for the TSV. If this is not set, the first line of each file will be assumed to be a header containing the field names.
- delimiter: str =
'\t'
- The column delimiter. CoNLL-U file default is t.
- Subclasses
CoNLLUNERDataSource.Config
Default JSON
{
"column_mapping": {},
"language": null,
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t"
}
data_source¶
DataSource.Config¶
Component: DataSource
-
class
DataSource.
Config
Bases:
Component.Config
All Attributes (including base classes)
- Subclasses
RowShardedDataSource.Config
ShardedDataSource.Config
SquadDataSource.Config
SquadDataSourceForKD.Config
Default JSON
{}
RootDataSource.Config¶
Component: RootDataSource
-
class
RootDataSource.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- An optional column mapping, allowing the columns in the raw data source to not map directly to the column names in the schema. This mapping will remap names from the raw data source to names in the schema.
- Subclasses
CoNLLUNERDataSource.Config
CoNLLUPOSDataSource.Config
PandasDataSource.Config
SessionPandasDataSource.Config
SessionDataSource.Config
BlockShardedTSVDataSource.Config
MultilingualTSVDataSource.Config
SessionTSVDataSource.Config
TSVDataSource.Config
Default JSON
{
"column_mapping": {}
}
RowShardedDataSource.Config¶
Component: RowShardedDataSource
-
class
RowShardedDataSource.
Config
Bases:
ShardedDataSource.Config
All Attributes (including base classes)
Default JSON
{}
ShardedDataSource.Config¶
Component: ShardedDataSource
-
class
ShardedDataSource.
Config
Bases:
DataSource.Config
All Attributes (including base classes)
- Subclasses
RowShardedDataSource.Config
Default JSON
{}
pandas¶
PandasDataSource.Config¶
Component: PandasDataSource
-
class
PandasDataSource.
Config
Bases:
RootDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- Subclasses
SessionPandasDataSource.Config
Default JSON
{
"column_mapping": {}
}
SessionPandasDataSource.Config¶
Component: SessionPandasDataSource
-
class
SessionPandasDataSource.
Config
Bases:
PandasDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
Default JSON
{
"column_mapping": {}
}
session¶
SessionDataSource.Config¶
Component: SessionDataSource
-
class
SessionDataSource.
Config
Bases:
RootDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
Default JSON
{
"column_mapping": {}
}
squad¶
SquadDataSource.Config¶
Component: SquadDataSource
-
class
SquadDataSource.
Config
[source] Bases:
DataSource.Config
All Attributes (including base classes)
- train_filename: Optional[str] =
'train-v2.0.json'
- test_filename: Optional[str] =
'dev-v2.0.json'
- eval_filename: Optional[str] =
'dev-v2.0.json'
- ignore_impossible: bool =
True
- max_character_length: int =
1048576
- min_overlap: float =
0.1
- delimiter: str =
'\t'
- quoted: bool =
False
- Subclasses
SquadDataSourceForKD.Config
Default JSON
{
"train_filename": "train-v2.0.json",
"test_filename": "dev-v2.0.json",
"eval_filename": "dev-v2.0.json",
"ignore_impossible": true,
"max_character_length": 1048576,
"min_overlap": 0.1,
"delimiter": "\t",
"quoted": false
}
SquadDataSourceForKD.Config¶
Component: SquadDataSourceForKD
-
class
SquadDataSourceForKD.
Config
Bases:
SquadDataSource.Config
All Attributes (including base classes)
- train_filename: Optional[str] =
'train-v2.0.json'
- test_filename: Optional[str] =
'dev-v2.0.json'
- eval_filename: Optional[str] =
'dev-v2.0.json'
- ignore_impossible: bool =
True
- max_character_length: int =
1048576
- min_overlap: float =
0.1
- delimiter: str =
'\t'
- quoted: bool =
False
Default JSON
{
"train_filename": "train-v2.0.json",
"test_filename": "dev-v2.0.json",
"eval_filename": "dev-v2.0.json",
"ignore_impossible": true,
"max_character_length": 1048576,
"min_overlap": 0.1,
"delimiter": "\t",
"quoted": false
}
tsv¶
BlockShardedTSVDataSource.Config¶
Component: BlockShardedTSVDataSource
-
class
BlockShardedTSVDataSource.
Config
Bases:
TSVDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- train_filename: Optional[str] =
None
- test_filename: Optional[str] =
None
- eval_filename: Optional[str] =
None
- field_names: Optional[list[str]] =
None
- delimiter: str =
'\t'
- quoted: bool =
False
- drop_incomplete_rows: bool =
False
Default JSON
{
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
MultilingualTSVDataSource.Config¶
Component: MultilingualTSVDataSource
-
class
MultilingualTSVDataSource.
Config
[source] Bases:
TSVDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- train_filename: Optional[str] =
None
- test_filename: Optional[str] =
None
- eval_filename: Optional[str] =
None
- field_names: Optional[list[str]] =
None
- delimiter: str =
'\t'
- quoted: bool =
False
- drop_incomplete_rows: bool =
False
- data_source_languages: dict[str, list[str]] =
{'train': ['en'], 'eval': ['en'], 'test': ['en']}
- language_columns: list[str] =
['language']
Default JSON
{
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false,
"data_source_languages": {
"train": [
"en"
],
"eval": [
"en"
],
"test": [
"en"
]
},
"language_columns": [
"language"
]
}
SessionTSVDataSource.Config¶
Component: SessionTSVDataSource
-
class
SessionTSVDataSource.
Config
Bases:
TSVDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- train_filename: Optional[str] =
None
- test_filename: Optional[str] =
None
- eval_filename: Optional[str] =
None
- field_names: Optional[list[str]] =
None
- delimiter: str =
'\t'
- quoted: bool =
False
- drop_incomplete_rows: bool =
False
Default JSON
{
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
TSVDataSource.Config¶
Component: TSVDataSource
-
class
TSVDataSource.
Config
[source] Bases:
RootDataSource.Config
All Attributes (including base classes)
- column_mapping: dict[str, str] =
{}
- train_filename: Optional[str] =
None
- Filename of training set. If not set, iteration will be empty.
- test_filename: Optional[str] =
None
- Filename of testing set. If not set, iteration will be empty.
- eval_filename: Optional[str] =
None
- Filename of eval set. If not set, iteration will be empty.
- field_names: Optional[list[str]] =
None
- Field names for the TSV. If this is not set, the first line of each file will be assumed to be a header containing the field names.
- delimiter: str =
'\t'
- The column delimiter passed to Python’s csv library. Change to “,” for csv.
- quoted: bool =
False
- Whether the columns can use quotes to include delimiters or not. Rows with unclosed quotes will be merged with n inside. Change to True for quoted csv.
- drop_incomplete_rows: bool =
False
- Subclasses
BlockShardedTSVDataSource.Config
MultilingualTSVDataSource.Config
SessionTSVDataSource.Config
Default JSON
{
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
squad_for_bert_tensorizer¶
SquadForBERTTensorizer.Config¶
Component: SquadForBERTTensorizer
-
class
SquadForBERTTensorizer.
Config
[source] Bases:
BERTTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['question', 'doc']
- tokenizer: Tokenizer.Config = WordPieceTokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
'/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
- max_seq_len: int =
256
- answers_column: str =
'answers'
- answer_starts_column: str =
'answer_starts'
- Subclasses
SquadForBERTTensorizerForKD.Config
Default JSON
{
"is_input": true,
"columns": [
"question",
"doc"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 256,
"answers_column": "answers",
"answer_starts_column": "answer_starts"
}
SquadForBERTTensorizerForKD.Config¶
Component: SquadForBERTTensorizerForKD
-
class
SquadForBERTTensorizerForKD.
Config
[source] Bases:
SquadForBERTTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['question', 'doc']
- tokenizer: Tokenizer.Config = WordPieceTokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
'/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
- max_seq_len: int =
256
- answers_column: str =
'answers'
- answer_starts_column: str =
'answer_starts'
- start_logits_column: str =
'start_logits'
- end_logits_column: str =
'end_logits'
- has_answer_logits_column: str =
'has_answer_logits'
- pad_mask_column: str =
'pad_mask'
- segment_labels_column: str =
'segment_labels'
Default JSON
{
"is_input": true,
"columns": [
"question",
"doc"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 256,
"answers_column": "answers",
"answer_starts_column": "answer_starts",
"start_logits_column": "start_logits",
"end_logits_column": "end_logits",
"has_answer_logits_column": "has_answer_logits",
"pad_mask_column": "pad_mask",
"segment_labels_column": "segment_labels"
}
SquadForRoBERTaTensorizer.Config¶
Component: SquadForRoBERTaTensorizer
-
class
SquadForRoBERTaTensorizer.
Config
[source] Bases:
RoBERTaTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- columns: list[str] =
['question', 'doc']
- tokenizer: Tokenizer.Config = GPT2BPETokenizer.Config()
- base_tokenizer: Optional[Tokenizer.Config] =
None
- vocab_file: str =
'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt'
- max_seq_len: int =
256
- answers_column: str =
'answers'
- answer_starts_column: str =
'answer_starts'
Default JSON
{
"is_input": true,
"columns": [
"question",
"doc"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256,
"answers_column": "answers",
"answer_starts_column": "answer_starts"
}
squad_tensorizer¶
SquadTensorizer.Config¶
Component: SquadTensorizer
-
class
SquadTensorizer.
Config
[source] Bases:
TokenTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text'
- tokenizer: Tokenizer.Config = Tokenizer.Config(split_regex=
'\\W+'
)- add_bos_token: bool =
False
- add_eos_token: bool =
False
- use_eos_token_for_bos: bool =
False
- max_seq_len: Optional[int] =
None
- vocab: VocabConfig = VocabConfig()
- vocab_file_delimiter: str =
' '
- doc_column: str =
'doc'
- ques_column: str =
'question'
- answers_column: str =
'answers'
- answer_starts_column: str =
'answer_starts'
- max_ques_seq_len: int =
64
- max_doc_seq_len: int =
256
- Subclasses
SquadTensorizerForKD.Config
Default JSON
{
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\W+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " ",
"doc_column": "doc",
"ques_column": "question",
"answers_column": "answers",
"answer_starts_column": "answer_starts",
"max_ques_seq_len": 64,
"max_doc_seq_len": 256
}
SquadTensorizerForKD.Config¶
Component: SquadTensorizerForKD
-
class
SquadTensorizerForKD.
Config
[source] Bases:
SquadTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text'
- tokenizer: Tokenizer.Config = Tokenizer.Config(split_regex=
'\\W+'
)- add_bos_token: bool =
False
- add_eos_token: bool =
False
- use_eos_token_for_bos: bool =
False
- max_seq_len: Optional[int] =
None
- vocab: VocabConfig = VocabConfig()
- vocab_file_delimiter: str =
' '
- doc_column: str =
'doc'
- ques_column: str =
'question'
- answers_column: str =
'answers'
- answer_starts_column: str =
'answer_starts'
- max_ques_seq_len: int =
64
- max_doc_seq_len: int =
256
- start_logits_column: str =
'start_logits'
- end_logits_column: str =
'end_logits'
- has_answer_logits_column: str =
'has_answer_logits'
- pad_mask_column: str =
'pad_mask'
- segment_labels_column: str =
'segment_labels'
Default JSON
{
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\W+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " ",
"doc_column": "doc",
"ques_column": "question",
"answers_column": "answers",
"answer_starts_column": "answer_starts",
"max_ques_seq_len": 64,
"max_doc_seq_len": 256,
"start_logits_column": "start_logits",
"end_logits_column": "end_logits",
"has_answer_logits_column": "has_answer_logits",
"pad_mask_column": "pad_mask",
"segment_labels_column": "segment_labels"
}
tensorizers¶
AnnotationNumberizer.Config¶
Component: AnnotationNumberizer
-
class
AnnotationNumberizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'seqlogical'
Default JSON
{
"is_input": true,
"column": "seqlogical"
}
ByteTensorizer.Config¶
Component: ByteTensorizer
-
class
ByteTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text'
- The name of the text column to parse from the data source.
- lower: bool =
True
- max_seq_len: Optional[int] =
None
- add_bos_token: Optional[bool] =
False
- add_eos_token: Optional[bool] =
False
- use_eos_token_for_bos: Optional[bool] =
False
Default JSON
{
"is_input": true,
"column": "text",
"lower": true,
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false
}
ByteTokenTensorizer.Config¶
Component: ByteTokenTensorizer
-
class
ByteTokenTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text'
- The name of the text column to parse from the data source.
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- The tokenizer to use to split input text into tokens.
- max_seq_len: Optional[int] =
None
- The max token length for input text.
- max_byte_len: int =
15
- The max byte length for a token.
- offset_for_non_padding: int =
0
- Offset to add to all non-padding bytes
- add_bos_token: bool =
False
- add_eos_token: bool =
False
- use_eos_token_for_bos: bool =
False
Default JSON
{
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"max_seq_len": null,
"max_byte_len": 15,
"offset_for_non_padding": 0,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false
}
CharacterTokenTensorizer.Config¶
Component: CharacterTokenTensorizer
-
class
CharacterTokenTensorizer.
Config
[source] Bases:
TokenTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text'
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- add_bos_token: bool =
False
- add_eos_token: bool =
False
- use_eos_token_for_bos: bool =
False
- max_seq_len: Optional[int] =
None
- vocab: VocabConfig = VocabConfig()
- vocab_file_delimiter: str =
' '
- max_char_length: int =
20
- The max character length for a token.
Default JSON
{
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " ",
"max_char_length": 20
}
FloatListTensorizer.Config¶
Component: FloatListTensorizer
-
class
FloatListTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str
- The name of the label column to parse from the data source.
- error_check: bool =
False
- dim: Optional[int] =
None
- normalize: bool =
False
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
FloatTensorizer.Config¶
Component: FloatTensorizer
-
class
FloatTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str
- The name of the column to parse from the data source.
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
GazetteerTensorizer.Config¶
Component: GazetteerTensorizer
-
class
GazetteerTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- text_column: str =
'text'
- dict_column: str =
'dict'
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- tokenizer to split text and create dict tensors of the same size.
Default JSON
{
"is_input": true,
"text_column": "text",
"dict_column": "dict",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
}
LabelListTensorizer.Config¶
Component: LabelListTensorizer
-
class
LabelListTensorizer.
Config
Bases:
LabelTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- column: str =
'label'
- allow_unknown: bool =
False
- pad_in_vocab: bool =
False
- label_vocab: Optional[list[str]] =
None
Default JSON
{
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
LabelTensorizer.Config¶
Component: LabelTensorizer
-
class
LabelTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- column: str =
'label'
- The name of the label column to parse from the data source.
- allow_unknown: bool =
False
- Whether to allow for unknown labels at test/prediction time
- pad_in_vocab: bool =
False
- if vocab should have pad, usually false when label is used as target
- label_vocab: Optional[list[str]] =
None
- The label values, if known. Will skip initialization step if provided.
- Subclasses
LabelListTensorizer.Config
SoftLabelTensorizer.Config
Default JSON
{
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
MetricTensorizer.Config¶
Component: MetricTensorizer
-
class
MetricTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- names: list[str]
- indexes: list[int]
- Subclasses
NtokensTensorizer.Config
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
NtokensTensorizer.Config¶
Component: NtokensTensorizer
-
class
NtokensTensorizer.
Config
Bases:
MetricTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- names: list[str]
- indexes: list[int]
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
NumericLabelTensorizer.Config¶
Component: NumericLabelTensorizer
-
class
NumericLabelTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- column: str =
'label'
- The name of the label column to parse from the data source.
- rescale_range: Optional[list[float]] =
None
- If provided, the range of values the raw label can be. Will rescale the label values to be within [0, 1].
Default JSON
{
"is_input": false,
"column": "label",
"rescale_range": null
}
SeqTokenTensorizer.Config¶
Component: SeqTokenTensorizer
-
class
SeqTokenTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text_seq'
- max_seq_len: Optional[int] =
None
- add_bos_token: bool =
False
- sentence markers
- add_eos_token: bool =
False
- use_eos_token_for_bos: bool =
False
- add_bol_token: bool =
False
- list markers
- add_eol_token: bool =
False
- use_eol_token_for_bol: bool =
False
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- The tokenizer to use to split input text into tokens.
Default JSON
{
"is_input": true,
"column": "text_seq",
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"add_bol_token": false,
"add_eol_token": false,
"use_eol_token_for_bol": false,
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
}
SlotLabelTensorizer.Config¶
Component: SlotLabelTensorizer
-
class
SlotLabelTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- slot_column: str =
'slots'
- The name of the slot label column to parse from the data source.
- text_column: str =
'text'
- The name of the text column to parse from the data source. We need this to be able to generate tensors which correspond to input text.
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- The tokenizer to use to split input text into tokens. This should be configured in a way which yields tokens consistent with the tokens input to or output by a model, so that the labels generated by this tensorizer will match the indices of the model’s tokens.
- allow_unknown: bool =
False
- Whether to allow for unknown labels at test/prediction time
- Subclasses
SlotLabelTensorizerExpansible.Config
Default JSON
{
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
SlotLabelTensorizerExpansible.Config¶
Component: SlotLabelTensorizerExpansible
-
class
SlotLabelTensorizerExpansible.
Config
Bases:
SlotLabelTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- slot_column: str =
'slots'
- text_column: str =
'text'
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- allow_unknown: bool =
False
Default JSON
{
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
SoftLabelTensorizer.Config¶
Component: SoftLabelTensorizer
-
class
SoftLabelTensorizer.
Config
[source] Bases:
LabelTensorizer.Config
All Attributes (including base classes)
- is_input: bool =
False
- column: str =
'label'
- allow_unknown: bool =
False
- pad_in_vocab: bool =
False
- label_vocab: Optional[list[str]] =
None
- probs_column: str =
'target_probs'
- logits_column: str =
'target_logits'
- labels_column: str =
'target_labels'
Default JSON
{
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null,
"probs_column": "target_probs",
"logits_column": "target_logits",
"labels_column": "target_labels"
}
Tensorizer.Config¶
Component: Tensorizer
-
class
Tensorizer.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- is_input: bool =
True
- Subclasses
BERTTensorizer.Config
BERTTensorizerBase.Config
RoBERTaTensorizer.Config
RoBERTaTokenLevelTensorizer.Config
SquadForBERTTensorizer.Config
SquadForBERTTensorizerForKD.Config
SquadForRoBERTaTensorizer.Config
SquadTensorizer.Config
SquadTensorizerForKD.Config
AnnotationNumberizer.Config
ByteTensorizer.Config
ByteTokenTensorizer.Config
CharacterTokenTensorizer.Config
FloatListTensorizer.Config
FloatTensorizer.Config
GazetteerTensorizer.Config
LabelListTensorizer.Config
LabelTensorizer.Config
MetricTensorizer.Config
NtokensTensorizer.Config
NumericLabelTensorizer.Config
SeqTokenTensorizer.Config
SlotLabelTensorizer.Config
SlotLabelTensorizerExpansible.Config
SoftLabelTensorizer.Config
TokenTensorizer.Config
UidTensorizer.Config
Default JSON
{
"is_input": true
}
TokenTensorizer.Config¶
Component: TokenTensorizer
-
class
TokenTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'text'
- The name of the text column to parse from the data source.
- tokenizer: Tokenizer.Config = Tokenizer.Config()
- The tokenizer to use to split input text into tokens.
- add_bos_token: bool =
False
- add_eos_token: bool =
False
- use_eos_token_for_bos: bool =
False
- max_seq_len: Optional[int] =
None
- vocab: VocabConfig = VocabConfig()
- vocab_file_delimiter: str =
' '
- Subclasses
SquadTensorizer.Config
SquadTensorizerForKD.Config
CharacterTokenTensorizer.Config
Default JSON
{
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
UidTensorizer.Config¶
Component: UidTensorizer
-
class
UidTensorizer.
Config
[source] Bases:
Tensorizer.Config
All Attributes (including base classes)
- is_input: bool =
True
- column: str =
'uid'
- allow_unknown: bool =
True
Default JSON
{
"is_input": true,
"column": "uid",
"allow_unknown": true
}
VocabConfig¶
Component: Component
-
class
pytext.data.tensorizers.
VocabConfig
[source] Bases:
Component.Config
All Attributes (including base classes)
- build_from_data: bool =
True
- Whether to add tokens from training data to vocab.
- size_from_data: int =
0
- Add size_from_data most frequent tokens in training data to vocab (if this is 0, add all tokens from training data).
- vocab_files: list[VocabFileConfig] =
[]
Default JSON
{
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
}
VocabFileConfig¶
Component: Component
-
class
pytext.data.tensorizers.
VocabFileConfig
[source] Bases:
Component.Config
All Attributes (including base classes)
- filepath: str =
''
- File containing tokens to add to vocab (first whitespace-separated entry per line)
- skip_header_line: bool =
False
- Whether to skip the first line of the file (e.g. if it is a header line)
- lowercase_tokens: bool =
False
- Whether to lowercase each of the tokens in the file
- size_limit: int =
0
- The max number of tokens to add to vocab
Default JSON
{
"filepath": "",
"skip_header_line": false,
"lowercase_tokens": false,
"size_limit": 0
}
tokenizers¶
tokenizer¶
BERTInitialTokenizer.Config¶
Component: BERTInitialTokenizer
-
class
BERTInitialTokenizer.
Config
[source] Bases:
Tokenizer.Config
Config for this class.
All Attributes (including base classes)
- split_regex: str =
'\\s+'
- lowercase: bool =
True
Default JSON
{
"split_regex": "\\s+",
"lowercase": true
}
DoNothingTokenizer.Config¶
Component: DoNothingTokenizer
-
class
DoNothingTokenizer.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- do_nothing: str =
''
Default JSON
{
"do_nothing": ""
}
GPT2BPETokenizer.Config¶
Component: GPT2BPETokenizer
-
class
GPT2BPETokenizer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- bpe_encoder_path: str =
'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json'
- bpe_vocab_path: str =
'manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe'
Default JSON
{
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
SentencePieceTokenizer.Config¶
Component: SentencePieceTokenizer
-
class
SentencePieceTokenizer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- sp_model_path: str =
''
Default JSON
{
"sp_model_path": ""
}
Tokenizer.Config¶
Component: Tokenizer
-
class
Tokenizer.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- split_regex: str =
'\\s+'
- A regular expression for the tokenizer to split on. Tokens are the segments between the regular expression matches. The start index is inclusive of the unmatched region, and the end index is exclusive (matching the first character of the matched split region).
- lowercase: bool =
True
- Whether token values should be lowercased or not.
- Subclasses
BERTInitialTokenizer.Config
Default JSON
{
"split_regex": "\\s+",
"lowercase": true
}
WordPieceTokenizer.Config¶
Component: WordPieceTokenizer
-
class
WordPieceTokenizer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- basic_tokenizer: BERTInitialTokenizer.Config = BERTInitialTokenizer.Config()
- wordpiece_vocab_path: str =
'/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt'
Default JSON
{
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
exporters¶
custom_exporters¶
DenseFeatureExporter.Config¶
Component: DenseFeatureExporter
-
class
DenseFeatureExporter.
Config
Bases:
ModelExporter.Config
All Attributes (including base classes)
- export_logits: bool =
False
- export_raw_to_metrics: bool =
False
Default JSON
{
"export_logits": false,
"export_raw_to_metrics": false
}
InitPredictNetExporter.Config¶
Component: InitPredictNetExporter
-
class
InitPredictNetExporter.
Config
Bases:
ModelExporter.Config
All Attributes (including base classes)
- export_logits: bool =
False
- export_raw_to_metrics: bool =
False
Default JSON
{
"export_logits": false,
"export_raw_to_metrics": false
}
exporter¶
ModelExporter.Config¶
Component: ModelExporter
-
class
ModelExporter.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- export_logits: bool =
False
- export_raw_to_metrics: bool =
False
- Subclasses
DenseFeatureExporter.Config
InitPredictNetExporter.Config
Default JSON
{
"export_logits": false,
"export_raw_to_metrics": false
}
loss¶
loss¶
AUCPRHingeLoss.Config¶
Component: AUCPRHingeLoss
-
class
AUCPRHingeLoss.
Config
[source] Bases:
ConfigBase
-
precision_range_lower
¶ the lower range of precision values over which to compute AUC. Must be nonnegative, leq precision_range_upper, and leq 1.0.
Type: float
-
precision_range_upper
¶ the upper range of precision values over which to compute AUC. Must be nonnegative, geq precision_range_lower, and leq 1.0.
Type: float
-
num_classes
¶ number of classes(aka labels)
Type: int
-
num_anchors
¶ The number of grid points used to approximate the Riemann sum.
Type: int
-
All Attributes (including base classes)
- precision_range_lower: float =
0.0
- precision_range_upper: float =
1.0
- num_classes: int =
1
- num_anchors: int =
20
Default JSON
{
"precision_range_lower": 0.0,
"precision_range_upper": 1.0,
"num_classes": 1,
"num_anchors": 20
}
BinaryCrossEntropyLoss.Config¶
Component: BinaryCrossEntropyLoss
-
class
BinaryCrossEntropyLoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- reweight_negative: bool =
True
- reduce: bool =
True
Default JSON
{
"reweight_negative": true,
"reduce": true
}
CosineEmbeddingLoss.Config¶
Component: CosineEmbeddingLoss
-
class
CosineEmbeddingLoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- margin: float =
0.0
Default JSON
{
"margin": 0.0
}
CrossEntropyLoss.Config¶
Component: CrossEntropyLoss
-
class
CrossEntropyLoss.
Config
Bases:
Loss.Config
All Attributes (including base classes)
Default JSON
{}
KLDivergenceBCELoss.Config¶
Component: KLDivergenceBCELoss
-
class
KLDivergenceBCELoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- temperature: float =
1.0
- hard_weight: float =
0.0
Default JSON
{
"temperature": 1.0,
"hard_weight": 0.0
}
KLDivergenceCELoss.Config¶
Component: KLDivergenceCELoss
-
class
KLDivergenceCELoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- temperature: float =
1.0
- hard_weight: float =
0.0
Default JSON
{
"temperature": 1.0,
"hard_weight": 0.0
}
LabelSmoothedCrossEntropyLoss.Config¶
Component: LabelSmoothedCrossEntropyLoss
-
class
LabelSmoothedCrossEntropyLoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- beta: float =
0.1
- from_logits: bool =
True
- use_entropy: bool =
False
Default JSON
{
"beta": 0.1,
"from_logits": true,
"use_entropy": false
}
Loss.Config¶
Component: Loss
-
class
Loss.
Config
Bases:
Component.Config
All Attributes (including base classes)
- Subclasses
CrossEntropyLoss.Config
MultiLabelSoftMarginLoss.Config
NLLLoss.Config
Default JSON
{}
MAELoss.Config¶
Component: MAELoss
-
class
MAELoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
Default JSON
{}
MSELoss.Config¶
Component: MSELoss
-
class
MSELoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
Default JSON
{}
MultiLabelSoftMarginLoss.Config¶
Component: MultiLabelSoftMarginLoss
-
class
MultiLabelSoftMarginLoss.
Config
Bases:
Loss.Config
All Attributes (including base classes)
Default JSON
{}
NLLLoss.Config¶
Component: NLLLoss
-
class
NLLLoss.
Config
Bases:
Loss.Config
All Attributes (including base classes)
Default JSON
{}
PairwiseRankingLoss.Config¶
Component: PairwiseRankingLoss
-
class
PairwiseRankingLoss.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- margin: float =
1.0
Default JSON
{
"margin": 1.0
}
metric_reporters¶
classification_metric_reporter¶
ClassificationMetricReporter.Config¶
Component: ClassificationMetricReporter
-
class
ClassificationMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- model_select_metric: ComparableClassificationMetric =
<ComparableClassificationMetric.ACCURACY: 'accuracy'>
- target_label: Optional[str] =
None
- text_column_names: list[str] =
['text']
- These column names correspond to raw input data columns. Text in these columns (usually just 1 column) will be concatenated and output in the IntentModelChannel as an evaluation tsv.
- additional_column_names: list[str] =
[]
- These column names correspond to raw input data columns, that will be read by data_source into context, and included in the run_model output file along with other saving results.
- recall_at_precision_thresholds: list[float] =
[0.2, 0.4, 0.6, 0.8, 0.9]
- Subclasses
MultiLabelClassificationMetricReporter.Config
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
MultiLabelClassificationMetricReporter.Config¶
Component: MultiLabelClassificationMetricReporter
-
class
MultiLabelClassificationMetricReporter.
Config
Bases:
ClassificationMetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- model_select_metric: ComparableClassificationMetric =
<ComparableClassificationMetric.ACCURACY: 'accuracy'>
- target_label: Optional[str] =
None
- text_column_names: list[str] =
['text']
- additional_column_names: list[str] =
[]
- recall_at_precision_thresholds: list[float] =
[0.2, 0.4, 0.6, 0.8, 0.9]
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
compositional_metric_reporter¶
CompositionalMetricReporter.Config¶
Component: CompositionalMetricReporter
-
class
CompositionalMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- text_column_name: str =
'tokenized_text'
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"text_column_name": "tokenized_text"
}
disjoint_multitask_metric_reporter¶
DisjointMultitaskMetricReporter.Config¶
Component: DisjointMultitaskMetricReporter
-
class
DisjointMultitaskMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- use_subtask_select_metric: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"use_subtask_select_metric": false
}
intent_slot_detection_metric_reporter¶
IntentSlotMetricReporter.Config¶
Component: IntentSlotMetricReporter
-
class
IntentSlotMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
language_model_metric_reporter¶
LanguageModelMetricReporter.Config¶
Component: LanguageModelMetricReporter
-
class
LanguageModelMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- aggregate_metrics: bool =
True
- perplexity_type: PerplexityType =
<PerplexityType.MEDIAN: 'median'>
- Subclasses
MaskedLMMetricReporter.Config
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"aggregate_metrics": true,
"perplexity_type": "median"
}
MaskedLMMetricReporter.Config¶
Component: MaskedLMMetricReporter
-
class
MaskedLMMetricReporter.
Config
Bases:
LanguageModelMetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- aggregate_metrics: bool =
True
- perplexity_type: PerplexityType =
<PerplexityType.MEDIAN: 'median'>
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"aggregate_metrics": true,
"perplexity_type": "median"
}
metric_reporter¶
MetricReporter.Config¶
Component: MetricReporter
-
class
MetricReporter.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- Subclasses
ClassificationMetricReporter.Config
MultiLabelClassificationMetricReporter.Config
CompositionalMetricReporter.Config
DisjointMultitaskMetricReporter.Config
IntentSlotMetricReporter.Config
LanguageModelMetricReporter.Config
MaskedLMMetricReporter.Config
PureLossMetricReporter.Config
PairwiseRankingMetricReporter.Config
RegressionMetricReporter.Config
SquadMetricReporter.Config
NERMetricReporter.Config
SequenceTaggingMetricReporter.Config
WordTaggingMetricReporter.Config
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
PureLossMetricReporter.Config¶
Component: PureLossMetricReporter
-
class
PureLossMetricReporter.
Config
Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
pairwise_ranking_metric_reporter¶
PairwiseRankingMetricReporter.Config¶
Component: PairwiseRankingMetricReporter
-
class
PairwiseRankingMetricReporter.
Config
Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
regression_metric_reporter¶
RegressionMetricReporter.Config¶
Component: RegressionMetricReporter
-
class
RegressionMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
squad_metric_reporter¶
SquadMetricReporter.Config¶
Component: SquadMetricReporter
-
class
SquadMetricReporter.
Config
[source] Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
- n_best_size: int =
5
- max_answer_length: int =
16
- ignore_impossible: bool =
True
- false_label: str =
'False'
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"n_best_size": 5,
"max_answer_length": 16,
"ignore_impossible": true,
"false_label": "False"
}
word_tagging_metric_reporter¶
NERMetricReporter.Config¶
Component: NERMetricReporter
-
class
NERMetricReporter.
Config
Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
SequenceTaggingMetricReporter.Config¶
Component: SequenceTaggingMetricReporter
-
class
SequenceTaggingMetricReporter.
Config
Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
WordTaggingMetricReporter.Config¶
Component: WordTaggingMetricReporter
-
class
WordTaggingMetricReporter.
Config
Bases:
MetricReporter.Config
All Attributes (including base classes)
- output_path: str =
'/tmp/test_out.txt'
- pep_format: bool =
False
Default JSON
{
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
models¶
bert_classification_models¶
BertModelInput¶
-
class
pytext.models.bert_classification_models.
BertModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: BERTTensorizer.Config = BERTTensorizer.Config(max_seq_len=
128
)- dense: Optional[FloatListTensorizer.Config] =
None
- labels: LabelTensorizer.Config = LabelTensorizer.Config()
- num_tokens: NtokensTensorizer.Config = NtokensTensorizer.Config(names=
['tokens']
, indexes=[2]
)
Default JSON
{
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"num_tokens": {
"is_input": false,
"names": [
"tokens"
],
"indexes": [
2
]
}
}
BertPairwiseModel.Config¶
Component: BertPairwiseModel
-
class
BertPairwiseModel.
Config
[source] Bases:
BasePairwiseModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- decoder: Optional[MLPDecoder.Config] = MLPDecoder.Config()
- output_layer: Union[ClassificationOutputLayer.Config, PairwiseCosineDistanceOutputLayer.Config] = ClassificationOutputLayer.Config()
- encode_relations: bool =
True
- encoder: TransformerSentenceEncoderBase.Config = HuggingFaceBertSentenceEncoder.Config()
- shared_encoder: bool =
True
Default JSON
{
"inputs": {
"tokens1": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text1"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"tokens2": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text2"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"num_tokens": {
"is_input": false,
"names": [
"tokens1",
"tokens2"
],
"indexes": [
2,
2
]
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"ClassificationOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"encode_relations": true,
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"shared_encoder": true
}
ModelInput¶
-
class
pytext.models.bert_classification_models.
ModelInput
Bases:
ModelInputBase
All Attributes (including base classes)
- tokens1: BERTTensorizer.Config = BERTTensorizer.Config(columns=
['text1']
, max_seq_len=128
)- tokens2: BERTTensorizer.Config = BERTTensorizer.Config(columns=
['text2']
, max_seq_len=128
)- labels: LabelTensorizer.Config = LabelTensorizer.Config()
- num_tokens: NtokensTensorizer.Config = NtokensTensorizer.Config(names=
['tokens1', 'tokens2']
, indexes=[2, 2]
)
Default JSON
{
"tokens1": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text1"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"tokens2": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text2"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"num_tokens": {
"is_input": false,
"names": [
"tokens1",
"tokens2"
],
"indexes": [
2,
2
]
}
}
NewBertModel.Config¶
Component: NewBertModel
-
class
NewBertModel.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: BertModelInput = BertModelInput()
- encoder: TransformerSentenceEncoderBase.Config = HuggingFaceBertSentenceEncoder.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
- Subclasses
NewBertRegressionModel.Config
BertSquadQAModel.Config
RoBERTa.Config
Default JSON
{
"inputs": {
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"num_tokens": {
"is_input": false,
"names": [
"tokens"
],
"indexes": [
2
]
}
},
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
}
bert_regression_model¶
InputConfig¶
-
class
pytext.models.bert_regression_model.
InputConfig
Bases:
ConfigBase
All Attributes (including base classes)
- tokens: BERTTensorizer.Config = BERTTensorizer.Config(columns=
['text1', 'text2']
, max_seq_len=128
)- labels: NumericLabelTensorizer.Config = NumericLabelTensorizer.Config()
Default JSON
{
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text1",
"text2"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"labels": {
"is_input": false,
"column": "label",
"rescale_range": null
}
}
NewBertRegressionModel.Config¶
Component: NewBertRegressionModel
-
class
NewBertRegressionModel.
Config
[source] Bases:
NewBertModel.Config
All Attributes (including base classes)
- inputs: InputConfig = InputConfig()
- encoder: TransformerSentenceEncoderBase.Config = HuggingFaceBertSentenceEncoder.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: RegressionOutputLayer.Config = RegressionOutputLayer.Config()
Default JSON
{
"inputs": {
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text1",
"text2"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"labels": {
"is_input": false,
"column": "label",
"rescale_range": null
}
},
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {},
"squash_to_unit_range": false
}
}
decoders¶
decoder_base¶
DecoderBase.Config¶
Component: DecoderBase
-
class
DecoderBase.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- Subclasses
IntentSlotModelDecoder.Config
MLPDecoder.Config
MLPDecoderQueryResponse.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
intent_slot_model_decoder¶
IntentSlotModelDecoder.Config¶
Component: IntentSlotModelDecoder
-
class
IntentSlotModelDecoder.
Config
[source] Bases:
DecoderBase.Config
Configuration class for IntentSlotModelDecoder.
-
use_doc_probs_in_word
¶ Whether to use intent probabilities for predicting slots.
Type: bool
-
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- use_doc_probs_in_word: bool =
False
- doc_decoder: MLPDecoder.Config = MLPDecoder.Config()
- word_decoder: MLPDecoder.Config = MLPDecoder.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"use_doc_probs_in_word": false,
"doc_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"word_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
}
mlp_decoder¶
MLPDecoder.Config¶
Component: MLPDecoder
-
class
MLPDecoder.
Config
[source] Bases:
DecoderBase.Config
Configuration class for MLPDecoder.
Dimensions of the outputs of hidden layers..
Type: List[int]
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- hidden_dims: list[int] =
[]
- out_dim: Optional[int] =
None
- layer_norm: bool =
False
- dropout: float =
0.0
- activation: Activation =
<Activation.RELU: 'relu'>
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
mlp_decoder_query_response¶
MLPDecoderQueryResponse.Config¶
Component: MLPDecoderQueryResponse
-
class
MLPDecoderQueryResponse.
Config
[source] Bases:
DecoderBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- hidden_dims: list[int] =
[]
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": []
}
disjoint_multitask_model¶
DisjointMultitaskModel.Config¶
Component: DisjointMultitaskModel
-
class
DisjointMultitaskModel.
Config
Bases:
Model.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- Subclasses
NewDisjointMultitaskModel.Config
Default JSON
{
"inputs": {}
}
NewDisjointMultitaskModel.Config¶
Component: NewDisjointMultitaskModel
-
class
NewDisjointMultitaskModel.
Config
Bases:
DisjointMultitaskModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
Default JSON
{
"inputs": {}
}
doc_model¶
ByteModelInput¶
-
class
pytext.models.doc_model.
ByteModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- dense: Optional[FloatListTensorizer.Config] =
None
- labels: LabelTensorizer.Config = LabelTensorizer.Config()
- token_bytes: ByteTokenTensorizer.Config = ByteTokenTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"token_bytes": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"max_seq_len": null,
"max_byte_len": 15,
"offset_for_non_padding": 0,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false
}
}
ByteTokensDocumentModel.Config¶
Component: ByteTokensDocumentModel
-
class
ByteTokensDocumentModel.
Config
[source] Bases:
DocModel.Config
All Attributes (including base classes)
- inputs: ByteModelInput = ByteModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[PureDocAttention.Config, BiLSTMDocAttention.Config, DocNNRepresentation.Config, DeepCNNRepresentation.Config] = BiLSTMDocAttention.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
- byte_embedding: CharacterEmbedding.Config = CharacterEmbedding.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"token_bytes": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"max_seq_len": null,
"max_byte_len": 15,
"offset_for_non_padding": 0,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"byte_embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"sparse": false,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
},
"highway_layers": 0,
"projection_dim": null,
"export_input_names": [
"char_vals"
],
"vocab_from_train_data": true,
"max_word_length": 20,
"min_freq": 1
}
}
DocModel.Config¶
Component: DocModel
-
class
DocModel.
Config
[source] Bases:
Model.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[PureDocAttention.Config, BiLSTMDocAttention.Config, DocNNRepresentation.Config, DeepCNNRepresentation.Config] = BiLSTMDocAttention.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
- Subclasses
ByteTokensDocumentModel.Config
DocRegressionModel.Config
PersonalizedDocModel.Config
SeqNNModel.Config
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
}
DocRegressionModel.Config¶
Component: DocRegressionModel
-
class
DocRegressionModel.
Config
[source] Bases:
DocModel.Config
All Attributes (including base classes)
- inputs: RegressionModelInput = RegressionModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[PureDocAttention.Config, BiLSTMDocAttention.Config, DocNNRepresentation.Config, DeepCNNRepresentation.Config] = BiLSTMDocAttention.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: RegressionOutputLayer.Config = RegressionOutputLayer.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"is_input": false,
"column": "label",
"rescale_range": null
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {},
"squash_to_unit_range": false
}
}
ModelInput¶
-
class
pytext.models.doc_model.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- dense: Optional[FloatListTensorizer.Config] =
None
- labels: LabelTensorizer.Config = LabelTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
}
PersonalizedDocModel.Config¶
Component: PersonalizedDocModel
-
class
PersonalizedDocModel.
Config
[source] Bases:
DocModel.Config
All Attributes (including base classes)
- inputs: PersonalizedModelInput = PersonalizedModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[PureDocAttention.Config, BiLSTMDocAttention.Config, DocNNRepresentation.Config, DeepCNNRepresentation.Config] = BiLSTMDocAttention.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
- user_embedding: WordEmbedding.Config = WordEmbedding.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"uid": {
"is_input": true,
"column": "uid",
"allow_unknown": true
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"user_embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
}
}
PersonalizedModelInput¶
-
class
pytext.models.doc_model.
PersonalizedModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- dense: Optional[FloatListTensorizer.Config] =
None
- labels: LabelTensorizer.Config = LabelTensorizer.Config()
- uid: Optional[UidTensorizer.Config] = UidTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"uid": {
"is_input": true,
"column": "uid",
"allow_unknown": true
}
}
RegressionModelInput¶
-
class
pytext.models.doc_model.
RegressionModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- dense: Optional[FloatListTensorizer.Config] =
None
- labels: NumericLabelTensorizer.Config = NumericLabelTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"is_input": false,
"column": "label",
"rescale_range": null
}
}
embeddings¶
char_embedding¶
CharacterEmbedding.Config¶
Component: CharacterEmbedding
-
class
CharacterEmbedding.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- embed_dim: int =
100
- sparse: bool =
False
- cnn: CNNParams = CNNParams()
- highway_layers: int =
0
- projection_dim: Optional[int] =
None
- export_input_names: list[str] =
['char_vals']
- vocab_from_train_data: bool =
True
- max_word_length: int =
20
- min_freq: int =
1
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"sparse": false,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
},
"highway_layers": 0,
"projection_dim": null,
"export_input_names": [
"char_vals"
],
"vocab_from_train_data": true,
"max_word_length": 20,
"min_freq": 1
}
contextual_token_embedding¶
ContextualTokenEmbedding.Config¶
Component: ContextualTokenEmbedding
-
class
ContextualTokenEmbedding.
Config
Bases:
ConfigBase
All Attributes (including base classes)
- embed_dim: int =
0
- model_paths: Optional[dict[str, str]] =
None
- export_input_names: list[str] =
['contextual_token_embedding']
Default JSON
{
"embed_dim": 0,
"model_paths": null,
"export_input_names": [
"contextual_token_embedding"
]
}
dict_embedding¶
DictEmbedding.Config¶
Component: DictEmbedding
-
class
DictEmbedding.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- embed_dim: int =
100
- sparse: bool =
False
- pooling: PoolingType =
<PoolingType.MEAN: 'mean'>
- export_input_names: list[str] =
['dict_vals', 'dict_weights', 'dict_lens']
- vocab_from_train_data: bool =
True
- mobile: bool =
False
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"sparse": false,
"pooling": "mean",
"export_input_names": [
"dict_vals",
"dict_weights",
"dict_lens"
],
"vocab_from_train_data": true,
"mobile": false
}
embedding_base¶
EmbeddingBase.Config¶
Component: EmbeddingBase
-
class
EmbeddingBase.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- Subclasses
EmbeddingList.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
embedding_list¶
EmbeddingList.Config¶
Component: EmbeddingList
-
class
EmbeddingList.
Config
Bases:
EmbeddingBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
word_embedding¶
WordEmbedding.Config¶
Component: WordEmbedding
-
class
WordEmbedding.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- embed_dim: int =
100
- embedding_init_strategy: EmbedInitStrategy =
<EmbedInitStrategy.RANDOM: 'random'>
- embedding_init_range: Optional[list[float]] =
None
- export_input_names: list[str] =
['tokens_vals']
- pretrained_embeddings_path: str =
''
- vocab_file: str =
''
- vocab_size: int =
0
- vocab_from_train_data: bool =
True
- vocab_from_all_data: bool =
False
- vocab_from_pretrained_embeddings: bool =
False
- lowercase_tokens: bool =
True
- min_freq: int =
1
- mlp_layer_dims: Optional[list[int]] =
[]
- padding_idx: Optional[int] =
None
- cpu_only: bool =
False
- skip_header: bool =
True
- delimiter: str =
' '
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
}
ensembles¶
bagging_doc_ensemble¶
BaggingDocEnsembleModel.Config¶
Component: BaggingDocEnsembleModel
-
class
BaggingDocEnsembleModel.
Config
[source] Bases:
EnsembleModel.Config
Configuration class for NewBaggingDocEnsemble. These attributes are used by Ensemble.from_config() to construct instance of NewBaggingDocEnsemble.
-
models
¶ List of document classification model configurations.
Type: List[NewDocModel.Config]
-
All Attributes (including base classes)
- models: list[DocModel.Config]
- sample_rate: float =
1.0
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
bagging_intent_slot_ensemble¶
BaggingIntentSlotEnsembleModel.Config¶
Component: BaggingIntentSlotEnsembleModel
-
class
BaggingIntentSlotEnsembleModel.
Config
[source] Bases:
EnsembleModel.Config
Configuration class for BaggingIntentSlotEnsemble. These attributes are used by Ensemble.from_config() to construct instance of BaggingIntentSlotEnsemble.
-
models
¶ List of intent-slot model configurations.
Type: List[IntentSlotModel.Config]
-
output_layer
¶ Output layer of intent-slot model responsible for computing loss and predictions.
Type: IntentSlotOutputLayer
-
All Attributes (including base classes)
- models: list[IntentSlotModel.Config]
- sample_rate: float =
1.0
- use_crf: bool =
False
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
ensemble¶
EnsembleModel.Config¶
Component: EnsembleModel
-
class
EnsembleModel.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- models: list[Any]
- sample_rate: float =
1.0
- Subclasses
BaggingDocEnsembleModel.Config
BaggingIntentSlotEnsembleModel.Config
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
joint_model¶
IntentSlotModel.Config¶
Component: IntentSlotModel
-
class
IntentSlotModel.
Config
[source] Bases:
Model.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- word_embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[BiLSTMDocSlotAttention.Config, JointCNNRepresentation.Config, SharedCNNRepresentation.Config, PassThroughRepresentation.Config] = BiLSTMDocSlotAttention.Config()
- output_layer: IntentSlotOutputLayer.Config = IntentSlotOutputLayer.Config()
- decoder: IntentSlotModelDecoder.Config = IntentSlotModelDecoder.Config()
- default_doc_loss_weight: float =
0.2
- default_word_loss_weight: float =
0.5
- Subclasses
ContextualIntentSlotModel.Config
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"word_labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": true
},
"doc_labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": true,
"pad_in_vocab": false,
"label_vocab": null
}
},
"doc_weight": null,
"word_weight": null
},
"word_embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocSlotAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
}
},
"pooling": null,
"slot_attention": null,
"doc_mlp_layers": 0,
"word_mlp_layers": 0
}
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_output": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"word_output": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"use_doc_probs_in_word": false,
"doc_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"word_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
},
"default_doc_loss_weight": 0.2,
"default_word_loss_weight": 0.5
}
ModelInput¶
-
class
pytext.models.joint_model.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- word_labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config(allow_unknown=
True
)- doc_labels: LabelTensorizer.Config = LabelTensorizer.Config(allow_unknown=
True
)- doc_weight: Optional[FloatTensorizer.Config] =
None
- word_weight: Optional[FloatTensorizer.Config] =
None
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"word_labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": true
},
"doc_labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": true,
"pad_in_vocab": false,
"label_vocab": null
}
},
"doc_weight": null,
"word_weight": null
}
language_models¶
lmlstm¶
LMLSTM.Config¶
Component: LMLSTM
-
class
LMLSTM.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[BiLSTM.Config, DeepCNNRepresentation.Config] = BiLSTM.Config(bidirectional=
False
)- decoder: Optional[MLPDecoder.Config] = MLPDecoder.Config()
- output_layer: LMOutputLayer.Config = LMOutputLayer.Config()
- tied_weights: bool =
False
- stateful: bool =
False
- caffe2_format: ExporterType =
<ExporterType.PREDICTOR: 'predictor'>
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": true,
"add_eos_token": true,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": false,
"pack_sequence": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {}
},
"tied_weights": false,
"stateful": false,
"caffe2_format": "predictor"
}
ModelInput¶
-
class
pytext.models.language_models.lmlstm.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: Optional[TokenTensorizer.Config] = TokenTensorizer.Config(add_bos_token=
True
, add_eos_token=True
)
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": true,
"add_eos_token": true,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
}
masked_lm¶
InputConfig¶
-
class
pytext.models.masked_lm.
InputConfig
Bases:
ConfigBase
All Attributes (including base classes)
- tokens: BERTTensorizerBase.Config = BERTTensorizerBase.Config(max_seq_len=
128
)
Default JSON
{
"tokens": {
"BERTTensorizerBase": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"base_tokenizer": null,
"vocab_file": "",
"max_seq_len": 128
}
}
}
MaskedLanguageModel.Config¶
Component: MaskedLanguageModel
-
class
MaskedLanguageModel.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: InputConfig = InputConfig()
- encoder: TransformerSentenceEncoderBase.Config = TransformerSentenceEncoder.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: LMOutputLayer.Config = LMOutputLayer.Config()
- mask_prob: float =
0.15
- mask_bos: bool =
False
- masking_strategy: MaskingStrategy =
<MaskingStrategy.RANDOM: 'random'>
- tie_weights: bool =
True
Default JSON
{
"inputs": {
"tokens": {
"BERTTensorizerBase": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"base_tokenizer": null,
"vocab_file": "",
"max_seq_len": 128
}
}
},
"encoder": {
"TransformerSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"dropout": 0.1,
"attention_dropout": 0.1,
"activation_dropout": 0.1,
"ffn_embedding_dim": 3072,
"num_encoder_layers": 6,
"num_attention_heads": 8,
"num_segments": 2,
"use_position_embeddings": true,
"offset_positions_by_padding": true,
"apply_bert_init": true,
"encoder_normalize_before": true,
"activation_fn": "relu",
"projection_dim": 0,
"max_seq_len": 128,
"multilingual": false,
"freeze_embeddings": false,
"n_trans_layers_to_freeze": 0,
"use_torchscript": false
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {}
},
"mask_prob": 0.15,
"mask_bos": false,
"masking_strategy": "random",
"tie_weights": true
}
model¶
BaseModel.Config¶
Component: BaseModel
-
class
BaseModel.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- Subclasses
BertPairwiseModel.Config
NewBertModel.Config
NewBertRegressionModel.Config
DisjointMultitaskModel.Config
NewDisjointMultitaskModel.Config
ByteTokensDocumentModel.Config
DocModel.Config
DocRegressionModel.Config
PersonalizedDocModel.Config
IntentSlotModel.Config
LMLSTM.Config
MaskedLanguageModel.Config
Model.Config
BasePairwiseModel.Config
PairwiseModel.Config
BertSquadQAModel.Config
DrQAModel.Config
QueryDocPairwiseRankingModel.Config
RoBERTa.Config
RoBERTaWordTaggingModel.Config
ContextualIntentSlotModel.Config
SeqNNModel.Config
WordTaggingLiteModel.Config
WordTaggingModel.Config
Default JSON
{
"inputs": {}
}
Model.Config¶
Component: Model
-
class
Model.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- Subclasses
DisjointMultitaskModel.Config
NewDisjointMultitaskModel.Config
ByteTokensDocumentModel.Config
DocModel.Config
DocRegressionModel.Config
PersonalizedDocModel.Config
IntentSlotModel.Config
ContextualIntentSlotModel.Config
SeqNNModel.Config
WordTaggingLiteModel.Config
WordTaggingModel.Config
Default JSON
{
"inputs": {}
}
ModelInput¶
-
class
pytext.models.model.
ModelInput
[source] Bases:
ModelInputBase
All Attributes (including base classes)
Default JSON
{}
module¶
Module.Config¶
Component: Module
-
class
Module.
Config
Bases:
ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- Subclasses
FeatureConfig
BatcherSchedulerConfig
ExponentialBatcherSchedulerConfig
DecoderBase.Config
IntentSlotModelDecoder.Config
MLPDecoder.Config
MLPDecoderQueryResponse.Config
CharacterEmbedding.Config
DictEmbedding.Config
EmbeddingBase.Config
EmbeddingList.Config
WordEmbedding.Config
PairwiseCosineDistanceOutputLayer.Config
BinaryClassificationOutputLayer.Config
ClassificationOutputLayer.Config
MultiLabelOutputLayer.Config
MulticlassOutputLayer.Config
RegressionOutputLayer.Config
IntentSlotOutputLayer.Config
LMOutputLayer.Config
OutputLayerBase.Config
PairwiseRankingOutputLayer.Config
SquadOutputLayer.Config
CRFOutputLayer.Config
WordTaggingOutputLayer.Config
DotProductSelfAttention.Config
MultiplicativeAttention.Config
SequenceAlignedAttention.Config
AugmentedLSTM.Config
BiLSTM.Config
BiLSTMDocAttention.Config
BiLSTMDocSlotAttention.Config
BiLSTMSlotAttention.Config
BSeqCNNRepresentation.Config
ContextualIntentSlotRepresentation.Config
DeepCNNRepresentation.Config
DocNNRepresentation.Config
HuggingFaceBertSentenceEncoder.Config
JointCNNRepresentation.Config
SharedCNNRepresentation.Config
OrderedNeuronLSTM.Config
OrderedNeuronLSTMLayer.Config
PassThroughRepresentation.Config
LastTimestepPool.Config
MaxPool.Config
MeanPool.Config
NoPool.Config
PureDocAttention.Config
RepresentationBase.Config
SeqRepresentation.Config
SparseTransformerSentenceEncoder.Config
StackedBidirectionalRNN.Config
TransformerSentenceEncoder.Config
TransformerSentenceEncoderBase.Config
RoBERTaEncoder.Config
RoBERTaEncoderBase.Config
RoBERTaEncoderJit.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
output_layers¶
distance_output_layer¶
PairwiseCosineDistanceOutputLayer.Config¶
Component: PairwiseCosineDistanceOutputLayer
-
class
PairwiseCosineDistanceOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[BinaryCrossEntropyLoss.Config, CosineEmbeddingLoss.Config, MAELoss.Config, MSELoss.Config, NLLLoss.Config] = CosineEmbeddingLoss.Config()
- score_threshold: float =
0.9
- score_type: OutputScore =
<OutputScore.norm_cosine: 2>
- label_weights: Optional[dict[str, float]] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CosineEmbeddingLoss": {
"margin": 0.0
}
},
"score_threshold": 0.9,
"score_type": 2,
"label_weights": null
}
doc_classification_output_layer¶
BinaryClassificationOutputLayer.Config¶
Component: BinaryClassificationOutputLayer
-
class
BinaryClassificationOutputLayer.
Config
Bases:
ClassificationOutputLayer.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
- label_weights: Optional[dict[str, float]] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
ClassificationOutputLayer.Config¶
Component: ClassificationOutputLayer
-
class
ClassificationOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
- label_weights: Optional[dict[str, float]] =
None
- Subclasses
BinaryClassificationOutputLayer.Config
MultiLabelOutputLayer.Config
MulticlassOutputLayer.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
MultiLabelOutputLayer.Config¶
Component: MultiLabelOutputLayer
-
class
MultiLabelOutputLayer.
Config
Bases:
ClassificationOutputLayer.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
- label_weights: Optional[dict[str, float]] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
MulticlassOutputLayer.Config¶
Component: MulticlassOutputLayer
-
class
MulticlassOutputLayer.
Config
Bases:
ClassificationOutputLayer.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, MultiLabelSoftMarginLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
- label_weights: Optional[dict[str, float]] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
doc_regression_output_layer¶
RegressionOutputLayer.Config¶
Component: RegressionOutputLayer
-
class
RegressionOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: MSELoss.Config = MSELoss.Config()
- squash_to_unit_range: bool =
False
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {},
"squash_to_unit_range": false
}
intent_slot_output_layer¶
IntentSlotOutputLayer.Config¶
Component: IntentSlotOutputLayer
-
class
IntentSlotOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- doc_output: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
- word_output: Union[WordTaggingOutputLayer.Config, CRFOutputLayer.Config] = WordTaggingOutputLayer.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_output": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"word_output": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
}
}
lm_output_layer¶
LMOutputLayer.Config¶
Component: LMOutputLayer
-
class
LMOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: CrossEntropyLoss.Config = CrossEntropyLoss.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {}
}
output_layer_base¶
OutputLayerBase.Config¶
Component: OutputLayerBase
-
class
OutputLayerBase.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- Subclasses
PairwiseCosineDistanceOutputLayer.Config
BinaryClassificationOutputLayer.Config
ClassificationOutputLayer.Config
MultiLabelOutputLayer.Config
MulticlassOutputLayer.Config
RegressionOutputLayer.Config
IntentSlotOutputLayer.Config
LMOutputLayer.Config
PairwiseRankingOutputLayer.Config
SquadOutputLayer.Config
CRFOutputLayer.Config
WordTaggingOutputLayer.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
pairwise_ranking_output_layer¶
PairwiseRankingOutputLayer.Config¶
Component: PairwiseRankingOutputLayer
-
class
PairwiseRankingOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: PairwiseRankingLoss.Config = PairwiseRankingLoss.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"margin": 1.0
}
}
squad_output_layer¶
SquadOutputLayer.Config¶
Component: SquadOutputLayer
-
class
SquadOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[CrossEntropyLoss.Config, KLDivergenceCELoss.Config] = CrossEntropyLoss.Config()
- ignore_impossible: bool =
True
- pos_loss_weight: float =
0.5
- has_answer_loss_weight: float =
0.5
- false_label: str =
'False'
- max_answer_len: int =
30
- hard_weight: float =
0.0
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"ignore_impossible": true,
"pos_loss_weight": 0.5,
"has_answer_loss_weight": 0.5,
"false_label": "False",
"max_answer_len": 30,
"hard_weight": 0.0
}
word_tagging_output_layer¶
CRFOutputLayer.Config¶
Component: CRFOutputLayer
-
class
CRFOutputLayer.
Config
Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
WordTaggingOutputLayer.Config¶
Component: WordTaggingOutputLayer
-
class
WordTaggingOutputLayer.
Config
[source] Bases:
OutputLayerBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- loss: Union[CrossEntropyLoss.Config, BinaryCrossEntropyLoss.Config, AUCPRHingeLoss.Config, KLDivergenceBCELoss.Config, KLDivergenceCELoss.Config, LabelSmoothedCrossEntropyLoss.Config] = CrossEntropyLoss.Config()
- label_weights: dict[str, float] =
{}
- ignore_pad_in_loss: Optional[bool] =
True
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
pair_classification_model¶
BasePairwiseModel.Config¶
Component: BasePairwiseModel
-
class
BasePairwiseModel.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: Union[ClassificationOutputLayer.Config, PairwiseCosineDistanceOutputLayer.Config] = ClassificationOutputLayer.Config()
- encode_relations: bool =
True
- Subclasses
BertPairwiseModel.Config
PairwiseModel.Config
QueryDocPairwiseRankingModel.Config
Default JSON
{
"inputs": {},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"ClassificationOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"encode_relations": true
}
ModelInput¶
-
class
pytext.models.pair_classification_model.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens1: TokenTensorizer.Config = TokenTensorizer.Config(column=
'text1'
)- tokens2: TokenTensorizer.Config = TokenTensorizer.Config(column=
'text2'
)- labels: LabelTensorizer.Config = LabelTensorizer.Config()
Default JSON
{
"tokens1": {
"is_input": true,
"column": "text1",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"tokens2": {
"is_input": true,
"column": "text2",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
}
PairwiseModel.Config¶
Component: PairwiseModel
-
class
PairwiseModel.
Config
[source] Bases:
BasePairwiseModel.Config
-
encode_relations
¶ if false, return the concatenation of the two representations; if true, also concatenate their pairwise absolute difference and pairwise elementwise product (à la arXiv:1705.02364). Default: true.
Type: bool
-
tied_representation
¶ whether to use the same representation, with tied weights, for all the input subrepresentations. Default: true.
-
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: Union[ClassificationOutputLayer.Config, PairwiseCosineDistanceOutputLayer.Config] = ClassificationOutputLayer.Config()
- encode_relations: bool =
True
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[BiLSTMDocAttention.Config, DocNNRepresentation.Config] = BiLSTMDocAttention.Config()
- shared_representations: bool =
True
- Subclasses
QueryDocPairwiseRankingModel.Config
Default JSON
{
"inputs": {
"tokens1": {
"is_input": true,
"column": "text1",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"tokens2": {
"is_input": true,
"column": "text2",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"ClassificationOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"encode_relations": true,
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"shared_representations": true
}
qna¶
bert_squad_qa¶
BertSquadQAModel.Config¶
Component: BertSquadQAModel
-
class
BertSquadQAModel.
Config
[source] Bases:
NewBertModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- encoder: TransformerSentenceEncoderBase.Config = HuggingFaceBertSentenceEncoder.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: SquadOutputLayer.Config = SquadOutputLayer.Config()
- pos_decoder: MLPDecoder.Config = MLPDecoder.Config(out_dim=
2
)- has_ans_decoder: MLPDecoder.Config = MLPDecoder.Config(out_dim=
2
)- is_kd: bool =
False
Default JSON
{
"inputs": {
"squad_input": {
"SquadForBERTTensorizer": {
"is_input": true,
"columns": [
"question",
"doc"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 256,
"answers_column": "answers",
"answer_starts_column": "answer_starts"
}
},
"has_answer": {
"LabelTensorizer": {
"is_input": false,
"column": "has_answer",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"ignore_impossible": true,
"pos_loss_weight": 0.5,
"has_answer_loss_weight": 0.5,
"false_label": "False",
"max_answer_len": 30,
"hard_weight": 0.0
},
"pos_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": 2,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"has_ans_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": 2,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"is_kd": false
}
ModelInput¶
-
class
pytext.models.qna.bert_squad_qa.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- squad_input: Union[SquadForBERTTensorizer.Config, SquadForRoBERTaTensorizer.Config] = SquadForBERTTensorizer.Config()
- has_answer: LabelTensorizer.Config = LabelTensorizer.Config(column=
'has_answer'
)
Default JSON
{
"squad_input": {
"SquadForBERTTensorizer": {
"is_input": true,
"columns": [
"question",
"doc"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 256,
"answers_column": "answers",
"answer_starts_column": "answer_starts"
}
},
"has_answer": {
"LabelTensorizer": {
"is_input": false,
"column": "has_answer",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
}
dr_qa¶
DrQAModel.Config¶
Component: DrQAModel
-
class
DrQAModel.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- dropout: float =
0.4
- embedding: WordEmbedding.Config = WordEmbedding.Config(embed_dim=
300
, pretrained_embeddings_path='/mnt/vol/pytext/users/kushall/pretrained/glove.840B.300d.txt'
, vocab_from_pretrained_embeddings=True
)- ques_rnn: StackedBidirectionalRNN.Config = StackedBidirectionalRNN.Config(dropout=
0.4
)- doc_rnn: StackedBidirectionalRNN.Config = StackedBidirectionalRNN.Config(dropout=
0.4
)- output_layer: SquadOutputLayer.Config = SquadOutputLayer.Config()
- is_kd: bool =
False
Default JSON
{
"inputs": {
"squad_input": {
"SquadTensorizer": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\W+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " ",
"doc_column": "doc",
"ques_column": "question",
"answers_column": "answers",
"answer_starts_column": "answer_starts",
"max_ques_seq_len": 64,
"max_doc_seq_len": 256
}
},
"has_answer": {
"LabelTensorizer": {
"is_input": false,
"column": "has_answer",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"dropout": 0.4,
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 300,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "/mnt/vol/pytext/users/kushall/pretrained/glove.840B.300d.txt",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": true,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"ques_rnn": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_size": 32,
"num_layers": 1,
"dropout": 0.4,
"bidirectional": true,
"rnn_type": "lstm",
"concat_layers": true
},
"doc_rnn": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_size": 32,
"num_layers": 1,
"dropout": 0.4,
"bidirectional": true,
"rnn_type": "lstm",
"concat_layers": true
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"ignore_impossible": true,
"pos_loss_weight": 0.5,
"has_answer_loss_weight": 0.5,
"false_label": "False",
"max_answer_len": 30,
"hard_weight": 0.0
},
"is_kd": false
}
ModelInput¶
-
class
pytext.models.qna.dr_qa.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- squad_input: SquadTensorizer.Config = SquadTensorizer.Config()
- has_answer: LabelTensorizer.Config = LabelTensorizer.Config(column=
'has_answer'
)
Default JSON
{
"squad_input": {
"SquadTensorizer": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\W+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " ",
"doc_column": "doc",
"ques_column": "question",
"answers_column": "answers",
"answer_starts_column": "answer_starts",
"max_ques_seq_len": 64,
"max_doc_seq_len": 256
}
},
"has_answer": {
"LabelTensorizer": {
"is_input": false,
"column": "has_answer",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
}
query_document_pairwise_ranking_model¶
ModelInput¶
-
class
pytext.models.query_document_pairwise_ranking_model.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- pos_response: TokenTensorizer.Config = TokenTensorizer.Config(column=
'pos_response'
)- neg_response: TokenTensorizer.Config = TokenTensorizer.Config(column=
'neg_response'
)- query: TokenTensorizer.Config = TokenTensorizer.Config(column=
'query'
)
Default JSON
{
"pos_response": {
"is_input": true,
"column": "pos_response",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"neg_response": {
"is_input": true,
"column": "neg_response",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"query": {
"is_input": true,
"column": "query",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
}
QueryDocPairwiseRankingModel.Config¶
Component: QueryDocPairwiseRankingModel
-
class
QueryDocPairwiseRankingModel.
Config
[source] Bases:
PairwiseModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- decoder: MLPDecoderQueryResponse.Config = MLPDecoderQueryResponse.Config()
- output_layer: PairwiseRankingOutputLayer.Config = PairwiseRankingOutputLayer.Config()
- encode_relations: bool =
True
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[BiLSTMDocAttention.Config, DocNNRepresentation.Config] = BiLSTMDocAttention.Config()
- shared_representations: bool =
True
- decoder_output_dim: int =
64
Default JSON
{
"inputs": {
"pos_response": {
"is_input": true,
"column": "pos_response",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"neg_response": {
"is_input": true,
"column": "neg_response",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"query": {
"is_input": true,
"column": "query",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": []
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"margin": 1.0
}
},
"encode_relations": true,
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"shared_representations": true,
"decoder_output_dim": 64
}
representations¶
attention¶
DotProductSelfAttention.Config¶
Component: DotProductSelfAttention
-
class
DotProductSelfAttention.
Config
[source] Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- input_dim: int =
32
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"input_dim": 32
}
MultiplicativeAttention.Config¶
Component: MultiplicativeAttention
-
class
MultiplicativeAttention.
Config
[source] Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- p_hidden_dim: int =
32
- q_hidden_dim: int =
32
- normalize: bool =
False
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"p_hidden_dim": 32,
"q_hidden_dim": 32,
"normalize": false
}
SequenceAlignedAttention.Config¶
Component: SequenceAlignedAttention
-
class
SequenceAlignedAttention.
Config
[source] Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- proj_dim: int =
32
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"proj_dim": 32
}
augmented_lstm¶
AugmentedLSTM.Config¶
Component: AugmentedLSTM
-
class
AugmentedLSTM.
Config
[source] Bases:
RepresentationBase.Config
,ConfigBase
Configuration class for AugmentedLSTM.
-
dropout
¶ Variational dropout probability to use. Defaults to 0.0.
Type: float
-
lstm_dim
¶ Number of features in the hidden state of the LSTM. Defaults to 32.
Type: int
-
num_layers
¶ Number of recurrent layers. Eg. setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in the outputs of the first LSTM and computing the final result. Defaults to 1.
Type: int
-
bidirectional
¶ If True, becomes a bidirectional LSTM. Defaults to True.
Type: bool
-
use_highway
¶ If True we append a highway network to the outputs of the LSTM.
Type: bool
-
use_bias
¶ If True we use a bias in our LSTM calculations, otherwise we don’t.
Type: bool
-
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.0
- lstm_dim: int =
32
- use_highway: bool =
True
- bidirectional: bool =
False
- num_layers: int =
1
- use_bias: bool =
False
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.0,
"lstm_dim": 32,
"use_highway": true,
"bidirectional": false,
"num_layers": 1,
"use_bias": false
}
bilstm¶
BiLSTM.Config¶
Component: BiLSTM
-
class
BiLSTM.
Config
[source] Bases:
RepresentationBase.Config
,ConfigBase
Configuration class for BiLSTM.
-
dropout
¶ Dropout probability to use. Defaults to 0.4.
Type: float
-
lstm_dim
¶ Number of features in the hidden state of the LSTM. Defaults to 32.
Type: int
-
num_layers
¶ Number of recurrent layers. Eg. setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in the outputs of the first LSTM and computing the final result. Defaults to 1.
Type: int
-
bidirectional
¶ If True, becomes a bidirectional LSTM. Defaults to True.
Type: bool
-
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.4
- lstm_dim: int =
32
- num_layers: int =
1
- bidirectional: bool =
True
- pack_sequence: bool =
True
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
}
bilstm_doc_attention¶
BiLSTMDocAttention.Config¶
Component: BiLSTMDocAttention
-
class
BiLSTMDocAttention.
Config
[source] Bases:
RepresentationBase.Config
Configuration class for BiLSTM.
-
dropout
¶ Dropout probability to use. Defaults to 0.4.
Type: float
-
lstm
¶ Config for the BiLSTM.
Type: BiLSTM.Config
-
pooling
¶ Config for the underlying pooling module.
Type: ConfigBase
-
mlp_decoder
¶ Config for the non-linear projection module.
Type: MLPDecoder.Config
-
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.4
- lstm: BiLSTM.Config = BiLSTM.Config()
- pooling: Union[SelfAttention.Config, MaxPool.Config, MeanPool.Config, NoPool.Config, LastTimestepPool.Config] = SelfAttention.Config()
- mlp_decoder: Optional[MLPDecoder.Config] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
bilstm_doc_slot_attention¶
BiLSTMDocSlotAttention.Config¶
Component: BiLSTMDocSlotAttention
-
class
BiLSTMDocSlotAttention.
Config
[source] Bases:
RepresentationBase.Config
,ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.4
- lstm: Union[BiLSTM.Config, OrderedNeuronLSTM.Config, AugmentedLSTM.Config] = BiLSTM.Config()
- pooling: Union[SelfAttention.Config, MaxPool.Config, MeanPool.Config, NoneType] =
None
- slot_attention: Optional[SlotAttention.Config] =
None
- doc_mlp_layers: int =
0
- word_mlp_layers: int =
0
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
}
},
"pooling": null,
"slot_attention": null,
"doc_mlp_layers": 0,
"word_mlp_layers": 0
}
bilstm_slot_attn¶
BiLSTMSlotAttention.Config¶
Component: BiLSTMSlotAttention
-
class
BiLSTMSlotAttention.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.4
- lstm: BiLSTM.Config = BiLSTM.Config()
- slot_attention: SlotAttention.Config = SlotAttention.Config()
- mlp_decoder: Optional[MLPDecoder.Config] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"slot_attention": {
"attn_dimension": 64,
"attention_type": "no_attention"
},
"mlp_decoder": null
}
biseqcnn¶
BSeqCNNRepresentation.Config¶
Component: BSeqCNNRepresentation
-
class
BSeqCNNRepresentation.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
},
"fwd_bwd_context_len": 5,
"surrounding_context_len": 2
}
contextual_intent_slot_rep¶
ContextualIntentSlotRepresentation.Config¶
Component: ContextualIntentSlotRepresentation
-
class
ContextualIntentSlotRepresentation.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- sen_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
- seq_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
- joint_representation: Union[BiLSTMDocSlotAttention.Config, JointCNNRepresentation.Config] = BiLSTMDocSlotAttention.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"sen_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"seq_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"joint_representation": {
"BiLSTMDocSlotAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
}
},
"pooling": null,
"slot_attention": null,
"doc_mlp_layers": 0,
"word_mlp_layers": 0
}
}
}
deepcnn¶
DeepCNNRepresentation.Config¶
Component: DeepCNNRepresentation
-
class
DeepCNNRepresentation.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- cnn: CNNParams = CNNParams()
- dropout: float =
0.3
- activation: Activation =
<Activation.GLU: 'glu'>
- separable: bool =
False
- bottleneck: int =
0
- pooling_type: PoolingType =
<PoolingType.NONE: 'none'>
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
},
"dropout": 0.3,
"activation": "glu",
"separable": false,
"bottleneck": 0,
"pooling_type": "none"
}
docnn¶
DocNNRepresentation.Config¶
Component: DocNNRepresentation
-
class
DocNNRepresentation.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
}
huggingface_bert_sentence_encoder¶
HuggingFaceBertSentenceEncoder.Config¶
Component: HuggingFaceBertSentenceEncoder
-
class
HuggingFaceBertSentenceEncoder.
Config
[source] Bases:
TransformerSentenceEncoderBase.Config
,ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- bert_cpt_dir: str =
'/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/'
- load_weights: bool =
True
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
jointcnn_rep¶
JointCNNRepresentation.Config¶
Component: JointCNNRepresentation
-
class
JointCNNRepresentation.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- doc_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
- word_representation: Union[BSeqCNNRepresentation.Config, DeepCNNRepresentation.Config] = BSeqCNNRepresentation.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"word_representation": {
"BSeqCNNRepresentation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
},
"fwd_bwd_context_len": 5,
"surrounding_context_len": 2
}
}
}
ordered_neuron_lstm¶
OrderedNeuronLSTM.Config¶
Component: OrderedNeuronLSTM
-
class
OrderedNeuronLSTM.
Config
[source] Bases:
RepresentationBase.Config
,ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.4
- lstm_dim: int =
32
- num_layers: int =
1
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1
}
OrderedNeuronLSTMLayer.Config¶
Component: OrderedNeuronLSTMLayer
-
class
OrderedNeuronLSTMLayer.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
pass_through¶
PassThroughRepresentation.Config¶
Component: PassThroughRepresentation
-
class
PassThroughRepresentation.
Config
Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
pooling¶
BoundaryPool.Config¶
Component: BoundaryPool
-
class
BoundaryPool.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- boundary_type: str =
'first'
Default JSON
{
"boundary_type": "first"
}
LastTimestepPool.Config¶
Component: LastTimestepPool
-
class
LastTimestepPool.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
MaxPool.Config¶
Component: MaxPool
-
class
MaxPool.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
MeanPool.Config¶
Component: MeanPool
-
class
MeanPool.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
NoPool.Config¶
Component: NoPool
-
class
NoPool.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
SelfAttention.Config¶
Component: SelfAttention
-
class
SelfAttention.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- attn_dimension: int =
64
- dropout: float =
0.4
Default JSON
{
"attn_dimension": 64,
"dropout": 0.4
}
pure_doc_attention¶
PureDocAttention.Config¶
Component: PureDocAttention
-
class
PureDocAttention.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- dropout: float =
0.4
- pooling: Union[SelfAttention.Config, MaxPool.Config, MeanPool.Config, NoPool.Config, BoundaryPool.Config] = SelfAttention.Config()
- mlp_decoder: Optional[MLPDecoder.Config] =
None
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
representation_base¶
RepresentationBase.Config¶
Component: RepresentationBase
-
class
RepresentationBase.
Config
Bases:
Module.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- Subclasses
AugmentedLSTM.Config
BiLSTM.Config
BiLSTMDocAttention.Config
BiLSTMDocSlotAttention.Config
BiLSTMSlotAttention.Config
BSeqCNNRepresentation.Config
ContextualIntentSlotRepresentation.Config
DeepCNNRepresentation.Config
DocNNRepresentation.Config
HuggingFaceBertSentenceEncoder.Config
JointCNNRepresentation.Config
SharedCNNRepresentation.Config
OrderedNeuronLSTM.Config
PassThroughRepresentation.Config
PureDocAttention.Config
SeqRepresentation.Config
SparseTransformerSentenceEncoder.Config
TransformerSentenceEncoder.Config
TransformerSentenceEncoderBase.Config
RoBERTaEncoder.Config
RoBERTaEncoderBase.Config
RoBERTaEncoderJit.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
seq_rep¶
SeqRepresentation.Config¶
Component: SeqRepresentation
-
class
SeqRepresentation.
Config
[source] Bases:
RepresentationBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- doc_representation: DocNNRepresentation.Config = DocNNRepresentation.Config()
- seq_representation: Union[BiLSTMDocAttention.Config, DocNNRepresentation.Config] = BiLSTMDocAttention.Config()
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"seq_representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
}
}
slot_attention¶
SlotAttention.Config¶
Component: SlotAttention
-
class
SlotAttention.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- attn_dimension: int =
64
- attention_type: SlotAttentionType =
<SlotAttentionType.NO_ATTENTION: 'no_attention'>
Default JSON
{
"attn_dimension": 64,
"attention_type": "no_attention"
}
sparse_transformer_sentence_encoder¶
SparseTransformerSentenceEncoder.Config¶
Component: SparseTransformerSentenceEncoder
-
class
SparseTransformerSentenceEncoder.
Config
[source] Bases:
TransformerSentenceEncoder.Config
,ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- dropout: float =
0.1
- attention_dropout: float =
0.1
- activation_dropout: float =
0.1
- ffn_embedding_dim: int =
3072
- num_encoder_layers: int =
6
- num_attention_heads: int =
8
- num_segments: int =
2
- use_position_embeddings: bool =
True
- offset_positions_by_padding: bool =
True
- apply_bert_init: bool =
True
- encoder_normalize_before: bool =
True
- activation_fn: str =
'relu'
- projection_dim: int =
0
- max_seq_len: int =
128
- multilingual: bool =
False
- freeze_embeddings: bool =
False
- n_trans_layers_to_freeze: int =
0
- use_torchscript: bool =
False
- project_representation: bool =
False
- is_bidirectional: bool =
True
- stride: int =
32
- expressivity: int =
8
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"dropout": 0.1,
"attention_dropout": 0.1,
"activation_dropout": 0.1,
"ffn_embedding_dim": 3072,
"num_encoder_layers": 6,
"num_attention_heads": 8,
"num_segments": 2,
"use_position_embeddings": true,
"offset_positions_by_padding": true,
"apply_bert_init": true,
"encoder_normalize_before": true,
"activation_fn": "relu",
"projection_dim": 0,
"max_seq_len": 128,
"multilingual": false,
"freeze_embeddings": false,
"n_trans_layers_to_freeze": 0,
"use_torchscript": false,
"project_representation": false,
"is_bidirectional": true,
"stride": 32,
"expressivity": 8
}
stacked_bidirectional_rnn¶
StackedBidirectionalRNN.Config¶
Component: StackedBidirectionalRNN
-
class
StackedBidirectionalRNN.
Config
[source] Bases:
Module.Config
Configuration class for StackedBidirectionalRNN.
Number of features in the hidden state of the RNN. Defaults to 32.
Type: int
-
num_layers
¶ Number of recurrent layers. Eg. setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in the outputs of the first RNN and computing the final result. Defaults to 1.
Type: int
-
dropout
¶ Dropout probability to use. Defaults to 0.4.
Type: float
-
bidirectional
¶ If True, becomes a bidirectional RNN. Defaults to True.
Type: bool
-
rnn_type
¶ Which RNN type to use. Options: “rnn”, “lstm”, “gru”.
Type: str
-
concat_layers
¶ Whether to concatenate the outputs of each layer of stacked RNN.
Type: bool
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- hidden_size: int =
32
- num_layers: int =
1
- dropout: float =
0.0
- bidirectional: bool =
True
- rnn_type: RnnType =
<RnnType.LSTM: 'lstm'>
- concat_layers: bool =
True
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_size": 32,
"num_layers": 1,
"dropout": 0.0,
"bidirectional": true,
"rnn_type": "lstm",
"concat_layers": true
}
transformer_sentence_encoder¶
TransformerSentenceEncoder.Config¶
Component: TransformerSentenceEncoder
-
class
TransformerSentenceEncoder.
Config
[source] Bases:
TransformerSentenceEncoderBase.Config
,ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- dropout: float =
0.1
- attention_dropout: float =
0.1
- activation_dropout: float =
0.1
- ffn_embedding_dim: int =
3072
- num_encoder_layers: int =
6
- num_attention_heads: int =
8
- num_segments: int =
2
- use_position_embeddings: bool =
True
- offset_positions_by_padding: bool =
True
- apply_bert_init: bool =
True
- encoder_normalize_before: bool =
True
- activation_fn: str =
'relu'
- projection_dim: int =
0
- max_seq_len: int =
128
- multilingual: bool =
False
- freeze_embeddings: bool =
False
- n_trans_layers_to_freeze: int =
0
- use_torchscript: bool =
False
- Subclasses
SparseTransformerSentenceEncoder.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"dropout": 0.1,
"attention_dropout": 0.1,
"activation_dropout": 0.1,
"ffn_embedding_dim": 3072,
"num_encoder_layers": 6,
"num_attention_heads": 8,
"num_segments": 2,
"use_position_embeddings": true,
"offset_positions_by_padding": true,
"apply_bert_init": true,
"encoder_normalize_before": true,
"activation_fn": "relu",
"projection_dim": 0,
"max_seq_len": 128,
"multilingual": false,
"freeze_embeddings": false,
"n_trans_layers_to_freeze": 0,
"use_torchscript": false
}
transformer_sentence_encoder_base¶
TransformerSentenceEncoderBase.Config¶
Component: TransformerSentenceEncoderBase
-
class
TransformerSentenceEncoderBase.
Config
[source] Bases:
RepresentationBase.Config
,ConfigBase
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- Subclasses
HuggingFaceBertSentenceEncoder.Config
SparseTransformerSentenceEncoder.Config
TransformerSentenceEncoder.Config
RoBERTaEncoder.Config
RoBERTaEncoderBase.Config
RoBERTaEncoderJit.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false
}
roberta¶
InputConfig¶
-
class
pytext.models.roberta.
InputConfig
Bases:
ConfigBase
All Attributes (including base classes)
- tokens: RoBERTaTensorizer.Config = RoBERTaTensorizer.Config()
- labels: LabelTensorizer.Config = LabelTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
}
RoBERTa.Config¶
Component: RoBERTa
-
class
RoBERTa.
Config
[source] Bases:
NewBertModel.Config
All Attributes (including base classes)
- inputs: InputConfig = InputConfig()
- encoder: RoBERTaEncoderBase.Config = RoBERTaEncoderJit.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"encoder": {
"RoBERTaEncoderJit": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"pretrained_encoder": {
"load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
"save_path": null,
"freeze": false,
"shared_module_key": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
}
RoBERTaEncoder.Config¶
Component: RoBERTaEncoder
-
class
RoBERTaEncoder.
Config
[source] Bases:
RoBERTaEncoderBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- vocab_size: int =
50265
- num_encoder_layers: int =
12
- num_attention_heads: int =
12
- model_path: str =
'manifold://pytext_training/tree/static/models/roberta_base_torch.pt'
- is_finetuned: bool =
False
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"vocab_size": 50265,
"num_encoder_layers": 12,
"num_attention_heads": 12,
"model_path": "manifold://pytext_training/tree/static/models/roberta_base_torch.pt",
"is_finetuned": false
}
RoBERTaEncoderBase.Config¶
Component: RoBERTaEncoderBase
-
class
RoBERTaEncoderBase.
Config
[source] Bases:
TransformerSentenceEncoderBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- Subclasses
RoBERTaEncoder.Config
RoBERTaEncoderJit.Config
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false
}
RoBERTaEncoderJit.Config¶
Component: RoBERTaEncoderJit
-
class
RoBERTaEncoderJit.
Config
[source] Bases:
RoBERTaEncoderBase.Config
All Attributes (including base classes)
- load_path: Optional[str] =
None
- save_path: Optional[str] =
None
- freeze: bool =
False
- shared_module_key: Optional[str] =
None
- output_dropout: float =
0.4
- embedding_dim: int =
768
- pooling: PoolingMethod =
<PoolingMethod.CLS_TOKEN: 'cls_token'>
- export: bool =
False
- pretrained_encoder: Module.Config = Module.Config(load_path=
'manifold://pytext_training/tree/static/models/roberta_public.pt1'
)
Default JSON
{
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"pretrained_encoder": {
"load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
"save_path": null,
"freeze": false,
"shared_module_key": null
}
}
RoBERTaWordTaggingModel.Config¶
Component: RoBERTaWordTaggingModel
-
class
RoBERTaWordTaggingModel.
Config
[source] Bases:
BaseModel.Config
All Attributes (including base classes)
- inputs: WordTaggingInputConfig = WordTaggingInputConfig()
- encoder: RoBERTaEncoderBase.Config = RoBERTaEncoderJit.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: WordTaggingOutputLayer.Config = WordTaggingOutputLayer.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256,
"labels_columns": [
"label"
],
"labels": []
}
},
"encoder": {
"RoBERTaEncoderJit": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"pretrained_encoder": {
"load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
"save_path": null,
"freeze": false,
"shared_module_key": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
}
WordTaggingInputConfig¶
-
class
pytext.models.roberta.
WordTaggingInputConfig
Bases:
ConfigBase
All Attributes (including base classes)
Default JSON
{
"tokens": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256,
"labels_columns": [
"label"
],
"labels": []
}
}
semantic_parsers¶
rnng¶
rnng_parser¶
-
class
pytext.models.semantic_parsers.rnng.rnng_parser.
AblationParams
Bases:
ConfigBase
Ablation parameters.
-
use_buffer
¶ whether to use the buffer LSTM
Type: bool
-
use_stack
¶ whether to use the stack LSTM
Type: bool
-
use_action
¶ whether to use the action LSTM
Type: bool
-
use_last_open_NT_feature
¶ whether to use the last open non-terminal as a 1-hot feature when computing representation for the action classifier
Type: bool
-
All Attributes (including base classes)
- use_buffer: bool =
True
- use_stack: bool =
True
- use_action: bool =
True
- use_last_open_NT_feature: bool =
False
Default JSON
{
"use_buffer": true,
"use_stack": true,
"use_action": true,
"use_last_open_NT_feature": false
}
-
class
pytext.models.semantic_parsers.rnng.rnng_parser.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config(column=
'tokenized_text'
)- actions: AnnotationNumberizer.Config = AnnotationNumberizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "tokenized_text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"actions": {
"is_input": true,
"column": "seqlogical"
}
}
-
class
pytext.models.semantic_parsers.rnng.rnng_parser.
RNNGConstraints
Bases:
ConfigBase
Constraints when computing valid actions.
-
intent_slot_nesting
¶ for the intent slot models, the top level non-terminal has to be an intent, an intent can only have slot non-terminals as children and vice-versa.
Type: bool
-
ignore_loss_for_unsupported
¶ if the data has “unsupported” label, that is if the label has a substring “unsupported” in it, do not compute loss
Type: bool
-
no_slots_inside_unsupported
¶ if the data has “unsupported” label, that is if the label has a substring “unsupported” in it, do not predict slots inside this label.
Type: bool
-
All Attributes (including base classes)
- intent_slot_nesting: bool =
True
- ignore_loss_for_unsupported: bool =
False
- no_slots_inside_unsupported: bool =
True
Default JSON
{
"intent_slot_nesting": true,
"ignore_loss_for_unsupported": false,
"no_slots_inside_unsupported": true
}
Component: RNNGParser
-
class
RNNGParser.
Config
[source] Bases:
RNNGParserBase.Config
All Attributes (including base classes)
- version: int =
2
- lstm: BiLSTM.Config = BiLSTM.Config()
- ablation: AblationParams = AblationParams()
- constraints: RNNGConstraints = RNNGConstraints()
- max_open_NT: int =
10
- dropout: float =
0.1
- beam_size: int =
1
- top_k: int =
1
- compositional_type: CompositionalType =
<CompositionalType.BLSTM: 'blstm'>
- inputs: ModelInput = ModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
Default JSON
{
"version": 2,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"ablation": {
"use_buffer": true,
"use_stack": true,
"use_action": true,
"use_last_open_NT_feature": false
},
"constraints": {
"intent_slot_nesting": true,
"ignore_loss_for_unsupported": false,
"no_slots_inside_unsupported": true
},
"max_open_NT": 10,
"dropout": 0.1,
"beam_size": 1,
"top_k": 1,
"compositional_type": "blstm",
"inputs": {
"tokens": {
"is_input": true,
"column": "tokenized_text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"actions": {
"is_input": true,
"column": "seqlogical"
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
}
}
Component: RNNGParserBase
-
class
RNNGParserBase.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- version: int =
2
- lstm: BiLSTM.Config = BiLSTM.Config()
- ablation: AblationParams = AblationParams()
- constraints: RNNGConstraints = RNNGConstraints()
- max_open_NT: int =
10
- dropout: float =
0.1
- beam_size: int =
1
- top_k: int =
1
- compositional_type: CompositionalType =
<CompositionalType.BLSTM: 'blstm'>
- Subclasses
RNNGParser.Config
Default JSON
{
"version": 2,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"ablation": {
"use_buffer": true,
"use_stack": true,
"use_action": true,
"use_last_open_NT_feature": false
},
"constraints": {
"intent_slot_nesting": true,
"ignore_loss_for_unsupported": false,
"no_slots_inside_unsupported": true
},
"max_open_NT": 10,
"dropout": 0.1,
"beam_size": 1,
"top_k": 1,
"compositional_type": "blstm"
}
seq_models¶
contextual_intent_slot¶
ContextualIntentSlotModel.Config¶
Component: ContextualIntentSlotModel
-
class
ContextualIntentSlotModel.
Config
[source] Bases:
IntentSlotModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- word_embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: ContextualIntentSlotRepresentation.Config = ContextualIntentSlotRepresentation.Config()
- output_layer: IntentSlotOutputLayer.Config = IntentSlotOutputLayer.Config()
- decoder: IntentSlotModelDecoder.Config = IntentSlotModelDecoder.Config()
- default_doc_loss_weight: float =
0.2
- default_word_loss_weight: float =
0.5
- seq_embedding: Optional[WordEmbedding.Config] = WordEmbedding.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"word_labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": true
},
"doc_labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": true,
"pad_in_vocab": false,
"label_vocab": null
}
},
"doc_weight": null,
"word_weight": null,
"seq_tokens": {
"is_input": true,
"column": "text_seq",
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"add_bol_token": false,
"add_eol_token": false,
"use_eol_token_for_bol": false,
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
}
},
"word_embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"sen_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"seq_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"joint_representation": {
"BiLSTMDocSlotAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
}
},
"pooling": null,
"slot_attention": null,
"doc_mlp_layers": 0,
"word_mlp_layers": 0
}
}
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_output": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"word_output": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"use_doc_probs_in_word": false,
"doc_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"word_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
},
"default_doc_loss_weight": 0.2,
"default_word_loss_weight": 0.5,
"seq_embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
}
}
ModelInput¶
-
class
pytext.models.seq_models.contextual_intent_slot.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- word_labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config(allow_unknown=
True
)- doc_labels: LabelTensorizer.Config = LabelTensorizer.Config(allow_unknown=
True
)- doc_weight: Optional[FloatTensorizer.Config] =
None
- word_weight: Optional[FloatTensorizer.Config] =
None
- seq_tokens: Optional[SeqTokenTensorizer.Config] = SeqTokenTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"word_labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": true
},
"doc_labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": true,
"pad_in_vocab": false,
"label_vocab": null
}
},
"doc_weight": null,
"word_weight": null,
"seq_tokens": {
"is_input": true,
"column": "text_seq",
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"add_bol_token": false,
"add_eol_token": false,
"use_eol_token_for_bol": false,
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
}
}
seqnn¶
ModelInput¶
-
class
pytext.models.seq_models.seqnn.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: SeqTokenTensorizer.Config = SeqTokenTensorizer.Config()
- dense: Optional[FloatListTensorizer.Config] =
None
- labels: LabelTensorizer.Config = LabelTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text_seq",
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"add_bol_token": false,
"add_eol_token": false,
"use_eol_token_for_bol": false,
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
}
SeqNNModel.Config¶
Component: SeqNNModel
-
class
SeqNNModel.
Config
[source] Bases:
DocModel.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: SeqRepresentation.Config = SeqRepresentation.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text_seq",
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"add_bol_token": false,
"add_eol_token": false,
"use_eol_token_for_bol": false,
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"seq_representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
}
SeqNNModel_Deprecated.Config¶
Component: SeqNNModel_Deprecated
-
class
SeqNNModel_Deprecated.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- representation: SeqRepresentation.Config = SeqRepresentation.Config()
- output_layer: ClassificationOutputLayer.Config = ClassificationOutputLayer.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
Default JSON
{
"representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"seq_representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
}
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
}
word_model¶
ByteModelInput¶
-
class
pytext.models.word_model.
ByteModelInput
Bases:
ModelInput
All Attributes (including base classes)
- token_bytes: ByteTokenTensorizer.Config = ByteTokenTensorizer.Config()
- labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config()
Default JSON
{
"token_bytes": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"max_seq_len": null,
"max_byte_len": 15,
"offset_for_non_padding": 0,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false
},
"labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
}
ModelInput¶
-
class
pytext.models.word_model.
ModelInput
Bases:
ModelInput
All Attributes (including base classes)
- tokens: TokenTensorizer.Config = TokenTensorizer.Config()
- labels: SlotLabelTensorizer.Config = SlotLabelTensorizer.Config()
Default JSON
{
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
}
WordTaggingLiteModel.Config¶
Component: WordTaggingLiteModel
-
class
WordTaggingLiteModel.
Config
[source] Bases:
WordTaggingModel.Config
All Attributes (including base classes)
- inputs: ByteModelInput = ByteModelInput()
- embedding: CharacterEmbedding.Config = CharacterEmbedding.Config()
- representation: Union[BiLSTMSlotAttention.Config, BSeqCNNRepresentation.Config, PassThroughRepresentation.Config, DeepCNNRepresentation.Config] = PassThroughRepresentation.Config()
- output_layer: Union[WordTaggingOutputLayer.Config, CRFOutputLayer.Config] = WordTaggingOutputLayer.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
Default JSON
{
"inputs": {
"token_bytes": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"max_seq_len": null,
"max_byte_len": 15,
"offset_for_non_padding": 0,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false
},
"labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"sparse": false,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
},
"highway_layers": 0,
"projection_dim": null,
"export_input_names": [
"char_vals"
],
"vocab_from_train_data": true,
"max_word_length": 20,
"min_freq": 1
},
"representation": {
"PassThroughRepresentation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
},
"output_layer": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
}
WordTaggingModel.Config¶
Component: WordTaggingModel
-
class
WordTaggingModel.
Config
[source] Bases:
Model.Config
All Attributes (including base classes)
- inputs: ModelInput = ModelInput()
- embedding: WordEmbedding.Config = WordEmbedding.Config()
- representation: Union[BiLSTMSlotAttention.Config, BSeqCNNRepresentation.Config, PassThroughRepresentation.Config, DeepCNNRepresentation.Config] = PassThroughRepresentation.Config()
- output_layer: Union[WordTaggingOutputLayer.Config, CRFOutputLayer.Config] = WordTaggingOutputLayer.Config()
- decoder: MLPDecoder.Config = MLPDecoder.Config()
- Subclasses
WordTaggingLiteModel.Config
Default JSON
{
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"PassThroughRepresentation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
},
"output_layer": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
}
optimizer¶
fp16_optimizer¶
FP16Optimizer.Config¶
Component: FP16Optimizer
-
class
FP16Optimizer.
Config
Bases:
Optimizer.Config
All Attributes (including base classes)
- Subclasses
FP16OptimizerApex.Config
FP16OptimizerFairseq.Config
MemoryEfficientFP16OptimizerFairseq.Config
Default JSON
{}
FP16OptimizerApex.Config¶
Component: FP16OptimizerApex
-
class
FP16OptimizerApex.
Config
[source] Bases:
FP16Optimizer.Config
All Attributes (including base classes)
- opt_level: str =
'O2'
- init_loss_scale: Optional[int] =
None
- min_loss_scale: Optional[float] =
None
Default JSON
{
"opt_level": "O2",
"init_loss_scale": null,
"min_loss_scale": null
}
FP16OptimizerFairseq.Config¶
Component: FP16OptimizerFairseq
-
class
FP16OptimizerFairseq.
Config
[source] Bases:
FP16Optimizer.Config
All Attributes (including base classes)
- init_loss_scale: int =
128
- scale_window: Optional[int] =
None
- scale_tolerance: float =
0.0
- threshold_loss_scale: Optional[float] =
None
- min_loss_scale: float =
0.0001
Default JSON
{
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
MemoryEfficientFP16OptimizerFairseq.Config¶
Component: MemoryEfficientFP16OptimizerFairseq
-
class
MemoryEfficientFP16OptimizerFairseq.
Config
[source] Bases:
FP16Optimizer.Config
All Attributes (including base classes)
- init_loss_scale: int =
128
- scale_window: Optional[int] =
None
- scale_tolerance: float =
0.0
- threshold_loss_scale: Optional[float] =
None
- min_loss_scale: float =
0.0001
Default JSON
{
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
lamb¶
Lamb.Config¶
Component: Lamb
-
class
Lamb.
Config
[source] Bases:
Optimizer.Config
All Attributes (including base classes)
- lr: float =
0.001
- weight_decay: float =
1e-05
- eps: float =
1e-08
- min_trust: Optional[float] =
None
Default JSON
{
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08,
"min_trust": null
}
optimizers¶
Adagrad.Config¶
Component: Adagrad
-
class
Adagrad.
Config
[source] Bases:
Optimizer.Config
All Attributes (including base classes)
- lr: float =
0.01
- weight_decay: float =
1e-05
Default JSON
{
"lr": 0.01,
"weight_decay": 1e-05
}
Adam.Config¶
Component: Adam
-
class
Adam.
Config
[source] Bases:
Optimizer.Config
All Attributes (including base classes)
- lr: float =
0.001
- weight_decay: float =
1e-05
- eps: float =
1e-08
Default JSON
{
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
AdamW.Config¶
Component: AdamW
-
class
AdamW.
Config
[source] Bases:
Optimizer.Config
All Attributes (including base classes)
- lr: float =
0.001
- weight_decay: float =
0.01
- eps: float =
1e-08
Default JSON
{
"lr": 0.001,
"weight_decay": 0.01,
"eps": 1e-08
}
Optimizer.Config¶
Component: Optimizer
-
class
Optimizer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- Subclasses
FP16Optimizer.Config
FP16OptimizerApex.Config
FP16OptimizerFairseq.Config
MemoryEfficientFP16OptimizerFairseq.Config
Lamb.Config
Adagrad.Config
Adam.Config
AdamW.Config
SGD.Config
RAdam.Config
StochasticWeightAveraging.Config
Default JSON
{}
radam¶
scheduler¶
BatchScheduler.Config¶
Component: BatchScheduler
-
class
BatchScheduler.
Config
Bases:
Scheduler.Config
All Attributes (including base classes)
- Subclasses
PolynomialDecayScheduler.Config
SchedulerWithWarmup.Config
WarmupScheduler.Config
Default JSON
{}
CosineAnnealingLR.Config¶
Component: CosineAnnealingLR
-
class
CosineAnnealingLR.
Config
[source] Bases:
Scheduler.Config
All Attributes (including base classes)
- t_max: int =
1000
- Maximum number of iterations.
- eta_min: float =
0
- Minimum learning rate
Default JSON
{
"t_max": 1000,
"eta_min": 0
}
CyclicLR.Config¶
Component: CyclicLR
-
class
CyclicLR.
Config
[source] Bases:
Scheduler.Config
All Attributes (including base classes)
- base_lr: float =
0.001
- max_lr: float =
0.002
- step_size_up: int =
2000
- step_size_down: Optional[int] =
None
- mode: str =
'triangular'
- gamma: float =
1.0
- scale_mode: str =
'cycle'
- cycle_momentum: bool =
True
- base_momentum: float =
0.8
- max_momentum: float =
0.9
- last_epoch: int =
-1
Default JSON
{
"base_lr": 0.001,
"max_lr": 0.002,
"step_size_up": 2000,
"step_size_down": null,
"mode": "triangular",
"gamma": 1.0,
"scale_mode": "cycle",
"cycle_momentum": true,
"base_momentum": 0.8,
"max_momentum": 0.9,
"last_epoch": -1
}
ExponentialLR.Config¶
Component: ExponentialLR
-
class
ExponentialLR.
Config
[source] Bases:
Scheduler.Config
All Attributes (including base classes)
- gamma: float =
0.1
- Multiplicative factor of learning rate decay.
Default JSON
{
"gamma": 0.1
}
LmFineTuning.Config¶
Component: LmFineTuning
-
class
LmFineTuning.
Config
[source] Bases:
Scheduler.Config
All Attributes (including base classes)
- cut_frac: float =
0.1
- The fraction of iterations we increase the learning rate. Default 0.1
- ratio: int =
32
- How much smaller the lowest LR is from the maximum LR eta_max.
- non_pretrained_param_groups: int =
2
- Number of param_groups, starting from the end, that were not pretrained. The default value is 2, since the base Model class supplies to the optimizer typically one param_group from the embedding and one param_group from its other components.
- lm_lr_multiplier: float =
1.0
- Factor to multiply lr for all pretrained layers by.
- lm_use_per_layer_lr: bool =
False
- Whether to make each pretrained layer’s lr one-half as large as the next (higher) layer.
- lm_gradual_unfreezing: bool =
True
- Whether to unfreeze layers one by one (per epoch).
- last_epoch: int =
-1
- Though the name is last_epoch, it means last batch update. last_batch_update: = current_epoch_number * num_batches_per_epoch + batch_id after each batch update, it will increment 1
Default JSON
{
"cut_frac": 0.1,
"ratio": 32,
"non_pretrained_param_groups": 2,
"lm_lr_multiplier": 1.0,
"lm_use_per_layer_lr": false,
"lm_gradual_unfreezing": true,
"last_epoch": -1
}
PolynomialDecayScheduler.Config¶
Component: PolynomialDecayScheduler
-
class
PolynomialDecayScheduler.
Config
[source] Bases:
BatchScheduler.Config
All Attributes (including base classes)
- warmup_steps: int =
0
- number of training steps over which to increase learning rate
- total_steps: int
- number of training steps for learning rate decay
- end_learning_rate: float
- end learning rate after total_steps of training
- power: float =
1.0
- power used for polynomial decay calculation
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
ReduceLROnPlateau.Config¶
Component: ReduceLROnPlateau
-
class
ReduceLROnPlateau.
Config
[source] Bases:
Scheduler.Config
All Attributes (including base classes)
- lower_is_better: bool =
True
- This indicates the desirable direction in which we would like the training to proceed. If set to true, learning rate will be reduce when quantity being monitored stops going down
- factor: float =
0.1
- Factor by which the learning rate will be reduced. new_lr = lr * factor
- patience: int =
5
- Number of epochs with no improvement after which learning rate will be reduced
- min_lr: float =
0
- Lower bound on the learning rate of all param groups
- threshold: float =
0.0001
- Threshold for measuring the new optimum, to only focus on significant changes.
- threshold_is_absolute: bool =
True
- One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode.
- cooldown: int =
0
- Number of epochs to wait before resuming normal operation after lr has been reduced.
Default JSON
{
"lower_is_better": true,
"factor": 0.1,
"patience": 5,
"min_lr": 0,
"threshold": 0.0001,
"threshold_is_absolute": true,
"cooldown": 0
}
Scheduler.Config¶
Component: Scheduler
-
class
Scheduler.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- Subclasses
BatchScheduler.Config
CosineAnnealingLR.Config
CyclicLR.Config
ExponentialLR.Config
LmFineTuning.Config
PolynomialDecayScheduler.Config
ReduceLROnPlateau.Config
SchedulerWithWarmup.Config
StepLR.Config
WarmupScheduler.Config
Default JSON
{}
SchedulerWithWarmup.Config¶
Component: SchedulerWithWarmup
-
class
SchedulerWithWarmup.
Config
[source] Bases:
BatchScheduler.Config
All Attributes (including base classes)
- warmup_scheduler: WarmupScheduler.Config = WarmupScheduler.Config()
- scheduler: Union[ExponentialLR.Config, CosineAnnealingLR.Config, ReduceLROnPlateau.Config, LmFineTuning.Config, CyclicLR.Config]
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
StepLR.Config¶
Component: StepLR
-
class
StepLR.
Config
[source] Bases:
Scheduler.Config
All Attributes (including base classes)
- step_size: int =
30
- Period of learning rate decay.
- gamma: float =
0.1
- Multiplicative factor of learning rate decay.
Default JSON
{
"step_size": 30,
"gamma": 0.1
}
WarmupScheduler.Config¶
Component: WarmupScheduler
-
class
WarmupScheduler.
Config
[source] Bases:
BatchScheduler.Config
All Attributes (including base classes)
- warmup_steps: int =
10000
- number of training steps over which to increase learning rate
- inverse_sqrt_decay: bool =
False
- whether to perform inverse sqrt decay after the warmup phase
Default JSON
{
"warmup_steps": 10000,
"inverse_sqrt_decay": false
}
sparsifiers¶
blockwise_sparsifier¶
BlockwiseMagnitudeSparsifier.Config¶
Component: BlockwiseMagnitudeSparsifier
-
class
BlockwiseMagnitudeSparsifier.
Config
[source] Bases:
L0_projection_sparsifier.Config
All Attributes (including base classes)
- sparsity: float =
0.9
- starting_epoch: int =
2
- frequency: int =
1
- layerwise_pruning: bool =
True
- accumulate_mask: bool =
False
- block_size: int =
16
- columnwise_blocking: bool =
False
Default JSON
{
"sparsity": 0.9,
"starting_epoch": 2,
"frequency": 1,
"layerwise_pruning": true,
"accumulate_mask": false,
"block_size": 16,
"columnwise_blocking": false
}
sparsifier¶
CRF_L1_SoftThresholding.Config¶
Component: CRF_L1_SoftThresholding
-
class
CRF_L1_SoftThresholding.
Config
[source] Bases:
CRF_SparsifierBase.Config
All Attributes (including base classes)
- starting_epoch: int =
1
- frequency: int =
1
- lambda_l1: float =
0.001
Default JSON
{
"starting_epoch": 1,
"frequency": 1,
"lambda_l1": 0.001
}
CRF_MagnitudeThresholding.Config¶
Component: CRF_MagnitudeThresholding
-
class
CRF_MagnitudeThresholding.
Config
[source] Bases:
CRF_SparsifierBase.Config
All Attributes (including base classes)
- starting_epoch: int =
1
- frequency: int =
1
- sparsity: float =
0.9
- grouping: str =
'row'
Default JSON
{
"starting_epoch": 1,
"frequency": 1,
"sparsity": 0.9,
"grouping": "row"
}
CRF_SparsifierBase.Config¶
Component: CRF_SparsifierBase
-
class
CRF_SparsifierBase.
Config
[source] Bases:
Sparsifier.Config
All Attributes (including base classes)
- starting_epoch: int =
1
- frequency: int =
1
- Subclasses
CRF_L1_SoftThresholding.Config
CRF_MagnitudeThresholding.Config
Default JSON
{
"starting_epoch": 1,
"frequency": 1
}
L0_projection_sparsifier.Config¶
Component: L0_projection_sparsifier
-
class
L0_projection_sparsifier.
Config
[source] Bases:
Sparsifier.Config
All Attributes (including base classes)
- sparsity: float =
0.9
- starting_epoch: int =
2
- frequency: int =
1
- layerwise_pruning: bool =
True
- accumulate_mask: bool =
False
- Subclasses
BlockwiseMagnitudeSparsifier.Config
Default JSON
{
"sparsity": 0.9,
"starting_epoch": 2,
"frequency": 1,
"layerwise_pruning": true,
"accumulate_mask": false
}
Sparsifier.Config¶
Component: Sparsifier
-
class
Sparsifier.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- Subclasses
BlockwiseMagnitudeSparsifier.Config
CRF_L1_SoftThresholding.Config
CRF_MagnitudeThresholding.Config
CRF_SparsifierBase.Config
L0_projection_sparsifier.Config
Default JSON
{}
swa¶
StochasticWeightAveraging.Config¶
Component: StochasticWeightAveraging
-
class
StochasticWeightAveraging.
Config
[source] Bases:
Optimizer.Config
All Attributes (including base classes)
- optimizer: Union[SGD.Config, Adam.Config, AdamW.Config, Adagrad.Config, RAdam.Config, Lamb.Config] = SGD.Config()
- start: int =
10
- frequency: int =
5
- swa_learning_rate: Optional[float] =
0.05
Default JSON
{
"optimizer": {
"SGD": {
"lr": 0.001,
"momentum": 0.0
}
},
"start": 10,
"frequency": 5,
"swa_learning_rate": 0.05
}
task¶
disjoint_multitask¶
DisjointMultitask.Config¶
Component: DisjointMultitask
-
class
DisjointMultitask.
Config
[source] Bases:
TaskBase.Config
All Attributes (including base classes)
- features: FeatureConfig = FeatureConfig()
- featurizer: Featurizer.Config = SimpleFeaturizer.Config()
- data_handler: DisjointMultitaskDataHandler.Config = DisjointMultitaskDataHandler.Config()
- trainer: Trainer.Config = Trainer.Config()
- exporter: Optional[ModelExporter.Config] =
None
- tasks: dict[str, Task_Deprecated.Config]
- task_weights: dict[str, float] =
{}
- target_task_name: Optional[str] =
None
- metric_reporter: DisjointMultitaskMetricReporter.Config = DisjointMultitaskMetricReporter.Config()
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
NewDisjointMultitask.Config¶
Component: NewDisjointMultitask
-
class
NewDisjointMultitask.
Config
[source] Bases:
_NewTask.Config
All Attributes (including base classes)
- data: DisjointMultitaskData.Config = DisjointMultitaskData.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- tasks: dict[str, NewTask.Config] =
{}
- task_weights: dict[str, float] =
{}
- target_task_name: Optional[str] =
None
- metric_reporter: DisjointMultitaskMetricReporter.Config = DisjointMultitaskMetricReporter.Config()
Default JSON
{
"data": {
"sampler": {
"RoundRobinBatchSampler": {
"iter_to_set_epoch": ""
}
},
"test_key": null
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"tasks": {},
"task_weights": {},
"target_task_name": null,
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"use_subtask_select_metric": false
}
}
new_task¶
NewTask.Config¶
Component: NewTask
-
class
NewTask.
Config
[source] Bases:
_NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: BaseModel.Config
- Subclasses
BertPairRegressionTask.Config
DocumentClassificationTask.Config
DocumentRegressionTask.Config
EnsembleTask.Config
IntentSlotTask.Config
LMTask.Config
MaskedLMTask.Config
NewBertClassificationTask.Config
NewBertPairClassificationTask.Config
PairwiseClassificationTask.Config
QueryDocumentPairwiseRankingTask.Config
RoBERTaNERTask.Config
SemanticParsingTask.Config
SeqNNTask.Config
SquadQATask.Config
WordTaggingTask.Config
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
_NewTask.Config¶
Component: _NewTask
-
class
_NewTask.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- Subclasses
NewDisjointMultitask.Config
NewTask.Config
BertPairRegressionTask.Config
DocumentClassificationTask.Config
DocumentRegressionTask.Config
EnsembleTask.Config
IntentSlotTask.Config
LMTask.Config
MaskedLMTask.Config
NewBertClassificationTask.Config
NewBertPairClassificationTask.Config
PairwiseClassificationTask.Config
QueryDocumentPairwiseRankingTask.Config
RoBERTaNERTask.Config
SemanticParsingTask.Config
SeqNNTask.Config
SquadQATask.Config
WordTaggingTask.Config
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
}
}
task¶
TaskBase.Config¶
Component: TaskBase
-
class
TaskBase.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- features: FeatureConfig = FeatureConfig()
- featurizer: Featurizer.Config = SimpleFeaturizer.Config()
- data_handler: DataHandler.Config
- trainer: Trainer.Config = Trainer.Config()
- exporter: Optional[ModelExporter.Config] =
None
- Subclasses
DisjointMultitask.Config
Task_Deprecated.Config
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
Task_Deprecated.Config¶
Component: Task_Deprecated
-
class
Task_Deprecated.
Config
Bases:
TaskBase.Config
All Attributes (including base classes)
- features: FeatureConfig = FeatureConfig()
- featurizer: Featurizer.Config = SimpleFeaturizer.Config()
- data_handler: DataHandler.Config
- trainer: Trainer.Config = Trainer.Config()
- exporter: Optional[ModelExporter.Config] =
None
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
tasks¶
BertPairRegressionTask.Config¶
Component: BertPairRegressionTask
-
class
BertPairRegressionTask.
Config
[source] Bases:
DocumentRegressionTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: NewBertRegressionModel.Config = NewBertRegressionModel.Config()
- metric_reporter: RegressionMetricReporter.Config = RegressionMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text1",
"text2"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"labels": {
"is_input": false,
"column": "label",
"rescale_range": null
}
},
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {},
"squash_to_unit_range": false
}
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
}
DocumentClassificationTask.Config¶
Component: DocumentClassificationTask
-
class
DocumentClassificationTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: BaseModel.Config = DocModel.Config()
- metric_reporter: Union[ClassificationMetricReporter.Config, PureLossMetricReporter.Config] = ClassificationMetricReporter.Config()
- Subclasses
NewBertClassificationTask.Config
NewBertPairClassificationTask.Config
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"DocModel": {
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
}
},
"metric_reporter": {
"ClassificationMetricReporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
}
}
DocumentRegressionTask.Config¶
Component: DocumentRegressionTask
-
class
DocumentRegressionTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: DocRegressionModel.Config = DocRegressionModel.Config()
- metric_reporter: RegressionMetricReporter.Config = RegressionMetricReporter.Config()
- Subclasses
BertPairRegressionTask.Config
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"dense": null,
"labels": {
"is_input": false,
"column": "label",
"rescale_range": null
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {},
"squash_to_unit_range": false
}
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
}
EnsembleTask.Config¶
Component: EnsembleTask
-
class
EnsembleTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: EnsembleTrainer.Config = EnsembleTrainer.Config()
- model: EnsembleModel.Config
- metric_reporter: Union[ClassificationMetricReporter.Config, IntentSlotMetricReporter.Config] = ClassificationMetricReporter.Config()
Warning
This config has parameters with no default values. We aren’t yet able to generate functional JSON for it.
IntentSlotTask.Config¶
Component: IntentSlotTask
-
class
IntentSlotTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: IntentSlotModel.Config = IntentSlotModel.Config()
- metric_reporter: IntentSlotMetricReporter.Config = IntentSlotMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"IntentSlotModel": {
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"word_labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": true
},
"doc_labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": true,
"pad_in_vocab": false,
"label_vocab": null
}
},
"doc_weight": null,
"word_weight": null
},
"word_embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocSlotAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
}
},
"pooling": null,
"slot_attention": null,
"doc_mlp_layers": 0,
"word_mlp_layers": 0
}
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_output": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
},
"word_output": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"use_doc_probs_in_word": false,
"doc_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"word_decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
},
"default_doc_loss_weight": 0.2,
"default_word_loss_weight": 0.5
}
},
"metric_reporter": {
"IntentSlotMetricReporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
}
}
LMTask.Config¶
Component: LMTask
-
class
LMTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: LMLSTM.Config = LMLSTM.Config()
- metric_reporter: LanguageModelMetricReporter.Config = LanguageModelMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": true,
"add_eos_token": true,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTM": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": false,
"pack_sequence": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {}
},
"tied_weights": false,
"stateful": false,
"caffe2_format": "predictor"
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"aggregate_metrics": true,
"perplexity_type": "median"
}
}
MaskedLMTask.Config¶
Component: MaskedLMTask
-
class
MaskedLMTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = PackedLMData.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: MaskedLanguageModel.Config = MaskedLanguageModel.Config()
- metric_reporter: MaskedLMMetricReporter.Config = MaskedLMMetricReporter.Config()
Default JSON
{
"data": {
"PackedLMData": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true,
"max_seq_len": 128
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"BERTTensorizerBase": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"base_tokenizer": null,
"vocab_file": "",
"max_seq_len": 128
}
}
},
"encoder": {
"TransformerSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"dropout": 0.1,
"attention_dropout": 0.1,
"activation_dropout": 0.1,
"ffn_embedding_dim": 3072,
"num_encoder_layers": 6,
"num_attention_heads": 8,
"num_segments": 2,
"use_position_embeddings": true,
"offset_positions_by_padding": true,
"apply_bert_init": true,
"encoder_normalize_before": true,
"activation_fn": "relu",
"projection_dim": 0,
"max_seq_len": 128,
"multilingual": false,
"freeze_embeddings": false,
"n_trans_layers_to_freeze": 0,
"use_torchscript": false
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {}
},
"mask_prob": 0.15,
"mask_bos": false,
"masking_strategy": "random",
"tie_weights": true
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"aggregate_metrics": true,
"perplexity_type": "median"
}
}
NewBertClassificationTask.Config¶
Component: NewBertClassificationTask
-
class
NewBertClassificationTask.
Config
[source] Bases:
DocumentClassificationTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: NewBertModel.Config = NewBertModel.Config()
- metric_reporter: Union[ClassificationMetricReporter.Config, PureLossMetricReporter.Config] = ClassificationMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"num_tokens": {
"is_input": false,
"names": [
"tokens"
],
"indexes": [
2
]
}
},
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"metric_reporter": {
"ClassificationMetricReporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
}
}
NewBertPairClassificationTask.Config¶
Component: NewBertPairClassificationTask
-
class
NewBertPairClassificationTask.
Config
[source] Bases:
DocumentClassificationTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: NewBertModel.Config = NewBertModel.Config(inputs=BertModelInput(tokens=BERTTensorizer.Config(columns=
['text1', 'text2']
, max_seq_len=128
)))- metric_reporter: ClassificationMetricReporter.Config = ClassificationMetricReporter.Config(text_column_names=
['text1', 'text2']
)
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"BERTTensorizer": {
"is_input": true,
"columns": [
"text1",
"text2"
],
"tokenizer": {
"WordPieceTokenizer": {
"basic_tokenizer": {
"split_regex": "\\s+",
"lowercase": true
},
"wordpiece_vocab_path": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt"
}
},
"base_tokenizer": null,
"vocab_file": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/vocab.txt",
"max_seq_len": 128
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
},
"num_tokens": {
"is_input": false,
"names": [
"tokens"
],
"indexes": [
2
]
}
},
"encoder": {
"HuggingFaceBertSentenceEncoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"bert_cpt_dir": "/mnt/vol/nlp_technologies/bert/uncased_L-12_H-768_A-12/",
"load_weights": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"metric_reporter": {
"ClassificationMetricReporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text1",
"text2"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
}
}
PairwiseClassificationTask.Config¶
Component: PairwiseClassificationTask
-
class
PairwiseClassificationTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: BasePairwiseModel.Config = PairwiseModel.Config()
- metric_reporter: ClassificationMetricReporter.Config = ClassificationMetricReporter.Config(text_column_names=
['text1', 'text2']
)
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"PairwiseModel": {
"inputs": {
"tokens1": {
"is_input": true,
"column": "text1",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"tokens2": {
"is_input": true,
"column": "text2",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"ClassificationOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"encode_relations": true,
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"shared_representations": true
}
},
"metric_reporter": {
"ClassificationMetricReporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text1",
"text2"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
}
}
QueryDocumentPairwiseRankingTask.Config¶
Component: QueryDocumentPairwiseRankingTask
-
class
QueryDocumentPairwiseRankingTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: QueryDocPairwiseRankingModel.Config = QueryDocPairwiseRankingModel.Config()
- metric_reporter: PairwiseRankingMetricReporter.Config = PairwiseRankingMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"pos_response": {
"is_input": true,
"column": "pos_response",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"neg_response": {
"is_input": true,
"column": "neg_response",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"query": {
"is_input": true,
"column": "query",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": []
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"margin": 1.0
}
},
"encode_relations": true,
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
},
"shared_representations": true,
"decoder_output_dim": 64
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
}
RoBERTaNERTask.Config¶
Component: RoBERTaNERTask
-
class
RoBERTaNERTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: RoBERTaWordTaggingModel.Config = RoBERTaWordTaggingModel.Config()
- metric_reporter: NERMetricReporter.Config = NERMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"is_input": true,
"columns": [
"text"
],
"tokenizer": {
"GPT2BPETokenizer": {
"bpe_encoder_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/encoder.json",
"bpe_vocab_path": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/vocab.bpe"
}
},
"base_tokenizer": null,
"vocab_file": "manifold://pytext_training/tree/static/vocabs/bpe/gpt2/dict.txt",
"max_seq_len": 256,
"labels_columns": [
"label"
],
"labels": []
}
},
"encoder": {
"RoBERTaEncoderJit": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"output_dropout": 0.4,
"embedding_dim": 768,
"pooling": "cls_token",
"export": false,
"pretrained_encoder": {
"load_path": "manifold://pytext_training/tree/static/models/roberta_public.pt1",
"save_path": null,
"freeze": false,
"shared_module_key": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
}
SemanticParsingTask.Config¶
Component: SemanticParsingTask
-
class
SemanticParsingTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: HogwildTrainer.Config = HogwildTrainer.Config()
- model: RNNGParser.Config = RNNGParser.Config()
- metric_reporter: CompositionalMetricReporter.Config = CompositionalMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"real_trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"num_workers": 1
},
"model": {
"version": 2,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"ablation": {
"use_buffer": true,
"use_stack": true,
"use_action": true,
"use_last_open_NT_feature": false
},
"constraints": {
"intent_slot_nesting": true,
"ignore_loss_for_unsupported": false,
"no_slots_inside_unsupported": true
},
"max_open_NT": 10,
"dropout": 0.1,
"beam_size": 1,
"top_k": 1,
"compositional_type": "blstm",
"inputs": {
"tokens": {
"is_input": true,
"column": "tokenized_text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"actions": {
"is_input": true,
"column": "seqlogical"
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
}
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"text_column_name": "tokenized_text"
}
}
SeqNNTask.Config¶
Component: SeqNNTask
-
class
SeqNNTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: SeqNNModel.Config = SeqNNModel.Config()
- metric_reporter: ClassificationMetricReporter.Config = ClassificationMetricReporter.Config(text_column_names=
['text_seq']
)
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"inputs": {
"tokens": {
"is_input": true,
"column": "text_seq",
"max_seq_len": null,
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"add_bol_token": false,
"add_eol_token": false,
"use_eol_token_for_bol": false,
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
}
},
"dense": null,
"labels": {
"LabelTensorizer": {
"is_input": false,
"column": "label",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"doc_representation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"cnn": {
"kernel_num": 100,
"kernel_sizes": [
3,
4
],
"weight_norm": false,
"dilated": false,
"causal": false
}
},
"seq_representation": {
"BiLSTMDocAttention": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"dropout": 0.4,
"lstm_dim": 32,
"num_layers": 1,
"bidirectional": true,
"pack_sequence": true
},
"pooling": {
"SelfAttention": {
"attn_dimension": 64,
"dropout": 0.4
}
},
"mlp_decoder": null
}
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": null
}
},
"metric_reporter": {
"ClassificationMetricReporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"model_select_metric": "accuracy",
"target_label": null,
"text_column_names": [
"text_seq"
],
"additional_column_names": [],
"recall_at_precision_thresholds": [
0.2,
0.4,
0.6,
0.8,
0.9
]
}
}
}
SquadQATask.Config¶
Component: SquadQATask
-
class
SquadQATask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: Union[BertSquadQAModel.Config, DrQAModel.Config] = DrQAModel.Config()
- metric_reporter: SquadMetricReporter.Config = SquadMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"DrQAModel": {
"inputs": {
"squad_input": {
"SquadTensorizer": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\W+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " ",
"doc_column": "doc",
"ques_column": "question",
"answers_column": "answers",
"answer_starts_column": "answer_starts",
"max_ques_seq_len": 64,
"max_doc_seq_len": 256
}
},
"has_answer": {
"LabelTensorizer": {
"is_input": false,
"column": "has_answer",
"allow_unknown": false,
"pad_in_vocab": false,
"label_vocab": null
}
}
},
"dropout": 0.4,
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 300,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "/mnt/vol/pytext/users/kushall/pretrained/glove.840B.300d.txt",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": true,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"ques_rnn": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_size": 32,
"num_layers": 1,
"dropout": 0.4,
"bidirectional": true,
"rnn_type": "lstm",
"concat_layers": true
},
"doc_rnn": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_size": 32,
"num_layers": 1,
"dropout": 0.4,
"bidirectional": true,
"rnn_type": "lstm",
"concat_layers": true
},
"output_layer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"ignore_impossible": true,
"pos_loss_weight": 0.5,
"has_answer_loss_weight": 0.5,
"false_label": "False",
"max_answer_len": 30,
"hard_weight": 0.0
},
"is_kd": false
}
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false,
"n_best_size": 5,
"max_answer_length": 16,
"ignore_impossible": true,
"false_label": "False"
}
}
WordTaggingTask.Config¶
Component: WordTaggingTask
-
class
WordTaggingTask.
Config
[source] Bases:
NewTask.Config
All Attributes (including base classes)
- data: Data.Config = Data.Config()
- trainer: TaskTrainer.Config = TaskTrainer.Config()
- model: WordTaggingModel.Config = WordTaggingModel.Config()
- metric_reporter: SequenceTaggingMetricReporter.Config = SequenceTaggingMetricReporter.Config()
Default JSON
{
"data": {
"Data": {
"source": {
"TSVDataSource": {
"column_mapping": {},
"train_filename": null,
"test_filename": null,
"eval_filename": null,
"field_names": null,
"delimiter": "\t",
"quoted": false,
"drop_incomplete_rows": false
}
},
"batcher": {
"PoolingBatcher": {
"train_batch_size": 16,
"eval_batch_size": 16,
"test_batch_size": 16,
"pool_num_batches": 10000,
"num_shuffled_pools": 1
}
},
"sort_key": null,
"in_memory": true
}
},
"trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"model": {
"WordTaggingModel": {
"inputs": {
"tokens": {
"is_input": true,
"column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"add_bos_token": false,
"add_eos_token": false,
"use_eos_token_for_bos": false,
"max_seq_len": null,
"vocab": {
"build_from_data": true,
"size_from_data": 0,
"vocab_files": []
},
"vocab_file_delimiter": " "
},
"labels": {
"is_input": false,
"slot_column": "slots",
"text_column": "text",
"tokenizer": {
"Tokenizer": {
"split_regex": "\\s+",
"lowercase": true
}
},
"allow_unknown": false
}
},
"embedding": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"embed_dim": 100,
"embedding_init_strategy": "random",
"embedding_init_range": null,
"export_input_names": [
"tokens_vals"
],
"pretrained_embeddings_path": "",
"vocab_file": "",
"vocab_size": 0,
"vocab_from_train_data": true,
"vocab_from_all_data": false,
"vocab_from_pretrained_embeddings": false,
"lowercase_tokens": true,
"min_freq": 1,
"mlp_layer_dims": [],
"padding_idx": null,
"cpu_only": false,
"skip_header": true,
"delimiter": " "
},
"representation": {
"PassThroughRepresentation": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null
}
},
"output_layer": {
"WordTaggingOutputLayer": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"loss": {
"CrossEntropyLoss": {}
},
"label_weights": {},
"ignore_pad_in_loss": true
}
},
"decoder": {
"load_path": null,
"save_path": null,
"freeze": false,
"shared_module_key": null,
"hidden_dims": [],
"out_dim": null,
"layer_norm": false,
"dropout": 0.0,
"activation": "relu"
}
}
},
"metric_reporter": {
"output_path": "/tmp/test_out.txt",
"pep_format": false
}
}
trainers¶
ensemble_trainer¶
EnsembleTrainer.Config¶
Component: EnsembleTrainer
-
class
EnsembleTrainer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- real_trainer: TaskTrainer.Config = TaskTrainer.Config()
Default JSON
{
"real_trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
}
}
hogwild_trainer¶
HogwildTrainer.Config¶
Component: HogwildTrainer
-
class
HogwildTrainer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- real_trainer: TaskTrainer.Config = TaskTrainer.Config()
- num_workers: int =
1
Default JSON
{
"real_trainer": {
"TaskTrainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
},
"num_workers": 1
}
HogwildTrainer_Deprecated.Config¶
Component: HogwildTrainer_Deprecated
-
class
HogwildTrainer_Deprecated.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- real_trainer: Trainer.Config = Trainer.Config()
- num_workers: int =
1
Default JSON
{
"real_trainer": {
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
},
"num_workers": 1
}
trainer¶
TaskTrainer.Config¶
Component: TaskTrainer
-
class
TaskTrainer.
Config
[source] Bases:
Trainer.Config
Make mypy happy
All Attributes (including base classes)
- epochs: int =
10
- early_stop_after: int =
0
- max_clip_norm: Optional[float] =
None
- report_train_metrics: bool =
True
- target_time_limit_seconds: Optional[int] =
None
- do_eval: bool =
True
- load_best_model_after_train: bool =
True
- num_samples_to_log_progress: int =
1000
- num_accumulated_batches: int =
1
- num_batches_per_epoch: Optional[int] =
None
- optimizer: Optimizer.Config = Adam.Config()
- scheduler: Optional[Scheduler.Config] =
None
- sparsifier: Optional[Sparsifier.Config] =
None
- fp16_args: FP16Optimizer.Config = FP16OptimizerFairseq.Config()
Default JSON
{
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
Trainer.Config¶
Component: Trainer
-
class
Trainer.
Config
[source] Bases:
ConfigBase
All Attributes (including base classes)
- epochs: int =
10
- Training epochs
- early_stop_after: int =
0
- Stop after how many epochs when the eval metric is not improving
- max_clip_norm: Optional[float] =
None
- Clip gradient norm if set
- report_train_metrics: bool =
True
- Whether metrics on training data should be computed and reported.
- target_time_limit_seconds: Optional[int] =
None
- Target time limit for training, default (None) to no time limit.
- do_eval: bool =
True
- Whether to do evaluation and model selection based on it.
- load_best_model_after_train: bool =
True
- num_samples_to_log_progress: int =
1000
- Number of samples for logging training progress.
- num_accumulated_batches: int =
1
- Number of forward & backward per batch before update gradients, the actual_batch_size = batch_size x num_accumulated_batches
- num_batches_per_epoch: Optional[int] =
None
- Define epoch as a fixed number of batches. Subsequent epochs will continue to iterate through the data, cycling through it when they reach the end. If not set, use exactly one pass through the dataset as one epoch. This configuration only affects the train epochs, test and eval will always test their entire datasets.
- optimizer: Optimizer.Config = Adam.Config()
- config for optimizer, used in parameter update
- scheduler: Optional[Scheduler.Config] =
None
- sparsifier: Optional[Sparsifier.Config] =
None
- fp16_args: FP16Optimizer.Config = FP16OptimizerFairseq.Config()
- Define arguments for fp16 training. A fp16_optimizer will be created and wraps the original optimizer, which will scale loss during backward and master weight will be maintained on original optimizer. https://arxiv.org/abs/1710.03740
- Subclasses
TaskTrainer.Config
Default JSON
{
"epochs": 10,
"early_stop_after": 0,
"max_clip_norm": null,
"report_train_metrics": true,
"target_time_limit_seconds": null,
"do_eval": true,
"load_best_model_after_train": true,
"num_samples_to_log_progress": 1000,
"num_accumulated_batches": 1,
"num_batches_per_epoch": null,
"optimizer": {
"Adam": {
"lr": 0.001,
"weight_decay": 1e-05,
"eps": 1e-08
}
},
"scheduler": null,
"sparsifier": null,
"fp16_args": {
"FP16OptimizerFairseq": {
"init_loss_scale": 128,
"scale_window": null,
"scale_tolerance": 0.0,
"threshold_loss_scale": null,
"min_loss_scale": 0.0001
}
}
}
TrainerBase.Config¶
Component: TrainerBase
-
class
TrainerBase.
Config
Bases:
Component.Config
All Attributes (including base classes)
Default JSON
{}
pytext package¶
Subpackages¶
pytext.common package¶
Submodules¶
pytext.common.constants module¶
-
class
pytext.common.constants.
BatchContext
[source]¶ Bases:
object
-
IGNORE_LOSS
= 'ignore_loss'¶
-
INDEX
= 'row_index'¶
-
TASK_NAME
= 'task_name'¶
-
-
class
pytext.common.constants.
DFColumn
[source]¶ Bases:
object
-
ALIGNMENT
= 'alignment'¶
-
CONTEXT_SEQUENCE
= 'context_sequence'¶
-
DENSE_FEAT
= 'dense_feat'¶
-
DICT_FEAT
= 'dict_feat'¶
-
DOC_LABEL
= 'doc_label'¶
-
DOC_WEIGHT
= 'doc_weight'¶
-
LANGUAGE_ID
= 'lang'¶
-
MODEL_FEATS
= 'model_feats'¶
-
RAW_FEATS
= 'raw_feats'¶
-
SEQLOGICAL
= 'seqlogical'¶
-
SOURCE_FEATS
= 'source_feats'¶
-
SOURCE_SEQUENCE
= 'source_sequence'¶
-
TARGET_LABELS
= 'target_labels'¶
-
TARGET_LOGITS
= 'target_logits'¶
-
TARGET_PROBS
= 'target_probs'¶
-
TARGET_SEQUENCE
= 'target_sequence'¶
-
TARGET_TOKENS
= 'target_tokens'¶
-
TOKEN_RANGE
= 'token_range'¶
-
UTTERANCE
= 'text'¶
-
WORD_LABEL
= 'word_label'¶
-
WORD_WEIGHT
= 'word_weight'¶
-
-
class
pytext.common.constants.
DatasetFieldName
[source]¶ Bases:
object
-
CHAR_FIELD
= 'char_feat'¶
-
CONTEXTUAL_TOKEN_EMBEDDING
= 'contextual_token_embedding'¶
-
DENSE_FIELD
= 'dense_feat'¶
-
DICT_FIELD
= 'dict_feat'¶
-
DOC_LABEL_FIELD
= 'doc_label'¶
-
DOC_WEIGHT_FIELD
= 'doc_weight'¶
-
LANGUAGE_ID_FIELD
= 'lang'¶
-
NUM_TOKENS
= 'num_tokens'¶
-
RAW_DICT_FIELD
= 'sparsefeat'¶
-
RAW_SEQUENCE
= 'raw_sequence'¶
-
RAW_WORD_LABEL
= 'raw_word_label'¶
-
SEQ_FIELD
= 'seq_word_feat'¶
-
SEQ_LENS
= 'seq_lens'¶
-
SOURCE_SEQ_FIELD
= 'source_sequence'¶
-
TARGET_SEQ_FIELD
= 'target_sequence'¶
-
TARGET_SEQ_LENS
= 'target_seq_lens'¶
-
TEXT_FIELD
= 'word_feat'¶
-
TOKENS
= 'tokens'¶
-
TOKEN_INDICES
= 'token_indices'¶
-
TOKEN_RANGE
= 'token_range'¶
-
UTTERANCE_FIELD
= 'utterance'¶
-
WORD_LABEL_FIELD
= 'word_label'¶
-
WORD_WEIGHT_FIELD
= 'word_weight'¶
-
-
class
pytext.common.constants.
PackageFileName
[source]¶ Bases:
object
-
RAW_EMBED
= 'pretrained_embed_raw'¶
-
SERIALIZED_EMBED
= 'pretrained_embed_pt_serialized'¶
-
-
class
pytext.common.constants.
Padding
[source]¶ Bases:
object
-
DEFAULT_LABEL_PAD_IDX
= -1¶
-
WORD_LABEL_PAD
= 'PAD_LABEL'¶
-
WORD_LABEL_PAD_IDX
= 0¶
-
Module contents¶
pytext.config package¶
Submodules¶
pytext.config.component module¶
-
class
pytext.config.component.
ComponentType
[source]¶ Bases:
enum.Enum
An enumeration.
-
BATCHER
= 'batcher'¶
-
BATCH_SAMPLER
= 'batch_sampler'¶
-
COLUMN
= 'column'¶
-
DATA_HANDLER
= 'data_handler'¶
-
DATA_SOURCE
= 'data_source'¶
-
DATA_TYPE
= 'data_type'¶
-
EXPORTER
= 'exporter'¶
-
FEATURIZER
= 'featurizer'¶
-
LOSS
= 'loss'¶
-
METRIC_REPORTER
= 'metric_reporter'¶
-
MODEL
= 'model'¶
-
MODEL2
= 'model2'¶
-
MODULE
= 'module'¶
-
OPTIMIZER
= 'optimizer'¶
-
PREDICTOR
= 'predictor'¶
-
SCHEDULER
= 'scheduler'¶
-
SPARSIFIER
= 'sparsifier'¶
-
TASK
= 'task'¶
-
TENSORIZER
= 'tensorizer'¶
-
TOKENIZER
= 'tokenizer'¶
-
TRAINER
= 'trainer'¶
-
-
class
pytext.config.component.
Registry
[source]¶ Bases:
object
-
classmethod
add
(component_type: pytext.config.component.ComponentType, cls_to_add: Type[CT_co], config_cls: Type[CT_co])[source]¶
-
classmethod
configs
(component_type: pytext.config.component.ComponentType) → Tuple[Type[CT_co], ...][source]¶
-
classmethod
-
pytext.config.component.
create_component
(component_type: pytext.config.component.ComponentType, config: Any, *args, **kwargs)[source]¶
-
pytext.config.component.
create_optimizer
(optimizer_config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
pytext.config.component.
create_trainer
(trainer_config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
pytext.config.config_adapter module¶
-
pytext.config.config_adapter.
doc_model_deprecated
(json_config)[source]¶ Rename DocModel to DocModel_Deprecated.
-
pytext.config.config_adapter.
ensemble_task_deprecated
(json_config)[source]¶ Rename tasks with new API consistently
-
pytext.config.config_adapter.
is_type_specifier
(json_dict)[source]¶ If a config object is a class, it might have a level which is a type specifier, with one key corresponding to the name of whichever type it is. These types should not be explicitly named in the path.
-
pytext.config.config_adapter.
lm_model_deprecated
(json_config)[source]¶ Rename LM model to _Deprecated (LMTask is already deprecated in v5)
-
pytext.config.config_adapter.
new_tasks_rename
(json_config)[source]¶ Rename tasks with new API consistently
-
pytext.config.config_adapter.
old_tasks_deprecated
(json_config)[source]¶ Rename tasks with data_handler config to _Deprecated
-
pytext.config.config_adapter.
rename_bitransformer_inputs
(json_config)[source]¶ In “BiTransformer” model, rename input “characters” -> “bytes” and update subfields.
-
pytext.config.config_adapter.
rename_parameter
(config, old_path, new_path, transform=<function <lambda>>)[source]¶ A powerful tool for writing config adapters, this allows you to specify a JSON-style path for an old and new config parameter. For instance
rename_parameter(config, “task.data.epoch_size”, “task.trainer.batches_per_epoch”)
will look through the config for task.data.epoch_size, including moving through explicitly specified types. If it’s specified, it will delete the value and set it in task.trainer.num_batches_per_epoch instead, creating trainer as an empty dictionary if necessary.
-
pytext.config.config_adapter.
upgrade_if_xlm
(json_config)[source]¶ Make XLMModel Union changes for encoder and tokens config. Since they are now unions, insert the old class into the config if no class name is mentioned.
-
pytext.config.config_adapter.
v12_to_v13
(json_config)[source]¶ remove_output_encoded_layers(json_config)
-
pytext.config.config_adapter.
v2_to_v3
(json_config)[source]¶ Optimizer and Scheduler configs used to be part of the task config, they now live in the trainer’s config.
pytext.config.contextual_intent_slot module¶
-
class
pytext.config.contextual_intent_slot.
ExtraField
[source]¶ Bases:
object
-
DOC_WEIGHT
= 'doc_weight'¶
-
RAW_WORD_LABEL
= 'raw_word_label'¶
-
TOKEN_RANGE
= 'token_range'¶
-
UTTERANCE
= 'utterance'¶
-
WORD_WEIGHT
= 'word_weight'¶
-
-
class
pytext.config.contextual_intent_slot.
ModelInput
[source]¶ Bases:
object
-
CHAR
= 'char_feat'¶
-
CONTEXTUAL_TOKEN_EMBEDDING
= 'contextual_token_embedding'¶
-
DENSE
= 'dense_feat'¶
-
DICT
= 'dict_feat'¶
-
SEQ
= 'seq_word_feat'¶
-
TEXT
= 'word_feat'¶
-
-
class
pytext.config.contextual_intent_slot.
ModelInputConfig
(**kwargs)[source]¶ Bases:
pytext.config.module_config.Module.Config
-
char_feat
= None¶
-
contextual_token_embedding
= None¶
-
dense_feat
= None¶
-
dict_feat
= None¶
-
seq_word_feat
= <pytext.config.field_config.WordFeatConfig object>¶
-
word_feat
= <pytext.config.field_config.WordFeatConfig object>¶
-
pytext.config.doc_classification module¶
pytext.config.field_config module¶
-
pytext.config.field_config.
ContextualTokenEmbeddingConfig
[source]¶ alias of
pytext.config.field_config.ContextualTokenEmbeddingConfig
-
class
pytext.config.field_config.
DocLabelConfig
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
-
export_output_names
= ['doc_scores']¶
-
label_weights
= {}¶
-
target_prob
= False¶
-
-
class
pytext.config.field_config.
EmbedInitStrategy
[source]¶ Bases:
enum.Enum
An enumeration.
-
RANDOM
= 'random'¶
-
ZERO
= 'zero'¶
-
-
class
pytext.config.field_config.
FeatureConfig
(**kwargs)[source]¶ Bases:
pytext.config.module_config.Module.Config
-
char_feat
= None¶
-
contextual_token_embedding
= None¶
-
dense_feat
= None¶
-
dict_feat
= None¶
-
seq_word_feat
= None¶
-
word_feat
= <pytext.config.field_config.WordFeatConfig object>¶
-
-
class
pytext.config.field_config.
FloatVectorConfig
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
-
dim
= 0¶
-
dim_error_check
= False¶
-
export_input_names
= ['float_vec_vals']¶
-
-
class
pytext.config.field_config.
Target
[source]¶ Bases:
object
-
DOC_LABEL
= 'doc_label'¶
-
TARGET_LABEL_FIELD
= 'target_label'¶
-
TARGET_LOGITS_FIELD
= 'target_logit'¶
-
TARGET_PROB_FIELD
= 'target_prob'¶
-
-
class
pytext.config.field_config.
WordLabelConfig
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
-
export_output_names
= ['word_scores']¶
-
use_bio_labels
= False¶
-
pytext.config.module_config module¶
-
class
pytext.config.module_config.
Activation
[source]¶ Bases:
enum.Enum
An enumeration.
-
GELU
= 'gelu'¶
-
GLU
= 'glu'¶
-
LEAKYRELU
= 'leakyrelu'¶
-
RELU
= 'relu'¶
-
TANH
= 'tanh'¶
-
-
class
pytext.config.module_config.
CNNParams
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
-
causal
= False¶
-
dilated
= False¶
-
kernel_num
= 100¶
-
kernel_sizes
= [3, 4]¶
-
weight_norm
= False¶
-
-
class
pytext.config.module_config.
ExporterType
[source]¶ Bases:
enum.Enum
An enumeration.
-
INIT_PREDICT
= 'init_predict'¶
-
PREDICTOR
= 'predictor'¶
-
-
class
pytext.config.module_config.
PerplexityType
[source]¶ Bases:
enum.Enum
An enumeration.
-
EOS
= 'eos'¶
-
MAX
= 'max'¶
-
MEAN
= 'mean'¶
-
MEDIAN
= 'median'¶
-
MIN
= 'min'¶
-
pytext.config.pair_classification module¶
-
class
pytext.config.pair_classification.
ExtraField
[source]¶ Bases:
object
-
UTTERANCE_PAIR
= 'utterance'¶
-
pytext.config.pytext_config module¶
-
class
pytext.config.pytext_config.
LogitsConfig
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.TestConfig
-
dump_raw_input
= False¶
-
-
class
pytext.config.pytext_config.
PyTextConfig
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
-
auto_resume_from_snapshot
= False¶
-
debug_path
= '/tmp/model.debug'¶
-
distributed_world_size
= 1¶
-
export_caffe2_path
= None¶
-
export_onnx_path
= '/tmp/model.onnx'¶
-
export_torchscript_path
= None¶
-
gpu_streams_for_distributed_training
= 1¶
-
include_dirs
= None¶
-
load_snapshot_path
= ''¶
-
modules_save_dir
= ''¶
-
random_seed
= None¶ Seed value to seed torch, python, and numpy random generators.
-
report_eval_results
= False¶
-
save_all_checkpoints
= False¶
-
save_module_checkpoints
= False¶
-
save_snapshot_path
= '/tmp/model.pt'¶
-
test_out_path
= '/tmp/test_out.txt'¶
-
torchscript_quantize
= False¶
-
use_config_from_snapshot
= True¶
-
use_cuda_for_testing
= True¶
-
use_cuda_if_available
= True¶
-
use_deterministic_cudnn
= False¶ Whether to allow CuDNN to behave deterministically.
-
use_fp16
= False¶
-
use_tensorboard
= True¶
-
-
class
pytext.config.pytext_config.
TestConfig
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
-
field_names
= None¶ Field names for the TSV. If this is not set, the first line of each file will be assumed to be a header containing the field names.
-
test_out_path
= ''¶
-
test_path
= 'test.tsv'¶
-
use_cuda_if_available
= True¶
-
use_tensorboard
= True¶
-
pytext.config.query_document_pairwise_ranking module¶
-
class
pytext.config.query_document_pairwise_ranking.
ModelInput
[source]¶ Bases:
object
-
NEG_RESPONSE
= 'neg_response'¶
-
POS_RESPONSE
= 'pos_response'¶
-
QUERY
= 'query'¶
-
-
class
pytext.config.query_document_pairwise_ranking.
ModelInputConfig
(**kwargs)[source]¶ Bases:
pytext.config.module_config.Module.Config
-
neg_response
= <pytext.config.field_config.WordFeatConfig object>¶
-
pos_response
= <pytext.config.field_config.WordFeatConfig object>¶
-
query
= <pytext.config.field_config.WordFeatConfig object>¶
-
pytext.config.serialize module¶
pytext.config.utils module¶
Module contents¶
pytext.data package¶
Subpackages¶
pytext.data.data_structures package¶
-
class
pytext.data.data_structures.annotation.
Annotation
(annotation_string: str, utterance: str = '', brackets: str = '[]', combination_labels: bool = True, add_dict_feat: bool = False, accept_flat_intents_slots: bool = False)[source]¶ Bases:
object
-
class
pytext.data.data_structures.annotation.
Node_Info
(node)[source]¶ Bases:
object
This class extracts the essential information for a mode, for use in rules.
-
class
pytext.data.data_structures.annotation.
Token_Info
(node)[source]¶ Bases:
object
This class extracts the essential information for a token for use in rules.
-
class
pytext.data.data_structures.annotation.
Tree
(root: pytext.data.data_structures.annotation.Root, combination_labels: bool, utterance: str = '', validate_tree: bool = True)[source]¶ Bases:
object
-
lotv_str
()[source]¶ LOTV – Limited Output Token Vocabulary We map the terminal tokens in the input to a constant output (SEQLOGICAL_LOTV_TOKEN) to make the parsing task easier for models where the decoding is decoupled from the input (e.g. seq2seq). This way, the model can focus on learning to predict the parse tree, rather than waste effort learning to replicate terminal tokens.
-
-
class
pytext.data.data_structures.node.
Node
(label: str, span: pytext.data.data_structures.node.Span, children: Optional[AbstractSet[Node]] = None, text: str = None)[source]¶ Bases:
object
Node in an intent-slot tree, representing either an intent or a slot.
-
label
¶ Label of the node.
Type: str
-
span
¶ Span of the node.
Type: Span
-
children
-
label
-
span
-
text
¶
-
pytext.data.featurizer package¶
-
class
pytext.data.featurizer.featurizer.
Featurizer
(config, feature_config: pytext.config.field_config.FeatureConfig)[source]¶ Bases:
pytext.config.component.Component
Featurizer is tasked with performing data preprocessing that should be shared between training and inference, namely, tokenization and gazetteer features alignment.
This is an interface whose featurize() method must be implemented so that the implemented interface can be used with the appropriate data handler.
-
featurize
(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]¶
-
-
class
pytext.data.featurizer.featurizer.
InputRecord
[source]¶ Bases:
tuple
Input data contract between Featurizer and DataHandler.
-
locale
¶ Alias for field number 2
-
raw_gazetteer_feats
¶ Alias for field number 1
-
raw_text
¶ Alias for field number 0
-
-
class
pytext.data.featurizer.featurizer.
OutputRecord
[source]¶ Bases:
tuple
Output data contract between Featurizer and DataHandler.
-
characters
¶ Alias for field number 5
-
contextual_token_embedding
¶ Alias for field number 6
-
dense_feats
¶ Alias for field number 7
-
gazetteer_feat_lengths
¶ Alias for field number 3
-
gazetteer_feat_weights
¶ Alias for field number 4
-
gazetteer_feats
¶ Alias for field number 2
-
token_ranges
¶ Alias for field number 1
-
tokens
¶ Alias for field number 0
-
-
class
pytext.data.featurizer.simple_featurizer.
SimpleFeaturizer
(config, feature_config: pytext.config.field_config.FeatureConfig)[source]¶ Bases:
pytext.data.featurizer.featurizer.Featurizer
Simple featurizer for basic tokenization and gazetteer feature alignment.
-
featurize
(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]¶ Featurize one instance/example only.
-
featurize_batch
(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]¶ Featurize a batch of instances/examples.
-
-
class
pytext.data.featurizer.
Featurizer
(config, feature_config: pytext.config.field_config.FeatureConfig)[source]¶ Bases:
pytext.config.component.Component
Featurizer is tasked with performing data preprocessing that should be shared between training and inference, namely, tokenization and gazetteer features alignment.
This is an interface whose featurize() method must be implemented so that the implemented interface can be used with the appropriate data handler.
-
featurize
(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]¶
-
-
class
pytext.data.featurizer.
InputRecord
[source]¶ Bases:
tuple
Input data contract between Featurizer and DataHandler.
-
locale
¶ Alias for field number 2
-
raw_gazetteer_feats
¶ Alias for field number 1
-
raw_text
¶ Alias for field number 0
-
-
class
pytext.data.featurizer.
OutputRecord
[source]¶ Bases:
tuple
Output data contract between Featurizer and DataHandler.
-
characters
¶ Alias for field number 5
-
contextual_token_embedding
¶ Alias for field number 6
-
dense_feats
¶ Alias for field number 7
-
gazetteer_feat_lengths
¶ Alias for field number 3
-
gazetteer_feat_weights
¶ Alias for field number 4
-
gazetteer_feats
¶ Alias for field number 2
-
token_ranges
¶ Alias for field number 1
-
tokens
¶ Alias for field number 0
-
-
class
pytext.data.featurizer.
SimpleFeaturizer
(config, feature_config: pytext.config.field_config.FeatureConfig)[source]¶ Bases:
pytext.data.featurizer.featurizer.Featurizer
Simple featurizer for basic tokenization and gazetteer feature alignment.
-
featurize
(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]¶ Featurize one instance/example only.
-
featurize_batch
(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]¶ Featurize a batch of instances/examples.
-
pytext.data.sources package¶
-
class
pytext.data.sources.conllu.
CoNLLUNERDataSource
(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶ Bases:
pytext.data.sources.conllu.CoNLLUPOSDataSource
Reads an empty line separated data (word label). This data source supports datasets for NER tasks
-
class
pytext.data.sources.conllu.
CoNLLUPOSDataSource
(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from CoNLL-U file.
-
classmethod
from_config
(config: pytext.data.sources.conllu.CoNLLUPOSDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
classmethod
-
class
pytext.data.sources.data_source.
DataSource
(schema: Dict[str, Type[CT_co]])[source]¶ Bases:
pytext.config.component.Component
Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.
Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.
-
class
pytext.data.sources.data_source.
GeneratorIterator
(generator, *args, **kwargs)[source]¶ Bases:
object
Create an object which can be iterated over multiple times from a generator call. Each iteration will call the generator and allow iterating over it. This is unsafe to use on generators which have side effects, such as file readers; it’s up to the callers to safely manage these scenarios.
-
class
pytext.data.sources.data_source.
GeneratorMethodProperty
(generator)[source]¶ Bases:
object
Identify a generator method as a property. This will allow instances to iterate over the property multiple times, and not consume the generator. It accomplishes this by wrapping the generator and creating multiple generator instances if iterated over multiple times.
-
class
pytext.data.sources.data_source.
RawExample
[source]¶ Bases:
dict
A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.
-
class
pytext.data.sources.data_source.
RootDataSource
(schema: Dict[str, Type[CT_co]], column_mapping: Dict[str, str] = ())[source]¶ Bases:
pytext.data.sources.data_source.DataSource
A data source which actually loads data from a location. This data source needs to be responsible for converting types based on a schema, because it should be the only part of the system that actually needs to understand details about the underlying storage system.
RootDataSource presents a simpler abstraction than DataSource where the rows are automatically converted to the right DataTypes.
A RootDataSource should implement raw_train_data_generator, raw_test_data_generator, and raw_eval_data_generator. These functions should yield dictionaries of raw objects which the loading system can convert using the schema loading functions.
-
DATA_SOURCE_TYPES
= {<class 'str'>: <function load_text>, typing.Any: <function load_text>, typing.List[pytext.utils.data.Slot]: <function load_slots>, typing.List[int]: <function load_json>, typing.List[str]: <function load_json>, typing.List[typing.Dict[str, typing.Dict[str, float]]]: <function load_json>, typing.List[float]: <function load_float_list>, ~JSONString: <function load_json_string>, <class 'float'>: <function load_float>, <class 'int'>: <function load_int>}¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
raw_test_data_generator
()[source]¶ Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
-
class
pytext.data.sources.data_source.
RowShardedDataSource
(data_source: pytext.data.sources.data_source.DataSource, rank=0, world_size=1)[source]¶ Bases:
pytext.data.sources.data_source.ShardedDataSource
Shards a given datasource by row.
-
class
pytext.data.sources.data_source.
SafeFileWrapper
(*args, **kwargs)[source]¶ Bases:
object
A simple wrapper class for files which allows filedescriptors to be managed with normal Python ref counts. Without using this, if you create a file in a from_config you will see a warning along the lines of “ResourceWarning: self._file is acquired but not always released” this is because we’re opening a file not in a context manager (with statement). We want to do it this way because it lets us pass a file object to the DataSource, rather than a filename. This exposes a ton more flexibility and testability, passing filenames is one of the paths towards pain.
However, we don’t have a clear resource management system set up for configuration. from_config functions are the tool that we have to allow objects to specify how they should be created from a configuration, which generally should only happen from the command line, whereas in eg. a notebook you should build the objects with constructors directly. If building from constructors, you can just open a file and pass it, but from_config here needs to create a file object from a configured filename. Python files don’t close automatically, so you also need a system that will close them when the python interpreter shuts down. If you don’t, it will print a resource warning at runtime, as the interpreter manually closes the filehandles (although modern OSs are pretty okay with having open file handles, it’s hard for me to justify exactly why Python is so strict about this; I think one of the main reasons you might actually care is if you have a writeable file handle it might not have flushed properly when the C runtime exits, but Python doesn’t actually distinguish between writeable and non-writeable file handles).
This class is a wrapper that creates a system for (sort-of) safely closing the file handles before the runtime exits. It does this by closing the file when the object’s deleter is called. Although the python standard doesn’t actually make any guarantees about when deleters are called, CPython is reference counted and so as an mplementation detail will call a deleter whenever the last reference to it is removed, which generally will happen to all objects created during program execution as long as there aren’t reference cycles (I don’t actually know off-hand whether the cycle collection is run before shutdown, and anyway the cycles would have to include objects that the runtime itself maintains pointers to, which seems like you’d have to work hard to do and wouldn’t do accidentally). This isn’t true for other python systems like PyPy or Jython which use generational garbage collection and so don’t actually always call destructors before the system shuts down, but again this is only really relevant for mutable files.
An alternative implementation would be to build a resource management system into PyText, something like a function that we use for opening system resources that registers the resources and then we make sure are all closed before system shutdown. That would probably technically be the right solution, but I didn’t really think of that first and also it’s a bit longer to implement.
If you are seeing resource warnings on your system, please file a github issue.
-
class
pytext.data.sources.data_source.
ShardedDataSource
(schema: Dict[str, Type[CT_co]])[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Base class for sharded data sources.
-
pytext.data.sources.data_source.
generator_property
¶ alias of
pytext.data.sources.data_source.GeneratorMethodProperty
-
class
pytext.data.sources.pandas.
PandasDataSource
(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from a pandas DataFrame.
- Inputs:
train_df: DataFrame for training
eval_df: DataFrame for evalu
test_df: DataFrame for test
schema: same as base DataSource, define the list of output values with their types
column_mapping: maps the column names in DataFrame to the name defined in schema
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
class
pytext.data.sources.pandas.
SessionPandasDataSource
(schema: Dict[str, Type[CT_co]], id_col: str, train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, column_mapping: Dict[str, str] = ())[source]¶ Bases:
pytext.data.sources.pandas.PandasDataSource
,pytext.data.sources.session.SessionDataSource
-
class
pytext.data.sources.session.
SessionDataSource
(id_col, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
Data source for session based data, the input data is organized in sessions, each session may have multiple rows. The first column is always the session id. Raw input rows are consolidated by session id and returned as one session per example
-
class
pytext.data.sources.squad.
SquadDataSource
(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)
-
DEFAULT_SCHEMA
= {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}¶
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
classmethod
from_config
(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.squad.
SquadDataSourceForKD
(**kwargs)[source]¶ Bases:
pytext.data.sources.squad.SquadDataSource
Squad-like data along with soft labels (logits). Will return tuples of ( doc, question, answer, answer_start, has_answer, start_logits, end_logits, has_answer_logits, pad_mask, segment_labels )
-
pytext.data.sources.squad.
process_squad
(fname, ignore_impossible, max_character_length, min_overlap=0.1, delimiter='\t', quoted=False, is_kd=False)[source]¶
-
pytext.data.sources.squad.
process_squad_json
(fname, ignore_impossible, max_character_length, min_overlap)[source]¶
-
class
pytext.data.sources.tsv.
BlockShardedTSV
(file, field_names=None, delimiter='t', quoted=False, block_id=0, num_blocks=1, drop_incomplete_rows=False)[source]¶ Bases:
object
Take a TSV file, split into N pieces (by byte location) and return an iterator on one of the pieces. The pieces are equal by byte size, not by number of rows. Thus, care needs to be taken when using this for distributed training, otherwise number of batches for different workers might be different.
-
class
pytext.data.sources.tsv.
BlockShardedTSVDataSource
(rank=0, world_size=1, **kwargs)[source]¶ Bases:
pytext.data.sources.tsv.TSVDataSource
,pytext.data.sources.data_source.ShardedDataSource
-
train_unsharded
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.tsv.
MultilingualTSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', data_source_languages={'eval': ['en'], 'test': ['en'], 'train': ['en']}, language_columns=['language'], **kwargs)[source]¶ Bases:
pytext.data.sources.tsv.TSVDataSource
Data Source for multi-lingual data. The input data can have multiple text fields and each field can either have the same language or different languages. The data_source_languages dict contains the language information for each text field and this should match the number of language identifiers specified in language_columns.
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.tsv.
SessionTSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, **kwargs)[source]¶ Bases:
pytext.data.sources.tsv.TSVDataSource
,pytext.data.sources.session.SessionDataSource
-
class
pytext.data.sources.tsv.
TSV
(file, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False)[source]¶ Bases:
object
-
class
pytext.data.sources.tsv.
TSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from TSV sources. Uses python’s csv library.
-
classmethod
from_config
(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
classmethod
-
class
pytext.data.sources.
DataSource
(schema: Dict[str, Type[CT_co]])[source]¶ Bases:
pytext.config.component.Component
Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.
Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.
-
class
pytext.data.sources.
RawExample
[source]¶ Bases:
dict
A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.
-
class
pytext.data.sources.
SquadDataSource
(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)
-
DEFAULT_SCHEMA
= {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}¶
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
classmethod
from_config
(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.
TSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from TSV sources. Uses python’s csv library.
-
classmethod
from_config
(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
classmethod
-
class
pytext.data.sources.
PandasDataSource
(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from a pandas DataFrame.
- Inputs:
train_df: DataFrame for training
eval_df: DataFrame for evalu
test_df: DataFrame for test
schema: same as base DataSource, define the list of output values with their types
column_mapping: maps the column names in DataFrame to the name defined in schema
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
class
pytext.data.sources.
CoNLLUNERDataSource
(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶ Bases:
pytext.data.sources.conllu.CoNLLUPOSDataSource
Reads an empty line separated data (word label). This data source supports datasets for NER tasks
pytext.data.test package¶
-
class
pytext.data.test.data_test.
BatcherTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
BERTTensorizerTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
ListTensorizersTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
LookupTokensTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
RobertaTensorizerTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
SquadForBERTTensorizerTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
SquadForRobertaTensorizerTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tensorizers_test.
SquadTensorizerTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tokenizers_test.
GPT2BPETest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tsv_data_source_test.
BlockShardedTSVDataSourceTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tsv_data_source_test.
SessionTSVDataSourceTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
class
pytext.data.test.tsv_data_source_test.
TSVDataSourceTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
test_bad_quoting
()[source]¶ The text column of the first row of this file opens a quote but does not close it.
-
-
class
pytext.data.test.utils_test.
PaddingTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
pytext.data.tokenizers package¶
-
class
pytext.data.tokenizers.tokenizer.
BERTInitialTokenizer
(basic_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Basic initial tokenization for BERT. This is run prior to word piece, does white space tokenization in addition to lower-casing and accent removal if specified.
-
class
pytext.data.tokenizers.tokenizer.
CppProcessorMixin
[source]¶ Bases:
object
Cpp processors like SentencePiece don’t pickle well; reload them.
-
class
pytext.data.tokenizers.tokenizer.
DoNothingTokenizer
[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand
-
class
pytext.data.tokenizers.tokenizer.
GPT2BPETokenizer
(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer for gpt-2 and RoBERTa.
-
class
pytext.data.tokenizers.tokenizer.
PickleableGPT2BPEEncoder
(encoder, bpe_merges, errors='replace')[source]¶ Bases:
fairseq.data.encoders.gpt2_bpe_utils.Encoder
Fairseq’s encoder stores the regex module as a local reference on its encoders, which means they can’t be saved via pickle.dumps or torch.save. This modified their save/load logic doesn’t store the module, and restores the reference after re-inflating.
-
class
pytext.data.tokenizers.tokenizer.
SentencePieceTokenizer
(sp_model_path: str = '')[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
,pytext.data.tokenizers.tokenizer.CppProcessorMixin
Sentence piece tokenizer.
-
class
pytext.data.tokenizers.tokenizer.
Token
(value, start, end)[source]¶ Bases:
tuple
-
end
¶ Alias for field number 2
-
start
¶ Alias for field number 1
-
value
¶ Alias for field number 0
-
-
class
pytext.data.tokenizers.tokenizer.
Tokenizer
(split_regex='\s+', lowercase=True)[source]¶ Bases:
pytext.config.component.Component
A simple regex-splitting tokenizer.
-
class
pytext.data.tokenizers.tokenizer.
WordPieceTokenizer
(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Word piece tokenizer for BERT models.
-
class
pytext.data.tokenizers.
GPT2BPETokenizer
(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer for gpt-2 and RoBERTa.
-
class
pytext.data.tokenizers.
Token
(value, start, end)[source]¶ Bases:
tuple
-
end
¶ Alias for field number 2
-
start
¶ Alias for field number 1
-
value
¶ Alias for field number 0
-
-
class
pytext.data.tokenizers.
Tokenizer
(split_regex='\s+', lowercase=True)[source]¶ Bases:
pytext.config.component.Component
A simple regex-splitting tokenizer.
-
class
pytext.data.tokenizers.
DoNothingTokenizer
[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand
-
class
pytext.data.tokenizers.
WordPieceTokenizer
(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Word piece tokenizer for BERT models.
-
class
pytext.data.tokenizers.
CppProcessorMixin
[source]¶ Bases:
object
Cpp processors like SentencePiece don’t pickle well; reload them.
-
class
pytext.data.tokenizers.
SentencePieceTokenizer
(sp_model_path: str = '')[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
,pytext.data.tokenizers.tokenizer.CppProcessorMixin
Sentence piece tokenizer.
Submodules¶
pytext.data.batch_sampler module¶
-
class
pytext.data.batch_sampler.
AlternatingRandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.RandomizedBatchSampler
This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.
-
class
pytext.data.batch_sampler.
EvalBatchSampler
[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
Output: [A, B, C, D, a, b]
-
class
pytext.data.batch_sampler.
RandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.
Example
Iterator A: [A, B, C, D], Iterator B: [a, b]
batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]
Parameters: unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
-
class
pytext.data.batch_sampler.
RoundRobinBatchSampler
(iter_to_set_epoch: Optional[str] = None)[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.
If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.
If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]
iter_to_set_epoch = None Output: [A, a, B, b]
Parameters: iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
-
pytext.data.batch_sampler.
extract_iterator_properties
(input_iterator_probs: Dict[str, float])[source]¶ Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to generate iterator properties: iterator_names and iterator_probs.
-
pytext.data.batch_sampler.
select_key_and_batch
(iterator_names: Dict[str, str], iterator_probs: Dict[str, float], iter_dict: Dict[str, collections.abc.Iterator], iterators: Dict[str, collections.abc.Iterator])[source]¶ Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to select a key from iterator_names using iterator_probs and return a batch for the selected key using iter_dict and iterators.
pytext.data.bert_tensorizer module¶
-
class
pytext.data.bert_tensorizer.
BERTTensorizer
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBase
Tensorizer for BERT tasks. Works for single sentence, sentence pair, triples etc.
-
classmethod
from_config
(config: pytext.data.bert_tensorizer.BERTTensorizer.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
-
class
pytext.data.bert_tensorizer.
BERTTensorizerBase
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, base_tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Base Tensorizer class for all BERT style models including XLM, RoBERTa and XLM-R.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(vocab_builder=None, from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
numberize
(row: Dict[KT, VT]) → Tuple[Any, ...][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
-
tensorize
(batch) → Tuple[torch.Tensor, ...][source]¶ Convert instance level vectors into batch level tensors.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.bert_tensorizer.
BERTTensorizerBaseScriptImpl
(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]¶ Bases:
pytext.data.tensorizers.TensorizerScriptImpl
-
forward
(texts: Optional[List[List[str]]] = None, pre_tokenized: Optional[List[List[List[str]]]] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Wire up tokenize(), numberize() and tensorize() functions for data processing. When export to TorchScript, the wrapper module should choose to use texts or pre_tokenized based on the TorchScript tokenizer implementation (e.g use external tokenizer such as Yoda or not).
-
numberize
(per_sentence_tokens: List[List[Tuple[str, int, int]]]) → Tuple[List[int], List[int], int, List[int]][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
Parameters: - per_sentence_tokens – list of tokens per sentence level in one row,
- token represented by token string, start and end indices. (each) –
Returns: List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions
Return type: tokens
-
tensorize
(tokens_2d: List[List[int]], segment_labels_2d: List[List[int]], seq_lens_1d: List[int], positions_2d: List[List[int]]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Convert instance level vectors into batch level tensors.
-
tokenize
(row_text: Optional[List[str]], row_pre_tokenized: Optional[List[List[str]]]) → List[List[Tuple[str, int, int]]][source]¶ This function convert raw inputs into tokens, each token is represented by token(str), start and end indices in the raw inputs. There are two possible inputs to this function depends if the tokenized in implemented in TorchScript or not.
Case 1: Tokenizer has a full TorchScript implementation, the input will be a list of sentences (in most case it is single sentence or a pair).
Case 2: Tokenizer have partial or no TorchScript implementation, in most case, the tokenizer will be host in Yoda, the input will be a list of pre-processed tokens.
Returns: tokens per setence level, each token is represented by token(str), start and end indices. Return type: per_sentence_tokens
-
-
class
pytext.data.bert_tensorizer.
BERTTensorizerScriptImpl
(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl
-
pytext.data.bert_tensorizer.
build_fairseq_vocab
(vocab_file: str, dictionary_class: fairseq.data.dictionary.Dictionary = <class 'fairseq.data.dictionary.Dictionary'>, special_token_replacements: Dict[str, pytext.data.utils.SpecialToken] = None, max_vocab: int = -1, min_count: int = -1) → pytext.data.utils.Vocabulary[source]¶ Function builds a PyText vocabulary for models pre-trained using Fairseq modules. The dictionary class can take any Fairseq Dictionary class and is used to load the vocab file.
pytext.data.data module¶
-
class
pytext.data.data.
BatchData
(raw_data, numberized)[source]¶ Bases:
tuple
-
numberized
¶ Alias for field number 1
-
raw_data
¶ Alias for field number 0
-
-
class
pytext.data.data.
Batcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]¶ Bases:
pytext.config.component.Component
Batcher designed to batch rows of data, before padding.
-
class
pytext.data.data.
Data
(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]¶ Bases:
pytext.config.component.Component
Data is an abstraction that handles all of the following:
- Initialize model metadata parameters
- Create batches of tensors for model training or prediction
It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.
The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.
-
batches
(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]¶ Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.
stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.
Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.
-
class
pytext.data.data.
PoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1)[source]¶ Bases:
pytext.data.data.Batcher
Batcher that loads a pool of data, sorts it, and batches it.
Shuffling is performed before pooling, by loading num_shuffled_pools worth of data, shuffling, and then splitting that up into pools.
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
-
class
pytext.data.data.
RowData
(raw_data, numberized)[source]¶ Bases:
tuple
-
numberized
¶ Alias for field number 1
-
raw_data
¶ Alias for field number 0
-
-
pytext.data.data.
generator_iterator
(fn)[source]¶ Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.
pytext.data.data_handler module¶
-
class
pytext.data.data_handler.
BatchIterator
(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]¶ Bases:
object
BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.
Parameters: - batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
- processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
- include_input (bool) – if input data should be returned, default is true
- include_target (bool) – if target data should be returned, default is true
- include_context (bool) – if context data should be returned, default is true
- is_train (bool) – if the batch data is for training
- num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
-
class
pytext.data.data_handler.
DataHandler
(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]¶ Bases:
pytext.config.component.Component
DataHandler is the central place to prepare data for model training/testing. The class is responsible of:
- Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
- Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata
The data processing pipeline contains the following steps:
- Read data from file into a list of raw data examples
- Convert each row of row data to a TorchText Example. This logic happens
in process_row function and will:
- Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
- Use the raw data and results from featurizer to do any preprocessing
- Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
- Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
-
raw_columns
¶ columns to read from data source. The order should match the data stored in that file.
Type: List[str]
-
featurizer
¶ perform data preprocessing that should be shared between training and inference
Type: Featurizer
-
features
¶ a dict of name -> field that used to process data as model input
Type: Dict[str, Field]
-
labels
¶ a dict of name -> field that used to process data as training target
Type: Dict[str, Field]
-
extra_fields
¶ fields that process any extra data used neither as model input nor target. This is None by default
Type: Dict[str, Field]
-
text_feature_name
¶ name of the text field, used to define the default sort key of data
Type: str
-
shuffle
¶ if the dataset should be shuffled, true by default
Type: bool
-
sort_within_batch
¶ if data within same batch should be sorted, true by default
Type: bool
-
train_path
¶ path of training data file
Type: str
-
eval_path
¶ path of evaluation data file
Type: str
-
test_path
¶ path of test data file
Type: str
-
train_batch_size
¶ training batch size, 128 by default
Type: int
-
eval_batch_size
¶ evaluation batch size, 128 by default
Type: int
-
test_batch_size
¶ test batch size, 128 by default
Type: int
-
max_seq_len
¶ maximum length of tokens to keep in sequence
Type: int
-
pass_index
¶ if the original index of data in the batch should be passed along to downstream steps, default is true
Type: bool
-
gen_dataset
(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.data.dataset.Dataset[source]¶ Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)
-
gen_dataset_from_path
(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.data.dataset.Dataset[source]¶ Generate a dataset from file :returns: dataset (TorchText.Dataset)
-
get_test_iter_from_path
(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_test_iter_from_raw_data
(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_train_iter_from_path
(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶ Generate data batch iterator for training data. See _get_train_iter() for details
Parameters: - train_path (str) – file path of training data
- batch_size (int) – batch size
- rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
- world_size (int) – used for distributed training, total number of Gpu
-
get_train_iter_from_raw_data
(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶
-
init_feature_metadata
(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]¶
-
init_metadata_from_path
(train_path, eval_path, test_path)[source]¶ Initialize metadata using data from file
-
init_target_metadata
(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]¶
-
load_metadata
(metadata: pytext.data.data_handler.CommonMetadata)[source]¶ Load previously saved metadata
-
load_vocab
(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]¶ Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.
Parameters: - vocab_file (str) – vocab file to load
- vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
- lowercase_tokens (bool) – if the tokens should be lowercased
-
preprocess
(data: Iterable[Dict[str, Any]])[source]¶ preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])
-
preprocess_row
(row_data: Dict[str, Any]) → Dict[str, Any][source]¶ preprocess steps for a single input row, sub class should override it
-
read_from_file
(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]¶ Read data from csv file. Input file format is required to be tab-separated columns
Parameters: - file_name (str) – csv file name
- columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
pytext.data.disjoint_multitask_data module¶
-
class
pytext.data.disjoint_multitask_data.
DisjointMultitaskData
(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]¶ Bases:
pytext.data.data.Data
Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskData.Config.
- data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_dict
¶ Data handlers to do roundrobin over.
Type: type
pytext.data.disjoint_multitask_data_handler module¶
-
class
pytext.data.disjoint_multitask_data_handler.
DisjointMultitaskDataHandler
(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]¶ Bases:
pytext.data.data_handler.DataHandler
Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
- data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
- target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_handlers
¶ Data handlers to do roundrobin over.
Type: type
-
target_task_name
¶ Used to select best epoch, and set batch_per_epoch.
Type: type
-
upsample
¶ If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.
Type: bool
-
class
pytext.data.disjoint_multitask_data_handler.
RoundRobinBatchIterator
(iterators: Dict[str, pytext.data.data_handler.BatchIterator], upsample: bool = True, iter_to_set_epoch: Optional[str] = None)[source]¶ Bases:
pytext.data.data_handler.BatchIterator
We take a dictionary of BatchIterators and do round robin over them in a cycle. The below describes the behavior for one epoch, with the example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
- If upsample is True:
If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.
iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]
If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.
Output: [A, a, B, b]
- If upsample is False:
Iterate over batches from one epoch of each iterator, with the order among iterators uniformly shuffled.
Possible output: [a, A, B, C, b, D]
Parameters: - iterators (Dict[str, BatchIterator]) – Iterators to do roundrobin over.
- upsample (bool) – If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, in random order. Evaluation will use upsample=False. Default True.
- iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If upsample is True and this is not set, epoch size defaults to the length of the shortest iterator. If upsample is False, this argument is not used.
-
iterators
¶ Iterators to do roundrobin over.
Type: Dict[str, BatchIterator]
-
upsample
¶ Whether to upsample iterators with fewer batches.
Type: bool
-
iter_to_set_epoch
¶ Name of iterator to define epoch size.
Type: str
pytext.data.dynamic_pooling_batcher module¶
-
class
pytext.data.dynamic_pooling_batcher.
BatcherSchedulerConfig
(**kwargs)[source]¶ Bases:
pytext.config.module_config.Module.Config
-
end_batch_size
= 256¶
-
epoch_period
= 10¶
-
start_batch_size
= 32¶
-
step_size
= 1¶
-
-
class
pytext.data.dynamic_pooling_batcher.
DynamicPoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]¶ Bases:
pytext.data.data.PoolingBatcher
Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
compute_dynamic_batch_size
(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]¶
-
-
class
pytext.data.dynamic_pooling_batcher.
ExponentialBatcherSchedulerConfig
(**kwargs)[source]¶ Bases:
pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig
-
gamma
= 5¶
-
-
class
pytext.data.dynamic_pooling_batcher.
ExponentialDynamicPoolingBatcher
(*args, **kwargs)[source]¶ Bases:
pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher
Exponential Dynamic Batch Scheduler: scales up batch size by a factor of gamma
-
class
pytext.data.dynamic_pooling_batcher.
LinearDynamicPoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]¶ Bases:
pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher
Linear Dynamic Batch Scheduler: scales up batch size linearly
pytext.data.packed_lm_data module¶
-
class
pytext.data.packed_lm_data.
PackedLMData
(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, max_seq_len: int = 128, sort_key: Optional[str] = None, language: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True)[source]¶ Bases:
pytext.data.data.Data
Special purpose Data object which assumes a single text tensorizer. Packs tokens into a square batch with no padding. Used for LM training. The object also takes in an optional language argument which is used for cross-lingual LM training.
pytext.data.roberta_tensorizer module¶
-
class
pytext.data.roberta_tensorizer.
RoBERTaTensorizer
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, base_tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]¶
-
class
pytext.data.roberta_tensorizer.
RoBERTaTokenLevelTensorizer
(columns, tokenizer=None, vocab=None, max_seq_len=256, labels_columns=['label'], labels=[])[source]¶ Bases:
pytext.data.roberta_tensorizer.RoBERTaTensorizer
Tensorizer for token level classification tasks such as NER, POS etc using RoBERTa. Here each token has an associated label and the tensorizer should output a label tensor as well. The input for this tensorizer comes from the CoNLLUNERDataSource data source.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
classmethod
from_config
(config: pytext.data.roberta_tensorizer.RoBERTaTokenLevelTensorizer.Config)[source]¶
-
numberize
(row: Dict[KT, VT]) → Tuple[Any, ...][source]¶ Numberize both the tokens and labels. Since we break up tokens, the label for anything other than the first sub-word is assigned the padding idx.
-
pytext.data.squad_for_bert_tensorizer module¶
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForBERTTensorizer
(answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizer
Produces BERT inputs and answer spans for Squad.
-
SPAN_PAD_IDX
= -100¶
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForBERTTensorizerForKD
(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]¶ Bases:
pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizerForKD.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForRoBERTaTensorizer
(columns: List[str] = ['question', 'doc'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, answers_column: str = 'answers', answer_starts_column: str = 'answer_starts')[source]¶ Bases:
pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer
,pytext.data.roberta_tensorizer.RoBERTaTensorizer
Produces RoBERTa inputs and answer spans for Squad.
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer.Config)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
pytext.data.squad_tensorizer module¶
-
class
pytext.data.squad_tensorizer.
SquadTensorizer
(doc_tensorizer: pytext.data.tensorizers.TokenTensorizer, ques_tensorizer: pytext.data.tensorizers.TokenTensorizer, doc_column: str = 'doc', ques_column: str = 'question', answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]¶ Bases:
pytext.data.tensorizers.TokenTensorizer
Produces inputs and answer spans for Squad.
-
SPAN_PAD_IDX
= -100¶
-
classmethod
from_config
(config: pytext.data.squad_tensorizer.SquadTensorizer.Config, **kwargs)[source]¶
-
-
class
pytext.data.squad_tensorizer.
SquadTensorizerForKD
(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]¶
pytext.data.tensorizers module¶
-
class
pytext.data.tensorizers.
AnnotationNumberizer
(column: str = 'seqlogical', vocab=None, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Not really a Tensorizer (since it does not create tensors) but technically serves the same function. This class parses Annotations in the format below and extracts the actions (type List[List[int]])
[IN:GET_ESTIMATED_DURATION How long will it take to [SL:METHOD_TRAVEL drive ] from [SL:SOURCE Chicago ] to [SL:DESTINATION Mississippi ] ]
Extraction algorithm is handled by Annotation class. We only care about the list of actions, which before vocab index lookups would look like:
[ IN:GET_ESTIMATED_DURATION, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT, SL:METHOD_TRAVEL, SHIFT, REDUCE, SHIFT, SL:SOURCE, SHIFT, REDUCE, SHIFT, SL:DESTINATION, SHIFT, REDUCE, ]
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
ByteTensorizer
(text_column, lower=True, max_seq_len=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Turn characters into sequence of int8 bytes. One character will have one or more bytes depending on it’s encoding
-
NUM
= 256¶
-
PAD_BYTE
= 0¶
-
UNK_BYTE
= 0¶
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
ByteTokenTensorizer
(text_column, tokenizer=None, max_seq_len=None, max_byte_len=15, offset_for_non_padding=0, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Turn words into 2-dimensional tensors of int8 bytes. Words are padded to max_byte_len. Also computes sequence lengths (1-D tensor) and token lengths (2-D tensor). 0 is the pad byte.
-
NUM_BYTES
= 256¶
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
CharacterTokenTensorizer
(max_char_length: int = 20, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.TokenTensorizer
Turn words into 2-dimensional tensors of ints based on their ascii values. Words are padded to the maximum word length (also capped at max_char_length). Sequence lengths are the length of each token, 0 for pad token.
-
initialize
(from_scratch=True)¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
-
class
pytext.data.tensorizers.
FloatListTensorizer
(column: str, error_check: bool, dim: Optional[int], normalize: bool, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize numeric labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
()[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
-
class
pytext.data.tensorizers.
FloatTensorizer
(column: str, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
A tensorizer for reading in scalars from the data.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
GazetteerTensorizer
(text_column: str = 'text', dict_column: str = 'dict', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Create 3 tensors for dict features.
- idx: index of feature in token order.
- weights: weight of feature in token order.
- lens: number of features per token.
For each input token, there will be the same number of idx and weights entries. (equal to the max number of features any token has in this row). The values in lens will tell how many of these features are actually used per token.
Input format for the dict column is json and should be a list of dictionaries containing the “features” and their weight for each relevant “tokenIdx”. Example:
text: "Order coffee from Starbucks please" dict: [ {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}}, {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}} ]
if we assume this vocab
vocab = { UNK: 0, PAD: 1, "drink/beverage": 2, "music/song": 3, "store/coffee_shop": 4 }
this example will result in those tensors:
idx = [1, 1, 2, 3, 1, 1, 4, 1, 1, 1] weights = [0.0, 0.0, 0.8, 0.2, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] lens = [1, 2, 1, 1, 1]
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ Look through the dataset for all dict features to create vocab.
-
class
pytext.data.tensorizers.
LabelListTensorizer
(label_column: str = 'label', *args, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.LabelTensorizer
LabelListTensorizer takes a list of labels as input and generate a tuple of tensors (label_idx, list_length).
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
LabelTensorizer
(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize labels. Label can be used as either input or target
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
MetricTensorizer
(names: List[str], indexes: List[int], is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
A tensorizer which use other tensorizers’ numerized data. Used mostly for metric reporting.
-
class
pytext.data.tensorizers.
NtokensTensorizer
(names: List[str], indexes: List[int], is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.MetricTensorizer
A tensorizer which will reference another tensorizer’s numerized data to calculate the num tokens. Used for calculating tokens per second.
-
class
pytext.data.tensorizers.
NumericLabelTensorizer
(label_column: str = 'label', rescale_range: Optional[List[float]] = None, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize numeric labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
SeqTokenTensorizer
(column: str = 'text_seq', tokenizer=None, add_bos_token: bool = False, add_eos_token: bool = False, use_eos_token_for_bos: bool = False, add_bol_token: bool = False, add_eol_token: bool = False, use_eol_token_for_bol: bool = False, max_seq_len=None, vocab=None, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Tensorize a sequence of sentences. The input is a list of strings, like this one:
["where do you wanna meet?", "MPK"]
if we assume this vocab
vocab { UNK: 0, PAD: 1, 'where': 2, 'do': 3, 'you': 4, 'wanna': 5, 'meet?': 6, 'mpk': 7 }
this example will result in those tensors:
idx = [[2, 3, 4, 5, 6], [7, 1, 1, 1, 1]] seq_len = [2]
If you’re using BOS, EOS, BOL and EOL, the vocab will look like this
vocab { UNK: 0, PAD: 1, BOS: 2, EOS: 3, BOL: 4, EOL: 5 'where': 6, 'do': 7, 'you': 8, 'wanna': 9, 'meet?': 10, 'mpk': 11 }
this example will result in those tensors:
idx = [ [2, 4, 3, 1, 1, 1, 1], [2, 6, 7, 8, 9, 10, 3], [2, 11, 3, 1, 1, 1, 1], [2, 5, 3, 1, 1, 1, 1] ] seq_len = [4]
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
SlotLabelTensorizer
(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize word/slot labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ Look through the dataset for all labels and create a vocab map for them.
-
-
class
pytext.data.tensorizers.
SlotLabelTensorizerExpansible
(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.SlotLabelTensorizer
Create a base SlotLabelTensorizer to support selecting different types in ModelInput.
-
class
pytext.data.tensorizers.
SoftLabelTensorizer
(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, probs_column: str = 'target_probs', logits_column: str = 'target_logits', labels_column: str = 'target_labels', is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.LabelTensorizer
Handles numberizing labels for knowledge distillation. This still requires the same label column as LabelTensorizer for the “true” label, but also processes soft “probabilistic” labels generated from a teacher model, via three new columns.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
Tensorizer
(is_input: bool = True)[source]¶ Bases:
pytext.config.component.Component
Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.
Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
TensorizerScriptImpl
[source]¶ Bases:
torch.nn.modules.module.Module
-
batch_size
(texts: Optional[List[List[str]]], tokens: Optional[List[List[List[str]]]]) → int[source]¶
-
get_tokens_by_index
(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[List[str]]][source]¶
-
numberize
(*args, **kwargs)[source]¶ This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tensorize
(*args, **kwargs)[source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tensorize_wrapper
(*args, **kwargs)[source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
It will be called in PyText Tensorizer during training time, this function is not torchscriptiable because it depends on cuda.device().
-
tokenize
(*args, **kwargs)[source]¶ This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
-
class
pytext.data.tensorizers.
TokenTensorizer
(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Convert text to a list of tokens. Do this based on a tokenizer configuration, and build a vocabulary for numberization. Finally, pad the batch to create a square tensor of the correct size.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
UidTensorizer
(uid_column: str = 'uid', allow_unknown: bool = True, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize user IDs which can be either strings or tensors.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
VocabConfig
(**kwargs)[source]¶ Bases:
pytext.config.component.Component.Config
-
build_from_data
= True¶ Whether to add tokens from training data to vocab.
-
size_from_data
= 0¶ Add size_from_data most frequent tokens in training data to vocab (if this is 0, add all tokens from training data).
-
vocab_files
= []¶
-
-
class
pytext.data.tensorizers.
VocabFileConfig
(**kwargs)[source]¶ Bases:
pytext.config.component.Component.Config
-
filepath
= ''¶ File containing tokens to add to vocab (first whitespace-separated entry per line)
-
lowercase_tokens
= False¶ Whether to lowercase each of the tokens in the file
-
size_limit
= 0¶ The max number of tokens to add to vocab
-
skip_header_line
= False¶ Whether to skip the first line of the file (e.g. if it is a header line)
-
-
pytext.data.tensorizers.
initialize_tensorizers
(tensorizers, data_source, from_scratch=True)[source]¶ A utility function to stream a data source to the initialize functions of a dict of tensorizers.
-
pytext.data.tensorizers.
lookup_tokens
(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, vocab: pytext.data.utils.Vocabulary = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]¶
-
pytext.data.tensorizers.
tokenize
(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]¶
pytext.data.utils module¶
-
class
pytext.data.utils.
VocabBuilder
(delimiter=' ')[source]¶ Bases:
object
Helper class for aggregating and building Vocabulary objects.
-
class
pytext.data.utils.
Vocabulary
(vocab_list: List[str], counts: List[T] = None, replacements: Optional[Dict[str, str]] = None, unk_token: str = '__UNKNOWN__', pad_token: str = '__PAD__', bos_token: str = '__BEGIN_OF_SENTENCE__', eos_token: str = '__END_OF_SENTENCE__')[source]¶ Bases:
object
A mapping from indices to vocab elements.
-
pytext.data.utils.
align_target_label
(targets: List[float], labels: List[str], label_vocab: Dict[str, int]) → List[float][source]¶ Given targets that are ordered according to labels, align the targets to match the order of label_vocab.
-
pytext.data.utils.
align_target_labels
(targets_list: List[List[float]], labels_list: List[List[str]], label_vocab: Dict[str, int]) → List[List[float]][source]¶ Given targets_list that are ordered according to labels_list, align the targets to match the order of label_vocab.
-
pytext.data.utils.
pad
(nested_lists, pad_token, pad_shape=None)[source]¶ Pad the input lists with the pad token. If pad_shape is provided, pad to that shape, otherwise infer the input shape and pad out to a square tensor shape.
pytext.data.xlm_constants module¶
pytext.data.xlm_dictionary module¶
pytext.data.xlm_tensorizer module¶
-
class
pytext.data.xlm_tensorizer.
XLMTensorizer
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, language_column: str = 'language', lang2id: Dict[str, int] = {'ar': 0, 'bg': 1, 'de': 2, 'el': 3, 'en': 4, 'es': 5, 'fr': 6, 'hi': 7, 'ru': 8, 'sw': 9, 'th': 10, 'tr': 11, 'ur': 12, 'vi': 13, 'zh': 14}, use_language_embeddings: bool = True, has_language_in_data: bool = False)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBase
Tensorizer for Cross-lingual LM tasks. Works for single sentence as well as sentence pair.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
numberize
(row: Dict[KT, VT]) → Tuple[Any, ...][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.xlm_tensorizer.
XLMTensorizerScriptImpl
(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int, language_vocab: List[str], default_language: str)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl
-
forward
(texts: Optional[List[List[str]]] = None, pre_tokenized: Optional[List[List[List[str]]]] = None, languages: Optional[List[List[str]]] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Wire up tokenize(), numberize() and tensorize() functions for data processing.
-
numberize
(per_sentence_tokens: List[List[Tuple[str, int, int]]], per_sentence_languages: List[int]) → Tuple[List[int], List[int], int, List[int]][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
Parameters: - per_sentence_tokens – list of tokens per sentence level in one row,
- token represented by token string, start and end indices. (each) –
Returns: List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions
Return type: tokens
-
Module contents¶
-
class
pytext.data.
AlternatingRandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.RandomizedBatchSampler
This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.
-
class
pytext.data.
Batcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]¶ Bases:
pytext.config.component.Component
Batcher designed to batch rows of data, before padding.
-
class
pytext.data.
BatchIterator
(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]¶ Bases:
object
BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.
Parameters: - batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
- processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
- include_input (bool) – if input data should be returned, default is true
- include_target (bool) – if target data should be returned, default is true
- include_context (bool) – if context data should be returned, default is true
- is_train (bool) – if the batch data is for training
- num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
-
class
pytext.data.
Data
(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]¶ Bases:
pytext.config.component.Component
Data is an abstraction that handles all of the following:
- Initialize model metadata parameters
- Create batches of tensors for model training or prediction
It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.
The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.
-
batches
(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]¶ Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.
stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.
Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.
-
class
pytext.data.
DataHandler
(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]¶ Bases:
pytext.config.component.Component
DataHandler is the central place to prepare data for model training/testing. The class is responsible of:
- Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
- Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata
The data processing pipeline contains the following steps:
- Read data from file into a list of raw data examples
- Convert each row of row data to a TorchText Example. This logic happens
in process_row function and will:
- Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
- Use the raw data and results from featurizer to do any preprocessing
- Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
- Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
-
raw_columns
¶ columns to read from data source. The order should match the data stored in that file.
Type: List[str]
-
featurizer
¶ perform data preprocessing that should be shared between training and inference
Type: Featurizer
-
features
¶ a dict of name -> field that used to process data as model input
Type: Dict[str, Field]
-
labels
¶ a dict of name -> field that used to process data as training target
Type: Dict[str, Field]
-
extra_fields
¶ fields that process any extra data used neither as model input nor target. This is None by default
Type: Dict[str, Field]
-
text_feature_name
¶ name of the text field, used to define the default sort key of data
Type: str
-
shuffle
¶ if the dataset should be shuffled, true by default
Type: bool
-
sort_within_batch
¶ if data within same batch should be sorted, true by default
Type: bool
-
train_path
¶ path of training data file
Type: str
-
eval_path
¶ path of evaluation data file
Type: str
-
test_path
¶ path of test data file
Type: str
-
train_batch_size
¶ training batch size, 128 by default
Type: int
-
eval_batch_size
¶ evaluation batch size, 128 by default
Type: int
-
test_batch_size
¶ test batch size, 128 by default
Type: int
-
max_seq_len
¶ maximum length of tokens to keep in sequence
Type: int
-
pass_index
¶ if the original index of data in the batch should be passed along to downstream steps, default is true
Type: bool
-
gen_dataset
(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.data.dataset.Dataset[source]¶ Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)
-
gen_dataset_from_path
(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.data.dataset.Dataset[source]¶ Generate a dataset from file :returns: dataset (TorchText.Dataset)
-
get_test_iter_from_path
(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_test_iter_from_raw_data
(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_train_iter_from_path
(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶ Generate data batch iterator for training data. See _get_train_iter() for details
Parameters: - train_path (str) – file path of training data
- batch_size (int) – batch size
- rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
- world_size (int) – used for distributed training, total number of Gpu
-
get_train_iter_from_raw_data
(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶
-
init_feature_metadata
(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]¶
-
init_metadata_from_path
(train_path, eval_path, test_path)[source]¶ Initialize metadata using data from file
-
init_target_metadata
(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]¶
-
load_metadata
(metadata: pytext.data.data_handler.CommonMetadata)[source]¶ Load previously saved metadata
-
load_vocab
(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]¶ Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.
Parameters: - vocab_file (str) – vocab file to load
- vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
- lowercase_tokens (bool) – if the tokens should be lowercased
-
preprocess
(data: Iterable[Dict[str, Any]])[source]¶ preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])
-
preprocess_row
(row_data: Dict[str, Any]) → Dict[str, Any][source]¶ preprocess steps for a single input row, sub class should override it
-
read_from_file
(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]¶ Read data from csv file. Input file format is required to be tab-separated columns
Parameters: - file_name (str) – csv file name
- columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
-
class
pytext.data.
DisjointMultitaskData
(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]¶ Bases:
pytext.data.data.Data
Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskData.Config.
- data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_dict
¶ Data handlers to do roundrobin over.
Type: type
-
class
pytext.data.
DisjointMultitaskDataHandler
(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]¶ Bases:
pytext.data.data_handler.DataHandler
Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
- data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
- target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_handlers
¶ Data handlers to do roundrobin over.
Type: type
-
target_task_name
¶ Used to select best epoch, and set batch_per_epoch.
Type: type
-
upsample
¶ If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.
Type: bool
-
class
pytext.data.
DynamicPoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]¶ Bases:
pytext.data.data.PoolingBatcher
Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
compute_dynamic_batch_size
(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]¶
-
-
class
pytext.data.
EvalBatchSampler
[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
Output: [A, B, C, D, a, b]
-
pytext.data.
generator_iterator
(fn)[source]¶ Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.
-
class
pytext.data.
PoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=10000, num_shuffled_pools=1)[source]¶ Bases:
pytext.data.data.Batcher
Batcher that loads a pool of data, sorts it, and batches it.
Shuffling is performed before pooling, by loading num_shuffled_pools worth of data, shuffling, and then splitting that up into pools.
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
-
class
pytext.data.
RandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.
Example
Iterator A: [A, B, C, D], Iterator B: [a, b]
batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]
Parameters: unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
-
class
pytext.data.
RoundRobinBatchSampler
(iter_to_set_epoch: Optional[str] = None)[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.
If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.
If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]
iter_to_set_epoch = None Output: [A, a, B, b]
Parameters: iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
-
class
pytext.data.
Tensorizer
(is_input: bool = True)[source]¶ Bases:
pytext.config.component.Component
Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.
Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-
pytext.exporters package¶
Submodules¶
pytext.exporters.custom_exporters module¶
-
class
pytext.exporters.custom_exporters.
DenseFeatureExporter
(config, input_names, dummy_model_input, vocab_map, output_names)[source]¶ Bases:
pytext.exporters.exporter.ModelExporter
Exporter for models that have DenseFeatures as input to the decoder
-
class
pytext.exporters.custom_exporters.
InitPredictNetExporter
(config, input_names, dummy_model_input, vocab_map, output_names)[source]¶ Bases:
pytext.exporters.exporter.ModelExporter
Exporter for converting models to their caffe2 init and predict nets. Does not rely on c2_prepared, but rather splits the ONNX model into the init and predict nets directly.
-
export_to_caffe2
(model, export_path: str, export_onnx_path: str = None) → List[str][source]¶ export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model
Parameters: - model (Model) – pytorch model to export
- export_path (str) – path to save the exported caffe2 model
- export_onnx_path (str) – path to save the exported onnx model
Returns: list of caffe2 model output names
Return type: final_output_names
-
postprocess_output
(init_net, predict_net, workspace, output_names: List[str], model)[source]¶ Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net
Parameters: - init_net (caffe2.python.Net) – caffe2 init net created by the current graph
- predict_net (caffe2.python.Net) – caffe2 net created by the current graph
- workspace (caffe2.python.workspace) – caffe2 current workspace
- output_names (List[str]) – current output names of the caffe2 net
- py_model (Model) – original pytorch model object
Returns: list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add
Return type: result
-
prepend_operators
(init_net, predict_net, input_names: List[str])[source]¶ Prepend operators to the converted caffe2 net, do nothing by default
Parameters: - c2_prepared (Caffe2Rep) – caffe2 net rep
- input_names (List[str]) – current input names to the caffe2 net
Returns: caffe2 net with prepended operators input_names (List[str]): list of input names for the new net
Return type: c2_prepared (Caffe2Rep)
-
pytext.exporters.exporter module¶
-
class
pytext.exporters.exporter.
ModelExporter
(config, input_names, dummy_model_input, vocab_map, output_names)[source]¶ Bases:
pytext.config.component.Component
Model exporter exports a PyTorch model to Caffe2 model using ONNX
-
input_names
¶ names of the input variables to model forward function, in a flattened way. e.g: forward(tokens, dict) where tokens is List[Tensor] and dict is a tuple of value and length: (List[Tensor], List[Tensor]) the input names should looks like [‘token’, ‘dict_value’, ‘dict_length’]
Type: List[Str]
-
dummy_model_input
¶ dummy values to define the shape of input tensors, should exactly match the shape of the model forward function
Type: Tuple[torch.Tensor]
-
vocab_map
¶ dict of input feature names to corresponding index_to_string array, e.g:
{ "text": ["<UNK>", "W1", "W2", "W3", "W4", "W5", "W6", "W7", "W8"], "dict": ["<UNK>", "D1", "D2", "D3", "D4", "D5", "D6", "D7", "D8"] }
Type: Dict[str, List[str]]
-
output_names
¶ names of output variables
Type: List[Str]
-
export_to_caffe2
(model, export_path: str, export_onnx_path: str = None) → List[str][source]¶ export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model
Parameters: - model (Model) – pytorch model to export
- export_path (str) – path to save the exported caffe2 model
- export_onnx_path (str) – path to save the exported onnx model
Returns: list of caffe2 model output names
Return type: final_output_names
-
export_to_metrics
(model, metric_channels)[source]¶ Exports the pytorch model to tensorboard as a graph.
Parameters: - model (Model) – pytorch model to export
- metric_channels (List[Channel]) – outputs of model’s execution graph
-
classmethod
from_config
(config, feature_config: pytext.config.field_config.FeatureConfig, target_config: Union[pytext.config.pytext_config.ConfigBase, List[pytext.config.pytext_config.ConfigBase]], meta: pytext.data.data_handler.CommonMetadata, *args, **kwargs)[source]¶ Gather all the necessary metadata from configs and global metadata to be used in exporter
-
get_extra_params
() → List[str][source]¶ Returns: list of blobs to be added as extra params to the caffe2 model
-
classmethod
get_feature_metadata
(feature_config: pytext.config.field_config.FeatureConfig, feature_meta: Dict[str, pytext.fields.field.FieldMeta])[source]¶
-
postprocess_output
(init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, output_names: List[str], py_model)[source]¶ Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net
Parameters: - init_net (caffe2.python.Net) – caffe2 init net created by the current graph
- predict_net (caffe2.python.Net) – caffe2 net created by the current graph
- workspace (caffe2.python.workspace) – caffe2 current workspace
- output_names (List[str]) – current output names of the caffe2 net
- py_model (Model) – original pytorch model object
Returns: list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add
Return type: result
-
prepend_operators
(c2_prepared: caffe2.python.onnx.backend_rep.Caffe2Rep, input_names: List[str]) → Tuple[caffe2.python.onnx.backend_rep.Caffe2Rep, List[str]][source]¶ Prepend operators to the converted caffe2 net, do nothing by default
Parameters: - c2_prepared (Caffe2Rep) – caffe2 net rep
- input_names (List[str]) – current input names to the caffe2 net
Returns: caffe2 net with prepended operators input_names (List[str]): list of input names for the new net
Return type: c2_prepared (Caffe2Rep)
-
Module contents¶
-
class
pytext.exporters.
ModelExporter
(config, input_names, dummy_model_input, vocab_map, output_names)[source]¶ Bases:
pytext.config.component.Component
Model exporter exports a PyTorch model to Caffe2 model using ONNX
-
input_names
¶ names of the input variables to model forward function, in a flattened way. e.g: forward(tokens, dict) where tokens is List[Tensor] and dict is a tuple of value and length: (List[Tensor], List[Tensor]) the input names should looks like [‘token’, ‘dict_value’, ‘dict_length’]
Type: List[Str]
-
dummy_model_input
¶ dummy values to define the shape of input tensors, should exactly match the shape of the model forward function
Type: Tuple[torch.Tensor]
-
vocab_map
¶ dict of input feature names to corresponding index_to_string array, e.g:
{ "text": ["<UNK>", "W1", "W2", "W3", "W4", "W5", "W6", "W7", "W8"], "dict": ["<UNK>", "D1", "D2", "D3", "D4", "D5", "D6", "D7", "D8"] }
Type: Dict[str, List[str]]
-
output_names
¶ names of output variables
Type: List[Str]
-
export_to_caffe2
(model, export_path: str, export_onnx_path: str = None) → List[str][source]¶ export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model
Parameters: - model (Model) – pytorch model to export
- export_path (str) – path to save the exported caffe2 model
- export_onnx_path (str) – path to save the exported onnx model
Returns: list of caffe2 model output names
Return type: final_output_names
-
export_to_metrics
(model, metric_channels)[source]¶ Exports the pytorch model to tensorboard as a graph.
Parameters: - model (Model) – pytorch model to export
- metric_channels (List[Channel]) – outputs of model’s execution graph
-
classmethod
from_config
(config, feature_config: pytext.config.field_config.FeatureConfig, target_config: Union[pytext.config.pytext_config.ConfigBase, List[pytext.config.pytext_config.ConfigBase]], meta: pytext.data.data_handler.CommonMetadata, *args, **kwargs)[source]¶ Gather all the necessary metadata from configs and global metadata to be used in exporter
-
get_extra_params
() → List[str][source]¶ Returns: list of blobs to be added as extra params to the caffe2 model
-
classmethod
get_feature_metadata
(feature_config: pytext.config.field_config.FeatureConfig, feature_meta: Dict[str, pytext.fields.field.FieldMeta])[source]¶
-
postprocess_output
(init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, output_names: List[str], py_model)[source]¶ Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net
Parameters: - init_net (caffe2.python.Net) – caffe2 init net created by the current graph
- predict_net (caffe2.python.Net) – caffe2 net created by the current graph
- workspace (caffe2.python.workspace) – caffe2 current workspace
- output_names (List[str]) – current output names of the caffe2 net
- py_model (Model) – original pytorch model object
Returns: list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add
Return type: result
-
prepend_operators
(c2_prepared: caffe2.python.onnx.backend_rep.Caffe2Rep, input_names: List[str]) → Tuple[caffe2.python.onnx.backend_rep.Caffe2Rep, List[str]][source]¶ Prepend operators to the converted caffe2 net, do nothing by default
Parameters: - c2_prepared (Caffe2Rep) – caffe2 net rep
- input_names (List[str]) – current input names to the caffe2 net
Returns: caffe2 net with prepended operators input_names (List[str]): list of input names for the new net
Return type: c2_prepared (Caffe2Rep)
-
-
class
pytext.exporters.
DenseFeatureExporter
(config, input_names, dummy_model_input, vocab_map, output_names)[source]¶ Bases:
pytext.exporters.exporter.ModelExporter
Exporter for models that have DenseFeatures as input to the decoder
-
class
pytext.exporters.
InitPredictNetExporter
(config, input_names, dummy_model_input, vocab_map, output_names)[source]¶ Bases:
pytext.exporters.exporter.ModelExporter
Exporter for converting models to their caffe2 init and predict nets. Does not rely on c2_prepared, but rather splits the ONNX model into the init and predict nets directly.
-
export_to_caffe2
(model, export_path: str, export_onnx_path: str = None) → List[str][source]¶ export pytorch model to caffe2 by first using ONNX to convert logic in forward function to a caffe2 net, and then prepend/append additional operators to the caffe2 net according to the model
Parameters: - model (Model) – pytorch model to export
- export_path (str) – path to save the exported caffe2 model
- export_onnx_path (str) – path to save the exported onnx model
Returns: list of caffe2 model output names
Return type: final_output_names
-
postprocess_output
(init_net, predict_net, workspace, output_names: List[str], model)[source]¶ Postprocess the model output, generate additional blobs for human readable prediction. By default it use export function of output layer from pytorch model to append additional operators to caffe2 net
Parameters: - init_net (caffe2.python.Net) – caffe2 init net created by the current graph
- predict_net (caffe2.python.Net) – caffe2 net created by the current graph
- workspace (caffe2.python.workspace) – caffe2 current workspace
- output_names (List[str]) – current output names of the caffe2 net
- py_model (Model) – original pytorch model object
Returns: list of blobs that will be added to the caffe2 model final_output_names: list of output names of the blobs to add
Return type: result
-
prepend_operators
(init_net, predict_net, input_names: List[str])[source]¶ Prepend operators to the converted caffe2 net, do nothing by default
Parameters: - c2_prepared (Caffe2Rep) – caffe2 net rep
- input_names (List[str]) – current input names to the caffe2 net
Returns: caffe2 net with prepended operators input_names (List[str]): list of input names for the new net
Return type: c2_prepared (Caffe2Rep)
-
pytext.fields package¶
Submodules¶
pytext.fields.char_field module¶
-
class
pytext.fields.char_field.
CharFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= tensor([[[1, 1, 1]], [[1, 1, 1]]])¶
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pytext.fields.contextual_token_embedding_field module¶
-
class
pytext.fields.contextual_token_embedding_field.
ContextualTokenEmbeddingField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]¶ Example of padded minibatch:
[[[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [3.1, 3.2, 3.3, 3.4, 3.5], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [0.0, 0.0, 0.0, 0.0, 0.0], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], ], ]
-
pytext.fields.dict_field module¶
-
class
pytext.fields.dict_field.
DictFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))¶
-
numericalize
(arr, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]¶ Pad a batch of examples using this field.
Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.
-
pytext.fields.field module¶
-
class
pytext.fields.field.
DocLabelField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.field.
FloatField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.field.
FloatVectorField
(dim=0, dim_error_check=False, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.field.
NestedField
(*args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
,torchtext.data.field.NestedField
-
class
pytext.fields.field.
RawField
(*args, is_target=False, **kwargs)[source]¶ Bases:
torchtext.data.field.RawField
-
class
pytext.fields.field.
SeqFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingNestedField
-
dummy_model_input
= tensor([[[1]], [[1]]])¶
-
-
class
pytext.fields.field.
TextFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
dummy_model_input
= tensor([[1], [1]])¶
-
-
class
pytext.fields.field.
VocabUsingField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
Base class for all fields that need to build a vocabulary.
-
class
pytext.fields.field.
VocabUsingNestedField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
,pytext.fields.field.NestedField
Base class for all nested fields that need to build a vocabulary.
-
class
pytext.fields.field.
WordLabelField
(use_bio_labels, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
pytext.fields.text_field_with_special_unk module¶
-
class
pytext.fields.text_field_with_special_unk.
TextFeatureFieldWithSpecialUnk
(*args, unkify_func=<function unkify>, **kwargs)[source]¶ Bases:
pytext.fields.field.TextFeatureField
-
build_vocab
(*args, min_freq=1, **kwargs)[source]¶ Code is exactly same as as torchtext.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.
-
numericalize
(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]¶ Code is exactly same as torchtext.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.
-
Module contents¶
-
class
pytext.fields.
CharFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= tensor([[[1, 1, 1]], [[1, 1, 1]]])¶
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
-
class
pytext.fields.
ContextualTokenEmbeddingField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]¶ Example of padded minibatch:
[[[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [3.1, 3.2, 3.3, 3.4, 3.5], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [0.0, 0.0, 0.0, 0.0, 0.0], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], ], ]
-
-
class
pytext.fields.
DictFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))¶
-
numericalize
(arr, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]¶ Pad a batch of examples using this field.
Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.
-
-
class
pytext.fields.
DocLabelField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
FloatField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
FloatVectorField
(dim=0, dim_error_check=False, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
RawField
(*args, is_target=False, **kwargs)[source]¶ Bases:
torchtext.data.field.RawField
-
class
pytext.fields.
TextFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
dummy_model_input
= tensor([[1], [1]])¶
-
-
class
pytext.fields.
VocabUsingField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
Base class for all fields that need to build a vocabulary.
-
class
pytext.fields.
WordLabelField
(use_bio_labels, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
NestedField
(*args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
,torchtext.data.field.NestedField
-
class
pytext.fields.
VocabUsingNestedField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
,pytext.fields.field.NestedField
Base class for all nested fields that need to build a vocabulary.
-
class
pytext.fields.
SeqFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingNestedField
-
dummy_model_input
= tensor([[[1]], [[1]]])¶
-
-
class
pytext.fields.
TextFeatureFieldWithSpecialUnk
(*args, unkify_func=<function unkify>, **kwargs)[source]¶ Bases:
pytext.fields.field.TextFeatureField
-
build_vocab
(*args, min_freq=1, **kwargs)[source]¶ Code is exactly same as as torchtext.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.
-
numericalize
(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]¶ Code is exactly same as torchtext.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.
-
pytext.loss package¶
Submodules¶
pytext.loss.loss module¶
-
class
pytext.loss.loss.
AUCPRHingeLoss
(config, weights=None, *args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,pytext.loss.loss.Loss
area under the precision-recall curve loss, Reference: “Scalable Learning of Non-Decomposable Objectives”, Section 5 TensorFlow Implementation: https://github.com/tensorflow/models/tree/master/research/global_objectives
-
forward
(logits, targets, reduce=True, size_average=True, weights=None)[source]¶ Parameters: - logits – Variable \((N, C)\) where C = number of classes
- targets – Variable \((N)\) where each value is 0 <= targets[i] <= C-1
- weights – Coefficients for the loss. Must be a Tensor of shape [N] or [N, C], where N = batch_size, C = number of classes.
- size_average (bool, optional) – By default, the losses are averaged
over observations for each minibatch. However, if the field
sizeAverage is set to False, the losses are instead summed
for each minibatch. Default:
True
- reduce (bool, optional) – By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per input/target element instead and ignores size_average. Default: True
-
-
class
pytext.loss.loss.
BinaryCrossEntropyLoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
CosineEmbeddingLoss
(config, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
CrossEntropyLoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
KLDivergenceBCELoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
KLDivergenceCELoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
LabelSmoothedCrossEntropyLoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
Loss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.config.component.Component
Base class for loss functions
-
class
pytext.loss.loss.
MAELoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
Mean absolute error or L1 loss, for regression tasks.
-
class
pytext.loss.loss.
MSELoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
Mean squared error or L2 loss, for regression tasks.
-
class
pytext.loss.loss.
MultiLabelSoftMarginLoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
NLLLoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.loss.
PairwiseRankingLoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
Given embeddings for a query, positive response and negative response computes pairwise ranking hinge loss
Module contents¶
-
class
pytext.loss.
AUCPRHingeLoss
(config, weights=None, *args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,pytext.loss.loss.Loss
area under the precision-recall curve loss, Reference: “Scalable Learning of Non-Decomposable Objectives”, Section 5 TensorFlow Implementation: https://github.com/tensorflow/models/tree/master/research/global_objectives
-
forward
(logits, targets, reduce=True, size_average=True, weights=None)[source]¶ Parameters: - logits – Variable \((N, C)\) where C = number of classes
- targets – Variable \((N)\) where each value is 0 <= targets[i] <= C-1
- weights – Coefficients for the loss. Must be a Tensor of shape [N] or [N, C], where N = batch_size, C = number of classes.
- size_average (bool, optional) – By default, the losses are averaged
over observations for each minibatch. However, if the field
sizeAverage is set to False, the losses are instead summed
for each minibatch. Default:
True
- reduce (bool, optional) – By default, the losses are averaged or summed over observations for each minibatch depending on size_average. When reduce is False, returns a loss per input/target element instead and ignores size_average. Default: True
-
-
class
pytext.loss.
Loss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.config.component.Component
Base class for loss functions
-
class
pytext.loss.
CrossEntropyLoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
CosineEmbeddingLoss
(config, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
BinaryCrossEntropyLoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
MultiLabelSoftMarginLoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
KLDivergenceBCELoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
KLDivergenceCELoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
MAELoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
Mean absolute error or L1 loss, for regression tasks.
-
class
pytext.loss.
MSELoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
Mean squared error or L2 loss, for regression tasks.
-
class
pytext.loss.
NLLLoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
-
class
pytext.loss.
PairwiseRankingLoss
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
Given embeddings for a query, positive response and negative response computes pairwise ranking hinge loss
-
class
pytext.loss.
LabelSmoothedCrossEntropyLoss
(config, ignore_index=-100, weight=None, *args, **kwargs)[source]¶ Bases:
pytext.loss.loss.Loss
pytext.metric_reporters package¶
Submodules¶
pytext.metric_reporters.channel module¶
-
class
pytext.metric_reporters.channel.
Channel
(stages: Tuple[pytext.common.constants.Stage, ...] = (<Stage.TRAIN: 'Training'>, <Stage.EVAL: 'Evaluation'>, <Stage.TEST: 'Test'>))[source]¶ Bases:
object
Channel defines how to format and report the result of a PyText job to an output stream.
-
stages
¶ in which stages the report will be triggered, default is all stages, which includes train, eval, test
-
report
(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]¶ Defines how to format and report data to the output channel.
Parameters: - stage (Stage) – train, eval or test
- epoch (int) – current epoch
- metrics (Any) – all metrics
- model_select_metric (double) – a single numeric metric to pick best model
- loss (double) – average loss
- preds (List[Any]) – list of predictions
- targets (List[Any]) – list of targets
- scores (List[Any]) – list of scores
- context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
-
-
class
pytext.metric_reporters.channel.
ConsoleChannel
(stages: Tuple[pytext.common.constants.Stage, ...] = (<Stage.TRAIN: 'Training'>, <Stage.EVAL: 'Evaluation'>, <Stage.TEST: 'Test'>))[source]¶ Bases:
pytext.metric_reporters.channel.Channel
Simple Channel that prints results to console.
-
report
(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]¶ Defines how to format and report data to the output channel.
Parameters: - stage (Stage) – train, eval or test
- epoch (int) – current epoch
- metrics (Any) – all metrics
- model_select_metric (double) – a single numeric metric to pick best model
- loss (double) – average loss
- preds (List[Any]) – list of predictions
- targets (List[Any]) – list of targets
- scores (List[Any]) – list of scores
- context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
-
-
class
pytext.metric_reporters.channel.
FileChannel
(stages, file_path)[source]¶ Bases:
pytext.metric_reporters.channel.Channel
Simple Channel that writes results to a TSV file.
-
report
(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]¶ Defines how to format and report data to the output channel.
Parameters: - stage (Stage) – train, eval or test
- epoch (int) – current epoch
- metrics (Any) – all metrics
- model_select_metric (double) – a single numeric metric to pick best model
- loss (double) – average loss
- preds (List[Any]) – list of predictions
- targets (List[Any]) – list of targets
- scores (List[Any]) – list of scores
- context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
-
-
class
pytext.metric_reporters.channel.
TensorBoardChannel
(summary_writer=None, metric_name='accuracy')[source]¶ Bases:
pytext.metric_reporters.channel.Channel
TensorBoardChannel defines how to format and report the result of a PyText job to TensorBoard.
-
summary_writer
¶ An instance of the TensorBoard SummaryWriter class, or an object that implements the same interface. https://pytorch.org/docs/stable/tensorboard.html
-
metric_name
¶ The name of the default metric to display on the TensorBoard dashboard, defaults to “accuracy”
-
train_step
¶ The training step count
-
add_scalars
(prefix, metrics, epoch)[source]¶ Recursively flattens the metrics object and adds each field name and value as a scalar for the corresponding epoch using the summary writer.
Parameters: - prefix (str) – The tag prefix for the metric. Each field name in the metrics object will be prepended with the prefix.
- metrics (Any) – The metrics object.
-
add_texts
(tag, metrics)[source]¶ Recursively flattens the metrics object and adds each field name and value as a text using the summary writer. For example, if tag = “test”, and metrics = { accuracy: 0.7, scores: { precision: 0.8, recall: 0.6 } }, then under “tag=test” we will display “accuracy=0.7”, and under “tag=test/scores” we will display “precision=0.8” and “recall=0.6” in TensorBoard.
Parameters: - tag (str) – The tag name for the metric. If a field needs to be flattened further, it will be prepended as a prefix to the field name.
- metrics (Any) – The metrics object.
-
export
(model, input_to_model=None, **kwargs)[source]¶ Draws the neural network representation graph in TensorBoard.
Parameters: - model (Any) – the model object.
- input_to_model (Any) – the input to the model (required for PyTorch models, since its execution graph is defined by run).
-
report
(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, meta, model, optimizer, *args)[source]¶ Defines how to format and report data to TensorBoard using the summary writer. In the current implementation, during the train/eval phase we recursively report each metric field as scalars, and during the test phase we report the final metrics to be displayed as texts.
Also visualizes the internal model states (weights, biases) as histograms in TensorBoard.
Parameters: - stage (Stage) – train, eval or test
- epoch (int) – current epoch
- metrics (Any) – all metrics
- model_select_metric (double) – a single numeric metric to pick best model
- loss (double) – average loss
- preds (List[Any]) – list of predictions
- targets (List[Any]) – list of targets
- scores (List[Any]) – list of scores
- context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
- meta (Dict[str, Any]) – global metadata, such as target names
- model (nn.Module) – the PyTorch neural network model
-
pytext.metric_reporters.classification_metric_reporter module¶
-
class
pytext.metric_reporters.classification_metric_reporter.
ClassificationMetricReporter
(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
classmethod
from_config
(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]¶
-
get_meta
()[source]¶ Get global meta data that is not specific to any batch, the data will be pass along to channels
-
classmethod
-
class
pytext.metric_reporters.classification_metric_reporter.
ComparableClassificationMetric
[source]¶ Bases:
enum.Enum
An enumeration.
-
ACCURACY
= 'accuracy'¶
-
LABEL_AVG_PRECISION
= 'label_avg_precision'¶
-
LABEL_F1
= 'label_f1'¶
-
LABEL_ROC_AUC
= 'label_roc_auc'¶
-
MACRO_F1
= 'macro_f1'¶
-
MCC
= 'mcc'¶
-
NEGATIVE_LOSS
= 'negative_loss'¶
-
ROC_AUC
= 'roc_auc'¶
-
-
class
pytext.metric_reporters.classification_metric_reporter.
MultiLabelClassificationMetricReporter
(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]¶ Bases:
pytext.metric_reporters.classification_metric_reporter.ClassificationMetricReporter
pytext.metric_reporters.compositional_metric_reporter module¶
-
class
pytext.metric_reporters.compositional_metric_reporter.
CompositionalMetricReporter
(actions_vocab, channels: List[pytext.metric_reporters.channel.Channel], text_column_name: str = 'tokenized_text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
classmethod
from_config
(config, metadata: pytext.data.data_handler.CommonMetadata = None, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer] = None)[source]¶
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
static
node_to_metrics_node
(node: Union[pytext.data.data_structures.annotation.Intent, pytext.data.data_structures.annotation.Slot], start: int = 0) → pytext.metrics.intent_slot_metrics.Node[source]¶ The input start is the absolute start position in utterance
-
static
tree_from_tokens_and_indx_actions
(token_str_list: List[str], actions_vocab: List[str], actions_indices: List[int], validate_tree: bool = True)[source]¶
-
static
tree_to_metric_node
(tree: pytext.data.data_structures.annotation.Tree) → pytext.metrics.intent_slot_metrics.Node[source]¶ Creates a Node from tree assuming the utterance is a concatenation of the tokens by whitespaces. The function does not necessarily reproduce the original utterance as extra whitespaces can be introduced.
-
classmethod
pytext.metric_reporters.disjoint_multitask_metric_reporter module¶
-
class
pytext.metric_reporters.disjoint_multitask_metric_reporter.
DisjointMultitaskMetricReporter
(reporters: Dict[str, pytext.metric_reporters.metric_reporter.MetricReporter], loss_weights: Dict[str, float], target_task_name: Optional[str], use_subtask_select_metric: bool)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= False¶
-
report_metric
(model, stage, epoch, reset=True, print_to_channels=True, optimizer=None)[source]¶ Calculate metrics and average loss, report all statistic data to channels
Parameters: - model (nn.Module) – the PyTorch neural network model.
- stage (Stage) – training, evaluation or test
- epoch (int) – current epoch
- reset (bool) – if all data should be reset after report, default is True
- print_to_channels (bool) – if report data to channels, default is True
-
pytext.metric_reporters.intent_slot_detection_metric_reporter module¶
-
class
pytext.metric_reporters.intent_slot_detection_metric_reporter.
IntentSlotMetricReporter
(doc_label_names: List[str], word_label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel], slot_column_name: str = 'slots', text_column_name: str = 'text', token_tensorizer_name: str = 'tokens')[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
pytext.metric_reporters.language_model_metric_reporter module¶
-
class
pytext.metric_reporters.language_model_metric_reporter.
LanguageModelChannel
(stages, file_path)[source]¶
-
class
pytext.metric_reporters.language_model_metric_reporter.
LanguageModelMetricReporter
(channels, metadata, tensorizers, aggregate_metrics, perplexity_type, pep_format)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
LABELS_COLUMN
= 'labels'¶
-
RAW_TEXT_COLUMN
= 'text'¶
-
TOKENS_COLUMN
= 'tokens'¶
-
UTTERANCE_COLUMN
= 'utterance'¶
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
calculate_metric
() → pytext.metrics.language_model_metrics.LanguageModelMetric[source]¶ Calculate metrics, each sub class should implement it
-
classmethod
from_config
(config: pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter.Config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]¶
-
get_model_select_metric
(metrics) → float[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= True¶
-
-
class
pytext.metric_reporters.language_model_metric_reporter.
MaskedLMMetricReporter
(channels, metadata, tensorizers, aggregate_metrics, perplexity_type, pep_format)[source]¶ Bases:
pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
pytext.metric_reporters.metric_reporter module¶
-
class
pytext.metric_reporters.metric_reporter.
MetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.config.component.Component
MetricReporter is responsible of three things:
- Aggregate output from trainer, which includes model inputs, predictions, targets, scores, and loss.
- Calculate metrics using the aggregated output, and define how the metric is used to find best model
- Optionally report the metrics and aggregated output to various channels
-
lower_is_better
¶ Whether a lower metric indicates better performance. Set to True for e.g. perplexity, and False for e.g. accuracy. Default is False
Type: bool
-
channels
¶ A list of Channel that will receive metrics and the aggregated trainer output then format and report them in any customized way.
Type: List[Channel]
MetricReporter is tightly-coupled with metric aggregation and computation which makes inheritance hard to reuse the parent functionalities and attributes. Next step is to decouple the metric aggregation and computation vs metric reporting.
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
classmethod
aggregate_data
(all_data, new_batch)[source]¶ Aggregate a batch of data, basically just convert tensors to list of native python data
-
compare_metric
(new_metric, old_metric)[source]¶ Check if new metric indicates better model performance
Returns: bool, true if model with new_metric performs better
-
get_meta
()[source]¶ Get global meta data that is not specific to any batch, the data will be pass along to channels
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= False
-
report_metric
(model, stage, epoch, reset=True, print_to_channels=True, optimizer=None)[source]¶ Calculate metrics and average loss, report all statistic data to channels
Parameters: - model (nn.Module) – the PyTorch neural network model.
- stage (Stage) – training, evaluation or test
- epoch (int) – current epoch
- reset (bool) – if all data should be reset after report, default is True
- print_to_channels (bool) – if report data to channels, default is True
-
class
pytext.metric_reporters.metric_reporter.
PureLossMetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
lower_is_better
= True¶
-
pytext.metric_reporters.pairwise_ranking_metric_reporter module¶
-
class
pytext.metric_reporters.pairwise_ranking_metric_reporter.
PairwiseRankingMetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
pytext.metric_reporters.regression_metric_reporter module¶
-
class
pytext.metric_reporters.regression_metric_reporter.
RegressionMetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= False¶
-
pytext.metric_reporters.squad_metric_reporter module¶
-
class
pytext.metric_reporters.squad_metric_reporter.
SquadMetricReporter
(channels: List[pytext.metric_reporters.channel.Channel], n_best_size: int, max_answer_length: int, ignore_impossible: bool, has_answer_labels: List[str], tensorizer=None, false_label='False')[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
ANSWERS_COLUMN
= 'answers'¶
-
DOC_COLUMN
= 'doc'¶
-
QUES_COLUMN
= 'question'¶
-
ROW_INDEX
= 'id'¶
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **contexts)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
pytext.metric_reporters.word_tagging_metric_reporter module¶
-
class
pytext.metric_reporters.word_tagging_metric_reporter.
NERMetricReporter
(label_names: List[str], pad_idx: int, channels: List[pytext.metric_reporters.channel.Channel], use_bio_labels: bool = True)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
class
pytext.metric_reporters.word_tagging_metric_reporter.
SequenceTaggingMetricReporter
(label_names, pad_idx, channels)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
class
pytext.metric_reporters.word_tagging_metric_reporter.
Span
(label, start, end)[source]¶ Bases:
tuple
-
end
¶ Alias for field number 2
-
label
¶ Alias for field number 0
-
start
¶ Alias for field number 1
-
-
class
pytext.metric_reporters.word_tagging_metric_reporter.
WordTaggingMetricReporter
(label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel])[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
Module contents¶
-
class
pytext.metric_reporters.
Channel
(stages: Tuple[pytext.common.constants.Stage, ...] = (<Stage.TRAIN: 'Training'>, <Stage.EVAL: 'Evaluation'>, <Stage.TEST: 'Test'>))[source]¶ Bases:
object
Channel defines how to format and report the result of a PyText job to an output stream.
-
stages
¶ in which stages the report will be triggered, default is all stages, which includes train, eval, test
-
report
(stage, epoch, metrics, model_select_metric, loss, preds, targets, scores, context, *args)[source]¶ Defines how to format and report data to the output channel.
Parameters: - stage (Stage) – train, eval or test
- epoch (int) – current epoch
- metrics (Any) – all metrics
- model_select_metric (double) – a single numeric metric to pick best model
- loss (double) – average loss
- preds (List[Any]) – list of predictions
- targets (List[Any]) – list of targets
- scores (List[Any]) – list of scores
- context (Dict[str, List[Any]]) – dict of any additional context data, each context is a list of data that maps to each example
-
-
class
pytext.metric_reporters.
MetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.config.component.Component
MetricReporter is responsible of three things:
- Aggregate output from trainer, which includes model inputs, predictions, targets, scores, and loss.
- Calculate metrics using the aggregated output, and define how the metric is used to find best model
- Optionally report the metrics and aggregated output to various channels
-
lower_is_better
¶ Whether a lower metric indicates better performance. Set to True for e.g. perplexity, and False for e.g. accuracy. Default is False
Type: bool
-
channels
¶ A list of Channel that will receive metrics and the aggregated trainer output then format and report them in any customized way.
Type: List[Channel]
MetricReporter is tightly-coupled with metric aggregation and computation which makes inheritance hard to reuse the parent functionalities and attributes. Next step is to decouple the metric aggregation and computation vs metric reporting.
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
classmethod
aggregate_data
(all_data, new_batch)[source]¶ Aggregate a batch of data, basically just convert tensors to list of native python data
-
compare_metric
(new_metric, old_metric)[source]¶ Check if new metric indicates better model performance
Returns: bool, true if model with new_metric performs better
-
get_meta
()[source]¶ Get global meta data that is not specific to any batch, the data will be pass along to channels
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= False
-
report_metric
(model, stage, epoch, reset=True, print_to_channels=True, optimizer=None)[source]¶ Calculate metrics and average loss, report all statistic data to channels
Parameters: - model (nn.Module) – the PyTorch neural network model.
- stage (Stage) – training, evaluation or test
- epoch (int) – current epoch
- reset (bool) – if all data should be reset after report, default is True
- print_to_channels (bool) – if report data to channels, default is True
-
class
pytext.metric_reporters.
ClassificationMetricReporter
(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
classmethod
from_config
(config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]¶
-
get_meta
()[source]¶ Get global meta data that is not specific to any batch, the data will be pass along to channels
-
classmethod
-
class
pytext.metric_reporters.
MultiLabelClassificationMetricReporter
(label_names: List[str], channels: List[pytext.metric_reporters.channel.Channel], model_select_metric: pytext.metric_reporters.classification_metric_reporter.ComparableClassificationMetric = <ComparableClassificationMetric.ACCURACY: 'accuracy'>, target_label: Optional[str] = None, text_column_names: List[str] = ['text'], additional_column_names: List[str] = [], recall_at_precision_thresholds: List[float] = [0.2, 0.4, 0.6, 0.8, 0.9])[source]¶ Bases:
pytext.metric_reporters.classification_metric_reporter.ClassificationMetricReporter
-
class
pytext.metric_reporters.
RegressionMetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= False¶
-
-
class
pytext.metric_reporters.
IntentSlotMetricReporter
(doc_label_names: List[str], word_label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel], slot_column_name: str = 'slots', text_column_name: str = 'text', token_tensorizer_name: str = 'tokens')[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
class
pytext.metric_reporters.
LanguageModelMetricReporter
(channels, metadata, tensorizers, aggregate_metrics, perplexity_type, pep_format)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
LABELS_COLUMN
= 'labels'¶
-
RAW_TEXT_COLUMN
= 'text'¶
-
TOKENS_COLUMN
= 'tokens'¶
-
UTTERANCE_COLUMN
= 'utterance'¶
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
calculate_metric
() → pytext.metrics.language_model_metrics.LanguageModelMetric[source]¶ Calculate metrics, each sub class should implement it
-
classmethod
from_config
(config: pytext.metric_reporters.language_model_metric_reporter.LanguageModelMetricReporter.Config, meta: pytext.data.data_handler.CommonMetadata = None, tensorizers=None)[source]¶
-
get_model_select_metric
(metrics) → float[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
lower_is_better
= True¶
-
-
class
pytext.metric_reporters.
SquadMetricReporter
(channels: List[pytext.metric_reporters.channel.Channel], n_best_size: int, max_answer_length: int, ignore_impossible: bool, has_answer_labels: List[str], tensorizer=None, false_label='False')[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
ANSWERS_COLUMN
= 'answers'¶
-
DOC_COLUMN
= 'doc'¶
-
QUES_COLUMN
= 'question'¶
-
ROW_INDEX
= 'id'¶
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **contexts)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
-
class
pytext.metric_reporters.
WordTaggingMetricReporter
(label_names: List[str], use_bio_labels: bool, channels: List[pytext.metric_reporters.channel.Channel])[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
class
pytext.metric_reporters.
CompositionalMetricReporter
(actions_vocab, channels: List[pytext.metric_reporters.channel.Channel], text_column_name: str = 'tokenized_text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
classmethod
from_config
(config, metadata: pytext.data.data_handler.CommonMetadata = None, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer] = None)[source]¶
-
get_model_select_metric
(metrics)[source]¶ Return a single numeric metric value that is used for model selection, returns the metric itself by default, but usually metrics will be more complicated data structures
-
static
node_to_metrics_node
(node: Union[pytext.data.data_structures.annotation.Intent, pytext.data.data_structures.annotation.Slot], start: int = 0) → pytext.metrics.intent_slot_metrics.Node[source]¶ The input start is the absolute start position in utterance
-
static
tree_from_tokens_and_indx_actions
(token_str_list: List[str], actions_vocab: List[str], actions_indices: List[int], validate_tree: bool = True)[source]¶
-
static
tree_to_metric_node
(tree: pytext.data.data_structures.annotation.Tree) → pytext.metrics.intent_slot_metrics.Node[source]¶ Creates a Node from tree assuming the utterance is a concatenation of the tokens by whitespaces. The function does not necessarily reproduce the original utterance as extra whitespaces can be introduced.
-
classmethod
-
class
pytext.metric_reporters.
PairwiseRankingMetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
add_batch_stats
(n_batches, preds, targets, scores, loss, m_input, **context)[source]¶ Aggregates a batch of output data (predictions, scores, targets/true labels and loss).
Parameters: - n_batches (int) – number of current batch
- preds (torch.Tensor) – predictions of current batch
- targets (torch.Tensor) – targets of current batch
- scores (torch.Tensor) – scores of current batch
- loss (double) – average loss of current batch
- m_input (Tuple[torch.Tensor, ..]) – model inputs of current batch
- context (Dict[str, Any]) – any additional context data, it could be either a list of data which maps to each example, or a single value for the batch
-
-
class
pytext.metric_reporters.
SequenceTaggingMetricReporter
(label_names, pad_idx, channels)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
class
pytext.metric_reporters.
PureLossMetricReporter
(channels, pep_format=False)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
-
lower_is_better
= True¶
-
-
class
pytext.metric_reporters.
NERMetricReporter
(label_names: List[str], pad_idx: int, channels: List[pytext.metric_reporters.channel.Channel], use_bio_labels: bool = True)[source]¶ Bases:
pytext.metric_reporters.metric_reporter.MetricReporter
pytext.metrics package¶
Submodules¶
pytext.metrics.intent_slot_metrics module¶
-
class
pytext.metrics.intent_slot_metrics.
AllMetrics
[source]¶ Bases:
tuple
Aggregated class for intent-slot related metrics.
-
top_intent_accuracy
¶ Accuracy of the top-level intent.
-
frame_accuracy
¶ Frame accuracy.
-
frame_accuracies_by_depth
¶ Frame accuracies bucketized by depth of the gold tree.
-
bracket_metrics
¶ Bracket metrics for intents and slots. For details, see the function compute_intent_slot_metrics().
-
tree_metrics
¶ Tree metrics for intents and slots. For details, see the function compute_intent_slot_metrics().
-
loss
¶ Cross entropy loss.
-
bracket_metrics
Alias for field number 4
-
frame_accuracies_by_depth
Alias for field number 3
-
frame_accuracy
Alias for field number 1
-
frame_accuracy_top_k
¶ Alias for field number 2
-
loss
Alias for field number 6
-
top_intent_accuracy
Alias for field number 0
-
tree_metrics
Alias for field number 5
-
-
pytext.metrics.intent_slot_metrics.
FrameAccuraciesByDepth
= typing.Dict[int, pytext.metrics.intent_slot_metrics.FrameAccuracy]¶ Frame accuracies bucketized by depth of the gold tree.
-
class
pytext.metrics.intent_slot_metrics.
FrameAccuracy
[source]¶ Bases:
tuple
Frame accuracy for a collection of intent frame predictions.
Frame accuracy means the entire tree structure of the predicted frame matches that of the gold frame.
-
frame_accuracy
¶ Alias for field number 1
-
num_samples
¶ Alias for field number 0
-
-
class
pytext.metrics.intent_slot_metrics.
FramePredictionPair
[source]¶ Bases:
tuple
Pair of predicted and gold intent frames.
-
expected_frame
¶ Alias for field number 1
-
predicted_frame
¶ Alias for field number 0
-
-
class
pytext.metrics.intent_slot_metrics.
IntentSlotConfusions
[source]¶ Bases:
tuple
Aggregated class for intent and slot confusions.
-
intent_confusions
¶ Confusion counts for intents.
-
slot_confusions
¶ Confusion counts for slots.
-
intent_confusions
Alias for field number 0
-
slot_confusions
Alias for field number 1
-
-
class
pytext.metrics.intent_slot_metrics.
IntentSlotMetrics
[source]¶ Bases:
tuple
Precision/recall/F1 metrics for intents and slots.
-
intent_metrics
¶ Precision/recall/F1 metrics for intents.
-
slot_metrics
¶ Precision/recall/F1 metrics for slots.
-
overall_metrics
¶ Combined precision/recall/F1 metrics for all nodes (merging intents and slots).
-
intent_metrics
Alias for field number 0
-
overall_metrics
Alias for field number 2
-
slot_metrics
Alias for field number 1
-
-
class
pytext.metrics.intent_slot_metrics.
IntentsAndSlots
[source]¶ Bases:
tuple
Collection of intents and slots in an intent frame.
-
intents
¶ Alias for field number 0
-
slots
¶ Alias for field number 1
-
-
class
pytext.metrics.intent_slot_metrics.
Node
(label: str, span: pytext.data.data_structures.node.Span, children: Optional[AbstractSet[Node]] = None, text: str = None)[source]¶ Bases:
pytext.data.data_structures.node.Node
Subclass of the base Node class, used for metric purposes. It is immutable so that hashing can be done on the class.
-
label
¶ Label of the node.
Type: str
-
span
¶ Span of the node.
Type: Span
-
children
¶ frozenset of the node’s children, left empty when computing bracketing metrics.
Type: frozenset
ofNode
-
text
¶ Text the node covers (=utterance[span.start:span.end])
Type: str
-
-
class
pytext.metrics.intent_slot_metrics.
NodesPredictionPair
[source]¶ Bases:
tuple
Pair of predicted and expected sets of nodes.
-
expected_nodes
¶ Alias for field number 1
-
predicted_nodes
¶ Alias for field number 0
-
-
pytext.metrics.intent_slot_metrics.
compare_frames
(predicted_frame: pytext.metrics.intent_slot_metrics.Node, expected_frame: pytext.metrics.intent_slot_metrics.Node, tree_based: bool, intent_per_label_confusions: Optional[pytext.metrics.PerLabelConfusions] = None, slot_per_label_confusions: Optional[pytext.metrics.PerLabelConfusions] = None) → pytext.metrics.intent_slot_metrics.IntentSlotConfusions[source]¶ Compares two intent frames and returns TP, FP, FN counts for intents and slots. Optionally collects the per label TP, FP, FN counts.
Parameters: - predicted_frame – Predicted intent frame.
- expected_frame – Gold intent frame.
- tree_based – Whether to get the tree-based confusions (if True) or bracket-based confusions (if False). For details, see the function compute_intent_slot_metrics().
- intent_per_label_confusions – If provided, update the per label confusions for intents as well. Defaults to None.
- slot_per_label_confusions – If provided, update the per label confusions for slots as well. Defaults to None.
Returns: IntentSlotConfusions, containing confusion counts for intents and slots.
-
pytext.metrics.intent_slot_metrics.
compute_all_metrics
(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair], top_intent_accuracy: bool = True, frame_accuracy: bool = True, frame_accuracies_by_depth: bool = True, bracket_metrics: bool = True, tree_metrics: bool = True, overall_metrics: bool = False, all_predicted_frames: List[List[pytext.metrics.intent_slot_metrics.Node]] = None, calculated_loss: float = None) → pytext.metrics.intent_slot_metrics.AllMetrics[source]¶ Given a list of predicted and gold intent frames, computes intent-slot related metrics.
Parameters: - frame_pairs – List of predicted and gold intent frames.
- top_intent_accuracy – Whether to compute top intent accuracy or not. Defaults to True.
- frame_accuracy – Whether to compute frame accuracy or not. Defaults to True.
- frame_accuracies_by_depth – Whether to compute frame accuracies by depth or not. Defaults to True.
- bracket_metrics – Whether to compute bracket metrics or not. Defaults to True.
- tree_metrics – Whether to compute tree metrics or not. Defaults to True.
- overall_metrics – If bracket_metrics or tree_metrics is true, decides whether to compute overall (merging intents and slots) metrics for them. Defaults to False.
Returns: AllMetrics which contains intent-slot related metrics.
-
pytext.metrics.intent_slot_metrics.
compute_frame_accuracies_by_depth
(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair]) → Dict[int, pytext.metrics.intent_slot_metrics.FrameAccuracy][source]¶ Given a list of predicted and gold intent frames, splits the predictions into buckets according to the depth of the gold trees, and computes frame accuracy for each bucket.
Parameters: frame_pairs – List of predicted and gold intent frames. Returns: FrameAccuraciesByDepth, a map from depths to their corresponding frame accuracies.
-
pytext.metrics.intent_slot_metrics.
compute_frame_accuracy
(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair]) → float[source]¶ Computes frame accuracy given a list of predicted and gold intent frames.
Parameters: frame_pairs – List of predicted and gold intent frames. Returns: Frame accuracy. For a prediction, frame accuracy is achieved if the entire tree structure of the predicted frame matches that of the gold frame.
-
pytext.metrics.intent_slot_metrics.
compute_frame_accuracy_top_k
(frame_pairs: List[pytext.metrics.intent_slot_metrics.FramePredictionPair], all_frames: List[List[pytext.metrics.intent_slot_metrics.Node]]) → float[source]¶
-
pytext.metrics.intent_slot_metrics.
compute_intent_slot_metrics
(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair], tree_based: bool, overall_metrics: bool = True) → pytext.metrics.intent_slot_metrics.IntentSlotMetrics[source]¶ Given a list of predicted and gold intent frames, computes precision, recall and F1 metrics for intents and slots, either in tree-based or bracket-based manner.
The following assumptions are taken on intent frames: 1. The root node is an intent, 2. Children of intents are always slots, and children of slots are always intents.
For tree-based metrics, a node (an intent or slot) in the predicted frame is considered a true positive only if the subtree rooted at this node has an exact copy in the gold frame, otherwise it is considered a false positive. A false negative is a node in the gold frame that does not have an exact subtree match in the predicted frame.
For bracket-based metrics, a node in the predicted frame is considered a true positive if there is a node in the gold frame having the same label and span (but not necessarily the same children). The definitions of false positives and false negatives are similar to the above.
Parameters: - frame_pairs – List of predicted and gold intent frames.
- tree_based – Whether to compute tree-based metrics (if True) or bracket-based metrics (if False).
- overall_metrics – Whether to compute overall (merging intents and slots) metrics or not. Defaults to True.
Returns: IntentSlotMetrics, containing precision/recall/F1 metrics for intents and slots.
-
pytext.metrics.intent_slot_metrics.
compute_metric_at_k
(references: List[pytext.metrics.intent_slot_metrics.Node], hypothesis: List[List[pytext.metrics.intent_slot_metrics.Node]], metric_fn: Callable[[pytext.metrics.intent_slot_metrics.Node, pytext.metrics.intent_slot_metrics.Node], bool] = <function <lambda>>) → List[float][source]¶ Computes a boolean metric at each position in the ranked list of hypothesis, and returns an average for each position over all examples. By default metric_fn is comparing if frames are equal.
-
pytext.metrics.intent_slot_metrics.
compute_prf1_metrics
(nodes_pairs: Sequence[pytext.metrics.intent_slot_metrics.NodesPredictionPair]) → Tuple[pytext.metrics.AllConfusions, pytext.metrics.PRF1Metrics][source]¶ Computes precision/recall/F1 metrics given a list of predicted and expected sets of nodes.
Parameters: nodes_pairs – List of predicted and expected node sets. Returns: A tuple, of which the first member contains the confusion information, and the second member contains the computed precision/recall/F1 metrics.
-
pytext.metrics.intent_slot_metrics.
compute_top_intent_accuracy
(frame_pairs: Sequence[pytext.metrics.intent_slot_metrics.FramePredictionPair]) → float[source]¶ Computes accuracy of the top-level intent.
Parameters: frame_pairs – List of predicted and gold intent frames. Returns: Prediction accuracy of the top-level intent.
pytext.metrics.language_model_metrics module¶
pytext.metrics.squad_metrics module¶
Module contents¶
-
class
pytext.metrics.
AllConfusions
[source]¶ Bases:
object
Aggregated class for per label confusions.
-
per_label_confusions
¶ Per label confusion information.
-
confusions
¶ Overall TP, FP and FN counts across the labels in per_label_confusions.
-
confusions
-
per_label_confusions
-
-
class
pytext.metrics.
ClassificationMetrics
[source]¶ Bases:
tuple
Metric class for various classification metrics.
-
accuracy
¶ Overall accuracy of predictions.
-
macro_prf1_metrics
¶ Macro precision/recall/F1 scores.
-
per_label_soft_scores
¶ Per label soft metrics.
-
mcc
¶ Matthews correlation coefficient.
-
roc_auc
¶ Area under the Receiver Operating Characteristic curve.
-
loss
¶ Training loss (only used for selecting best model, no need to print).
-
accuracy
Alias for field number 0
-
loss
Alias for field number 5
-
macro_prf1_metrics
Alias for field number 1
-
mcc
Alias for field number 3
-
per_label_soft_scores
Alias for field number 2
-
roc_auc
Alias for field number 4
-
-
class
pytext.metrics.
Confusions
(TP: int = 0, FP: int = 0, FN: int = 0)[source]¶ Bases:
object
Confusion information for a collection of predictions.
-
TP
¶ Number of true positives.
-
FP
¶ Number of false positives.
-
FN
¶ Number of false negatives.
-
FN
-
FP
-
TP
-
-
class
pytext.metrics.
LabelListPrediction
[source]¶ Bases:
tuple
Label list predictions of an example.
-
label_scores
¶ Confidence scores that each label receives.
-
predicted_label
¶ List of indices of the predicted label.
-
expected_label
¶ List of indices of the true label.
-
expected_label
Alias for field number 2
-
label_scores
Alias for field number 0
-
predicted_label
Alias for field number 1
-
-
class
pytext.metrics.
LabelPrediction
[source]¶ Bases:
tuple
Label predictions of an example.
-
label_scores
¶ Confidence scores that each label receives.
-
predicted_label
¶ Index of the predicted label. This is usually the label with the highest confidence score in label_scores.
-
expected_label
¶ Index of the true label.
-
expected_label
Alias for field number 2
-
label_scores
Alias for field number 0
-
predicted_label
Alias for field number 1
-
-
class
pytext.metrics.
MacroPRF1Metrics
[source]¶ Bases:
tuple
Aggregated metric class for macro precision/recall/F1 scores.
-
per_label_scores
¶ Mapping from label string to the corresponding precision/recall/F1 scores.
-
macro_scores
¶ Macro precision/recall/F1 scores across the labels in per_label_scores.
-
macro_scores
Alias for field number 1
-
per_label_scores
Alias for field number 0
-
-
class
pytext.metrics.
MacroPRF1Scores
[source]¶ Bases:
tuple
Macro precision/recall/F1 scores (averages across each label).
-
num_label
¶ Number of distinct labels.
-
precision
¶ Equally weighted average of precisions for each label.
-
recall
¶ Equally weighted average of recalls for each label.
-
f1
¶ Equally weighted average of F1 scores for each label.
-
f1
Alias for field number 3
-
num_labels
¶ Alias for field number 0
-
precision
Alias for field number 1
-
recall
Alias for field number 2
-
-
pytext.metrics.
PRECISION_AT_RECALL_THRESHOLDS
= [0.2, 0.4, 0.6, 0.8, 0.9]¶ Basic metric classes and functions for single-label prediction problems. Extending to multi-label support
-
class
pytext.metrics.
PRF1Metrics
[source]¶ Bases:
tuple
Metric class for all types of precision/recall/F1 scores.
-
per_label_scores
¶ Map from label string to the corresponding precision/recall/F1 scores.
-
macro_scores
¶ Macro precision/recall/F1 scores across the labels in per_label_scores.
-
micro_scores
¶ Micro (regular) precision/recall/F1 scores for the same collection of predictions.
-
macro_scores
Alias for field number 1
-
micro_scores
Alias for field number 2
-
per_label_scores
Alias for field number 0
-
-
class
pytext.metrics.
PRF1Scores
[source]¶ Bases:
tuple
Precision/recall/F1 scores for a collection of predictions.
-
true_positives
¶ Number of true positives.
-
false_positives
¶ Number of false positives.
-
false_negatives
¶ Number of false negatives.
-
precision
¶ TP / (TP + FP).
-
recall
¶ TP / (TP + FN).
-
f1
¶ 2 * TP / (2 * TP + FP + FN).
-
f1
Alias for field number 5
-
false_negatives
Alias for field number 2
-
false_positives
Alias for field number 1
-
precision
Alias for field number 3
-
recall
Alias for field number 4
-
true_positives
Alias for field number 0
-
-
class
pytext.metrics.
PairwiseRankingMetrics
[source]¶ Bases:
tuple
Metric class for pairwise ranking
-
num_examples
¶ number of samples
Type: int
-
accuracy
¶ how many times did we rank in the correct order
Type: float
-
average_score_difference
¶ average score(higherRank) - score(lowerRank)
Type: float
-
accuracy
Alias for field number 1
-
average_score_difference
Alias for field number 2
-
num_examples
Alias for field number 0
-
-
class
pytext.metrics.
PerLabelConfusions
[source]¶ Bases:
object
Per label confusion information.
-
label_confusions_map
¶ Map from label string to the corresponding confusion counts.
-
label_confusions_map
-
-
class
pytext.metrics.
RealtimeMetrics
[source]¶ Bases:
tuple
Realtime Metrics for tracking training progress and performance.
-
samples
¶ number of samples
Type: int
-
tps
¶ tokens per second
Type: float
-
ups
¶ updates per second
Type: float
-
samples
Alias for field number 0
-
tps
Alias for field number 1
-
ups
Alias for field number 2
-
-
class
pytext.metrics.
RegressionMetrics
[source]¶ Bases:
tuple
Metrics for regression tasks.
-
num_examples
¶ number of examples
Type: int
-
pearson_correlation
¶ correlation between predictions and labels
Type: float
-
mse
¶ mean-squared error between predictions and labels
Type: float
-
mse
Alias for field number 2
-
num_examples
Alias for field number 0
-
pearson_correlation
Alias for field number 1
-
-
class
pytext.metrics.
SoftClassificationMetrics
[source]¶ Bases:
tuple
Classification scores that are independent of thresholds.
-
average_precision
¶ Alias for field number 0
-
decision_thresh_at_precision
¶ Alias for field number 2
-
decision_thresh_at_recall
¶ Alias for field number 4
-
precision_at_recall
¶ Alias for field number 3
-
recall_at_precision
¶ Alias for field number 1
-
roc_auc
¶ Alias for field number 5
-
-
pytext.metrics.
average_precision_score
(y_true_sorted: numpy.ndarray, y_score_sorted: numpy.ndarray) → float[source]¶ Computes average precision, which summarizes the precision-recall curve as the precisions achieved at each threshold weighted by the increase in recall since the previous threshold.
Parameters: - y_true_sorted – Numpy array sorted according to decreasing confidence scores indicating whether each prediction is correct.
- Numpy array of confidence scores for the predictions in (y_score_sorted) – decreasing order.
Returns: Average precision score.
TODO: This is too slow, improve the performance
-
pytext.metrics.
compute_classification_metrics
(predictions: Sequence[pytext.metrics.LabelPrediction], label_names: Sequence[str], loss: float, average_precisions: bool = True, recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → pytext.metrics.ClassificationMetrics[source]¶ A general function that computes classification metrics given a list of label predictions.
Parameters: - predictions – Label predictions, including the confidence score for each label.
- label_names – Indexed label names.
- average_precisions – Whether to compute average precisions for labels or not. Defaults to True.
- recall_at_precision_thresholds – precision thresholds at which to calculate recall
- precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns: ClassificationMetrics which contains various classification metrics.
-
pytext.metrics.
compute_matthews_correlation_coefficients
(TP: int, FP: int, FN: int, TN: int) → float[source]¶ Computes Matthews correlation coefficient, a way to summarize all four counts (TP, FP, FN, TN) in the confusion matrix of binary classification.
Parameters: - TP – Number of true positives.
- FP – Number of false positives.
- FN – Number of false negatives.
- TN – Number of true negatives.
Returns: Matthews correlation coefficient, which is sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)).
-
pytext.metrics.
compute_multi_label_classification_metrics
(predictions: Sequence[pytext.metrics.LabelListPrediction], label_names: Sequence[str], loss: float, average_precisions: bool = True, recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → pytext.metrics.ClassificationMetrics[source]¶ A general function that computes classification metrics given a list of multi-label predictions.
Parameters: - predictions – multi-label predictions, including the confidence score for each label.
- label_names – Indexed label names.
- average_precisions – Whether to compute average precisions for labels or not. Defaults to True.
- recall_at_precision_thresholds – precision thresholds at which to calculate recall
- precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns: ClassificationMetrics which contains various classification metrics.
-
pytext.metrics.
compute_multi_label_soft_metrics
(predictions: Sequence[pytext.metrics.LabelListPrediction], label_names: Sequence[str], recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → Dict[str, pytext.metrics.SoftClassificationMetrics][source]¶ Computes multi-label soft classification metrics
Parameters: - predictions – multi-label predictions, including the confidence score for each label.
- label_names – Indexed label names.
- recall_at_precision_thresholds – precision thresholds at which to calculate recall
- precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns: Dict from label strings to their corresponding soft metrics.
-
pytext.metrics.
compute_pairwise_ranking_metrics
(predictions: Sequence[int], scores: Sequence[float]) → pytext.metrics.PairwiseRankingMetrics[source]¶ Computes metrics for pairwise ranking given sequences of predictions and scores
Parameters: - predictions – 1 if ranking was correct, 0 if ranking was incorrect
- scores – score(higher-ranked-sample) - score(lower-ranked-sample)
Returns: PairwiseRankingMetrics object
-
pytext.metrics.
compute_regression_metrics
(predictions: Sequence[float], targets: Sequence[float]) → pytext.metrics.RegressionMetrics[source]¶ Computes metrics for regression tasks.abs
Parameters: - predictions – 1-D sequence of float predictions
- targets – 1-D sequence of float labels
Returns: RegressionMetrics object
-
pytext.metrics.
compute_roc_auc
(predictions: Sequence[pytext.metrics.LabelPrediction], target_class: int = 0) → Optional[float][source]¶ Computes area under the Receiver Operating Characteristic curve, for binary classification. Implementation based off of (and explained at) https://www.ibm.com/developerworks/community/blogs/jfp/entry/Fast_Computation_of_AUC_ROC_score?lang=en.
-
pytext.metrics.
compute_soft_metrics
(predictions: Sequence[pytext.metrics.LabelPrediction], label_names: Sequence[str], recall_at_precision_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9], precision_at_recall_thresholds: Sequence[float] = [0.2, 0.4, 0.6, 0.8, 0.9]) → Dict[str, pytext.metrics.SoftClassificationMetrics][source]¶ Computes soft classification metrics given a list of label predictions.
Parameters: - predictions – Label predictions, including the confidence score for each label.
- label_names – Indexed label names.
- recall_at_precision_thresholds – precision thresholds at which to calculate recall
- precision_at_recall_thresholds – recall thresholds at which to calculate precision
Returns: Dict from label strings to their corresponding soft metrics.
-
pytext.metrics.
precision_at_recall
(y_true_sorted: numpy.ndarray, y_score_sorted: numpy.ndarray, thresholds: Sequence[float]) → Tuple[Dict[float, float], Dict[float, float]][source]¶ Computes precision at various recall levels
Parameters: - y_true_sorted – Numpy array sorted according to decreasing confidence scores indicating whether each prediction is correct.
- y_score_sorted – Numpy array of confidence scores for the predictions in decreasing order.
- thresholds – Sequence of floats indicating the requested recall thresholds
Returns: Dictionary of maximum precision at requested recall thresholds. Dictionary of decision thresholds resulting in max precision at requested recall thresholds.
-
pytext.metrics.
recall_at_precision
(y_true_sorted: numpy.ndarray, y_score_sorted: numpy.ndarray, thresholds: Sequence[float]) → Dict[float, float][source]¶ Computes recall at various precision levels
Parameters: - y_true_sorted – Numpy array sorted according to decreasing confidence scores indicating whether each prediction is correct.
- y_score_sorted – Numpy array of confidence scores for the predictions in decreasing order.
- thresholds – Sequence of floats indicating the requested precision thresholds
Returns: Dictionary of maximum recall at requested precision thresholds.
pytext.models package¶
Subpackages¶
pytext.models.decoders package¶
-
class
pytext.models.decoders.decoder_base.
DecoderBase
(config: pytext.config.pytext_config.ConfigBase)[source]¶ Bases:
pytext.models.module.Module
Base class for all decoder modules.
Parameters: config (ConfigBase) – Configuration object. -
in_dim
¶ Dimension of input Tensor passed to the decoder.
Type: int
-
out_dim
¶ Dimension of output Tensor produced by the decoder.
Type: int
-
forward
(*input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.decoders.intent_slot_model_decoder.
IntentSlotModelDecoder
(config: pytext.models.decoders.intent_slot_model_decoder.IntentSlotModelDecoder.Config, in_dim_doc: int, in_dim_word: int, out_dim_doc: int, out_dim_word: int)[source]¶ Bases:
pytext.models.decoders.decoder_base.DecoderBase
IntentSlotModelDecoder implements the decoder layer for intent-slot models. Intent-slot models jointly predict intent and slots from an utterance. At the core these models learn to jointly perform document classification and word tagging tasks.
- IntentSlotModelDecoder accepts arguments for decoding both document
- classification and word tagging tasks, namely, in_dim_doc and in_dim_word.
Parameters: - config (type) – Configuration object of type IntentSlotModelDecoder.Config.
- in_dim_doc (type) – Dimension of input Tensor for projecting document
- representation. –
- in_dim_word (type) – Dimension of input Tensor for projecting word
- representation. –
- out_dim_doc (type) – Dimension of projected output Tensor for document
- classification. –
- out_dim_word (type) – Dimension of projected output Tensor for word tagging.
-
use_doc_probs_in_word
¶ Whether to use intent probabilities for
Type: bool
-
predicting slots.
-
doc_decoder
¶ Document/intent decoder module.
Type: type
-
word_decoder
¶ Word/slot decoder module.
Type: type
-
forward
(x_d: torch.Tensor, x_w: torch.Tensor, dense: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.decoders.mlp_decoder.
MLPDecoder
(config: pytext.models.decoders.mlp_decoder.MLPDecoder.Config, in_dim: int, out_dim: int = 0)[source]¶ Bases:
pytext.models.decoders.decoder_base.DecoderBase
MLPDecoder implements a fully connected network and uses ReLU as the activation function. The module projects an input tensor to out_dim.
Parameters: - config (Config) – Configuration object of type MLPDecoder.Config.
- in_dim (int) – Dimension of input Tensor passed to MLP.
- out_dim (int) – Dimension of output Tensor produced by MLP. Defaults to 0.
-
mlp
¶ Module that implements the MLP.
Type: type
-
out_dim
¶ Dimension of the output of this module.
Type: type
Dimensions of the outputs of hidden layers.
Type: List[int]
-
forward
(*input) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.decoders.mlp_decoder_query_response.
MLPDecoderQueryResponse
(config: pytext.models.decoders.mlp_decoder_query_response.MLPDecoderQueryResponse.Config, from_dim: int, to_dim: int)[source]¶ Bases:
pytext.models.decoders.decoder_base.DecoderBase
Implements a ‘two-tower’ MLP: one for query and one for response Used in search pairwise ranking: both pos_response and neg_response use the response-MLP
-
forward
(*x) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.decoders.
DecoderBase
(config: pytext.config.pytext_config.ConfigBase)[source]¶ Bases:
pytext.models.module.Module
Base class for all decoder modules.
Parameters: config (ConfigBase) – Configuration object. -
in_dim
¶ Dimension of input Tensor passed to the decoder.
Type: int
-
out_dim
¶ Dimension of output Tensor produced by the decoder.
Type: int
-
forward
(*input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.decoders.
MLPDecoder
(config: pytext.models.decoders.mlp_decoder.MLPDecoder.Config, in_dim: int, out_dim: int = 0)[source]¶ Bases:
pytext.models.decoders.decoder_base.DecoderBase
MLPDecoder implements a fully connected network and uses ReLU as the activation function. The module projects an input tensor to out_dim.
Parameters: - config (Config) – Configuration object of type MLPDecoder.Config.
- in_dim (int) – Dimension of input Tensor passed to MLP.
- out_dim (int) – Dimension of output Tensor produced by MLP. Defaults to 0.
-
mlp
¶ Module that implements the MLP.
Type: type
-
out_dim
¶ Dimension of the output of this module.
Type: type
Dimensions of the outputs of hidden layers.
Type: List[int]
-
forward
(*input) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.decoders.
IntentSlotModelDecoder
(config: pytext.models.decoders.intent_slot_model_decoder.IntentSlotModelDecoder.Config, in_dim_doc: int, in_dim_word: int, out_dim_doc: int, out_dim_word: int)[source]¶ Bases:
pytext.models.decoders.decoder_base.DecoderBase
IntentSlotModelDecoder implements the decoder layer for intent-slot models. Intent-slot models jointly predict intent and slots from an utterance. At the core these models learn to jointly perform document classification and word tagging tasks.
- IntentSlotModelDecoder accepts arguments for decoding both document
- classification and word tagging tasks, namely, in_dim_doc and in_dim_word.
Parameters: - config (type) – Configuration object of type IntentSlotModelDecoder.Config.
- in_dim_doc (type) – Dimension of input Tensor for projecting document
- representation. –
- in_dim_word (type) – Dimension of input Tensor for projecting word
- representation. –
- out_dim_doc (type) – Dimension of projected output Tensor for document
- classification. –
- out_dim_word (type) – Dimension of projected output Tensor for word tagging.
-
use_doc_probs_in_word
¶ Whether to use intent probabilities for
Type: bool
-
predicting slots.
-
doc_decoder
¶ Document/intent decoder module.
Type: type
-
word_decoder
¶ Word/slot decoder module.
Type: type
-
forward
(x_d: torch.Tensor, x_w: torch.Tensor, dense: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
pytext.models.embeddings package¶
-
class
pytext.models.embeddings.char_embedding.
CharacterEmbedding
(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], highway_layers: int, projection_dim: Optional[int], *args, **kwargs)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.
Implementation is loosely based on https://arxiv.org/abs/1508.06615.
Parameters: - num_embeddings (int) – Total number of characters (vocabulary size).
- embed_dim (int) – Size of character embeddings to be passed to convolutions.
- out_channels (int) – Number of output channels.
- kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
- highway_layers (int) – Number of highway layers applied to pooled output.
- projection_dim (int) – If specified, size of output embedding for token, via a linear projection from convolution output.
-
char_embed
¶ Character embedding table.
Type: nn.Embedding
-
convs
¶ Convolution layers that operate on character
Type: nn.ModuleList
-
embeddings.
-
highway_layers
¶ Highway layers on top of convolution output.
Type: nn.Module
-
projection
¶ Final linear layer to token embedding.
Type: nn.Module
-
embedding_dim
¶ Dimension of the final token embedding produced.
Type: int
-
forward
(chars: torch.Tensor) → torch.Tensor[source]¶ Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.
Parameters: - chars (torch.Tensor) – Batch of sentences where each token is broken
- characters. (into) –
- Dimension – batch size X maximum sentence length X maximum word length
Returns: Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))
Return type: torch.Tensor
-
classmethod
from_config
(config: pytext.config.field_config.CharFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, vocab_size: Optional[int] = None)[source]¶ Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.
Parameters: - config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding.
- metadata (FieldMeta) – Object containing this field’s metadata.
Returns: An instance of CharacterEmbedding.
Return type: type
-
class
pytext.models.embeddings.char_embedding.
Highway
(input_dim: int, num_layers: int = 1)[source]¶ Bases:
torch.nn.modules.module.Module
A Highway layer <https://arxiv.org/abs/1505.00387>. Adopted from the AllenNLP implementation.
-
forward
(x: torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.embeddings.contextual_token_embedding.
ContextualTokenEmbedding
(embedding_dim: int)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
Module for providing token embeddings from a pretrained model.
-
forward
(embedding: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.embeddings.dict_embedding.
DictEmbedding
(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, pad_index: int = 1, unk_index: int = 0, mobile: bool = False)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
,torch.nn.modules.sparse.Embedding
Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be
[ {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}}, {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}} ]
:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.
Parameters: - num_embeddings (int) – Total number of dictionary features (vocabulary size).
- embed_dim (int) – Size of embedding vector.
- pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.
-
pooling_type
¶ Type of pooling for combining the dictionary feature embeddings.
Type: PoolingType
-
find_and_replace
(tensor: torch.Tensor, find_val: int, replace_val: int) → torch.Tensor[source]¶ torch.where is not supported for mobile ONNX, this hack allows a mobile exported version of torch.where which is computationally more expensive
-
forward
(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]¶ Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.
Parameters: - feats (torch.Tensor) – Batch of sentences with dictionary feature ids. shape: [bsz, seq_len * max_feat_per_token]
- weights (torch.Tensor) – Batch of sentences with dictionary feature weights for the dictionary features. shape: [bsz, seq_len * max_feat_per_token]
- lengths (torch.Tensor) – Batch of sentences with the number of dictionary features per token. shape: [bsz, seq_len]
Returns: Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.
Return type: torch.Tensor
-
classmethod
from_config
(config: pytext.config.field_config.DictFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None)[source]¶ Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.
Parameters: - config (DictFeatConfig) – Configuration object specifying all the
- of DictEmbedding. (parameters) –
- metadata (FieldMeta) – Object containing this field’s metadata.
Returns: An instance of DictEmbedding.
Return type: type
-
class
pytext.models.embeddings.embedding_base.
EmbeddingBase
(embedding_dim: int)[source]¶ Bases:
pytext.models.module.Module
Base class for token level embedding modules.
Parameters: embedding_dim (int) – Size of embedding vector. -
num_emb_modules
¶ Number of ways to embed a token.
Type: int
-
embedding_dim
¶ Size of embedding vector.
Type: int
-
-
class
pytext.models.embeddings.embedding_list.
EmbeddingList
(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
,torch.nn.modules.container.ModuleList
There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.
Parameters: - embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to
- a token. (embed) –
- concat (bool) – Whether to concatenate the embedding vectors emitted from
- modules. (embeddings) –
-
num_emb_modules
¶ Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total
Type: int
-
input_start_indices
¶ List of indices of the sub-embeddings in the embedding list.
Type: List[int]
-
concat
¶ Whether to concatenate the embedding vectors emitted from embeddings modules.
Type: bool
-
embedding_dim
¶ Total embedding size, can be a single int or tuple of int depending on concat setting
-
forward
(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]¶ Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.
Parameters: *emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors. Returns: - If concat is True then
- a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type: Union[torch.Tensor, Tuple[torch.Tensor]]
-
get_param_groups_for_optimizer
() → List[Dict[str, torch.nn.parameter.Parameter]][source]¶ Organize child embedding parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer. The param_groups from each child embedding are aligned at the first (lowest) param_group.
-
class
pytext.models.embeddings.word_embedding.
WordEmbedding
(num_embeddings: int, embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, unk_token_idx: int = 0, mlp_layer_dims: List[int] = (), padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.
Note: Embedding weights for UNK token are always initialized to zeros.
Parameters: - num_embeddings (int) – Total number of words/tokens (vocabulary size).
- embedding_dim (int) – Size of embedding vector.
- embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
- init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
- unk_token_idx (int) – Index of UNK token in the word vocabulary.
- mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
classmethod
from_config
(config: pytext.config.field_config.WordFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]¶ Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.
Parameters: - config (WordFeatConfig) – Configuration object specifying all the
- of WordEmbedding. (parameters) –
- metadata (FieldMeta) – Object containing this field’s metadata.
Returns: An instance of WordEmbedding.
Return type: type
-
class
pytext.models.embeddings.
EmbeddingBase
(embedding_dim: int)[source]¶ Bases:
pytext.models.module.Module
Base class for token level embedding modules.
Parameters: embedding_dim (int) – Size of embedding vector. -
num_emb_modules
¶ Number of ways to embed a token.
Type: int
-
embedding_dim
¶ Size of embedding vector.
Type: int
-
-
class
pytext.models.embeddings.
EmbeddingList
(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
,torch.nn.modules.container.ModuleList
There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.
Parameters: - embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to
- a token. (embed) –
- concat (bool) – Whether to concatenate the embedding vectors emitted from
- modules. (embeddings) –
-
num_emb_modules
¶ Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total
Type: int
-
input_start_indices
¶ List of indices of the sub-embeddings in the embedding list.
Type: List[int]
-
concat
¶ Whether to concatenate the embedding vectors emitted from embeddings modules.
Type: bool
-
embedding_dim
¶ Total embedding size, can be a single int or tuple of int depending on concat setting
-
forward
(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]¶ Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.
Parameters: *emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors. Returns: - If concat is True then
- a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type: Union[torch.Tensor, Tuple[torch.Tensor]]
-
get_param_groups_for_optimizer
() → List[Dict[str, torch.nn.parameter.Parameter]][source]¶ Organize child embedding parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer. The param_groups from each child embedding are aligned at the first (lowest) param_group.
-
class
pytext.models.embeddings.
WordEmbedding
(num_embeddings: int, embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, unk_token_idx: int = 0, mlp_layer_dims: List[int] = (), padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.
Note: Embedding weights for UNK token are always initialized to zeros.
Parameters: - num_embeddings (int) – Total number of words/tokens (vocabulary size).
- embedding_dim (int) – Size of embedding vector.
- embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
- init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
- unk_token_idx (int) – Index of UNK token in the word vocabulary.
- mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
classmethod
from_config
(config: pytext.config.field_config.WordFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]¶ Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.
Parameters: - config (WordFeatConfig) – Configuration object specifying all the
- of WordEmbedding. (parameters) –
- metadata (FieldMeta) – Object containing this field’s metadata.
Returns: An instance of WordEmbedding.
Return type: type
-
class
pytext.models.embeddings.
DictEmbedding
(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, pad_index: int = 1, unk_index: int = 0, mobile: bool = False)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
,torch.nn.modules.sparse.Embedding
Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be
[ {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}}, {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}} ]
:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.
Parameters: - num_embeddings (int) – Total number of dictionary features (vocabulary size).
- embed_dim (int) – Size of embedding vector.
- pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.
-
pooling_type
¶ Type of pooling for combining the dictionary feature embeddings.
Type: PoolingType
-
find_and_replace
(tensor: torch.Tensor, find_val: int, replace_val: int) → torch.Tensor[source]¶ torch.where is not supported for mobile ONNX, this hack allows a mobile exported version of torch.where which is computationally more expensive
-
forward
(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]¶ Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.
Parameters: - feats (torch.Tensor) – Batch of sentences with dictionary feature ids. shape: [bsz, seq_len * max_feat_per_token]
- weights (torch.Tensor) – Batch of sentences with dictionary feature weights for the dictionary features. shape: [bsz, seq_len * max_feat_per_token]
- lengths (torch.Tensor) – Batch of sentences with the number of dictionary features per token. shape: [bsz, seq_len]
Returns: Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.
Return type: torch.Tensor
-
classmethod
from_config
(config: pytext.config.field_config.DictFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None)[source]¶ Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.
Parameters: - config (DictFeatConfig) – Configuration object specifying all the
- of DictEmbedding. (parameters) –
- metadata (FieldMeta) – Object containing this field’s metadata.
Returns: An instance of DictEmbedding.
Return type: type
-
class
pytext.models.embeddings.
CharacterEmbedding
(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], highway_layers: int, projection_dim: Optional[int], *args, **kwargs)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.
Implementation is loosely based on https://arxiv.org/abs/1508.06615.
Parameters: - num_embeddings (int) – Total number of characters (vocabulary size).
- embed_dim (int) – Size of character embeddings to be passed to convolutions.
- out_channels (int) – Number of output channels.
- kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
- highway_layers (int) – Number of highway layers applied to pooled output.
- projection_dim (int) – If specified, size of output embedding for token, via a linear projection from convolution output.
-
char_embed
¶ Character embedding table.
Type: nn.Embedding
-
convs
¶ Convolution layers that operate on character
Type: nn.ModuleList
-
embeddings.
-
highway_layers
¶ Highway layers on top of convolution output.
Type: nn.Module
-
projection
¶ Final linear layer to token embedding.
Type: nn.Module
-
embedding_dim
¶ Dimension of the final token embedding produced.
Type: int
-
forward
(chars: torch.Tensor) → torch.Tensor[source]¶ Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.
Parameters: - chars (torch.Tensor) – Batch of sentences where each token is broken
- characters. (into) –
- Dimension – batch size X maximum sentence length X maximum word length
Returns: Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))
Return type: torch.Tensor
-
classmethod
from_config
(config: pytext.config.field_config.CharFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, vocab_size: Optional[int] = None)[source]¶ Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.
Parameters: - config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding.
- metadata (FieldMeta) – Object containing this field’s metadata.
Returns: An instance of CharacterEmbedding.
Return type: type
-
class
pytext.models.embeddings.
ContextualTokenEmbedding
(embedding_dim: int)[source]¶ Bases:
pytext.models.embeddings.embedding_base.EmbeddingBase
Module for providing token embeddings from a pretrained model.
-
forward
(embedding: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.ensembles package¶
-
class
pytext.models.ensembles.bagging_doc_ensemble.
BaggingDocEnsembleModel
(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]¶ Bases:
pytext.models.ensembles.ensemble.EnsembleModel
Ensemble class that uses bagging for ensembling document classification models.
-
class
pytext.models.ensembles.bagging_intent_slot_ensemble.
BaggingIntentSlotEnsembleModel
(config: pytext.models.ensembles.bagging_intent_slot_ensemble.BaggingIntentSlotEnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]¶ Bases:
pytext.models.ensembles.ensemble.EnsembleModel
Ensemble class that uses bagging for ensembling intent-slot models.
Parameters: - config (Config) – Configuration object specifying all the parameters of BaggingIntentSlotEnsemble.
- models (List[Model]) – List of intent-slot model objects.
-
use_crf
¶ Whether to use CRF for word tagging task.
Type: bool
-
output_layer
¶ Output layer of intent-slot model responsible for computing loss and predictions.
Type: IntentSlotOutputLayer
-
forward
(*args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Call forward() method of each intent-slot sub-model by passing all arguments and named arguments to the sub-models, collect the logits from them and average their values.
Returns: Logits from the ensemble. Return type: torch.Tensor
-
class
pytext.models.ensembles.ensemble.
EnsembleModel
(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]¶ Bases:
pytext.models.model.Model
Base class for ensemble models.
Parameters: - config (Config) – Configuration object specifying all the parameters of Ensemble.
- models (List[Model]) – List of sub-model objects.
-
output_layer
¶ Responsible for computing loss and predictions.
Type: OutputLayerBase
-
models
¶ ModuleList container for sub-model objects.
Type: nn.ModuleList]
-
forward
(*args, **kwargs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
classmethod
from_config
(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], *args, **kwargs)[source]¶ Factory method to construct an instance of Ensemble or one its derived classes from the module’s config object and tensorizers It creates sub-models in the ensemble using the sub-model’s configuration.
Parameters: - config (Config) – Configuration object specifying all the parameters of Ensemble.
- tensorizers (Dict[str, Tensorizer]) – Tensorizer specifying all the parameters of the input features to the model.
Returns: An instance of Ensemble.
Return type: type
-
class
pytext.models.ensembles.
BaggingDocEnsembleModel
(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]¶ Bases:
pytext.models.ensembles.ensemble.EnsembleModel
Ensemble class that uses bagging for ensembling document classification models.
-
class
pytext.models.ensembles.
BaggingIntentSlotEnsembleModel
(config: pytext.models.ensembles.bagging_intent_slot_ensemble.BaggingIntentSlotEnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]¶ Bases:
pytext.models.ensembles.ensemble.EnsembleModel
Ensemble class that uses bagging for ensembling intent-slot models.
Parameters: - config (Config) – Configuration object specifying all the parameters of BaggingIntentSlotEnsemble.
- models (List[Model]) – List of intent-slot model objects.
-
use_crf
¶ Whether to use CRF for word tagging task.
Type: bool
-
output_layer
¶ Output layer of intent-slot model responsible for computing loss and predictions.
Type: IntentSlotOutputLayer
-
forward
(*args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Call forward() method of each intent-slot sub-model by passing all arguments and named arguments to the sub-models, collect the logits from them and average their values.
Returns: Logits from the ensemble. Return type: torch.Tensor
-
class
pytext.models.ensembles.
EnsembleModel
(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, models: List[pytext.models.model.Model], *args, **kwargs)[source]¶ Bases:
pytext.models.model.Model
Base class for ensemble models.
Parameters: - config (Config) – Configuration object specifying all the parameters of Ensemble.
- models (List[Model]) – List of sub-model objects.
-
output_layer
¶ Responsible for computing loss and predictions.
Type: OutputLayerBase
-
models
¶ ModuleList container for sub-model objects.
Type: nn.ModuleList]
-
forward
(*args, **kwargs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
classmethod
from_config
(config: pytext.models.ensembles.ensemble.EnsembleModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], *args, **kwargs)[source]¶ Factory method to construct an instance of Ensemble or one its derived classes from the module’s config object and tensorizers It creates sub-models in the ensemble using the sub-model’s configuration.
Parameters: - config (Config) – Configuration object specifying all the parameters of Ensemble.
- tensorizers (Dict[str, Tensorizer]) – Tensorizer specifying all the parameters of the input features to the model.
Returns: An instance of Ensemble.
Return type: type
pytext.models.language_models package¶
-
class
pytext.models.language_models.lmlstm.
LMLSTM
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase = <pytext.config.field_config.WordFeatConfig object>, representation: pytext.models.representations.representation_base.RepresentationBase = <pytext.models.representations.bilstm.BiLSTM.Config object>, decoder: pytext.models.decoders.decoder_base.DecoderBase = <pytext.models.decoders.mlp_decoder.MLPDecoder.Config object>, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase = <pytext.models.output_layers.lm_output_layer.LMOutputLayer.Config object>, stateful: bool = False, exporter: object = <class 'pytext.exporters.exporter.ModelExporter'>)[source]¶ Bases:
pytext.models.model.BaseModel
LMLSTM implements a word-level language model that uses LSTMs to represent the document.
-
classmethod
checkTokenConfig
(tokens: Optional[pytext.data.tensorizers.TokenTensorizer.Config])[source]¶
-
forward
(tokens: torch.Tensor, seq_len: torch.Tensor) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
classmethod
from_config
(config: pytext.models.language_models.lmlstm.LMLSTM.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]¶
Initialize the hidden states of the LSTM if the language model is stateful.
Parameters: bsz (int) – Batch size. Returns: Initialized hidden state and cell state of the LSTM. Return type: Tuple[torch.Tensor, torch.Tensor]
-
classmethod
Wraps hidden states in new Tensors, to detach them from their history.
Parameters: hidden (Union[torch.Tensor, Tuple[torch.Tensor, ..]]) – Tensor or a tuple of tensors to repackage. Returns: Repackaged output Return type: Union[torch.Tensor, Tuple[torch.Tensor, ..]]
pytext.models.output_layers package¶
-
class
pytext.models.output_layers.distance_output_layer.
OutputScore
[source]¶ Bases:
enum.IntEnum
An enumeration.
-
norm_cosine
= 2¶
-
raw_cosine
= 1¶
-
sigmoid_cosine
= 3¶
-
-
class
pytext.models.output_layers.distance_output_layer.
PairwiseCosineDistanceOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Union[pytext.loss.loss.BinaryCrossEntropyLoss, pytext.loss.loss.CosineEmbeddingLoss, pytext.loss.loss.MAELoss, pytext.loss.loss.MSELoss, pytext.loss.loss.NLLLoss] = None, score_threshold: bool = 0.9, score_type: pytext.models.output_layers.distance_output_layer.OutputScore = <OutputScore.norm_cosine: 2>)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
-
classmethod
from_config
(config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]¶
-
get_loss
(logits: torch.Tensor, targets: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]¶ Compute and return the loss given logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None. - reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_pred
(logits: torch.Tensor, targets: torch.Tensor, *args, **kwargs)[source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
classmethod
-
class
pytext.models.output_layers.doc_classification_output_layer.
BinaryClassificationOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ See OutputLayerBase.export_to_caffe2().
-
-
class
pytext.models.output_layers.doc_classification_output_layer.
ClassificationOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for document classification models. It supports CrossEntropyLoss and BinaryCrossEntropyLoss per document.
Parameters: loss_fn (Union[CrossEntropyLoss, BinaryCrossEntropyLoss]) – The loss function to use for computing loss. Defaults to None. -
loss_fn
¶ The loss function to use for computing loss.
-
classmethod
from_config
(config: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer.Config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]¶
-
get_pred
(logit, *args, **kwargs)[source]¶ Compute and return prediction and scores from the model.
Prediction is computed using argmax over the document label/target space.
Scores are sigmoid or softmax scores over the model logits depending on the loss component being used.
Parameters: logit (torch.Tensor) – Logits returned DocModel
.Returns: Model prediction and scores. Return type: Tuple[torch.Tensor, torch.Tensor]
-
-
class
pytext.models.output_layers.doc_classification_output_layer.
ClassificationScores
(classes, score_function)[source]¶ Bases:
torch.jit.ScriptModule
-
class
pytext.models.output_layers.doc_classification_output_layer.
MultiLabelOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ See OutputLayerBase.export_to_caffe2().
-
-
class
pytext.models.output_layers.doc_classification_output_layer.
MulticlassOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ See OutputLayerBase.export_to_caffe2().
-
-
class
pytext.models.output_layers.doc_regression_output_layer.
RegressionOutputLayer
(loss_fn: pytext.loss.loss.MSELoss, squash_to_unit_range: bool = False)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for doc regression models. Currently only supports Mean Squared Error loss.
Parameters: - loss (MSELoss) – config for MSE loss
- squash_to_unit_range (bool) – whether to clamp the output to the range [0, 1], via a sigmoid.
-
classmethod
from_config
(config: pytext.models.output_layers.doc_regression_output_layer.RegressionOutputLayer.Config)[source]¶
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]¶ Compute regression loss from logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None. - reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_pred
(logit, *args, **kwargs)[source]¶ Compute predictions and scores from the model (unlike in classification, where prediction = “most likely class” and scores = “log probs”, here these are the same values). If squash_to_unit_range is True, fit prediction to [0, 1] via a sigmoid.
Parameters: logit (torch.Tensor) – Logits returned from the model. Returns: Model prediction and scores. Return type: Tuple[torch.Tensor, torch.Tensor]
-
class
pytext.models.output_layers.intent_slot_output_layer.
IntentSlotOutputLayer
(doc_output: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer, word_output: pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for joint intent classification and slot-filling models. Intent classification is a document classification problem and slot filling is a word tagging problem. Thus terms these can be used interchangeably in the documentation.
Parameters: - doc_output (ClassificationOutputLayer) – Output layer for intent
classification task. See
ClassificationOutputLayer
for details. - word_output (WordTaggingOutputLayer) – Output layer for slot filling task.
See
WordTaggingOutputLayer
for details.
-
doc_output
¶ Output layer for intent classification task.
Type: type
-
word_output
¶ Output layer for slot filling task.
Type: type
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: List[torch.Tensor], doc_out_name: str, word_out_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the intent slot output layer to Caffe2. See OutputLayerBase.export_to_caffe2() for details.
-
classmethod
from_config
(config: pytext.models.output_layers.intent_slot_output_layer.IntentSlotOutputLayer.Config, doc_labels: pytext.data.utils.Vocabulary, word_labels: pytext.data.utils.Vocabulary)[source]¶
-
get_loss
(logits: Tuple[torch.Tensor, torch.Tensor], targets: Tuple[torch.Tensor, torch.Tensor], context: Dict[str, Any] = None, *args, **kwargs) → torch.Tensor[source]¶ Compute and return the averaged intent and slot-filling loss.
Parameters: - logit (Tuple[torch.Tensor, torch.Tensor]) – Logits returned by
JointModel
. It is a tuple containing logits for intent classification and slot filling. - targets (Tuple[torch.Tensor, torch.Tensor]) – Tuple of target Tensors containing true document label/target and true word labels/targets.
- context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns: Averaged intent and slot loss.
Return type: torch.Tensor
- logit (Tuple[torch.Tensor, torch.Tensor]) – Logits returned by
-
get_pred
(logits: Tuple[torch.Tensor, torch.Tensor], targets: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (Tuple[torch.Tensor, torch.Tensor]) – Logits returned by
JointModel
. It’s tuple containing logits for intent classification and slot filling. - targets (Optional[torch.Tensor]) – Not applicable. Defaults to None.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (Tuple[torch.Tensor, torch.Tensor]) – Logits returned by
- doc_output (ClassificationOutputLayer) – Output layer for intent
classification task. See
-
class
pytext.models.output_layers.intent_slot_output_layer.
IntentSlotScores
(doc_scores: torch.jit.ScriptModule, word_scores: torch.jit.ScriptModule)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(logits: Tuple[torch.Tensor, torch.Tensor], context: Dict[str, torch.Tensor]) → Tuple[List[Dict[str, float]], List[List[Dict[str, float]]]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.output_layers.lm_output_layer.
LMOutputLayer
(target_names: List[str], loss_fn: pytext.loss.loss.Loss = None, config=None, pad_token_idx=-100)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for language models. It supports CrossEntropyLoss per word.
Parameters: loss_fn (CrossEntropyLoss) – Cross-entropy loss component. Defaults to None. -
loss_fn
¶ Cross-entropy loss component for computing loss.
-
classmethod
from_config
(config: pytext.models.output_layers.lm_output_layer.LMOutputLayer.Config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]¶
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Dict[str, Any], reduce=True) → torch.Tensor[source]¶ Compute word prediction loss by comparing prediction of each word in the sentence with the true word.
Parameters: - logit (torch.Tensor) – Logit returned by
LMLSTM
. - targets (torch.Tensor) – Not applicable for language models.
- context (Dict[str, Any]) – Not applicable. Defaults to None.
- reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Word prediction loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logit returned by
-
get_pred
(logits: torch.Tensor, *args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute and return prediction and scores from the model. Prediction is computed using argmax over the word label/target space. Scores are softmax scores over the model logits.
Parameters: - logits (torch.Tensor) – Logits returned
LMLSTM
. - targets (torch.Tensor) – True words.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logits (torch.Tensor) – Logits returned
-
-
class
pytext.models.output_layers.output_layer_base.
OutputLayerBase
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.module.Module
Base class for all output layers in PyText. The responsibilities of this layer are
- Implement how loss is computed from logits and targets.
- Implement how to get predictions from logits.
- Implement the Caffe2 operator for performing the above tasks. This is
- used when PyText exports PyTorch model to Caffe2.
Parameters: loss_fn (type) – The loss function object to use for computing loss. Defaults to None. -
loss_fn
¶ The loss function object to use for computing loss.
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the output layer to Caffe2 by manually adding the necessary operators to the init_net and predict_net and, returns the list of external output blobs to be added to the model. By default this does nothing, so any sub-class must override this method (if necessary).
To learn about Caffe2 computation graphs and why we need two networks, init_net and predict_net/exec_net read https://caffe2.ai/docs/intro-tutorial#null__nets-and-operators.
Parameters: - workspace (core.workspace) – Caffe2 workspace to use for adding the operator. See https://caffe2.ai/docs/workspace.html to learn about Caffe2 workspace.
- init_net (core.Net) – Caffe2 init_net to add the operator to.
- predict_net (core.Net) – Caffe2 predict_net to add the operator to.
- model_out (torch.Tensor) – Output logit Tensor from the model to .
- output_name (str) – Name of model_out to use in Caffe2 net.
- label_names (List[str]) – List of names of the targets/labels to expose from the Caffe2 net.
Returns: - List of output blobs that the output_layer
generates.
Return type: List[core.BlobReference]
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]¶ Compute and return the loss given logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None. - reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_pred
(logit: torch.Tensor, targets: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
class
pytext.models.output_layers.pairwise_ranking_output_layer.
PairwiseRankingOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
-
get_pred
(logit, targets, context)[source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
-
class
pytext.models.output_layers.squad_output_layer.
SquadOutputLayer
(loss_fn: pytext.loss.loss.Loss, ignore_impossible: bool = True, pos_loss_weight: float = 0.5, has_answer_loss_weight: float = 0.5, has_answer_labels: Iterable[str] = ('False', 'True'), false_label: str = 'False', max_answer_len: int = 30, hard_weight: float = 0.0, is_kd: bool = False)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
-
classmethod
from_config
(config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[Iterable[str]] = None, is_kd: bool = False)[source]¶
-
get_loss
(logits: Tuple[torch.Tensor, ...], targets: Tuple[torch.Tensor, ...], contexts: Optional[Dict[str, Any]] = None, *args, **kwargs) → torch.Tensor[source]¶ Compute and return the loss given logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.=
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_position_preds
(start_pos_logits: torch.Tensor, end_pos_logits: torch.Tensor, max_span_length: int)[source]¶
-
get_pred
(logits: torch.Tensor, targets: torch.Tensor, contexts: Dict[str, List[Any]]) → Tuple[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor]][source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
classmethod
-
class
pytext.models.output_layers.utils.
OutputLayerUtils
[source]¶ Bases:
object
-
static
gen_additional_blobs
(predict_net: caffe2.python.core.Net, probability_out, model_out: torch.Tensor, output_name: str, label_names: List[str]) → List[caffe2.python.core.BlobReference][source]¶ Utility method to generate additional blobs for human readable result for models that use explicit labels.
-
static
-
class
pytext.models.output_layers.word_tagging_output_layer.
CRFOutputLayer
(num_tags, labels: pytext.data.utils.Vocabulary, *args)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for word tagging models that use Conditional Random Field.
Parameters: num_tags (int) – Total number of possible word tags. Total number of possible word tags.
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the CRF output layer to Caffe2. See OutputLayerBase.export_to_caffe2() for details.
-
classmethod
from_config
(config: pytext.config.component.ComponentMeta.__new__.<locals>.Config, labels: pytext.data.utils.Vocabulary)[source]¶
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Dict[str, Any], reduce=True)[source]¶ Compute word tagging loss by using CRF.
Parameters: - logit (torch.Tensor) – Logit returned by
WordTaggingModel
. - targets (torch.Tensor) – True document label/target.
- context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
- reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logit returned by
-
get_pred
(logit: torch.Tensor, target: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None)[source]¶ Compute and return prediction and scores from the model.
Prediction is computed using CRF decoding.
Scores are softmax scores over the model logits where the logits are computed by rearranging the word logits such that decoded word tag has the highest valued logits. This is done because with CRF, the highest valued word tag for a given may not be part of the overall set of word tags. In order for argmax to work, we rearrange the logit values.
Parameters: - logit (torch.Tensor) – Logits returned
WordTaggingModel
. - target (torch.Tensor) – Not applicable. Defaults to None.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
class
pytext.models.output_layers.word_tagging_output_layer.
CRFWordTaggingScores
(classes: List[str], crf)[source]¶ Bases:
pytext.models.output_layers.word_tagging_output_layer.WordTaggingScores
-
forward
(logits: torch.Tensor, context: Dict[str, torch.Tensor]) → List[List[Dict[str, float]]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.output_layers.word_tagging_output_layer.
WordTaggingOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for word tagging models. It supports CrossEntropyLoss per word.
Parameters: loss_fn (CrossEntropyLoss) – Cross-entropy loss component. Defaults to None. -
loss_fn
¶ Cross-entropy loss component.
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the word tagging output layer to Caffe2.
-
classmethod
from_config
(config: pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer.Config, labels: pytext.data.utils.Vocabulary)[source]¶
-
get_loss
(logit: torch.Tensor, target: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor]], context: Dict[str, Any], reduce: bool = True) → torch.Tensor[source]¶ Compute word tagging loss by comparing prediction of each word in the sentence with its true label/target.
Parameters: - logit (torch.Tensor) – Logit returned by
WordTaggingModel
. - targets (torch.Tensor) – True document label/target.
- context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
- reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Word tagging loss for all words in the sentence.
Return type: torch.Tensor
- logit (torch.Tensor) – Logit returned by
-
get_pred
(logit: torch.Tensor, *args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute and return prediction and scores from the model. Prediction is computed using argmax over the word label/target space. Scores are softmax scores over the model logits.
Parameters: logit (torch.Tensor) – Logits returned WordTaggingModel
.Returns: Model prediction and scores. Return type: Tuple[torch.Tensor, torch.Tensor]
-
-
class
pytext.models.output_layers.word_tagging_output_layer.
WordTaggingScores
(classes)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(logits: torch.Tensor, context: Optional[Dict[str, torch.Tensor]] = None) → List[List[Dict[str, float]]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.output_layers.
OutputLayerBase
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.module.Module
Base class for all output layers in PyText. The responsibilities of this layer are
- Implement how loss is computed from logits and targets.
- Implement how to get predictions from logits.
- Implement the Caffe2 operator for performing the above tasks. This is
- used when PyText exports PyTorch model to Caffe2.
Parameters: loss_fn (type) – The loss function object to use for computing loss. Defaults to None. -
loss_fn
¶ The loss function object to use for computing loss.
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the output layer to Caffe2 by manually adding the necessary operators to the init_net and predict_net and, returns the list of external output blobs to be added to the model. By default this does nothing, so any sub-class must override this method (if necessary).
To learn about Caffe2 computation graphs and why we need two networks, init_net and predict_net/exec_net read https://caffe2.ai/docs/intro-tutorial#null__nets-and-operators.
Parameters: - workspace (core.workspace) – Caffe2 workspace to use for adding the operator. See https://caffe2.ai/docs/workspace.html to learn about Caffe2 workspace.
- init_net (core.Net) – Caffe2 init_net to add the operator to.
- predict_net (core.Net) – Caffe2 predict_net to add the operator to.
- model_out (torch.Tensor) – Output logit Tensor from the model to .
- output_name (str) – Name of model_out to use in Caffe2 net.
- label_names (List[str]) – List of names of the targets/labels to expose from the Caffe2 net.
Returns: - List of output blobs that the output_layer
generates.
Return type: List[core.BlobReference]
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]¶ Compute and return the loss given logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None. - reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_pred
(logit: torch.Tensor, targets: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
class
pytext.models.output_layers.
CRFOutputLayer
(num_tags, labels: pytext.data.utils.Vocabulary, *args)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for word tagging models that use Conditional Random Field.
Parameters: num_tags (int) – Total number of possible word tags. Total number of possible word tags.
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the CRF output layer to Caffe2. See OutputLayerBase.export_to_caffe2() for details.
-
classmethod
from_config
(config: pytext.config.component.ComponentMeta.__new__.<locals>.Config, labels: pytext.data.utils.Vocabulary)[source]¶
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Dict[str, Any], reduce=True)[source]¶ Compute word tagging loss by using CRF.
Parameters: - logit (torch.Tensor) – Logit returned by
WordTaggingModel
. - targets (torch.Tensor) – True document label/target.
- context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
- reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logit returned by
-
get_pred
(logit: torch.Tensor, target: Optional[torch.Tensor] = None, context: Optional[Dict[str, Any]] = None)[source]¶ Compute and return prediction and scores from the model.
Prediction is computed using CRF decoding.
Scores are softmax scores over the model logits where the logits are computed by rearranging the word logits such that decoded word tag has the highest valued logits. This is done because with CRF, the highest valued word tag for a given may not be part of the overall set of word tags. In order for argmax to work, we rearrange the logit values.
Parameters: - logit (torch.Tensor) – Logits returned
WordTaggingModel
. - target (torch.Tensor) – Not applicable. Defaults to None.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
class
pytext.models.output_layers.
ClassificationOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for document classification models. It supports CrossEntropyLoss and BinaryCrossEntropyLoss per document.
Parameters: loss_fn (Union[CrossEntropyLoss, BinaryCrossEntropyLoss]) – The loss function to use for computing loss. Defaults to None. -
loss_fn
¶ The loss function to use for computing loss.
-
classmethod
from_config
(config: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer.Config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]¶
-
get_pred
(logit, *args, **kwargs)[source]¶ Compute and return prediction and scores from the model.
Prediction is computed using argmax over the document label/target space.
Scores are sigmoid or softmax scores over the model logits depending on the loss component being used.
Parameters: logit (torch.Tensor) – Logits returned DocModel
.Returns: Model prediction and scores. Return type: Tuple[torch.Tensor, torch.Tensor]
-
-
class
pytext.models.output_layers.
RegressionOutputLayer
(loss_fn: pytext.loss.loss.MSELoss, squash_to_unit_range: bool = False)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for doc regression models. Currently only supports Mean Squared Error loss.
Parameters: - loss (MSELoss) – config for MSE loss
- squash_to_unit_range (bool) – whether to clamp the output to the range [0, 1], via a sigmoid.
-
classmethod
from_config
(config: pytext.models.output_layers.doc_regression_output_layer.RegressionOutputLayer.Config)[source]¶
-
get_loss
(logit: torch.Tensor, target: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]¶ Compute regression loss from logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None. - reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_pred
(logit, *args, **kwargs)[source]¶ Compute predictions and scores from the model (unlike in classification, where prediction = “most likely class” and scores = “log probs”, here these are the same values). If squash_to_unit_range is True, fit prediction to [0, 1] via a sigmoid.
Parameters: logit (torch.Tensor) – Logits returned from the model. Returns: Model prediction and scores. Return type: Tuple[torch.Tensor, torch.Tensor]
-
class
pytext.models.output_layers.
WordTaggingOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
Output layer for word tagging models. It supports CrossEntropyLoss per word.
Parameters: loss_fn (CrossEntropyLoss) – Cross-entropy loss component. Defaults to None. -
loss_fn
¶ Cross-entropy loss component.
-
export_to_caffe2
(workspace: <module 'caffe2.python.workspace' from '/home/docs/checkouts/readthedocs.org/user_builds/pytext/envs/stable/lib/python3.7/site-packages/caffe2/python/workspace.py'>, init_net: caffe2.python.core.Net, predict_net: caffe2.python.core.Net, model_out: torch.Tensor, output_name: str) → List[caffe2.python.core.BlobReference][source]¶ Exports the word tagging output layer to Caffe2.
-
classmethod
from_config
(config: pytext.models.output_layers.word_tagging_output_layer.WordTaggingOutputLayer.Config, labels: pytext.data.utils.Vocabulary)[source]¶
-
get_loss
(logit: torch.Tensor, target: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor, torch.Tensor]], context: Dict[str, Any], reduce: bool = True) → torch.Tensor[source]¶ Compute word tagging loss by comparing prediction of each word in the sentence with its true label/target.
Parameters: - logit (torch.Tensor) – Logit returned by
WordTaggingModel
. - targets (torch.Tensor) – True document label/target.
- context (Dict[str, Any]) – Context is a dictionary of items that’s passed as additional metadata. Defaults to None.
- reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Word tagging loss for all words in the sentence.
Return type: torch.Tensor
- logit (torch.Tensor) – Logit returned by
-
get_pred
(logit: torch.Tensor, *args, **kwargs) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute and return prediction and scores from the model. Prediction is computed using argmax over the word label/target space. Scores are softmax scores over the model logits.
Parameters: logit (torch.Tensor) – Logits returned WordTaggingModel
.Returns: Model prediction and scores. Return type: Tuple[torch.Tensor, torch.Tensor]
-
-
class
pytext.models.output_layers.
PairwiseRankingOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Optional[pytext.loss.loss.Loss] = None, *args, **kwargs)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
-
get_pred
(logit, targets, context)[source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
-
class
pytext.models.output_layers.
PairwiseCosineDistanceOutputLayer
(target_names: Optional[List[str]] = None, loss_fn: Union[pytext.loss.loss.BinaryCrossEntropyLoss, pytext.loss.loss.CosineEmbeddingLoss, pytext.loss.loss.MAELoss, pytext.loss.loss.MSELoss, pytext.loss.loss.NLLLoss] = None, score_threshold: bool = 0.9, score_type: pytext.models.output_layers.distance_output_layer.OutputScore = <OutputScore.norm_cosine: 2>)[source]¶ Bases:
pytext.models.output_layers.output_layer_base.OutputLayerBase
-
classmethod
from_config
(config, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None)[source]¶
-
get_loss
(logits: torch.Tensor, targets: torch.Tensor, context: Optional[Dict[str, Any]] = None, reduce: bool = True) → torch.Tensor[source]¶ Compute and return the loss given logits and targets.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - target (torch.Tensor) – True label/target to compute loss against.
- context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None. - reduce (bool) – Whether to reduce loss over the batch. Defaults to True.
Returns: Model loss.
Return type: torch.Tensor
- logit (torch.Tensor) – Logits returned
-
get_pred
(logits: torch.Tensor, targets: torch.Tensor, *args, **kwargs)[source]¶ Compute and return prediction and scores from the model.
Parameters: - logit (torch.Tensor) – Logits returned
Model
. - targets (Optional[torch.Tensor]) – True label/target. Only used by
LMOutputLayer
. Defaults to None. - context (Optional[Dict[str, Any]]) – Context is a dictionary of items
that’s passed as additional metadata by the
DataHandler
. Defaults to None.
Returns: Model prediction and scores.
Return type: Tuple[torch.Tensor, torch.Tensor]
- logit (torch.Tensor) – Logits returned
-
classmethod
-
class
pytext.models.output_layers.
OutputLayerUtils
[source]¶ Bases:
object
-
static
gen_additional_blobs
(predict_net: caffe2.python.core.Net, probability_out, model_out: torch.Tensor, output_name: str, label_names: List[str]) → List[caffe2.python.core.BlobReference][source]¶ Utility method to generate additional blobs for human readable result for models that use explicit labels.
-
static
pytext.models.qna package¶
-
class
pytext.models.qna.bert_squad_qa.
BertSquadQAModel
(encoder: torch.nn.modules.module.Module, decoder: torch.nn.modules.module.Module, has_ans_decoder: torch.nn.modules.module.Module, output_layer: torch.nn.modules.module.Module, stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>, is_kd: bool = False)[source]¶ Bases:
pytext.models.bert_classification_models.NewBertModel
-
forward
(*inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.qna.dr_qa.
DrQAModel
(dropout: torch.nn.modules.module.Module, embedding: torch.nn.modules.module.Module, ques_rnn: torch.nn.modules.module.Module, doc_rnn: torch.nn.modules.module.Module, ques_self_attn: torch.nn.modules.module.Module, ques_aligned_doc_attn: torch.nn.modules.module.Module, start_attn: torch.nn.modules.module.Module, end_attn: torch.nn.modules.module.Module, doc_rep_pool: torch.nn.modules.module.Module, has_ans_decoder: torch.nn.modules.module.Module, output_layer: torch.nn.modules.module.Module, is_kd: bool = False)[source]¶ Bases:
pytext.models.model.BaseModel
-
classmethod
create_embedding
(model_config: pytext.models.qna.dr_qa.DrQAModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]¶
-
forward
(doc_tokens: torch.Tensor, doc_seq_len: torch.Tensor, doc_mask: torch.Tensor, ques_tokens: torch.Tensor, ques_seq_len: torch.Tensor, ques_mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
classmethod
pytext.models.representations package¶
-
class
pytext.models.representations.transformer.multihead_attention.
MultiheadSelfAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.
-
forward
(query, key_padding_mask)[source]¶ Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.
-
-
class
pytext.models.representations.transformer.positional_embedding.
PositionalEmbedding
(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.
This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
-
class
pytext.models.representations.transformer.residual_mlp.
GeLU
[source]¶ Bases:
torch.nn.modules.module.Module
Component class to wrap F.gelu.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.residual_mlp.
ResidualMLP
(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]¶ Bases:
torch.nn.modules.module.Module
A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.
Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.sentence_encoder.
SentenceEncoder
(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.
To use RoBERTa with this, download the RoBERTa public weights as roberta.weights
>>> encoder = SentenceEncoder() >>> weights = torch.load("roberta.weights") >>> encoder.load_roberta_state_dict(weights)
Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.
-
forward
(tokens)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
pytext.models.representations.transformer.sentence_encoder.
merge_input_projection
(state)[source]¶ New checkpoints of fairseq multihead attention split in_projections into k,v,q projections. This function merge them back to to make it compatible.
-
pytext.models.representations.transformer.sentence_encoder.
remove_state_keys
(state, keys_regex)[source]¶ Remove keys from state that match a regex
-
pytext.models.representations.transformer.sentence_encoder.
rename_component_from_root
(state, old_name, new_name)[source]¶ Rename keys from state using full python paths
-
class
pytext.models.representations.transformer.transformer.
Transformer
(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(tokens: torch.Tensor) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.transformer.
TransformerLayer
(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input, key_padding_mask)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
This directory contains modules for implementing a productionized RoBERTa model. These modules implement the same Transformer components that are implemented in the fairseq library, however they’re distilled down to just the elements which are used in the final RoBERTa model, and within that are restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The SentenceEncoder specifically can be used to load model weights directly from the publicly release RoBERTa weights, and it will translate these weights to the corresponding values in this implementation.
-
class
pytext.models.representations.transformer.
MultiheadSelfAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.
-
forward
(query, key_padding_mask)[source]¶ Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.
-
-
class
pytext.models.representations.transformer.
PositionalEmbedding
(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.
This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
-
class
pytext.models.representations.transformer.
ResidualMLP
(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]¶ Bases:
torch.nn.modules.module.Module
A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.
Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
SentenceEncoder
(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.
To use RoBERTa with this, download the RoBERTa public weights as roberta.weights
>>> encoder = SentenceEncoder() >>> weights = torch.load("roberta.weights") >>> encoder.load_roberta_state_dict(weights)
Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.
-
forward
(tokens)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
Transformer
(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(tokens: torch.Tensor) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
TransformerLayer
(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input, key_padding_mask)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.attention.
DotProductSelfAttention
(input_dim)[source]¶ Bases:
pytext.models.module.Module
Given vector w and token vectors = {t1, t2, …, t_n}, compute self attention weights to weighs the tokens * a_j = softmax(w . t_j)
-
class
pytext.models.representations.attention.
MultiplicativeAttention
(p_hidden_dim, q_hidden_dim, normalize)[source]¶ Bases:
pytext.models.module.Module
Given sequence P and vector q, computes attention weights for each element in P by matching q with each element in P using multiplicative attention. * a_i = softmax(p_i . W . q)
-
class
pytext.models.representations.attention.
SequenceAlignedAttention
(proj_dim)[source]¶ Bases:
pytext.models.module.Module
Given sequences P and Q, computes attention weights for each element in P by matching Q with each element in P. * a_i_j = softmax(p_i . q_j) where softmax is computed by summing over q_j
-
class
pytext.models.representations.augmented_lstm.
AugmentedLSTM
(config: pytext.models.representations.augmented_lstm.AugmentedLSTM.Config, embed_dim: int, padding_value: float = 0.0)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
AugmentedLSTM implements a generic AugmentedLSTM representation layer. AugmentedLSTM is an LSTM which optionally appends an optional highway network to the output layer. Furthermore the dropout controlls the level of variational dropout done.
Parameters: - config (Config) – Configuration object of type BiLSTM.Config.
- embed_dim (int) – The number of expected features in the input.
- padding_value (float) – Value for the padded elements. Defaults to 0.0.
-
padding_value
¶ Value for the padded elements.
Type: float
-
forward_layers
¶ A module list of unidirectional AugmentedLSTM layers moving forward in time.
Type: nn.ModuleList
-
backward_layers
¶ A module list of unidirectional AugmentedLSTM layers moving backward in time.
Type: nn.ModuleList
-
representation_dim
¶ The calculated dimension of the output features of AugmentedLSTM.
Type: int
-
forward
(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]¶ Given an input batch of sequential data such as word embeddings, produces a AugmentedLSTM representation of the sequential input and new state tensors.
Parameters: - embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
- seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers x num_directions * nhid). Defaults to None.
Returns: AgumentedLSTM representation of input and the state of the LSTM t = seq_len. Shape of representation is (bsize x seq_len x representation_dim). Shape of each state is (bsize x num_layers * num_directions x nhid).
Return type: Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
-
class
pytext.models.representations.augmented_lstm.
AugmentedLSTMCell
(embed_dim: int, lstm_dim: int, use_highway: bool, use_bias: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
AugmentedLSTMCell implements a AugmentedLSTM cell. :param embed_dim: The number of expected features in the input. :type embed_dim: int :param lstm_dim: Number of features in the hidden state of the LSTM. :type lstm_dim: int :param Defaults to 32.: :param use_highway: If True we append a highway network to the :type use_highway: bool :param outputs of the LSTM.: :param use_bias: If True we use a bias in our LSTM calculations, otherwise :type use_bias: bool :param we don’t.:
-
input_linearity
¶ Fused weight matrix which computes a linear function over the input.
Type: nn.Module
-
state_linearity
¶ Fused weight matrix which computes a linear function over the states.
Type: nn.Module
-
forward
(x: torch.Tensor, states=typing.Tuple[torch.Tensor, torch.Tensor], variational_dropout_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Warning: DO NOT USE THIS LAYER DIRECTLY, INSTEAD USE the AugmentedLSTM class
Parameters: - x (torch.Tensor) – Input tensor of shape (bsize x input_dim).
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x nhid). Defaults to None.
Returns: Returned states. Shape of each state is (bsize x nhid).
Return type: Tuple[torch.Tensor, torch.Tensor]
-
-
class
pytext.models.representations.augmented_lstm.
AugmentedLSTMUnidirectional
(embed_dim: int, lstm_dim: int, go_forward: bool = True, recurrent_dropout_probability: float = 0.0, use_highway: bool = True, use_input_projection_bias: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
AugmentedLSTMUnidirectional implements a one-layer single directional AugmentedLSTM layer. AugmentedLSTM is an LSTM which optionally appends an optional highway network to the output layer. Furthermore the dropout controlls the level of variational dropout done.
Parameters: - embed_dim (int) – The number of expected features in the input.
- lstm_dim (int) – Number of features in the hidden state of the LSTM. Defaults to 32.
- go_forward (bool) – Whether to compute features left to right (forward) or right to left (backward).
- recurrent_dropout_probability (float) – Variational dropout probability to use. Defaults to 0.0.
- use_highway (bool) – If True we append a highway network to the outputs of the LSTM.
- use_input_projection_bias (bool) – If True we use a bias in our LSTM calculations, otherwise we don’t.
-
cell
¶ AugmentedLSTMCell that is applied at every timestep.
Type: AugmentedLSTMCell
-
forward
(inputs: torch.nn.utils.rnn.PackedSequence, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.nn.utils.rnn.PackedSequence, Tuple[torch.Tensor, torch.Tensor]][source]¶ Warning: DO NOT USE THIS LAYER DIRECTLY, INSTEAD USE the AugmentedLSTM class
Given an input batch of sequential data such as word embeddings, produces a single layer unidirectional AugmentedLSTM representation of the sequential input and new state tensors.
Parameters: - inputs (PackedSequence) – Input tensor of shape (bsize x seq_len x input_dim).
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (1 x bsize x num_directions * nhid). Defaults to None.
Returns: AgumentedLSTM representation of input and the state of the LSTM t = seq_len. Shape of representation is (bsize x seq_len x representation_dim). Shape of each state is (1 x bsize x nhid).
Return type: Tuple[PackedSequence, Tuple[torch.Tensor, torch.Tensor]]
-
class
pytext.models.representations.bilstm.
BiLSTM
(config: pytext.models.representations.bilstm.BiLSTM.Config, embed_dim: int, padding_value: float = 0.0)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
BiLSTM implements a multi-layer bidirectional LSTM representation layer preceded by a dropout layer.
Parameters: - config (Config) – Configuration object of type BiLSTM.Config.
- embed_dim (int) – The number of expected features in the input.
- padding_value (float) – Value for the padded elements. Defaults to 0.0.
-
padding_value
¶ Value for the padded elements.
Type: float
-
dropout
¶ Dropout layer preceding the LSTM.
Type: nn.Dropout
-
lstm
¶ LSTM layer that operates on the inputs.
Type: nn.LSTM
-
representation_dim
¶ The calculated dimension of the output features of BiLSTM.
Type: int
-
forward
(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]¶ Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation of the sequential input and new state tensors.
Parameters: - embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
- seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns: - Bidirectional
LSTM representation of input and the state of the LSTM t = seq_len. Shape of representation is (bsize x seq_len x representation_dim). Shape of each state is (bsize x num_layers * num_directions x nhid).
Return type: Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
-
class
pytext.models.representations.bilstm_doc_attention.
BiLSTMDocAttention
(config: pytext.models.representations.bilstm_doc_attention.BiLSTMDocAttention.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
BiLSTMDocAttention implements a multi-layer bidirectional LSTM based representation for documents with or without pooling. The pooling can be max pooling, mean pooling or self attention.
Parameters: - config (Config) – Configuration object of type BiLSTMDocAttention.Config.
- embed_dim (int) – The number of expected features in the input.
-
dropout
¶ Dropout layer preceding the LSTM.
Type: nn.Dropout
-
lstm
¶ Module that implements the LSTM.
Type: nn.Module
-
attention
¶ Module that implements the attention or pooling.
Type: nn.Module
-
dense
¶ Module that implements the non-linear projection over attended representation.
Type: nn.Module
-
representation_dim
¶ The calculated dimension of the output features of the BiLSTMDocAttention representation.
Type: int
-
forward
(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, *args, states: Tuple[torch.Tensor, torch.Tensor] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]¶ Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation with or without pooling of the sequential input and new state tensors.
Parameters: - embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
- seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns: - Bidirectional
LSTM representation of input and the state of the LSTM at t = seq_len.
Return type: Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
-
class
pytext.models.representations.bilstm_doc_slot_attention.
BiLSTMDocSlotAttention
(config: pytext.models.representations.bilstm_doc_slot_attention.BiLSTMDocSlotAttention.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
BiLSTMDocSlotAttention implements a multi-layer bidirectional LSTM based representation with support for various attention mechanisms.
In default mode, when attention configuration is not provided, it behaves like a multi-layer LSTM encoder and returns the output features from the last layer of the LSTM, for each t. When document_attention configuration is provided, it produces a fixed-sized document representation. When slot_attention configuration is provide, it attends on output of each cell of LSTM module to produce a fixed sized word representation.
Parameters: - config (Config) – Configuration object of type BiLSTMDocSlotAttention.Config.
- embed_dim (int) – The number of expected features in the input.
-
dropout
¶ Dropout layer preceding the LSTM.
Type: nn.Dropout
-
relu
¶ An instance of the ReLU layer.
Type: nn.ReLU
-
lstm
¶ Module that implements the LSTM.
Type: nn.Module
-
use_doc_attention
¶ If True, indicates using document attention.
Type: bool
-
doc_attention
¶ Module that implements document attention.
Type: nn.Module
-
self.
projection_d
¶ A sequence of dense layers for projection over document representation.
Type: nn.Sequential
-
use_word_attention
¶ If True, indicates using word attention.
Type: bool
-
word_attention
¶ Module that implements word attention.
Type: nn.Module
-
self.
projection_w
¶ A sequence of dense layers for projection over word representation.
Type: nn.Sequential
-
representation_dim
¶ The calculated dimension of the output features of the BiLSTMDocAttention representation.
Type: int
-
forward
(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, *args, states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]¶ Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation the appropriate attention.
Parameters: - embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
- seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns: Tensors containing the document and the word representation of the input.
Return type: Tuple[torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
-
class
pytext.models.representations.bilstm_slot_attn.
BiLSTMSlotAttention
(config: pytext.models.representations.bilstm_slot_attn.BiLSTMSlotAttention.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
BiLSTMSlotAttention implements a multi-layer bidirectional LSTM based representation with attention over slots.
Parameters: - config (Config) – Configuration object of type BiLSTMSlotAttention.Config.
- embed_dim (int) – The number of expected features in the input.
-
dropout
¶ Dropout layer preceding the LSTM.
Type: nn.Dropout
-
lstm
¶ Module that implements the LSTM.
Type: nn.Module
-
attention
¶ Module that implements the attention.
Type: nn.Module
-
dense
¶ Module that implements the non-linear projection over attended representation.
Type: nn.Module
-
representation_dim
¶ The calculated dimension of the output features of the SlotAttention representation.
Type: int
-
forward
(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor, *args, states: torch.Tensor = None, **kwargs) → torch.Tensor[source]¶ Given an input batch of sequential data such as word embeddings, produces a bidirectional LSTM representation with or without Slot attention.
Parameters: - embedded_tokens (torch.Tensor) – Input tensor of shape (bsize x seq_len x input_dim).
- seq_lengths (torch.Tensor) – List of sequences lengths of each batch element.
- states (Tuple[torch.Tensor, torch.Tensor]) – Tuple of tensors containing the initial hidden state and the cell state of each element in the batch. Each of these tensors have a dimension of (bsize x num_layers * num_directions x nhid). Defaults to None.
Returns: - Bidirectional LSTM representation of input with or
without slot attention.
Return type: torch.Tensor
-
class
pytext.models.representations.biseqcnn.
BSeqCNNRepresentation
(config: pytext.models.representations.biseqcnn.BSeqCNNRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
This class is an implementation of the paper https://arxiv.org/pdf/1606.07783. It is a bidirectional CNN model that captures context like RNNs do.
The module expects that input mini-batch is already padded.
TODO: Current implementation has a single layer conv-maxpool operation.
-
forward
(inputs: torch.Tensor, *args) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.biseqcnn.
ContextualWordConvolution
(in_channels: int, out_channels: int, kernel_sizes: List[int])[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(words: torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.contextual_intent_slot_rep.
ContextualIntentSlotRepresentation
(config: pytext.models.representations.contextual_intent_slot_rep.ContextualIntentSlotRepresentation.Config, embed_dim: Tuple[int, ...])[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
Representation for a contextual intent slot model
The inputs are two embeddings: word level embedding containing dictionary features, sequence (contexts) level embedding. See following diagram for the representation implementation that combines the two embeddings. Seq_representation is concatenated with word_embeddings.
+-----------+ | word_embed|--------------------------->+ +--------------------+ +-----------+ | | doc_representation | +-----------+ +-------------------+ |-->+--------------------+ | seq_embed |-->| seq_representation|--->+ | word_representation| +-----------+ +-------------------+ +--------------------+ joint_representation
-
forward
(word_seq_embed: Tuple[torch.Tensor, torch.Tensor], word_lengths: torch.Tensor, seq_lengths: torch.Tensor, *args) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.deepcnn.
DeepCNNRepresentation
(config: pytext.models.representations.deepcnn.DeepCNNRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
DeepCNNRepresentation implements CNN representation layer preceded by a dropout layer. CNN representation layer is based on the encoder in the architecture proposed by Gehring et. al. in Convolutional Sequence to Sequence Learning.
Parameters: - config (Config) – Configuration object of type DeepCNNRepresentation.Config.
- embed_dim (int) – The number of expected features in the input.
-
forward
(inputs: torch.Tensor, *args) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.representations.deepcnn.
SeparableConv1d
(input_channels: int, output_channels: int, kernel_size: int, padding: int, dilation: int, bottleneck: int)[source]¶ Bases:
torch.nn.modules.module.Module
Implements a 1d depthwise separable convolutional layer. In regular convolutional layers, the input channels are mixed with each other to produce each output channel. Depthwise separable convolutions decompose this process into two smaller convolutions – a depthwise and pointwise convolution.
The depthwise convolution spatially convolves each input channel separately, then the pointwise convolution projects this result into a new channel space. This process reduces the number of FLOPS used to compute a convolution and also exhibits a regularization effect. The general behavior – including the input parameters – is equivalent to nn.Conv1d.
bottleneck controls the behavior of the pointwise convolution. Instead of upsampling directly, we split the pointwise convolution into two pieces: the first convolution downsamples into a (sufficiently small) low dimension and the second convolution upsamples into the target (higher) dimension. Creating this bottleneck significantly cuts the number of parameters with minimal loss in performance.
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.deepcnn.
Trim1d
(trim)[source]¶ Bases:
torch.nn.modules.module.Module
Trims a 1d convolutional output. Used to implement history-padding by removing excess padding from the right.
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
pytext.models.representations.deepcnn.
create_conv_package
(index: int, activation: pytext.config.module_config.Activation, in_channels: int, out_channels: int, kernel_size: int, causal: bool, dilated: bool, separable: bool, bottleneck: int, weight_norm: bool)[source]¶ Creates a convolutional layer with the specified arguments.
Parameters: - index (int) – Index of a convolutional layer in the stack.
- activation (Activation) – Activation function.
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- kernel_size (int) – Size of 1d convolutional filter.
- causal (bool) – Whether the convolution is causal or not. If set, it
- for the temporal ordering of the inputs. (accounts) –
- dilated (bool) – Whether the convolution is dilated or not. If set,
- receptive field of the convolutional stack grows exponentially. (the) –
- separable (bool) – Whether to use depthwise separable convolutions
- not -- see SeparableConv1d. (or) –
- bottleneck (int) – Bottleneck channel dimension for depthwise separable
- See SeparableConv1d for an in-depth explanation. (convolutions.) –
- weight_norm (bool) – Whether to add weight normalization to the
- convolutions or not. (regular) –
-
class
pytext.models.representations.docnn.
DocNNRepresentation
(config: pytext.models.representations.docnn.DocNNRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
CNN based representation of a document.
-
forward
(embedded_tokens: torch.Tensor, *args) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.huggingface_bert_sentence_encoder.
HuggingFaceBertSentenceEncoder
(config: pytext.models.representations.huggingface_bert_sentence_encoder.HuggingFaceBertSentenceEncoder.Config, output_encoded_layers: bool, *args, **kwargs)[source]¶ Bases:
pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase
Generate sentence representation using the open source HuggingFace BERT model. This class implements loading the model weights from a pre-trained model file.
-
class
pytext.models.representations.jointcnn_rep.
JointCNNRepresentation
(config: pytext.models.representations.jointcnn_rep.JointCNNRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
-
forward
(embedded_tokens: torch.Tensor, *args) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
Bases:
pytext.models.representations.representation_base.RepresentationBase
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.representations.ordered_neuron_lstm.
OrderedNeuronLSTM
(config: pytext.models.representations.ordered_neuron_lstm.OrderedNeuronLSTM.Config, embed_dim: int, padding_value: Optional[float] = 0.0)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
-
forward
(rep: torch.Tensor, seq_lengths: torch.Tensor, states: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.ordered_neuron_lstm.
OrderedNeuronLSTMLayer
(embed_dim: int, lstm_dim: int, padding_value: float, dropout: float)[source]¶ Bases:
pytext.models.module.Module
-
forward
(embedded_tokens: torch.Tensor, states: Tuple[torch.Tensor, torch.Tensor], seq_lengths: List[int]) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pair_rep.
PairRepresentation
(config: pytext.models.representations.pair_rep.PairRepresentation.Config, embed_dim: Tuple[int, ...])[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
Wrapper representation for a pair of inputs.
Takes a tuple of inputs: the left sentence, and the right sentence(s). Returns a representation of the pair of sentences, either as a concatenation of the two sentence embeddings or as a “siamese” representation which also includes their difference and elementwise product (arXiv:1705.02364). If more than two inputs are provided, the extra inputs are assumed to be extra “right” sentences, and the output will be the stacked pair representations of the left sentence together with all right sentences. This is more efficient than separately computing all these pair representations, because the left sentence will not need to be re-embedded multiple times.
-
forward
(embeddings: Tuple[torch.Tensor, ...], *lengths) → torch.Tensor[source]¶ Computes the pair representations.
Parameters: - embeddings – token embeddings of the left sentence, followed by the token embeddings of the right sentence(s).
- lengths – the corresponding sequence lengths.
Returns: A tensor of shape (num_right_inputs, batch_size, rep_size), with the first dimension squeezed if one.
-
-
class
pytext.models.representations.pass_through.
PassThroughRepresentation
(config: pytext.config.component.ComponentMeta.__new__.<locals>.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
-
forward
(embedded_tokens: torch.Tensor, *args) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pooling.
BoundaryPool
(config: pytext.models.representations.pooling.BoundaryPool.Config, n_input: int)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pooling.
LastTimestepPool
(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor, seq_lengths: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pooling.
MaxPool
(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pooling.
MeanPool
(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor, seq_lengths: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pooling.
NoPool
(config: pytext.config.module_config.ModuleConfig, n_input: int)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pooling.
SelfAttention
(config: pytext.models.representations.pooling.SelfAttention.Config, n_input: int)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor, seq_lengths: torch.Tensor = None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.pure_doc_attention.
PureDocAttention
(config: pytext.models.representations.pure_doc_attention.PureDocAttention.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
pooling (e.g. max pooling or self attention) followed by optional MLP
-
forward
(embedded_tokens: torch.Tensor, seq_lengths: torch.Tensor = None, *args) → Any[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.representation_base.
RepresentationBase
(config)[source]¶ Bases:
pytext.models.module.Module
-
forward
(*inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.seq_rep.
SeqRepresentation
(config: pytext.models.representations.seq_rep.SeqRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
Representation for a sequence of sentences Each sentence will be embedded with a DocNN model, then all the sentences are embedded with another DocNN/BiLSTM model
-
forward
(embedded_seqs: torch.Tensor, seq_lengths: torch.Tensor, *args) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.slot_attention.
SlotAttention
(config: pytext.models.representations.slot_attention.SlotAttention.Config, n_input: int, batch_first: bool = True)[source]¶ Bases:
pytext.models.module.Module
-
forward
(inputs: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.sparse_transformer_sentence_encoder.
SparseTransformerSentenceEncoder
(config: pytext.models.representations.sparse_transformer_sentence_encoder.SparseTransformerSentenceEncoder.Config, output_encoded_layers: bool, padding_idx: int, vocab_size: int, *args, **kwarg)[source]¶ Bases:
pytext.models.representations.transformer_sentence_encoder.TransformerSentenceEncoder
Implementation of the Transformer Sentence Encoder. This directly makes use of the TransformerSentenceEncoder module in Fairseq.
- A few interesting config options:
- encoder_normalize_before detemines whether the layer norm is applied before or after self_attention. This is similar to original implementation from Google.
- activation_fn can be set to ‘gelu’ instead of the default of ‘relu’.
- project_representation adds a linear projection + tanh to the pooled output in the style of BERT.
-
class
pytext.models.representations.stacked_bidirectional_rnn.
RnnType
[source]¶ Bases:
enum.Enum
An enumeration.
-
GRU
= 'gru'¶
-
LSTM
= 'lstm'¶
-
RNN
= 'rnn'¶
-
-
class
pytext.models.representations.stacked_bidirectional_rnn.
StackedBidirectionalRNN
(config: pytext.models.representations.stacked_bidirectional_rnn.StackedBidirectionalRNN.Config, input_size: int, padding_value: float = 0.0)[source]¶ Bases:
pytext.models.module.Module
StackedBidirectionalRNN implements a multi-layer bidirectional RNN with an option to return outputs from all the layers of RNN.
Parameters: - config (Config) – Configuration object of type BiLSTM.Config.
- embed_dim (int) – The number of expected features in the input.
- padding_value (float) – Value for the padded elements. Defaults to 0.0.
-
padding_value
¶ Value for the padded elements.
Type: float
-
dropout
¶ Dropout layer preceding the LSTM.
Type: nn.Dropout
-
lstm
¶ LSTM layer that operates on the inputs.
Type: nn.LSTM
-
representation_dim
¶ The calculated dimension of the output features of BiLSTM.
Type: int
-
class
pytext.models.representations.traced_transformer_encoder.
TraceableTransformerWrapper
(eager_encoder: fairseq.modules.transformer_sentence_encoder.TransformerSentenceEncoder)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(tokens: torch.Tensor, segment_labels: torch.Tensor = None, positions: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.traced_transformer_encoder.
TracedTransformerEncoder
(eager_encoder: fairseq.modules.transformer_sentence_encoder.TransformerSentenceEncoder, tokens: torch.Tensor, segment_labels: torch.Tensor = None, positions: torch.Tensor = None)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(tokens: torch.Tensor, segment_labels: torch.Tensor = None, positions: torch.Tensor = None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer_sentence_encoder.
TransformerSentenceEncoder
(config: pytext.models.representations.transformer_sentence_encoder.TransformerSentenceEncoder.Config, output_encoded_layers: bool, padding_idx: int, vocab_size: int, *args, **kwarg)[source]¶ Bases:
pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase
Implementation of the Transformer Sentence Encoder. This directly makes use of the TransformerSentenceEncoder module in Fairseq.
- A few interesting config options:
- encoder_normalize_before detemines whether the layer norm is applied before or after self_attention. This is similar to original implementation from Google.
- activation_fn can be set to ‘gelu’ instead of the default of ‘relu’.
- projection_dim adds a linear projection to projection_dim + tanh to the pooled output in the style of BERT.
-
load_state_dict
(state_dict)[source]¶ Copies parameters and buffers from
state_dict
into this module and its descendants. Ifstrict
isTrue
, then the keys ofstate_dict
must exactly match the keys returned by this module’sstate_dict()
function.Parameters: - state_dict (dict) – a dict containing parameters and persistent buffers.
- strict (bool, optional) – whether to strictly enforce that the keys
in
state_dict
match the keys returned by this module’sstate_dict()
function. Default:True
Returns: - missing_keys is a list of str containing the missing keys
- unexpected_keys is a list of str containing the unexpected keys
Return type: NamedTuple
withmissing_keys
andunexpected_keys
fields
-
class
pytext.models.representations.transformer_sentence_encoder_base.
PoolingMethod
[source]¶ Bases:
enum.Enum
Pooling Methods are chosen from the “Feature-based Approachs” section in https://arxiv.org/pdf/1810.04805.pdf
-
AVG_CONCAT_LAST_4_LAYERS
= 'avg_concat_last_4_layers'¶
-
AVG_LAST_LAYER
= 'avg_last_layer'¶
-
AVG_SECOND_TO_LAST_LAYER
= 'avg_second_to_last_layer'¶
-
AVG_SUM_LAST_4_LAYERS
= 'avg_sum_last_4_layers'¶
-
CLS_TOKEN
= 'cls_token'¶
-
NO_POOL
= 'no_pool'¶
-
-
class
pytext.models.representations.transformer_sentence_encoder_base.
TransformerSentenceEncoderBase
(config: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase.Config, output_encoded_layers=False, *args, **kwargs)[source]¶ Bases:
pytext.models.representations.representation_base.RepresentationBase
Base class for all Bi-directional Transformer based Sentence Encoders. All children of this class should implement an _encoder function which takes as input: tokens, [optional] segment labels and a pad mask and outputs both the sentence representation (output of _pool_encoded_layers) and the output states of all the intermediate Transformer layers as a list of tensors.
Input tuple consists of the following elements: 1) tokens: torch tensor of size B x T which contains tokens ids 2) pad_mask: torch tensor of size B x T generated with the condition tokens != self.vocab.get_pad_index() 3) segment_labels: torch tensor of size B x T which contains the segment id of each token
Output tuple consists of the following elements: 1) encoded_layers: List of torch tensors where each tensor has shape B x T x C and there are num_transformer_layers + 1 of these. Each tensor represents the output of the intermediate transformer layers with the 0th element being the input to the first transformer layer (token + segment + position emebdding). 2) [Optional] pooled_output: Output of the pooling operation associated with config.pooling_method to the encoded_layers. Size B x C (or B x 4C if pooling = AVG_CONCAT_LAST_4_LAYERS)
-
forward
(input_tuple: Tuple[torch.Tensor, ...], *args) → Tuple[torch.Tensor, ...][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.semantic_parsers package¶
-
class
pytext.models.semantic_parsers.rnng.rnng_data_structures.
CompositionalNN
(lstm_dim: int)[source]¶ Bases:
torch.jit.ScriptModule
Combines a list / sequence of embeddings into one using a biLSTM
-
class
pytext.models.semantic_parsers.rnng.rnng_data_structures.
CompositionalSummationNN
(lstm_dim: int)[source]¶ Bases:
torch.jit.ScriptModule
Simpler version of CompositionalNN
-
class
pytext.models.semantic_parsers.rnng.rnng_data_structures.
Element
(node: Any)[source]¶ Bases:
object
Generic element representing a token / non-terminal / sub-tree on a stack. Used to compute valid actions in the RNNG parser.
-
class
pytext.models.semantic_parsers.rnng.rnng_data_structures.
ParserState
(parser=None)[source]¶ Bases:
object
Maintains state of the Parser. Useful for beam search
-
class
pytext.models.semantic_parsers.rnng.rnng_data_structures.
StackLSTM
(lstm: torch.nn.modules.rnn.LSTM)[source]¶ Bases:
collections.abc.Sized
,typing.Generic
The Stack LSTM from Dyer et al: https://arxiv.org/abs/1505.08075
-
element_from_top
(index: int) → pytext.models.semantic_parsers.rnng.rnng_data_structures.Element[source]¶
-
-
class
pytext.models.semantic_parsers.rnng.rnng_parser.
RNNGParser
(ablation: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.AblationParams, constraints: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.RNNGConstraints, lstm_num_layers: int, lstm_dim: int, max_open_NT: int, dropout: float, actions_vocab, shift_idx: int, reduce_idx: int, ignore_subNTs_roots: List[int], valid_NT_idxs: List[int], valid_IN_idxs: List[int], valid_SL_idxs: List[int], embedding: pytext.models.embeddings.embedding_list.EmbeddingList, p_compositional)[source]¶ Bases:
pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase
-
class
pytext.models.semantic_parsers.rnng.rnng_parser.
RNNGParserBase
(ablation: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.AblationParams, constraints: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParserBase.Config.RNNGConstraints, lstm_num_layers: int, lstm_dim: int, max_open_NT: int, dropout: float, actions_vocab, shift_idx: int, reduce_idx: int, ignore_subNTs_roots: List[int], valid_NT_idxs: List[int], valid_IN_idxs: List[int], valid_SL_idxs: List[int], embedding: pytext.models.embeddings.embedding_list.EmbeddingList, p_compositional)[source]¶ Bases:
pytext.models.model.BaseModel
The Recurrent Neural Network Grammar (RNNG) parser from Dyer et al.: https://arxiv.org/abs/1602.07776 and Gupta et al.: https://arxiv.org/abs/1810.07942d. RNNG is a neural constituency parsing algorithm that explicitly models compositional structure of a sentence. It is able to learn about hierarchical relationship among the words and phrases in a given sentence thereby learning the underlying tree structure. The paper proposes generative as well as discriminative approaches. In PyText we have implemented the discriminative approach for modeling intent slot models. It is a top-down shift-reduce parser than can output trees with non-terminals (intent and slot labels) and terminals (tokens)
-
contextualize
(context)[source]¶ Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by
DisjointMultitaskModel
for changing the task that should be trained with a given iterator.
-
forward
(tokens: torch.Tensor, seq_lens: torch.Tensor, dict_feat: Optional[Tuple[torch.Tensor, ...]] = None, actions: Optional[List[List[int]]] = None, contextual_token_embeddings: Optional[torch.Tensor] = None, beam_size=1, top_k=1) → List[Tuple[torch.Tensor, torch.Tensor]][source]¶ RNNG forward function.
Parameters: - tokens (torch.Tensor) – list of tokens
- seq_lens (torch.Tensor) – list of sequence lengths
- dict_feat (Optional[Tuple[torch.Tensor, ..]]) – dictionary or gazetteer features for each token
- actions (Optional[List[List[int]]]) – Used only during training. Oracle actions for the instances.
Returns: list of top k tuple of predicted actions tensor and corresponding scores tensor. Tensor shape: (batch_size, action_length) (batch_size, action_length, number_of_actions)
-
classmethod
from_config
(model_config, feature_config=None, metadata: pytext.data.data_handler.CommonMetadata = None, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer] = None)[source]¶
-
get_loss
(logits: List[Tuple[torch.Tensor, torch.Tensor]], target_actions: torch.Tensor, context: torch.Tensor)[source]¶ - Shapes:
- logits[1]: action scores: (1, action_length, number_of_actions) target_actions: (1, action_length)
-
get_param_groups_for_optimizer
()[source]¶ This is called by code that looks for an instance of pytext.models.model.Model.
-
get_pred
(logits: List[Tuple[torch.Tensor, torch.Tensor]], context=None, *args)[source]¶ - Return Shapes:
- preds: batch (1) * topk * action_len scores: batch (1) * topk * (action_len * number_of_actions)
-
push_action
(state: pytext.models.semantic_parsers.rnng.rnng_data_structures.ParserState, target_action_idx: int) → None[source]¶ Used for updating the state with a target next action
Parameters: - state (ParserState) – The state of the stack, buffer and action
- target_action_idx (int) – Index of the action to process
-
valid_actions
(state: pytext.models.semantic_parsers.rnng.rnng_data_structures.ParserState) → List[int][source]¶ Used for restricting the set of possible action predictions
Parameters: state (ParserState) – The state of the stack, buffer and action Returns: indices of the valid actions Return type: List[int]
-
pytext.models.seq_models package¶
-
class
pytext.models.seq_models.contextual_intent_slot.
ContextualIntentSlotModel
(default_doc_loss_weight, default_word_loss_weight, *args, **kwargs)[source]¶ Bases:
pytext.models.joint_model.IntentSlotModel
Joint Model for Intent classification and slot tagging with inputs of contextual information (sequence of utterances) and dictionary feature of the last utterance.
Training data should include: doc_label (string): intent classification label of either the sequence of utterances or just the last sentence word_label (string): slot tagging label of the last utterance in the format of start_idx:end_idx:slot_label, multiple slots are separated by a comma text (list of string): sequence of utterances for training dict_feat (dict): a dict of features that contains the feature of each word in the last utterance
Following is an example of raw columns from training data:
doc_label reply-where word_label 10:20:restaurant_name text [“dinner at 6?”, “wanna try Tomi Sushi?”] dict_feat - {“tokenFeatList”: [{“tokenIdx”: 2, “features”: {“poi:eatery”: 0.66}},
- {“tokenIdx”: 3, “features”: {“poi:eatery”: 0.66}}]}
-
class
pytext.models.seq_models.seqnn.
SeqNNModel
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.doc_model.DocModel
Classification model with sequence of utterances as input. It uses a docnn model (CNN or LSTM) to generate vector representation for each sequence, and then use an LSTM or BLSTM to capture the dynamics and produce labels for each sequence.
-
class
pytext.models.seq_models.seqnn.
SeqNNModel_Deprecated
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.model.Model
Classification model with sequence of utterances as input. It uses a docnn model (CNN or LSTM) to generate vector representation for each sequence, and then use an LSTM or BLSTM to capture the dynamics and produce labels for each sequence.
DEPRECATED: Use SeqNNModel
Submodules¶
pytext.models.bert_classification_models module¶
-
class
pytext.models.bert_classification_models.
BertPairwiseModel
(encoder1, encoder2, decoder, output_layer, encode_relations)[source]¶ Bases:
pytext.models.pair_classification_model.BasePairwiseModel
Bert Pairwise classification model
The model takes two sets of tokens (left and right), calculates their representations separately using shared BERT encoder and passes them to the decoder along with their absolute difference and elementwise product, all concatenated. Used for e.g. natural language inference.
-
forward
(input_tuple1: Tuple[torch.Tensor, ...], input_tuple2: Tuple[torch.Tensor, ...]) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.bert_classification_models.
NewBertModel
(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]¶ Bases:
pytext.models.model.BaseModel
BERT single sentence classification.
-
SUPPORT_FP16_OPTIMIZER
= True¶
-
forward
(encoder_inputs: Tuple[torch.Tensor, ...], *args) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.bert_regression_model module¶
-
class
pytext.models.bert_regression_model.
NewBertRegressionModel
(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]¶ Bases:
pytext.models.bert_classification_models.NewBertModel
BERT single sentence (or concatenated sentences) regression.
pytext.models.crf module¶
-
class
pytext.models.crf.
CRF
(num_tags: int, ignore_index: int, default_label_pad_index: int)[source]¶ Bases:
torch.nn.modules.module.Module
Compute the log-likelihood of the input assuming a conditional random field model.
Parameters: num_tags – The number of tags -
decode
(emissions: torch.Tensor, seq_lens: torch.Tensor) → torch.Tensor[source]¶ Given a set of emission probabilities, return the predicted tags.
Parameters: - emissions – Emission probabilities with expected shape of batch_size * seq_len * num_labels
- seq_lens – Length of each input.
-
export_to_caffe2
(workspace, init_net, predict_net, logits_output_name)[source]¶ Exports the crf layer to caffe2 by manually adding the necessary operators to the init_net and predict net.
Parameters: - init_net – caffe2 init net created by the current graph
- predict_net – caffe2 net created by the current graph
- workspace – caffe2 current workspace
- output_names – current output names of the caffe2 net
- py_model – original pytorch model object
Returns: The updated predictions blob name
Return type: string
-
forward
(emissions: torch.Tensor, tags: torch.Tensor, reduce: bool = True) → torch.Tensor[source]¶ Compute log-likelihood of input.
Parameters: - emissions – Emission values for different tags for each input. The expected shape is batch_size * seq_len * num_labels. Padding is should be on the right side of the input.
- tags – Actual tags for each token in the input. Expected shape is batch_size * seq_len
-
pytext.models.disjoint_multitask_model module¶
-
class
pytext.models.disjoint_multitask_model.
DisjointMultitaskModel
(models, loss_weights)[source]¶ Bases:
pytext.models.model.Model
Wrapper model to train multiple PyText models that share parameters. Designed to be used for multi-tasking when the tasks have disjoint datasets.
Modules which have the same shared_module_key and type share parameters. Only need to configure the first such module in full in each case.
Parameters: models (type) – Dictionary of models of sub-tasks. -
current_model
¶ Current model to route the input batch to.
Type: type
-
contextualize
(context)[source]¶ Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by
DisjointMultitaskModel
for changing the task that should be trained with a given iterator.
-
current_model
-
forward
(*inputs) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.disjoint_multitask_model.
NewDisjointMultitaskModel
(models, loss_weights)[source]¶ Bases:
pytext.models.disjoint_multitask_model.DisjointMultitaskModel
pytext.models.distributed_model module¶
-
class
pytext.models.distributed_model.
DistributedModel
(*args, **kwargs)[source]¶ Bases:
torch.nn.parallel.distributed.DistributedDataParallel
Wrapper model class to train models in distributed data parallel manner. The way to use this class to train your module in distributed manner is:
distributed_model = DistributedModel( module=model, device_ids=[device_id0, device_id1], output_device=device_id0, broadcast_buffers=False, )
where, model is the object of the actual model class you want to train in distributed manner.
-
load_state_dict
(*args, **kwargs)[source]¶ Copies parameters and buffers from
state_dict
into this module and its descendants. Ifstrict
isTrue
, then the keys ofstate_dict
must exactly match the keys returned by this module’sstate_dict()
function.Parameters: - state_dict (dict) – a dict containing parameters and persistent buffers.
- strict (bool, optional) – whether to strictly enforce that the keys
in
state_dict
match the keys returned by this module’sstate_dict()
function. Default:True
Returns: - missing_keys is a list of str containing the missing keys
- unexpected_keys is a list of str containing the unexpected keys
Return type: NamedTuple
withmissing_keys
andunexpected_keys
fields
-
state_dict
(*args, **kwargs)[source]¶ Returns a dictionary containing a whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names.
Returns: a dictionary containing a whole state of the module Return type: dict Example:
>>> module.state_dict().keys() ['bias', 'weight']
-
pytext.models.doc_model module¶
-
class
pytext.models.doc_model.
ByteTokensDocumentModel
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.doc_model.DocModel
DocModel that receives both word IDs and byte IDs as inputs (concatenating word and byte-token embeddings to represent input tokens).
-
class
pytext.models.doc_model.
DocModel
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.model.Model
DocModel that’s compatible with the new Model abstraction, which is responsible for describing which inputs it expects and arranging its input tensors.
-
classmethod
create_decoder
(config: pytext.models.doc_model.DocModel.Config, representation_dim: int, num_labels: int)[source]¶
-
classmethod
create_embedding
(config: pytext.models.doc_model.DocModel.Config, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer])[source]¶
-
classmethod
-
class
pytext.models.doc_model.
DocRegressionModel
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.doc_model.DocModel
Model that’s compatible with the new Model abstraction, and is configured for regression tasks (specifically for labels, predictions, and loss).
-
class
pytext.models.doc_model.
PersonalizedDocModel
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase, user_embedding: Optional[pytext.models.embeddings.embedding_base.EmbeddingBase] = None)[source]¶ Bases:
pytext.models.doc_model.DocModel
DocModel that includes a user embedding which learns user features to produce personalized prediction. In this class, user-embedding is fed directly to the decoder (i.e., does not go through the encoders).
-
forward
(*inputs) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.joint_model module¶
-
class
pytext.models.joint_model.
IntentSlotModel
(default_doc_loss_weight, default_word_loss_weight, *args, **kwargs)[source]¶ Bases:
pytext.models.model.Model
A joint intent-slot model. This is framed as a model to do document classification model and word tagging tasks where the embedding and text representation layers are shared for both tasks.
The supported representation layers are based on bidirectional LSTM or CNN.
It can be instantiated just like any other
Model
.This is in the new data handling design involving tensorizers; that is the difference between this and JointModel
pytext.models.masked_lm module¶
-
class
pytext.models.masked_lm.
MaskedLanguageModel
(encoder: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase, decoder: pytext.models.decoders.mlp_decoder.MLPDecoder, output_layer: pytext.models.output_layers.lm_output_layer.LMOutputLayer, token_tensorizer: pytext.data.bert_tensorizer.BERTTensorizerBase, vocab: pytext.data.utils.Vocabulary, mask_prob: float = 0.15, mask_bos: float = False, masking_strategy: pytext.models.masking_utils.MaskingStrategy = <MaskingStrategy.RANDOM: 'random'>, stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>)[source]¶ Bases:
pytext.models.model.BaseModel
Masked language model for BERT style pre-training.
-
SUPPORT_FP16_OPTIMIZER
= True¶
-
forward
(*inputs) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.masking_utils module¶
-
class
pytext.models.masking_utils.
MaskingStrategy
[source]¶ Bases:
enum.Enum
An enumeration.
-
FREQUENCY
= 'frequency_based'¶
-
RANDOM
= 'random'¶
-
-
pytext.models.masking_utils.
frequency_based_masking
(tokens: None._VariableFunctions.tensor, token_sampling_weights: numpy.ndarray, mask_prob: float) → torch.Tensor[source]¶ Function to mask tokens based on frequency.
- Inputs:
- tokens: Tensor with token ids of shape (batch_size x seq_len)
- token_sampling_weights: numpy array with shape (batch_size x seq_len)
- and each element representing the sampling weight assicated with the corresponding token in tokens
- mask_prob: Probability of masking a particular token
- Outputs:
- mask: Tensor with same shape as input tokens (batch_size x seq_len)
- with masked tokens represented by a 1 and everything else as 0.
-
pytext.models.masking_utils.
random_masking
(tokens: None._VariableFunctions.tensor, mask_prob: float) → torch.Tensor[source]¶ Function to mask tokens randomly.
- Inputs:
- tokens: Tensor with token ids of shape (batch_size x seq_len)
- mask_prob: Probability of masking a particular token
- Outputs:
- mask: Tensor with same shape as input tokens (batch_size x seq_len)
- with masked tokens represented by a 1 and everything else as 0.
pytext.models.model module¶
-
class
pytext.models.model.
BaseModel
(stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>)[source]¶ Bases:
torch.nn.modules.module.Module
,pytext.config.component.Component
Base model class which inherits from nn.Module. Also has a stage flag to indicate it’s in train, eval, or test stage. This is because the built-in train/eval flag in PyTorch can’t distinguish eval and test, which is required to support some use cases.
-
SUPPORT_FP16_OPTIMIZER
= False¶
-
contextualize
(context)[source]¶ Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by
DisjointMultitaskModel
for changing the task that should be trained with a given iterator.
-
eval
(stage=<Stage.TEST: 'Test'>)[source]¶ Override to explicitly maintain the stage (train, eval, test).
-
get_param_groups_for_optimizer
() → List[Dict[str, List[torch.nn.parameter.Parameter]]][source]¶ Returns a list of parameter groups of the format {“params”: param_list}. The parameter groups loosely correspond to layers and are ordered from low to high. Currently, only the embedding layer can provide multiple param groups, and other layers are put into one param group. The output of this method is passed to the optimizer so that schedulers can change learning rates by layer.
-
-
class
pytext.models.model.
Model
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.model.BaseModel
Generic single-task model class that expects four components:
- Embedding
- Representation
- Decoder
- Output Layer
Forward pass: embedding -> representation -> decoder -> output_layer
These four components have specific responsibilities as described below.
Embedding layer should implement the way to represent each token in the input text. It can be as simple as just token/word embedding or can be composed of multiple ways to represent a token, e.g., word embedding, character embedding, etc.
Representation layer should implement the way to encode the entire input text such that the output vector(s) can be used by decoder to produce logits. There is no restriction on the number of inputs it should encode. There is also not restriction on the number of ways to encode input.
Decoder layer should implement the way to consume the output of model’s representation and produce logits that can be used by the output layer to compute loss or generate predictions (and prediction scores/confidence)
Output layer should implement the way loss computation is done as well as the logic to generate predictions from the logits.
Let us discuss the joint intent-slot model as a case to go over these layers. The model predicts intent of input utterance and the slots in the utterance. (Refer to Train Intent-Slot model on ATIS Dataset for details about intent-slot model.)
EmbeddingList
layer is tasked with representing tokens. To do so we can use learnable word embedding table in conjunction with learnable character embedding table that are distilled to token level representation using CNN and pooling. Note: This class is meant to be reused by all models. It acts as a container of all the different ways of representing a token/word.BiLSTMDocSlotAttention
is tasked with encoding the embedded input string for intent classification and slot filling. In order to do that it has a shared bidirectional LSTM layer followed by separate attention layers for document level attention and word level attention. Finally it produces two vectors per utterance.IntentSlotModelDecoder
accepts the two input vectors from BiLSTMDocSlotAttention and produces logits for intent classification and slot filling. Conditioned on a flag it can also use the probabilities from intent classification for slot filling.IntentSlotOutputLayer
implements the logic behind computing loss and prediction, as well as, how to export this layer to export to Caffe2. This is used by model exporter as a post-processing Caffe2 operator.
Parameters: - embedding (EmbeddingBase) – Description of parameter embedding.
- representation (RepresentationBase) – Description of parameter representation.
- decoder (DecoderBase) – Description of parameter decoder.
- output_layer (OutputLayerBase) – Description of parameter output_layer.
-
embedding
¶
-
representation
¶
-
decoder
¶
-
output_layer
¶
-
classmethod
compose_embedding
(sub_emb_module_dict: Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase], metadata) → pytext.models.embeddings.embedding_list.EmbeddingList[source]¶ Default implementation is to compose an instance of
EmbeddingList
with all the sub-embedding modules. You should override this class method if you want to implement a specific way to embed tokens/words.Parameters: sub_emb_module_dict (Dict[str, EmbeddingBase]) – Named dictionary of embedding modules each of which implement a way to embed/encode a token. Returns: An instance of EmbeddingList
.Return type: EmbeddingList
-
classmethod
create_embedding
(feat_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata)[source]¶
-
classmethod
create_sub_embs
(emb_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata) → Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase][source]¶ Creates the embedding modules defined in the emb_config.
Parameters: - emb_config (FeatureConfig) – Object containing all the sub-embedding configurations.
- metadata (CommonMetadata) – Object containing features and label metadata.
Returns: Named dictionary of embedding modules.
Return type: Dict[str, EmbeddingBase]
-
forward
(*inputs) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.model.
ModelInputBase
(**kwargs)[source]¶ Bases:
pytext.config.pytext_config.ConfigBase
Base class for model inputs.
pytext.models.module module¶
-
class
pytext.models.module.
Module
(config=None)[source]¶ Bases:
torch.nn.modules.module.Module
,pytext.config.component.Component
Generic module class that serves as base class for all PyText modules.
Parameters: config (type) – Module’s config object. Specific contents of this object depends on the module. Defaults to None.
-
pytext.models.module.
create_module
(module_config, *args, create_fn=<function _create_module_from_registry>, **kwargs)[source]¶ Create module object given the module’s config object. It depends on the global shared module registry. Hence, your module must be available for the registry. This entails that your module must be imported somewhere in the code path during module creation (ideally in your model class) for the module to be visible for registry.
Parameters: - module_config (type) – Module config object.
- create_fn (type) – The function to use for creating the module. Use this parameter if your module creation requires custom code and pass your function here. Defaults to _create_module_from_registry().
Returns: Description of returned object.
Return type: type
pytext.models.pair_classification_model module¶
-
class
pytext.models.pair_classification_model.
BasePairwiseModel
(decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase, encode_relations: bool)[source]¶ Bases:
pytext.models.model.BaseModel
A base classification model that scores a pair of texts.
Subclasses need to implement the from_config, forward and save_modules.
-
forward
(input1: Tuple[torch.Tensor, ...], input2: Tuple[torch.Tensor, ...])[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.pair_classification_model.
PairwiseModel
(embeddings: torch.nn.modules.container.ModuleList, representations: torch.nn.modules.container.ModuleList, decoder: pytext.models.decoders.mlp_decoder.MLPDecoder, output_layer: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer, encode_relations: bool)[source]¶ Bases:
pytext.models.pair_classification_model.BasePairwiseModel
A classification model that scores a pair of texts, for example, a model for natural language inference.
The model shares embedding space (so it doesn’t support pairs of texts where left and right are in different languages). It uses bidirectional LSTM or CNN to represent the two documents, and concatenates them along with their absolute difference and elementwise product. This concatenated pair representation is passed to a multi-layer perceptron to decode to label/target space.
See https://arxiv.org/pdf/1705.02364.pdf for more details.
It can be instantiated just like any other
Model
.-
EMBEDDINGS
= ['embedding']¶
-
INPUTS_PAIR
= [['tokens1'], ['tokens2']]¶
-
forward
(input1: Tuple[torch.Tensor, ...], input2: Tuple[torch.Tensor, ...]) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.query_document_pairwise_ranking_model module¶
-
class
pytext.models.query_document_pairwise_ranking_model.
QueryDocPairwiseRankingModel
(embeddings: torch.nn.modules.container.ModuleList, representations: torch.nn.modules.container.ModuleList, decoder: pytext.models.decoders.mlp_decoder.MLPDecoder, output_layer: pytext.models.output_layers.doc_classification_output_layer.ClassificationOutputLayer, encode_relations: bool)[source]¶ Bases:
pytext.models.pair_classification_model.PairwiseModel
Pairwise ranking model This model takes in a query, and two responses (pos_response and neg_response) It passes representations of the query and the two responses to a decoder pos_response should be ranked higher than neg_response - this is ensured by training with a ranking hinge loss function
-
forward
(pos_response: Tuple[torch.Tensor, torch.Tensor], neg_response: Tuple[torch.Tensor, torch.Tensor], query: Tuple[torch.Tensor, torch.Tensor]) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.roberta module¶
-
class
pytext.models.roberta.
RoBERTa
(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]¶ Bases:
pytext.models.bert_classification_models.NewBertModel
-
class
pytext.models.roberta.
RoBERTaEncoder
(config: pytext.models.roberta.RoBERTaEncoder.Config, output_encoded_layers: bool, **kwarg)[source]¶ Bases:
pytext.models.roberta.RoBERTaEncoderBase
A PyTorch RoBERTa implementation
-
class
pytext.models.roberta.
RoBERTaEncoderBase
(config: pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase.Config, output_encoded_layers=False, *args, **kwargs)[source]¶ Bases:
pytext.models.representations.transformer_sentence_encoder_base.TransformerSentenceEncoderBase
-
class
pytext.models.roberta.
RoBERTaEncoderJit
(config: pytext.models.roberta.RoBERTaEncoderJit.Config, output_encoded_layers: bool, **kwarg)[source]¶ Bases:
pytext.models.roberta.RoBERTaEncoderBase
A TorchScript RoBERTa implementation
-
class
pytext.models.roberta.
RoBERTaWordTaggingModel
(encoder, decoder, output_layer, stage=<Stage.TRAIN: 'Training'>)[source]¶ Bases:
pytext.models.model.BaseModel
Single Sentence Token-level Classification Model using XLM.
-
forward
(encoder_inputs: Tuple[torch.Tensor, ...], *args) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.word_model module¶
-
class
pytext.models.word_model.
WordTaggingLiteModel
(*args, **kwargs)[source]¶ Bases:
pytext.models.word_model.WordTaggingModel
Also a word tagging model, but uses bytes as inputs to the model. Using bytes instead of words, the model does not need to store a word embedding table mapping words in the vocab to their embedding vector representations, but instead compute them on the fly using CharacterEmbedding. This produces an exported/serialized model that requires much less storage space as well as less memory during run/inference time.
-
class
pytext.models.word_model.
WordTaggingModel
(*args, **kwargs)[source]¶ Bases:
pytext.models.model.Model
Word tagging model. It can be used for any task that requires predicting the tag for a word/token. For example, the following tasks can be modeled as word tagging tasks. This is not an exhaustive list. 1. Part of speech tagging. 2. Named entity recognition. 3. Slot filling for task oriented dialog.
It can be instantiated just like any other
Model
.
Module contents¶
-
class
pytext.models.
Model
(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase, representation: pytext.models.representations.representation_base.RepresentationBase, decoder: pytext.models.decoders.decoder_base.DecoderBase, output_layer: pytext.models.output_layers.output_layer_base.OutputLayerBase)[source]¶ Bases:
pytext.models.model.BaseModel
Generic single-task model class that expects four components:
- Embedding
- Representation
- Decoder
- Output Layer
Forward pass: embedding -> representation -> decoder -> output_layer
These four components have specific responsibilities as described below.
Embedding layer should implement the way to represent each token in the input text. It can be as simple as just token/word embedding or can be composed of multiple ways to represent a token, e.g., word embedding, character embedding, etc.
Representation layer should implement the way to encode the entire input text such that the output vector(s) can be used by decoder to produce logits. There is no restriction on the number of inputs it should encode. There is also not restriction on the number of ways to encode input.
Decoder layer should implement the way to consume the output of model’s representation and produce logits that can be used by the output layer to compute loss or generate predictions (and prediction scores/confidence)
Output layer should implement the way loss computation is done as well as the logic to generate predictions from the logits.
Let us discuss the joint intent-slot model as a case to go over these layers. The model predicts intent of input utterance and the slots in the utterance. (Refer to Train Intent-Slot model on ATIS Dataset for details about intent-slot model.)
EmbeddingList
layer is tasked with representing tokens. To do so we can use learnable word embedding table in conjunction with learnable character embedding table that are distilled to token level representation using CNN and pooling. Note: This class is meant to be reused by all models. It acts as a container of all the different ways of representing a token/word.BiLSTMDocSlotAttention
is tasked with encoding the embedded input string for intent classification and slot filling. In order to do that it has a shared bidirectional LSTM layer followed by separate attention layers for document level attention and word level attention. Finally it produces two vectors per utterance.IntentSlotModelDecoder
accepts the two input vectors from BiLSTMDocSlotAttention and produces logits for intent classification and slot filling. Conditioned on a flag it can also use the probabilities from intent classification for slot filling.IntentSlotOutputLayer
implements the logic behind computing loss and prediction, as well as, how to export this layer to export to Caffe2. This is used by model exporter as a post-processing Caffe2 operator.
Parameters: - embedding (EmbeddingBase) – Description of parameter embedding.
- representation (RepresentationBase) – Description of parameter representation.
- decoder (DecoderBase) – Description of parameter decoder.
- output_layer (OutputLayerBase) – Description of parameter output_layer.
-
embedding
¶
-
representation
¶
-
decoder
¶
-
output_layer
¶
-
classmethod
compose_embedding
(sub_emb_module_dict: Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase], metadata) → pytext.models.embeddings.embedding_list.EmbeddingList[source]¶ Default implementation is to compose an instance of
EmbeddingList
with all the sub-embedding modules. You should override this class method if you want to implement a specific way to embed tokens/words.Parameters: sub_emb_module_dict (Dict[str, EmbeddingBase]) – Named dictionary of embedding modules each of which implement a way to embed/encode a token. Returns: An instance of EmbeddingList
.Return type: EmbeddingList
-
classmethod
create_embedding
(feat_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata)[source]¶
-
classmethod
create_sub_embs
(emb_config: pytext.config.field_config.FeatureConfig, metadata: pytext.data.data_handler.CommonMetadata) → Dict[str, pytext.models.embeddings.embedding_base.EmbeddingBase][source]¶ Creates the embedding modules defined in the emb_config.
Parameters: - emb_config (FeatureConfig) – Object containing all the sub-embedding configurations.
- metadata (CommonMetadata) – Object containing features and label metadata.
Returns: Named dictionary of embedding modules.
Return type: Dict[str, EmbeddingBase]
-
forward
(*inputs) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
pytext.models.
BaseModel
(stage: pytext.common.constants.Stage = <Stage.TRAIN: 'Training'>)[source]¶ Bases:
torch.nn.modules.module.Module
,pytext.config.component.Component
Base model class which inherits from nn.Module. Also has a stage flag to indicate it’s in train, eval, or test stage. This is because the built-in train/eval flag in PyTorch can’t distinguish eval and test, which is required to support some use cases.
-
SUPPORT_FP16_OPTIMIZER
= False¶
-
contextualize
(context)[source]¶ Add additional context into model. context can be anything that helps maintaining/updating state. For example, it is used by
DisjointMultitaskModel
for changing the task that should be trained with a given iterator.
-
eval
(stage=<Stage.TEST: 'Test'>)[source]¶ Override to explicitly maintain the stage (train, eval, test).
-
get_param_groups_for_optimizer
() → List[Dict[str, List[torch.nn.parameter.Parameter]]][source]¶ Returns a list of parameter groups of the format {“params”: param_list}. The parameter groups loosely correspond to layers and are ordered from low to high. Currently, only the embedding layer can provide multiple param groups, and other layers are put into one param group. The output of this method is passed to the optimizer so that schedulers can change learning rates by layer.
-
pytext.optimizer package¶
Subpackages¶
pytext.optimizer.sparsifiers package¶
-
class
pytext.optimizer.sparsifiers.blockwise_sparsifier.
BlockwiseMagnitudeSparsifier
(sparsity, starting_epoch, frequency, block_size, columnwise_blocking, accumulate_mask, layerwise_pruning)[source]¶ Bases:
pytext.optimizer.sparsifiers.sparsifier.L0_projection_sparsifier
running blockwise magnitude-based sparsification
Parameters: - block_size – define the size of each block
- columnwise_blocking – define columnwise block if true
- starting_epoch – sparsification_condition returns true only after starting_epoch
- frequency – sparsification_condition only if number of steps devides frequency
- accumulate_mask – if true, the mask after each .sparisfy() will be reused
- sparsity – percentage of zeros among the UNPRUNED parameters.
- on how the sparsifier work (Examples) –
- matrix (2D) –
- [ – 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
- ] –
- 3 X 1 block (define) –
- [ – ***** *** 0 1 2 3 4 ****** *** 5 6 7 8 9 ****** *** 10 11 12 13 14 ****** *** 15 16 17 18 19 ****** *** 20 21 22 23 24 ****** ***
- ] –
- l1 norm of each block and sort them. Retain blocks with largest (compute) –
- values until sparsity threshold is met (absolute) –
-
classmethod
from_config
(config: pytext.optimizer.sparsifiers.blockwise_sparsifier.BlockwiseMagnitudeSparsifier.Config)[source]¶
-
get_masks
(model: torch.nn.modules.module.Module, pre_masks: List[torch.Tensor] = None) → List[torch.Tensor][source]¶ Note: this function returns the masks only but do not sparsify or modify the weights
prune x% of weights among the weights with “1” in pre_masks
Parameters: - model – Model
- pre_masks – list of FloatTensors where “1” means retained the weight and “0” means pruned the weight
Returns: List[torch.Tensor], intersection of new masks and pre_masks, so that “1” only if the weight is selected after new masking and pre_mask
Return type: masks
-
class
pytext.optimizer.sparsifiers.sparsifier.
CRF_L1_SoftThresholding
(lambda_l1: float, starting_epoch: int, frequency: int)[source]¶ Bases:
pytext.optimizer.sparsifiers.sparsifier.CRF_SparsifierBase
- implement l1 regularization:
- min Loss(x, y, CRFparams) + lambda_l1 * ||CRFparams||_1
and solve the optimiation problem via (stochastic) proximal gradient-based method i.e., soft-thresholding
param_updated = sign(CRFparams) * max ( abs(CRFparams) - lambda_l1, 0)
-
class
pytext.optimizer.sparsifiers.sparsifier.
CRF_MagnitudeThresholding
(sparsity, starting_epoch, frequency, grouping)[source]¶ Bases:
pytext.optimizer.sparsifiers.sparsifier.CRF_SparsifierBase
magnitude-based (equivalent to projection onto l0 constraint set) sparsification on CRF transition matrix. Preserveing the top-k elements either rowwise or columnwise until sparsity constraint is met.
-
class
pytext.optimizer.sparsifiers.sparsifier.
CRF_SparsifierBase
(config=None, *args, **kwargs)[source]¶
-
class
pytext.optimizer.sparsifiers.sparsifier.
L0_projection_sparsifier
(sparsity, starting_epoch, frequency, layerwise_pruning=True, accumulate_mask=False)[source]¶ Bases:
pytext.optimizer.sparsifiers.sparsifier.Sparsifier
L0 projection-based (unstructured) sparsification
Parameters: - weights (torch.Tensor) – input weight matrix
- sparsity (float32) – the desired sparsity [0-1]
-
apply_masks
(model: pytext.models.model.Model, masks: List[torch.Tensor])[source]¶ apply given masks to zero-out learnable weights in model
-
classmethod
from_config
(config: pytext.optimizer.sparsifiers.sparsifier.L0_projection_sparsifier.Config)[source]¶
-
get_masks
(model: pytext.models.model.Model, pre_masks: List[torch.Tensor] = None) → List[torch.Tensor][source]¶ Note: this function returns the masks only but do not sparsify or modify the weights
prune x% of weights among the weights with “1” in pre_masks
Parameters: - model – Model
- pre_masks – list of FloatTensors where “1” means retained the weight and “0” means pruned the weight
Returns: List[torch.Tensor], intersection of new masks and pre_masks, so that “1” only if the weight is selected after new masking and pre_mask
Return type: masks
Submodules¶
pytext.optimizer.activations module¶
-
class
pytext.optimizer.activations.
GeLU
[source]¶ Bases:
torch.nn.modules.module.Module
Implements Gaussian Error Linear Units (GELUs).
Reference: Gaussian Error Linear Units (GELUs). Dan Hendrycks, Kevin Gimpel. Technical Report, 2017. https://arxiv.org/pdf/1606.08415.pdf
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.optimizer.fairseq_fp16_utils module¶
-
class
pytext.optimizer.fairseq_fp16_utils.
Fairseq_FP16OptimizerMixin
(*args, **kwargs)[source]¶ Bases:
object
-
backward
(loss)[source]¶ Computes the sum of gradients of the given tensor w.r.t. graph leaves.
Compared to
fairseq.optim.FairseqOptimizer.backward()
, this function additionally dynamically scales the loss to avoid gradient underflow.
-
load_state_dict
(state_dict, optimizer_overrides=None)[source]¶ Load an optimizer state dict.
In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.
-
-
class
pytext.optimizer.fairseq_fp16_utils.
Fairseq_MemoryEfficientFP16OptimizerMixin
(*args, **kwargs)[source]¶ Bases:
object
-
backward
(loss)[source]¶ Computes the sum of gradients of the given tensor w.r.t. graph leaves.
Compared to
fairseq.optim.FairseqOptimizer.backward()
, this function additionally dynamically scales the loss to avoid gradient underflow.
-
load_state_dict
(state_dict, optimizer_overrides=None)[source]¶ Load an optimizer state dict.
In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.
-
pytext.optimizer.fp16_optimizer module¶
-
class
pytext.optimizer.fp16_optimizer.
DynamicLossScaler
(init_scale, scale_factor, scale_window)[source]¶ Bases:
object
-
class
pytext.optimizer.fp16_optimizer.
FP16Optimizer
(fp32_optimizer)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
-
param_groups
¶
-
-
class
pytext.optimizer.fp16_optimizer.
FP16OptimizerApex
(fp32_optimizer: pytext.optimizer.optimizers.Optimizer, model: torch.nn.modules.module.Module, opt_level: str, init_loss_scale: Optional[int], min_loss_scale: Optional[float])[source]¶
-
class
pytext.optimizer.fp16_optimizer.
FP16OptimizerDeprecated
(init_optimizer, init_scale, scale_factor, scale_window)[source]¶ Bases:
object
-
step
()[source]¶ Realize weights update.
Update the grads from model to master. During iteration for parameters, we check overflow after floating grads and copy. Then do unscaling.
If overflow doesn’t happen, call inner optimizer’s step() and copy back the updated weights from inner optimizer to model.
Update loss scale according to overflow checking result.
-
-
class
pytext.optimizer.fp16_optimizer.
FP16OptimizerFairseq
(fp16_params, fp32_optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]¶ Bases:
fairseq.optim.fp16_optimizer._FP16OptimizerMixin
,pytext.optimizer.fp16_optimizer.FP16Optimizer
Wrap an optimizer to support FP16 (mixed precision) training.
-
clip_grad_norm
(max_norm, unused_model)[source]¶ Clips gradient norm and updates dynamic loss scaler.
-
-
class
pytext.optimizer.fp16_optimizer.
GeneratorFP16Optimizer
(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]¶ Bases:
pytext.optimizer.fp16_optimizer.PureFP16Optimizer
-
load_state_dict
(state_dict)[source]¶ Load an optimizer state dict.
We prefer the configuration of the existing optimizer instance. After we load state dict to inner_optimizer, we create the copy of references of parameters again as in init().
-
step
()[source]¶ Updates weights.
- Effects:
Check overflow, if not, when inner_optimizer supports memory-effcient step, do overall unscale and call memory-efficient step.
If it doesn’t support, modify each parameter list in param_groups of inner_optimizer to a generator of the tensors. Call normal step then, data type changing will be added automatically in that function.
No matter whether it is overflow, we need to update scale at the last step.
-
-
class
pytext.optimizer.fp16_optimizer.
MemoryEfficientFP16OptimizerFairseq
(fp16_params, optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]¶ Bases:
fairseq.optim.fp16_optimizer._MemoryEfficientFP16OptimizerMixin
,pytext.optimizer.fp16_optimizer.FP16Optimizer
Wrap the mem efficient optimizer to support FP16 (mixed precision) training.
-
clip_grad_norm
(max_norm, unused_model)[source]¶ Clips gradient norm and updates dynamic loss scaler.
-
-
class
pytext.optimizer.fp16_optimizer.
PureFP16Optimizer
(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]¶ Bases:
pytext.optimizer.fp16_optimizer.FP16OptimizerDeprecated
-
load_state_dict
(state_dict)[source]¶ Load an optimizer state dict.
We prefer the configuration of the existing optimizer instance. Realize the same logic as in init() – point the param_groups of outer optimizer to that of the inner_optimizer.
-
step
()[source]¶ Updates the weights in inner optimizer.
If inner optimizer supports memory efficient, check overflow, unscale and call advanced step.
Otherwise, float weights and grads, check whether grads are overflow during the iteration, if not overflow, unscale grads and call inner optimizer’s step; If overflow happens, do nothing, wait to the end to call half weights and grads (grads will be eliminated in zero_grad)
-
-
pytext.optimizer.fp16_optimizer.
convert_generator
(params, scale)[source]¶ Create the generator for parameter tensors.
For each parameter, we float and unscale it. After the caller calls next(), we realize the half process and start next parameter’s processing.
pytext.optimizer.lamb module¶
-
class
pytext.optimizer.lamb.
Lamb
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, min_trust=None)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
Implements Lamb algorithm. THIS WAS DIRECTLY COPIED OVER FROM pytorch/contrib: https://github.com/cybertronai/pytorch-lamb It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962
Has the option for minimum trust LAMB as described in “Single Headed Attention RNN: Stop Thinking With Your Head” section 6.3 https://arxiv.org/abs/1911.11423
pytext.optimizer.optimizers module¶
-
class
pytext.optimizer.optimizers.
Adagrad
(parameters, lr, weight_decay)[source]¶ Bases:
torch.optim.adagrad.Adagrad
,pytext.optimizer.optimizers.Optimizer
-
class
pytext.optimizer.optimizers.
Adam
(parameters, lr, weight_decay, eps)[source]¶ Bases:
torch.optim.adam.Adam
,pytext.optimizer.optimizers.Optimizer
-
class
pytext.optimizer.optimizers.
AdamW
(parameters, lr, weight_decay, eps)[source]¶ Bases:
torch.optim.adamw.AdamW
,pytext.optimizer.optimizers.Optimizer
Adds PyText support for Decoupled Weight Decay Regularization for Adam as done in the paper: https://arxiv.org/abs/1711.05101 for more information read the fast.ai blog on this optimization method here: https://www.fast.ai/2018/07/02/adam-weight-decay/
-
class
pytext.optimizer.optimizers.
Optimizer
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.config.component.Component
-
params
¶ Return an iterable of the parameters held by the optimizer.
-
-
class
pytext.optimizer.optimizers.
SGD
(parameters, lr, momentum)[source]¶ Bases:
torch.optim.sgd.SGD
,pytext.optimizer.optimizers.Optimizer
pytext.optimizer.radam module¶
-
class
pytext.optimizer.radam.
RAdam
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
Implements rectified adam as derived in the following paper: “On the Variance of the Adaptive Learning Rate and Beyond” (https://arxiv.org/abs/1908.03265)
This code is mostly a direct copy-paste of the code provided by the authors here: https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py
pytext.optimizer.scheduler module¶
-
class
pytext.optimizer.scheduler.
CosineAnnealingLR
(optimizer, T_max, eta_min=0, last_epoch=-1)[source]¶ Bases:
torch.optim.lr_scheduler.CosineAnnealingLR
,pytext.optimizer.scheduler.BatchScheduler
Wrapper around torch.optim.lr_scheduler.CosineAnnealingLR See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
CyclicLR
(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]¶ Bases:
torch.optim.lr_scheduler.CyclicLR
,pytext.optimizer.scheduler.BatchScheduler
Wrapper around torch.optim.lr_scheduler.CyclicLR See the original documentation for more details
-
class
pytext.optimizer.scheduler.
ExponentialLR
(optimizer, gamma, last_epoch=-1)[source]¶ Bases:
torch.optim.lr_scheduler.ExponentialLR
,pytext.optimizer.scheduler.Scheduler
Wrapper around torch.optim.lr_scheduler.ExponentialLR See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
LmFineTuning
(optimizer, cut_frac=0.1, ratio=32, non_pretrained_param_groups=2, lm_lr_multiplier=1.0, lm_use_per_layer_lr=False, lm_gradual_unfreezing=True, last_epoch=-1)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Fine-tuning methods from the paper “[arXiv:1801.06146]Universal Language Model Fine-tuning for Text Classification”.
Specifically, modifies training schedule using slanted triangular learning rates, discriminative fine-tuning (per-layer learning rates), and gradual unfreezing.
-
class
pytext.optimizer.scheduler.
PolynomialDecayScheduler
(optimizer, warmup_steps, total_steps, end_learning_rate, power)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Applies a polynomial decay with lr warmup to the learning rate.
It is commonly observed that a monotonically decreasing learning rate, whose degree of change is carefully chosen, results in a better performing model.
This scheduler linearly increase learning rate from 0 to final value at the beginning of training, determined by warmup_steps. Then it applies a polynomial decay function to an optimizer step, given a provided base_lrs to reach an end_learning_rate after total_steps.
-
class
pytext.optimizer.scheduler.
ReduceLROnPlateau
(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)[source]¶ Bases:
torch.optim.lr_scheduler.ReduceLROnPlateau
,pytext.optimizer.scheduler.Scheduler
Wrapper around torch.optim.lr_scheduler.ReduceLROnPlateau See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
Scheduler
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.config.component.Component
Schedulers help in adjusting the learning rate during training. Scheduler is a wrapper class over schedulers which can be available in torch library or for custom implementations. There are two kinds of lr scheduling that is supported by this class. Per epoch scheduling and per batch scheduling. In per epoch scheduling, the learning rate is adjusted at the end of each epoch and in per batch scheduling the learning rate is adjusted after the forward and backward pass through one batch during the training.
There are two main methods that needs to be implemented by the Scheduler. step_epoch() is called at the end of each epoch and step_batch() is called at the end of each batch in the training data.
prepare() method can be used by BatchSchedulers to initialize any attributes they may need.
-
class
pytext.optimizer.scheduler.
SchedulerWithWarmup
(optimizer, warmup_scheduler, scheduler, switch_steps)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Wraps another scheduler with a warmup phase. After warmup_steps defined in warmup_scheduler.warmup_steps, the scheduler will switch to use the specified scheduler in scheduler.
warmup_scheduler: is the configuration for the WarmupScheduler, that warms up learning rate over warmup_steps linearly.
scheduler: is the main scheduler that will be applied after the warmup phase (once warmup_steps have passed)
-
class
pytext.optimizer.scheduler.
StepLR
(optimizer, step_size, gamma=0.1, last_epoch=-1)[source]¶ Bases:
torch.optim.lr_scheduler.StepLR
,pytext.optimizer.scheduler.Scheduler
Wrapper around torch.optim.lr_scheduler.StepLR See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
WarmupScheduler
(optimizer, warmup_steps, inverse_sqrt_decay)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Scheduler to linearly increase the learning rate from 0 to its final value over a number of steps:
lr = base_lr * current_step / warmup_stepsAfter the warm-up phase, the scheduler has the option of decaying the learning rate as the inverse square root of the number of training steps taken:
lr = base_lr * sqrt(warmup_steps) / sqrt(current_step)
pytext.optimizer.swa module¶
-
class
pytext.optimizer.swa.
StochasticWeightAveraging
(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.Parameters: - param_group (dict) – Specifies what Tensors should be optimized along
- group specific optimization options. (with) –
-
static
bn_update
(loader, model, device=None)[source]¶ Updates BatchNorm running_mean, running_var buffers in the model.
It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model.
Parameters: - loader (torch.utils.data.DataLoader) – dataset loader to compute the activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
- model (torch.nn.Module) – model for which we seek to update BatchNorm statistics.
- device (torch.device, optional) – If set, data will be trasferred to
device
before being passed intomodel
.
-
finalize
()[source]¶ Swaps the values of the optimized variables and swa buffers.
It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.
-
classmethod
from_config
(config: pytext.optimizer.swa.StochasticWeightAveraging.Config, model: torch.nn.modules.module.Module)[source]¶
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
Parameters: state_dict (dict) – SWA optimizer state. Should be an object returned from a call to state_dict.
-
state_dict
()[source]¶ Returns the state of SWA as a
dict
.- It contains three entries:
- opt_state - a dict holding current optimization state of the base
- optimizer. Its content differs between optimizer classes.
- swa_state - a dict containing current state of SWA. For each
- optimized variable it contains swa_buffer keeping the running average of the variable
- param_groups - a dict containing all parameter groups
-
step
(closure=None)[source]¶ Performs a single optimization step.
In automatic mode also updates SWA running averages.
-
update_swa_group
(group)[source]¶ Updates the SWA running averages for the given parameter group.
Parameters: param_group (dict) – Specifies for what parameter group SWA running averages should be updated Examples
>>> # automatic mode >>> base_opt = torch.optim.SGD([{'params': [x]}, >>> {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9) >>> opt = torchcontrib.optim.SWA(base_opt) >>> for i in range(100): >>> opt.zero_grad() >>> loss_fn(model(input), target).backward() >>> opt.step() >>> if i > 10 and i % 5 == 0: >>> # Update SWA for the second parameter group >>> opt.update_swa_group(opt.param_groups[1]) >>> opt.swap_swa_sgd()
-
Module contents¶
pytext.task package¶
Submodules¶
pytext.task.disjoint_multitask module¶
-
class
pytext.task.disjoint_multitask.
DisjointMultitask
(target_task_name, exporters, **kwargs)[source]¶ Bases:
pytext.task.task.TaskBase
Modules which have the same shared_module_key and type share parameters. Only the first instance of such module should be configured in tasks list.
-
export
(multitask_model, export_path, metric_channels, export_onnx_path=None)[source]¶ Wrapper method to export PyTorch model to Caffe2 model using
Exporter
.Parameters: - export_path (str) – file path of exported caffe2 model
- metric_channels – output the PyTorch model’s execution graph to
- export_onnx_path (str) – file path of exported onnx model
-
classmethod
from_config
(task_config: pytext.task.disjoint_multitask.DisjointMultitask.Config, metadata=None, model_state=None, tensorizers=None, rank=0, world_size=1)[source]¶ Create the task from config, and optionally load metadata/model_state This function will create components including
DataHandler
,Trainer
,MetricReporter
,Exporter
, and wire them up.Parameters: - task_config (Task.Config) – the config of the current task
- metadata – saved global context of this task, e.g: vocabulary, will be
generated by
DataHandler
if it’s None - model_state – saved model parameters, will be loaded into model when given
-
-
class
pytext.task.disjoint_multitask.
NewDisjointMultitask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task._NewTask
Multitask training based on underlying subtasks. To share parameters between modules from different tasks, specify the same shared_module_key. Only the first instance of each shared module should be configured in tasks list. Only the multitask trainer (not the per-task trainers) is used.
-
export
(model, export_path, metric_channels=None, export_onnx_path=None)[source]¶ Wrapper method to export PyTorch model to Caffe2 model using
Exporter
.Parameters: - export_path (str) – file path of exported caffe2 model
- metric_channels (List[Channel]) – outputs of model’s execution graph
- export_onnx_path (str) – file path of exported onnx model
-
classmethod
from_config
(task_config: pytext.task.disjoint_multitask.NewDisjointMultitask.Config, unused_metadata=None, model_state=None, tensorizers=None, rank=0, world_size=1)[source]¶ Create the task from config, and optionally load metadata/model_state This function will create components including
DataHandler
,Trainer
,MetricReporter
,Exporter
, and wire them up.Parameters: - task_config (Task.Config) – the config of the current task
- metadata – saved global context of this task, e.g: vocabulary, will be
generated by
DataHandler
if it’s None - model_state – saved model parameters, will be loaded into model when given
-
pytext.task.new_task module¶
-
class
pytext.task.new_task.
NewTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task._NewTask
pytext.task.serialize module¶
-
class
pytext.task.serialize.
CheckpointManager
[source]¶ Bases:
object
CheckpointManager is class abstraction to manage training job’s checkpoints with different IO and storage, using two functions: save() and load().
-
DELIMITER
= '-'¶
-
generate_checkpoint_path
(config: pytext.config.pytext_config.PyTextConfig, identifier: str)[source]¶
-
get_latest_checkpoint_path
() → str[source]¶ Return most recent saved checkpoint path in str Returns: checkpoint_path (str)
-
list
() → List[str][source]¶ Return all existing checkpoint path in str Returns: checkpoint_path_list (List[str]), list elements are in the same order of checkpoint saving
-
load
(load_path: str, overwrite_config=None)[source]¶ Loads a checkpoint from disk. :param load_path: the file path to load for checkpoint :type load_path: str
Returns: task (Task), config (PyTextConfig) and training_state (TrainingState)
-
save
(config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None, identifier: str = None) → str[source]¶ save a checkpoint to given path, config, model and training_state together represent the checkpoint. When identifier is None, this function is used to save post-training snapshot
-
-
pytext.task.serialize.
get_latest_checkpoint_path
(dir_path: Optional[str] = None) → str[source]¶ Get the latest checkpoint path :param dir_path: the dir to scan for existing checkpoint files. Default: if None, :param the latest checkpoint path saved in momery will be returned:
Returns: checkpoint_path
-
pytext.task.serialize.
load
(load_path: str, overwrite_config=None)[source]¶ Load task, config and training state from a saved snapshot by default, it will construct the task using the saved config then load metadata and model state.
if overwrite_task is specified, it will construct the task using overwrite_task then load metadata and model state.
-
pytext.task.serialize.
save
(config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None, identifier: Optional[str] = None) → str[source]¶ Save all stateful information of a training task to a specified file-like object, will save the original config, model state, metadata, training state if training is not completed Args: identifier (str): used to identify a checkpoint within a training job, used as a suffix for save path config (PytextConfig): contains all raw parameter/hyper-parameters for training task model (Model): actual model in training training_state (TrainingState): stateful infomation during training Returns: identifier (str): if identifier is not specified, will save to config.save_snapshot_path to be consistent to post-training snapshot; if specified, will be used to save checkpoint during training, identifier is used to identify checkpoints in the same training
-
pytext.task.serialize.
save_checkpoint
(f: io.IOBase, config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None) → str[source]¶
pytext.task.task module¶
-
class
pytext.task.task.
TaskBase
(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]¶ Bases:
pytext.config.component.Component
Task is the central place to define and wire up components for data processing, model training, metric reporting, etc. Task class has a Config class containing the config of each component in a descriptive way.
-
export
(model, export_path, metric_channels=None, export_onnx_path=None)[source]¶ Wrapper method to export PyTorch model to Caffe2 model using
Exporter
.Parameters: - export_path (str) – file path of exported caffe2 model
- metric_channels (List[Channel]) – outputs of model’s execution graph
- export_onnx_path (str) – file path of exported onnx model
-
classmethod
format_prediction
(predictions, scores, context, target_meta)[source]¶ Format the prediction and score from model output, by default just return them in a dict
-
classmethod
from_config
(task_config, metadata=None, model_state=None, tensorizers=None, rank=1, world_size=0)[source]¶ Create the task from config, and optionally load metadata/model_state This function will create components including
DataHandler
,Trainer
,MetricReporter
,Exporter
, and wire them up.Parameters: - task_config (Task.Config) – the config of the current task
- metadata – saved global context of this task, e.g: vocabulary, will be
generated by
DataHandler
if it’s None - model_state – saved model parameters, will be loaded into model when given
-
predict
(examples)[source]¶ Generates predictions using PyTorch model. The difference with test() is that this should be used when the the examples do not have any true label/target.
Parameters: examples – json format examples, input names should match the names specified in this task’s features config
-
test
(test_path)[source]¶ Wrapper method to compute test metrics on holdout blind test dataset.
Parameters: test_path (str) – test data file path
-
train
(train_config, rank=0, world_size=1, training_state=None)[source]¶ Wrapper method to train the model using
Trainer
object.Parameters: - train_config (PyTextConfig) – config for training
- rank (int) – for distributed training only, rank of the gpu, default is 0
- world_size (int) – for distributed training only, total gpu to use, default is 1
-
-
class
pytext.task.task.
Task_Deprecated
(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]¶ Bases:
pytext.task.task.TaskBase
pytext.task.tasks module¶
-
class
pytext.task.tasks.
BertPairRegressionTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶
-
class
pytext.task.tasks.
DocumentClassificationTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
DocumentRegressionTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
EnsembleTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
IntentSlotTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
LMTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
MaskedLMTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
NewBertClassificationTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶
-
class
pytext.task.tasks.
NewBertPairClassificationTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶
-
class
pytext.task.tasks.
PairwiseClassificationTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
QueryDocumentPairwiseRankingTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
RoBERTaNERTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
SemanticParsingTask
(data: pytext.data.data.Data, model: pytext.models.semantic_parsers.rnng.rnng_parser.RNNGParser, metric_reporter: pytext.metric_reporters.compositional_metric_reporter.CompositionalMetricReporter, trainer: pytext.trainers.hogwild_trainer.HogwildTrainer)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
SeqNNTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
SquadQATask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
-
class
pytext.task.tasks.
WordTaggingTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task.NewTask
Module contents¶
-
class
pytext.task.
NewTask
(data: pytext.data.data.Data, model: pytext.models.model.BaseModel, metric_reporter: Optional[pytext.metric_reporters.metric_reporter.MetricReporter] = None, trainer: Optional[pytext.trainers.trainer.TaskTrainer] = None)[source]¶ Bases:
pytext.task.new_task._NewTask
-
class
pytext.task.
Task_Deprecated
(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]¶ Bases:
pytext.task.task.TaskBase
-
class
pytext.task.
TaskBase
(trainer: pytext.trainers.trainer.Trainer, data_handler: pytext.data.data_handler.DataHandler, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, exporter: Optional[pytext.exporters.exporter.ModelExporter])[source]¶ Bases:
pytext.config.component.Component
Task is the central place to define and wire up components for data processing, model training, metric reporting, etc. Task class has a Config class containing the config of each component in a descriptive way.
-
export
(model, export_path, metric_channels=None, export_onnx_path=None)[source]¶ Wrapper method to export PyTorch model to Caffe2 model using
Exporter
.Parameters: - export_path (str) – file path of exported caffe2 model
- metric_channels (List[Channel]) – outputs of model’s execution graph
- export_onnx_path (str) – file path of exported onnx model
-
classmethod
format_prediction
(predictions, scores, context, target_meta)[source]¶ Format the prediction and score from model output, by default just return them in a dict
-
classmethod
from_config
(task_config, metadata=None, model_state=None, tensorizers=None, rank=1, world_size=0)[source]¶ Create the task from config, and optionally load metadata/model_state This function will create components including
DataHandler
,Trainer
,MetricReporter
,Exporter
, and wire them up.Parameters: - task_config (Task.Config) – the config of the current task
- metadata – saved global context of this task, e.g: vocabulary, will be
generated by
DataHandler
if it’s None - model_state – saved model parameters, will be loaded into model when given
-
predict
(examples)[source]¶ Generates predictions using PyTorch model. The difference with test() is that this should be used when the the examples do not have any true label/target.
Parameters: examples – json format examples, input names should match the names specified in this task’s features config
-
test
(test_path)[source]¶ Wrapper method to compute test metrics on holdout blind test dataset.
Parameters: test_path (str) – test data file path
-
train
(train_config, rank=0, world_size=1, training_state=None)[source]¶ Wrapper method to train the model using
Trainer
object.Parameters: - train_config (PyTextConfig) – config for training
- rank (int) – for distributed training only, rank of the gpu, default is 0
- world_size (int) – for distributed training only, total gpu to use, default is 1
-
-
pytext.task.
save
(config: pytext.config.pytext_config.PyTextConfig, model: pytext.models.model.Model, meta: Optional[pytext.data.data_handler.CommonMetadata], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], training_state: Optional[pytext.trainers.training_state.TrainingState] = None, identifier: Optional[str] = None) → str[source]¶ Save all stateful information of a training task to a specified file-like object, will save the original config, model state, metadata, training state if training is not completed Args: identifier (str): used to identify a checkpoint within a training job, used as a suffix for save path config (PytextConfig): contains all raw parameter/hyper-parameters for training task model (Model): actual model in training training_state (TrainingState): stateful infomation during training Returns: identifier (str): if identifier is not specified, will save to config.save_snapshot_path to be consistent to post-training snapshot; if specified, will be used to save checkpoint during training, identifier is used to identify checkpoints in the same training
-
pytext.task.
load
(load_path: str, overwrite_config=None)[source]¶ Load task, config and training state from a saved snapshot by default, it will construct the task using the saved config then load metadata and model state.
if overwrite_task is specified, it will construct the task using overwrite_task then load metadata and model state.
pytext.torchscript package¶
Subpackages¶
pytext.torchscript.tensorizer package¶
-
class
pytext.torchscript.tensorizer.bert.
ScriptBERTTensorizer
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase
-
class
pytext.torchscript.tensorizer.bert.
ScriptBERTTensorizerBase
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer
-
class
pytext.torchscript.tensorizer.normalizer.
VectorNormalizer
(dim: int, do_normalization: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
Performs in-place normalization over all features of a dense feature vector by doing (x - mean)/stddev for each x in the feature vector.
This is a ScriptModule so that the normalize function can be called at training time in the tensorizer, as well as at inference time by using it in your torchscript forward function. To use this in your tensorizer update_meta_data must be called once per row in your initialize function, and then calculate_feature_stats must be called upon the last time it runs. See usage in FloatListTensorizer for an example.
Setting do_normalization=False will make the normalize function an identity function.
-
forward
()[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.torchscript.tensorizer.roberta.
ScriptRoBERTaTensorizer
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase
-
class
pytext.torchscript.tensorizer.roberta.
ScriptRoBERTaTensorizerWithIndices
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase
-
class
pytext.torchscript.tensorizer.xlm.
ScriptXLMTensorizer
(tokenizer: torch.jit.ScriptModule, token_vocab: pytext.torchscript.vocab.ScriptVocabulary, language_vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int, default_language: str)[source]¶ Bases:
pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer
-
class
pytext.torchscript.tensorizer.
ScriptBERTTensorizer
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase
-
class
pytext.torchscript.tensorizer.
ScriptRoBERTaTensorizer
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase
-
class
pytext.torchscript.tensorizer.
ScriptRoBERTaTensorizerWithIndices
(tokenizer: torch.jit.ScriptModule, vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int)[source]¶ Bases:
pytext.torchscript.tensorizer.bert.ScriptBERTTensorizerBase
-
class
pytext.torchscript.tensorizer.
ScriptXLMTensorizer
(tokenizer: torch.jit.ScriptModule, token_vocab: pytext.torchscript.vocab.ScriptVocabulary, language_vocab: pytext.torchscript.vocab.ScriptVocabulary, max_seq_len: int, default_language: str)[source]¶ Bases:
pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer
-
class
pytext.torchscript.tensorizer.
VectorNormalizer
(dim: int, do_normalization: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
Performs in-place normalization over all features of a dense feature vector by doing (x - mean)/stddev for each x in the feature vector.
This is a ScriptModule so that the normalize function can be called at training time in the tensorizer, as well as at inference time by using it in your torchscript forward function. To use this in your tensorizer update_meta_data must be called once per row in your initialize function, and then calculate_feature_stats must be called upon the last time it runs. See usage in FloatListTensorizer for an example.
Setting do_normalization=False will make the normalize function an identity function.
-
forward
()[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.torchscript.tokenizer package¶
-
class
pytext.torchscript.tokenizer.bpe.
ScriptBPE
(vocab: Dict[str, int], eow: str = '_EOW')[source]¶ Bases:
torch.jit.ScriptModule
Byte-pair encoding implementation in TorchScript.
vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.
eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.
>>> import io >>> vocab_file = io.StringIO(''' hello_EOW 20 world_EOW 18 th 17 is_EOW 16 bpe_EOW 15 ! 14 h 13 t 6 s_EOW 2 i -1 ii -2 ''') >>> bpe = ScriptBPE.from_vocab_file(vocab_file) >>> bpe.tokenize(["hello", "world", "this", "is", "bpe"]) ["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"] >>> bpe.tokenize(["iiiis"]) ["ii", "i", "is_EOW"]
-
classmethod
from_vocab_file
(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶
-
classmethod
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptBPETokenizer
(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptDoNothingTokenizer
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTextTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTokenTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
torch.jit.ScriptModule
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.
ScriptBPE
(vocab: Dict[str, int], eow: str = '_EOW')[source]¶ Bases:
torch.jit.ScriptModule
Byte-pair encoding implementation in TorchScript.
vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.
eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.
>>> import io >>> vocab_file = io.StringIO(''' hello_EOW 20 world_EOW 18 th 17 is_EOW 16 bpe_EOW 15 ! 14 h 13 t 6 s_EOW 2 i -1 ii -2 ''') >>> bpe = ScriptBPE.from_vocab_file(vocab_file) >>> bpe.tokenize(["hello", "world", "this", "is", "bpe"]) ["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"] >>> bpe.tokenize(["iiiis"]) ["ii", "i", "is_EOW"]
-
classmethod
from_vocab_file
(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶
-
classmethod
-
class
pytext.torchscript.tokenizer.
ScriptBPETokenizer
(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.
ScriptDoNothingTokenizer
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.
ScriptTextTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.
ScriptTokenTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
Submodules¶
pytext.torchscript.module module¶
-
class
pytext.torchscript.module.
ScriptModule
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
torch.jit.ScriptModule
-
class
pytext.torchscript.module.
ScriptTextModule
(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]¶
-
class
pytext.torchscript.module.
ScriptTokenLanguageModule
(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]¶
-
class
pytext.torchscript.module.
ScriptTokenLanguageModuleWithDenseFeature
(model: torch.jit.ScriptModule, output_layer: torch.jit.ScriptModule, tensorizer: pytext.torchscript.tensorizer.tensorizer.ScriptTensorizer)[source]¶
pytext.torchscript.utils module¶
pytext.torchscript.vocab module¶
Module contents¶
pytext.trainers package¶
Submodules¶
pytext.trainers.ensemble_trainer module¶
-
class
pytext.trainers.ensemble_trainer.
EnsembleTrainer
(real_trainers)[source]¶ Bases:
pytext.trainers.trainer.TrainerBase
Trainer for ensemble models
-
real_trainer
¶ the actual trainer to run
Type: Trainer
-
pytext.trainers.hogwild_trainer module¶
-
class
pytext.trainers.hogwild_trainer.
HogwildTrainer
(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶ Bases:
pytext.trainers.trainer.Trainer
-
classmethod
from_config
(config: pytext.trainers.hogwild_trainer.HogwildTrainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
run_epoch
(state: pytext.trainers.training_state.TrainingState, data_iter: torchtext.data.iterator.Iterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)¶
-
set_up_training
(state: pytext.trainers.training_state.TrainingState, training_data)¶
-
classmethod
-
class
pytext.trainers.hogwild_trainer.
HogwildTrainer_Deprecated
(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶ Bases:
pytext.trainers.trainer.Trainer
-
classmethod
from_config
(config: pytext.trainers.hogwild_trainer.HogwildTrainer_Deprecated.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
classmethod
pytext.trainers.trainer module¶
-
class
pytext.trainers.trainer.
TaskTrainer
(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]¶ Bases:
pytext.trainers.trainer.Trainer
-
run_step
(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]¶ Our run_step is a bit different, because we’re wrapping the model forward call with model.train_batch, which arranges tensors and gets loss, etc.
Whenever “samples” contains more than one mini-batch (sample_size > 1), we want to accumulate gradients locally and only call all-reduce in the last backwards pass.
-
-
class
pytext.trainers.trainer.
Trainer
(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]¶ Bases:
pytext.trainers.trainer.TrainerBase
- Base Trainer class that provide ways to
- 1 Train model, compute metrics against eval set and use the metrics for model selection. 2 Test trained model, compute and publish metrics against a blind test set.
-
epochs
¶ Training epochs
Type: int
-
early_stop_after
¶ Stop after how many epochs when the eval metric is not improving
Type: int
-
max_clip_norm
¶ Clip gradient norm if set
Type: Optional[float]
-
report_train_metrics
¶ Whether metrics on training data should be computed and reported.
Type: bool
-
target_time_limit_seconds
¶ Target time limit for training in seconds. If the expected time to train another epoch exceeds this limit, stop training.
Type: float
-
classmethod
from_config
(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
run_epoch
(state: pytext.trainers.training_state.TrainingState, data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]¶
-
run_step
(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]¶
-
save_checkpoint
(state: pytext.trainers.training_state.TrainingState, train_config: pytext.config.pytext_config.PyTextConfig) → str[source]¶
-
set_up_training
(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator)[source]¶
-
test
(test_iter, model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]¶
-
train
(training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig, rank: int = 0) → Tuple[torch.nn.modules.module.Module, Any][source]¶ Train and eval a model, the model states will be modified. :param train_iter: batch iterator of training data :type train_iter: BatchIterator :param eval_iter: batch iterator of evaluation data :type eval_iter: BatchIterator :param model: model to be trained :type model: Model :param metric_reporter: compute metric based on training :type metric_reporter: MetricReporter :param output and report results to console, file.. etc: :param train_config: training config :type train_config: PyTextConfig :param training_result: only meaningful for Hogwild training. default :type training_result: Optional :param is None: :param rank: only used in distributed training, the rank of the current :type rank: int :param training thread, evaluation will only be done in rank 0:
Returns: the trained model together with the best metric Return type: model, best_metric
-
train_from_state
(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig) → Tuple[torch.nn.modules.module.Module, Any][source]¶ Train and eval a model from a given training state will be modified. This function iterates epochs specified in config, and for each epoch do:
- Train model using training data, aggregate and report training results
- Adjust learning rate if scheduler is specified
- Evaluate model using evaluation data
- Calculate metrics based on evaluation results and select best model
Parameters: - training_state (TrainingState) – contrains stateful information to be
- to restore a training job (able) –
- train_iter (BatchIterator) – batch iterator of training data
- eval_iter (BatchIterator) – batch iterator of evaluation data
- model (Model) – model to be trained
- metric_reporter (MetricReporter) – compute metric based on training output and report results to console, file.. etc
- train_config (PyTextConfig) – training config
Returns: the trained model together with the best metric
Return type: model, best_metric
pytext.trainers.training_state module¶
Module contents¶
-
class
pytext.trainers.
Trainer
(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]¶ Bases:
pytext.trainers.trainer.TrainerBase
- Base Trainer class that provide ways to
- 1 Train model, compute metrics against eval set and use the metrics for model selection. 2 Test trained model, compute and publish metrics against a blind test set.
-
epochs
¶ Training epochs
Type: int
-
early_stop_after
¶ Stop after how many epochs when the eval metric is not improving
Type: int
-
max_clip_norm
¶ Clip gradient norm if set
Type: Optional[float]
-
report_train_metrics
¶ Whether metrics on training data should be computed and reported.
Type: bool
-
target_time_limit_seconds
¶ Target time limit for training in seconds. If the expected time to train another epoch exceeds this limit, stop training.
Type: float
-
classmethod
from_config
(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
run_epoch
(state: pytext.trainers.training_state.TrainingState, data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]¶
-
run_step
(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]¶
-
save_checkpoint
(state: pytext.trainers.training_state.TrainingState, train_config: pytext.config.pytext_config.PyTextConfig) → str[source]¶
-
set_up_training
(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator)[source]¶
-
test
(test_iter, model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)[source]¶
-
train
(training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, model: pytext.models.model.Model, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig, rank: int = 0) → Tuple[torch.nn.modules.module.Module, Any][source]¶ Train and eval a model, the model states will be modified. :param train_iter: batch iterator of training data :type train_iter: BatchIterator :param eval_iter: batch iterator of evaluation data :type eval_iter: BatchIterator :param model: model to be trained :type model: Model :param metric_reporter: compute metric based on training :type metric_reporter: MetricReporter :param output and report results to console, file.. etc: :param train_config: training config :type train_config: PyTextConfig :param training_result: only meaningful for Hogwild training. default :type training_result: Optional :param is None: :param rank: only used in distributed training, the rank of the current :type rank: int :param training thread, evaluation will only be done in rank 0:
Returns: the trained model together with the best metric Return type: model, best_metric
-
train_from_state
(state: pytext.trainers.training_state.TrainingState, training_data: pytext.data.data_handler.BatchIterator, eval_data: pytext.data.data_handler.BatchIterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, train_config: pytext.config.pytext_config.PyTextConfig) → Tuple[torch.nn.modules.module.Module, Any][source]¶ Train and eval a model from a given training state will be modified. This function iterates epochs specified in config, and for each epoch do:
- Train model using training data, aggregate and report training results
- Adjust learning rate if scheduler is specified
- Evaluate model using evaluation data
- Calculate metrics based on evaluation results and select best model
Parameters: - training_state (TrainingState) – contrains stateful information to be
- to restore a training job (able) –
- train_iter (BatchIterator) – batch iterator of training data
- eval_iter (BatchIterator) – batch iterator of evaluation data
- model (Model) – model to be trained
- metric_reporter (MetricReporter) – compute metric based on training output and report results to console, file.. etc
- train_config (PyTextConfig) – training config
Returns: the trained model together with the best metric
Return type: model, best_metric
-
class
pytext.trainers.
TrainingState
(**kwargs)[source]¶ Bases:
object
-
best_model_metric
= None¶
-
best_model_state
= None¶
-
epoch
= 0¶
-
epochs_since_last_improvement
= 0¶
-
rank
= 0¶
-
stage
= 'Training'¶
-
step_counter
= 0¶
-
tensorizers
= None¶
-
-
class
pytext.trainers.
EnsembleTrainer
(real_trainers)[source]¶ Bases:
pytext.trainers.trainer.TrainerBase
Trainer for ensemble models
-
real_trainer
¶ the actual trainer to run
Type: Trainer
-
-
class
pytext.trainers.
HogwildTrainer
(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶ Bases:
pytext.trainers.trainer.Trainer
-
classmethod
from_config
(config: pytext.trainers.hogwild_trainer.HogwildTrainer.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
run_epoch
(state: pytext.trainers.training_state.TrainingState, data_iter: torchtext.data.iterator.Iterator, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter)¶
-
set_up_training
(state: pytext.trainers.training_state.TrainingState, training_data)¶
-
classmethod
-
class
pytext.trainers.
HogwildTrainer_Deprecated
(real_trainer_config, num_workers, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶ Bases:
pytext.trainers.trainer.Trainer
-
classmethod
from_config
(config: pytext.trainers.hogwild_trainer.HogwildTrainer_Deprecated.Config, model: torch.nn.modules.module.Module, *args, **kwargs)[source]¶
-
classmethod
-
class
pytext.trainers.
TaskTrainer
(config: pytext.trainers.trainer.Trainer.Config, model: torch.nn.modules.module.Module)[source]¶ Bases:
pytext.trainers.trainer.Trainer
-
run_step
(samples: List[Any], state: pytext.trainers.training_state.TrainingState, metric_reporter: pytext.metric_reporters.metric_reporter.MetricReporter, report_metric: bool)[source]¶ Our run_step is a bit different, because we’re wrapping the model forward call with model.train_batch, which arranges tensors and gets loss, etc.
Whenever “samples” contains more than one mini-batch (sample_size > 1), we want to accumulate gradients locally and only call all-reduce in the last backwards pass.
-
pytext.utils package¶
Submodules¶
pytext.utils.ascii_table module¶
pytext.utils.cuda module¶
pytext.utils.data module¶
-
class
pytext.utils.data.
Slot
(label: str, start: int, end: int)[source]¶ Bases:
object
-
B_LABEL_PREFIX
= 'B-'¶
-
I_LABEL_PREFIX
= 'I-'¶
-
NO_LABEL_SLOT
= 'NoLabel'¶
-
b_label_name
¶
-
i_label_name
¶
-
-
pytext.utils.data.
align_slot_labels
(token_ranges: List[Tuple[int, int]], slots_field: str, use_bio_labels: bool = False)[source]¶
-
pytext.utils.data.
byte_length
(text: str) → int[source]¶ Return the string length in term of byte offset
-
pytext.utils.data.
char_offset_to_byte_offset
(text: str, char_offset: int) → int[source]¶ Convert a char offset to byte offset
-
pytext.utils.data.
get_substring_from_offsets
(text: str, start: Optional[int], end: Optional[int], byte_offset: bool = True) → str[source]¶ Access substring of a text using byte offset, if the switch is turned on. Otherwise return substring as the usual text[start:end]
-
pytext.utils.data.
parse_and_align_slot_labels_list
(token_ranges: List[Tuple[int, int]], slots_field: str, use_bio_labels: bool = False)[source]¶
pytext.utils.distributed module¶
-
pytext.utils.distributed.
dist_init
(distributed_rank: int, world_size: int, init_method: str, device_id: int, backend: str = 'nccl', gpu_streams: int = 1)[source]¶ 1. After spawn process per GPU, we want all workers to call init_process_group around the same time or times out. 2. After dist_init, we want all workers to start calling all_reduce/barrier around the same time or NCCL timeouts.
-
pytext.utils.distributed.
get_shard_range
(dataset_size: int, rank: int, world_size: int)[source]¶ In case dataset_size is not evenly divided by world_size, we need to pad one extra example in each shard shard_len = dataset_size // world_size + 1
Case 1 rank < remainder: each shard start position is rank * shard_len
Case 2 rank >= remainder: without padding, each shard start position is rank * (shard_len - 1) + remainder = rank * shard_len - (rank - remainder) But to make sure all shard have same size, we need to pad one extra example when rank >= remainder, so start_position = start_position - 1
For example, dataset_size = 21, world_size = 8 rank 0 to 4: [0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14] rank 5 to 7: [14, 15, 16], [16, 17, 18], [18, 19, 20]
pytext.utils.documentation module¶
-
pytext.utils.documentation.
find_config_class
(class_name)[source]¶ Return the set of PyText classes matching that name. Handles fully-qualified class_name including module.
-
pytext.utils.documentation.
get_class_members_recursive
(obj)[source]¶ Find all the field names for a given class and their default value.
-
pytext.utils.documentation.
get_config_fields
(obj)[source]¶ Return a dict of config help for this object, where: - key: config name - value: (default, type, options)
- default: default value for this key if not specified
- type: type for this config value, as a string
- options: possible values for this config, only if type = Union
If the type is “Union”, the options give the lists of class names that are possible, and the default is one of those class names.
-
pytext.utils.documentation.
get_subclasses
(klass, stop_classes=(<class 'pytext.models.module.Module'>, <class 'pytext.config.component.Component'>, <class 'torch.nn.modules.module.Module'>))[source]¶
pytext.utils.embeddings module¶
-
class
pytext.utils.embeddings.
PretrainedEmbedding
(embeddings_path: str = None, lowercase_tokens: bool = True, skip_header: bool = True, delimiter: str = ' ')[source]¶ Bases:
object
Utility class for loading/caching/initializing word embeddings
-
cache_pretrained_embeddings
(cache_path: str) → None[source]¶ Cache the processed embedding vectors and vocab to a file for faster loading
-
initialize_embeddings_weights
(str_to_idx: Dict[str, int], unk: str, embed_dim: int, init_strategy: pytext.config.field_config.EmbedInitStrategy) → torch.Tensor[source]¶ Initialize embeddings weights of shape (len(str_to_idx), embed_dim) from the pretrained embeddings vectors. Words that are not in the pretrained embeddings list will be initialized according to init_strategy. :param str_to_idx: a dict that maps words to indices that the model expects :param unk: unknown token :param embed_dim: the embeddings dimension :param init_strategy: method of initializing new tokens :returns: a float tensor of dimension (vocab_size, embed_dim)
-
load_pretrained_embeddings
(raw_embeddings_path: str, append: bool = False, dialect: str = None, lowercase_tokens: bool = True, skip_header: bool = True, delimiter: str = ' ') → None[source]¶ Loading raw embeddings vectors from file in the format: num_words dim word_i v0, v1, v2, …., v_dim word_2 v0, v1, v2, …., v_dim …. Optionally appends _dialect to every token in the vocabulary (for XLU embeddings).
-
pytext.utils.file_io module¶
TODO: @stevenliu Deprecate this file after borc available in PyPI
pytext.utils.label module¶
pytext.utils.lazy module¶
-
class
pytext.utils.lazy.
Infer
(resolve_fn)[source]¶ Bases:
object
A value which can be inferred from a forward pass. Infer objects should be passed as arguments or keyword arguments to Lazy objects; see Lazy documentation for more details.
-
class
pytext.utils.lazy.
Lazy
(module_class, *args, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
A module which is able to infer some of its parameters from the inputs to its first forward pass. Lazy wraps any other nn.Module, and arguments can be passed that will be used to construct that wrapped Module after the first forward pass. If any of these arguments are Infer objects, those arguments will be replaced by calling the callback of the Infer object on the forward pass input.
For instance, >>> Lazy(nn.Linear, Infer(lambda input: input.size(-1)), 4) Lazy()
takes its in_features dimension from the last dimension of the input to its forward pass. This can be simplified to
>>> Lazy(nn.Linear, Infer.dimension(-1), 4)
or a partial can be created, for instance
>>> LazyLinear = Lazy.partial(nn.Linear, Infer.dimension(-1)) >>> LazyLinear(4) Lazy()
Finally, these Lazy objects explicitly forbid treating themselves normally; they must instead be replaced by calling init_lazy_modules on your model before training. For instance,
>>> ll = lazy.Linear(4) >>> seq = nn.Sequential(ll) >>> seq Sequential( 0: Lazy(), ) >>> init_lazy_modules(seq, torch.rand(1, 2) Sequential( 0: Linear(in_features=2, out_features=4, bias=True) )
-
forward
(*args, **kwargs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
exception
pytext.utils.lazy.
UninitializedLazyModuleError
[source]¶ Bases:
Exception
A lazy module was used improperly.
-
pytext.utils.lazy.
init_lazy_modules
(module: torch.nn.modules.module.Module, dummy_input: Tuple[torch.Tensor, ...]) → torch.nn.modules.module.Module[source]¶ Finalize an nn.Module which has Lazy components. This will both mutate internal modules which have Lazy elements, and return a new non-lazy nn.Module (in case the top-level module itself is Lazy).
Parameters: - module – An nn.Module which may be lazy or contain Lazy subcomponents
- dummy_input – module is called with this input to ensure that Lazy subcomponents have been able to infer any parameters they need
Returns: The full nn.Module object constructed using inferred arguments/dimensions.
-
class
pytext.utils.lazy.
lazy_property
(fget)[source]¶ Bases:
object
More or less copy-pasta: http://stackoverflow.com/a/6849299 Meant to be used for lazy evaluation of an object attribute. property should represent non-mutable data, as it replaces itself.
pytext.utils.loss module¶
-
class
pytext.utils.loss.
LagrangeMultiplier
[source]¶ Bases:
torch.autograd.function.Function
-
static
backward
(ctx, grad_output)[source]¶ Defines a formula for differentiating the operation.
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs didforward()
return, and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
-
static
forward
(ctx, input)[source]¶ Performs the operation.
This function is to be overridden by all subclasses.
It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
The context can be used to store tensors that can be then retrieved during the backward pass.
-
static
-
pytext.utils.loss.
build_class_priors
(labels, class_priors=None, weights=None, positive_pseudocount=1.0, negative_pseudocount=1.0)[source]¶ build class priors, if necessary. For each class, the class priors are estimated as (P + sum_i w_i y_i) / (P + N + sum_i w_i), where y_i is the ith label, w_i is the ith weight, P is a pseudo-count of positive labels, and N is a pseudo-count of negative labels.
Parameters: - labels – A Tensor with shape [batch_size, num_classes]. Entries should be in [0, 1].
- class_priors – None, or a floating point Tensor of shape [C] containing the prior probability of each class (i.e. the fraction of the training data consisting of positive examples). If None, the class priors are computed from targets with a moving average.
- weights – Tensor of shape broadcastable to labels, [N, 1] or [N, C], where N = batch_size, C = num_classes`
- positive_pseudocount – Number of positive labels used to initialize the class priors.
- negative_pseudocount – Number of negative labels used to initialize the class priors.
Returns: - A Tensor of shape [num_classes] consisting of the
weighted class priors, after updating with moving average ops if created.
Return type: class_priors
-
pytext.utils.loss.
false_postives_upper_bound
(labels, logits, weights)[source]¶ false_positives_upper_bound defined in paper: “Scalable Learning of Non-Decomposable Objectives”
Parameters: - labels – A Tensor of shape broadcastable to logits.
- logits – A Tensor of shape [N, C] or [N, C, K]. If the third dimension is present, the lower bound is computed on each slice [:, :, k] independently.
- weights – Per-example loss coefficients, with shape broadcast-compatible with that of labels. i.e. [N, 1] or [N, C]
Returns: A Tensor of shape [C] or [C, K].
-
pytext.utils.loss.
range_to_anchors_and_delta
(precision_range, num_anchors)[source]¶ Calculates anchor points from precision range.
Parameters: - precision_range – an interval (a, b), where 0.0 <= a <= b <= 1.0
- num_anchors – int, number of equally spaced anchor points.
Returns: - A Tensor of [num_anchors] equally spaced values
in the interval precision_range.
delta: The spacing between the values in precision_values.
Return type: precision_values
Raises: ValueError
– If precision_range is invalid.
-
pytext.utils.loss.
true_positives_lower_bound
(labels, logits, weights)[source]¶ true_positives_lower_bound defined in paper: “Scalable Learning of Non-Decomposable Objectives”
Parameters: - labels – A Tensor of shape broadcastable to logits.
- logits – A Tensor of shape [N, C] or [N, C, K]. If the third dimension is present, the lower bound is computed on each slice [:, :, k] independently.
- weights – Per-example loss coefficients, with shape [N, 1] or [N, C]
Returns: A Tensor of shape [C] or [C, K].
-
pytext.utils.loss.
weighted_hinge_loss
(labels, logits, positive_weights=1.0, negative_weights=1.0)[source]¶ Parameters: - labels – one-hot representation Tensor of shape broadcastable to logits
- logits – A Tensor of shape [N, C] or [N, C, K]
- positive_weights – Scalar or Tensor
- negative_weights – same shape as positive_weights
Returns: 3D Tensor of shape [N, C, K], where K is length of positive weights or 2D Tensor of shape [N, C]
pytext.utils.meter module¶
pytext.utils.mobile_onnx module¶
-
pytext.utils.mobile_onnx.
add_feats_numericalize_ops
(init_net, predict_net, vocab_map, input_names)[source]¶
pytext.utils.model module¶
-
pytext.utils.model.
get_mismatched_param
(models: Iterable[torch.nn.modules.module.Module], rel_epsilon: Optional[float] = None, abs_epsilon: Optional[float] = None) → str[source]¶ Return the name of the first mismatched parameter. Return an empty string if all the parameters of the modules are identical.
pytext.utils.onnx module¶
pytext.utils.path module¶
pytext.utils.precision module¶
pytext.utils.tensor module¶
pytext.utils.timing module¶
pytext.utils.torch module¶
-
class
pytext.utils.torch.
CPUOnlyParameter
(*args, **kwargs)[source]¶ Bases:
torch.nn.parameter.Parameter
-
cuda
(device=None, non_blocking=False) → Tensor[source]¶ Returns a copy of this object in CUDA memory.
If this object is already in CUDA memory and on the correct device, then no copy is performed and the original object is returned.
Parameters: - device (
torch.device
) – The destination GPU device. Defaults to the current CUDA device. - non_blocking (bool) – If
True
and the source is in pinned memory, the copy will be asynchronous with respect to the host. Otherwise, the argument has no effect. Default:False
.
- device (
-
Submodules¶
pytext.builtin_task module¶
pytext.main module¶
-
pytext.main.
run_single
(rank: int, config_json: str, world_size: int, dist_init_method: Optional[str], metadata: Union[Dict[str, pytext.data.data_handler.CommonMetadata], pytext.data.data_handler.CommonMetadata, None], metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]])[source]¶
pytext.workflow module¶
-
pytext.workflow.
export_saved_model_to_caffe2
(saved_model_path: str, export_caffe2_path: str, output_onnx_path: str = None) → None[source]¶
-
pytext.workflow.
export_saved_model_to_torchscript
(saved_model_path: str, path: str, quantize: bool = False) → None[source]¶
-
pytext.workflow.
get_logits
(snapshot_path: str, use_cuda_if_available: bool, output_path: Optional[str] = None, test_path: Optional[str] = None, field_names: Optional[List[str]] = None, dump_raw_input: bool = False)[source]¶
-
pytext.workflow.
prepare_task
(config: pytext.config.pytext_config.PyTextConfig, dist_init_url: str = None, device_id: int = 0, rank: int = 0, world_size: int = 1, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None, metadata: pytext.data.data_handler.CommonMetadata = None) → Tuple[pytext.task.task.Task_Deprecated, pytext.trainers.training_state.TrainingState][source]¶
-
pytext.workflow.
prepare_task_metadata
(config: pytext.config.pytext_config.PyTextConfig) → pytext.data.data_handler.CommonMetadata[source]¶ Loading the whole dataset into cpu memory on every single processes could cause OOMs for data parallel distributed training. To avoid such practice, we move the operations that required loading the whole dataset out of spawn, and pass the context to every single process.
-
pytext.workflow.
save_and_export
(config: pytext.config.pytext_config.PyTextConfig, task: pytext.task.task.Task_Deprecated, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None) → None[source]¶
-
pytext.workflow.
test_model
(test_config: pytext.config.pytext_config.TestConfig, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]], test_out_path: str) → Any[source]¶
-
pytext.workflow.
test_model_from_snapshot_path
(snapshot_path: str, use_cuda_if_available: bool, test_path: Optional[str] = None, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None, test_out_path: str = '', field_names: Optional[List[str]] = None)[source]¶
-
pytext.workflow.
train_model
(config: pytext.config.pytext_config.PyTextConfig, dist_init_url: str = None, device_id: int = 0, rank: int = 0, world_size: int = 1, metric_channels: Optional[List[pytext.metric_reporters.channel.Channel]] = None, metadata: pytext.data.data_handler.CommonMetadata = None) → Tuple[source]¶
Module contents¶
-
pytext.
batch_predict_caffe2_model
(pytext_model_file: str, caffe2_model_file: str, db_type: str = 'minidb', data_source: Optional[pytext.data.sources.data_source.DataSource] = None, use_cuda=False, task: Optional[pytext.task.new_task.NewTask] = None, train_config: Optional[pytext.config.pytext_config.PyTextConfig] = None)[source]¶
-
pytext.
create_predictor
(config: pytext.config.pytext_config.PyTextConfig, model_file: Optional[str] = None, db_type: str = 'minidb', task: Optional[pytext.task.new_task.NewTask] = None) → Callable[[Mapping[str, str]], Mapping[str, numpy.array]][source]¶ Create a simple prediction API from a training config and an exported caffe2 model file. This model file should be created by calling export on a trained model snapshot.