Webrecorder pywb documentation!

The Webrecorder (pywb) toolkit is a full-featured, advanced web archiving capture and replay framework for python. It provides command-line tools and an extensible framework for high-fidelity web archive access and creation. A subset of features provides the basic functionality of a “Wayback Machine”.

Usage

New Features

The 2.0 release of pywb is a significant overhaul from the previous iteration, and introduces many new features, including:

  • Dynamic multi-collection configuration system with no-restart updates.
  • New Recording Mode capability to create new web archives from the live web or from other archives.
  • Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
  • Support for Memento API aggregation and fallback chains for querying multiple remote and local archival sources.
  • HTTP/S Proxy Mode with customizable certificate authority for proxy mode recording and replay.
  • Flexible rewriting system with pluggable rewriters for different content-types.
  • Significantly improved Client-Side Rewriting System (wombat.js) to handle most modern web sites.
  • Improved ‘calendar’ query UI with incremental loading, grouping results by year and month, and updated replay banner.
  • New in 2.4: Extensible Customization Guide for modifying all aspects of the UI.
  • New in 2.4: Robust Embargo and Access Control system for blocking or excluding URLs, by prefix or by exact match.

Getting Started

At its core, pywb includes a fully featured web archive replay system, sometimes known as ‘wayback machine’, to provide the ability to replay, or view, archived web content in the browser.

If you have existing web archive (WARC or legacy ARC) files, here’s how to make them accessible using pywb

(If not, see Creating a Web Archive for instructions on how to easily create a WARC file right away)

By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.

pywb ships with several Command-Line Apps. The following two are useful to get started:

  • wb-manager is a command line tool for managing common collection operations.
  • wayback (pywb) starts a web server that provides the access to web archives.

(For more details, run wb-manager -h and wayback -h)

For example, to install pywb and create a new collection “my-web-archive” in ./collections/my-web-archive.

pip install pywb
wb-manager init my-web-archive
wb-manager add my-web-archive <path/to/my_warc.warc.gz>
wayback

Point your browser to http://localhost:8080/my-web-archive/<url>/ where <url> is a url you recorded before into your WARC/ARC file.

If all worked well, you should see your archived version of <url>. Congrats, you are now running your own web archive!

Getting Started Using Docker

pywb also comes with an official production-ready Dockerfile, and several automatically built Docker images.

The following Docker image tags are updated automatically with pywb updates on github:

  • webrecorder/pywb corresponds to the latest release of pywb and the master branch on github.
  • webrecorder/pywb:develop – corresponds to the develop branch of pywb on github and contains the latest development work.
  • webrecorder/pywb:<VERSION> – Starting with pywb 2.2, each incremental release will correspond to a Docker image with tag <VERSION>

Using a specific version, eg. webrecorder/pywb:<VERSION> release is recommended for production. Versioned Docker images are available for pywb releases >= 2.2.

All releases of pywb are listed in the Python Package Index for pywb

All of the currently available Docker image tags are listed on Docker hub

For the below examples, the latest webrecorder/pywb image is used.

To add WARCs in Docker, the source directory should be added as a volume.

By default, pywb runs out of the /webarchive directory, which should generally be mounted as a volume to store the data on the host outside the container. pywb will not change permissions of the data mounted at /webarchive and will instead attempt to run as same user that owns the directory.

For example, give a WARC at /path/to/my_warc.warc.gz and a pywb data directory of /pywb-data, the following will add the WARC to a new collection and start pywb:

docker pull webrecorder/pywb
docker run -e INIT_COLLECTION=my-web-archive -v /pywb-data:/webarchive \
   -v /path/to:/source webrecorder/pywb wb-manager add my-web-archive /source/my_warc.warc.gz
docker run -p 8080:8080 -v /pywb-data/:/webarchive webrecorder/pywb wayback

This example is equivalent to the non-Docker example above.

Setting INIT_COLLECTION=my-web-archive results in automatic collection initializiation via wb-manager init my-web-archive.

The wayback command is launched on port 8080 and mapped to the same on the local host.

If the wayback command is not specified, the Docker container launches with the uwsgi server recommended for production deployment. See Deployment for more info.

Using Existing Web Archive Collections

Existing archives of WARCs/ARCs files can be used with pywb with minimal amount of setup. By using wb-manager add, WARC/ARC files will automatically be placed in the collection archive directory and indexed.

By default wb-manager, places new collections in collections/<coll name> subdirectory in the current working directory. To specify a different root directory, the wb-manager -d <dir>. Other options can be set in the config file.

If you have a large number of existing CDX index files, pywb will be able to read them as well after running through a simple conversion process.

It is recommended that any index files be converted to the latest CDXJ format, which can be done by running: wb-manager cdx-convert <path/to/cdx>

To setup a collection with existing ARC/WARCs and CDX index files, you can:

  1. Run wb-manager init <coll name>. This will initialize all the required collection directories.
  2. Copy any archive files (WARCs and ARCs) to collections/<coll name>/archive/
  3. Copy any existing cdx indexes to collections/<coll name>/indexes/
  4. Run wb-manager cdx-convert collections/<coll name>/indexes/. This strongly recommended, as it will ensure that any legacy indexes are updated to the latest CDXJ format.

This will fully migrate your archive and indexes the collection. Any new WARCs added with wb-manager add will be indexed and added to the existing collection.

Dynamic Collections and Automatic Indexing

Collections created via wb-manager init are fully dynamic, and new collections can be added without restarting pywb.

When adding WARCs with wb-manager add, the indexes are also updated automatically. No restart is required, and the content is instantly available for replay.

For more complex use cases, mod:pywb also includes a background indexer that checks the archives directory and automatically updates the indexes, if any files have changed or were added.

(Of course, indexing will take some time if adding a large amount of data all at once, but is quite useful for smaller archive updates).

To enable auto-indexing, run with wayback -a or wayback -a --auto-interval 30 to adjust the frequency of auto-indexing (default is 30 seconds).

Creating a Web Archive

Using Webrecorder

If you do not have a web archive to test, one easy way to create one is to use Webrecorder

After recording, you can click Stop and then click Download Collection to receive a WARC (.warc.gz) file.

You can then use this with work with pywb.

Using pywb Recorder

The core recording functionality in Webrecorder is also part of pywb. If you want to create a WARC locally, this can be done by directly recording into your pywb collection:

  1. Create a collection: wb-manager init my-web-archive (if you haven’t already created a web archive collection)
  2. Run: wayback --record --live -a --auto-interval 10
  3. Point your browser to http://localhost:8080/my-web-archive/record/<url>

For example, to record http://example.com/, visit http://localhost:8080/my-web-archive/record/http://example.com/

In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg: http://localhost:8080/my-web-archive/http://example.com/

HTTP/S Proxy Mode Access

It is also possible to access any pywb collection via HTTP/S proxy mode, providing possibly better replay without client-side url rewriting.

At this time, a single collection for proxy mode access can be specified with the --proxy flag.

For example, wayback --proxy my-web-archive will start pywb and enable proxy mode access.

You can then configure a browser to Proxy Settings host port to: localhost:8080 and then loading any url, eg. http://example.com/ should load the latest copy from the my-web-archive collection.

See HTTP/S Proxy Mode section for additional configuration details.

Deployment

For testing, development and small production loads, the default wayback command line may be sufficient. pywb uses the gevent coroutine library, and the default app will support many concurrent connections in a single process.

For larger scale production deployments, running with uwsgi server application is recommended. The uwsgi.ini script provided can be used to launch pywb with uwsgi. uwsgi can be scaled to multiple processes to support the necessary workload, and pywb must be run with the Gevent Loop Engine. Nginx or Apache can be used as an additional frontend for uwsgi.

It is recommended to install uwsgi and its dependencies in a Python virtual environment (virtualenv). Consult the uwsgi documentation for virtualenv support for details on how to specify the virtualenv to uwsgi.

Installation of uswgi in a virtualenv will avoid known issues with installing uwsgi in some Debian-based OSes with Python 3.9+. As an example, in Ubuntu 22.04 with Python 3.10, it is recommended to install uwsgi like so:

sudo apt install -y python3-pip \
    python3-dev \
    build-essential \
    libssl-dev \
    libffi-dev \
    python3-setuptools \
    python3-venv
python3 -m venv pywbenv
source pywbenv/bin/activate
pip install wheel uwsgi pywb

Although uwsgi does not provide a way to specify command line, all command line options can alternatively be configured via config.yaml. See Configuring the Web Archive for more info on available configuration options.

Docker Deployment

The default pywb Docker image uses the production ready uwsgi server by default.

The following will run pywb in Docker directly on port 80:

docker run -p 80:8080 -v /webarchive-data/:/webarchive webrecorder/pywb

To run pywb in Docker behind a local nginx (as shown below), port 8081 should also be mapped:

docker run -p 8081:8081 -v /webarchive-data/:/webarchive webrecorder/pywb

See Getting Started Using Docker for more info on using pywb with Docker.

Sample Nginx Configuration

The following nginx configuration snippet can be used to deploy pywb with uwsgi and nginx.

The configuration assumes pywb is running the uwsgi protocol on port 8081, as is the default when running uwsgi uwsgi.ini.

The location /static block allows nginx to serve static files, and is an optional optimization.

This configuration can be updated to use HTTPS and run on 443, the UWSGI_SCHEME param ensures that pywb will use the correct scheme when rewriting.

See the Nginx Docs for a lot more details on how to configure Nginx.

server {
    listen 80;

    location /static {
        alias /path/to/pywb/static;
    }

    location / {
        uwsgi_pass localhost:8081;

        include uwsgi_params;
        uwsgi_param UWSGI_SCHEME $scheme;
    }
}

Sample Apache Configuration

The recommended Apache configuration is to use pywb with mod_proxy and mod_proxy_uwsgi.

To enable these, ensure that your httpd.conf includes:

LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_uwsgi_module modules/mod_proxy_uwsgi.so

Then, in your config, simply include:

<VirtualHost *:80>
  ProxyPass / uwsgi://pywb:8081/
</VirtualHost>

The configuration assumes uwsgi is started with uwsgi uwsgi.ini

Configuring Access Control Header

The Embargo and Access Control system allows users to be granted different access settings based on the value of an ACL header, X-pywb-ACL-user.

The header can be set via Nginx or Apache to grant custom access priviliges based on IP address, password, or other combination of rules.

For example, to set the value of the header to staff if the IP of the request is from designated local IP ranges (127.0.0.1, 192.168.1.0/24), the following settings can be added to the configs:

For Nginx:

geo $acl_user {
  # ensure user is set to empty by default
  default           "";

  # optional: add IP ranges to allow privileged access
  127.0.0.1         "staff";
  192.168.0.0/24    "staff";
}

...
location /wayback/ {
  ...
  uwsgi_param HTTP_X_PYWB_ACL_USER $acl_user;
}

For Apache:

<If "-R '192.168.1.0/24' || -R '127.0.0.1'">
  RequestHeader set X-Pywb-ACL-User staff
</If>
# ensure header is cleared if no match
<Else>
  RequestHeader set X-Pywb-ACL-User ""
</Else>

}

Running on Subdirectory Path

To run pywb on a subdirectory, rather than at the root of the web server, the recommended configuration is to adjust the uwsgi.ini to include the subdirectory: For example, to deploy pywb under the /wayback subdirectory, the uwsgi.ini can be configured as follows:

mount = /wayback=./pywb/apps/wayback.py
manage-script-name = true

Deployment Examples

The sample-deploy directory includes working Docker Compose examples for deploying pywb with Nginx and Apache on the /wayback subdirectory.

See:

Configuring the Web Archive

pywb offers an extensible YAML based configuration format via a main config.yaml at the root of each web archive.

Framed vs Frameless Replay

pywb supports several modes for serving archived web content.

With framed replay, the archived content is loaded into an iframe, and a top frame UI provides info and metadata. In this mode, the top frame url is for example, http://my-archive.example.com/<coll name>/http://example.com/ while the actual content is served at http://my-archive.example.com/<coll name>/mp_/http://example.com/

With frameless replay, the archived content is loaded directly. As of pywb 2.7, frameless replay is bannerless unless a custom banner is added via the custom_banner.html template.

Warning

pywb 2.7 introduces a breaking change around frameless replay and banners. Any custom banner intended to be used with frameless replay in pywb 2.7 and higher must be specified in the custom_banner.html template. This may require moving custom content from banner.html to the new custom_banner.html.

The default banner will no longer be served in frameless replay.

In this mode, the content is served directly at http://my-archive.example.com/<coll name>/http://example.com/

For security reasons, we recommend running pywb in framed mode, because a malicious site could tamper with the banner

However, for certain situations, frameless replay made be appropriate.

To disable framed replay add:

framed_replay: false to your config.yaml

Note: pywb also supports HTTP/S proxy mode which requires additional setup. See HTTP/S Proxy Mode for more details.

Directory Structure

The pywb system is designed to automatically access and manage web archive collections that follow a defined directory structure. The directory structure can be fully customized and “special” collections can be defined outside the structure as well.

The default directory structure for a web archive is as follows:

+-- config.yaml (optional)
|
+-- templates (optional)
|
+-- static (optional)
|
+-- collections
    |
    +-- <coll name>
        |
        +-- archive
        |     |
        |     +-- (WARC or ARC files here)
        |
        +-- indexes
        |     |
        |     +-- (CDXJ index files here)
        |
        |
        +-- acl
        |     |
        |     +-- (.aclj access control files)
        |
        +-- templates
        |     |
        |     +-- (optional html templates here)
        |
        +-- static
              |
              +-- (optional custom static assets here)

If running with default settings, the config.yaml can be omitted.

It is possible to config these directory paths in the config.yaml The following are some of the implicit default settings which can be customized:

collections_root: collections
archive_paths: archive
index_paths: indexes

(For a complete list of defaults, see the pywb/default_config.yaml file for reference)

Index Paths

The index_paths key defines the subdirectory for index files (usually CDXJ) and determine the contents of each archive collection.

The index files usually contain a pointer to a WARC file, but not the absolute path.

Archive Paths

The archive_paths key indicates how pywb will resolve WARC files listed in the index.

For example, it is possible to configure multiple archive paths:

archive_paths:
  - archive
  - http://remote-bakup.example.com/collections/

When resolving a example.warc.gz, pywb will then check (in order):

  • First, collections/<coll name>/example.warc.gz
  • Then, http://remote-backup.example.com/collections/<coll name>/example.warc.gz (if first lookup unsuccessful)

Access Controls

With pywb 2.4, pywb includes an extensible Embargo and Access Control system. By default, the access control files are stored in acl directory of each collection.

UI Customizations

The templates directory supports custom Jinja templates to allow customizing the UI. See Customization Guide for more details on available options.

Special and Custom Collections

While pywb can detect automatically collections following the above directory structure, it also provides the option to fully declare Custom User-Defined Collections explicitly.

In addition, several “special” collection definitions are possible.

All custom defined collections are placed under the collections key in config.yaml

Live Web Collection

The live web collection proxies all data to the live web, and can be defined as follows:

collections:
  live: $live

This configures the /live/ route to point to the live web.

(As a shortcut, wayback --live adds this collection via cli w/o modifying the config.yaml)

This collection can be useful for testing, or even more powerful, when combined with recording.

SOCKS Proxy for Live Web

pywb can be configured to use a SOCKS5 proxy when connecting to the live web. This allows pywb to be used with Tor and other services that require a SOCKS proxy.

If the SOCKS_HOST and optionally SOCKS_PORT environment variables are set, pywb will attempt to route all live web traffic through the SOCKS5 proxy. Note that, at this time, it is not possible to configure a SOCKS proxy per pywb collection – all live web traffic will use the SOCKS proxy if enabled.

Auto “All” Aggregate Collection

The aggregate all collections automatically aggregates data from all collections in the collections directory:

collections:
  all: $all

Accessing /all/<url> will cause an aggregate lookup within the collections directory.

Note: It is not (yet) possible to exclude collections from the auto-all collection, although “special” collections are not included.

Collection Provenance

When using the auto-all collection, it is possible to determine the original collection of each resource by looking at the Link header metadata if Memento API is enabled. The header will include the extra collection field, specifying the collection:

Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"

For example, if two collections coll-1 and coll-2 contain http://example.com/, loading the timemap for /all/timemap/link/http://example.com/ might look like as follows:

<http://localhost:8080/all/timemap/link/http://example.com/>; rel="self"; type="application/link-format"; from="Wed, 20 Sep 2017 03:53:27 GMT",
<http://localhost:8080/all/mp_/http://example.com/>; rel="timegate",
<http://example.com/>; rel="original",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",

Remote Memento Collection

It’s also possible to define remote archives as easily as location collections. For example, the following defines a collection /ia/ which accesses Internet Archive’s Wayback Machine as a single collection:

collections:
  ia: memento+https://web.archive.org/web/

Many additional options, including memento “aggregation”, fallback chains are possible using the Warcserver configuration syntax. See Warcserver Index Configuration for more info.

Custom User-Defined Collections

The collection definition syntax allows for explicitly setting the index, archive paths and all other templates, per collection, for example:

collections:
  custom:
     index: ./path/to/indexes
     resource: ./some/other/path/to/archive/
     query_html: ./path/to/templates/query.html

If possible, it is recommended to use the default directory structure to avoid per-collection configuration. However, this configuration allows for using pywb with existing collections that have unique path requirements.

Root Collection

It is also possible to define a “root” collection, for example, accessible at http://my-archive.example.com/<url> Such a collection must be defined explicitly using the $root as collection name:

collections:
  $root:
     index: ./path/to/indexes
     resource: ./path/to/archive/

Note: When a root collection is set, no other collections are currently accessible, they are ignored.

Recording Mode

Recording mode enables pywb to support recording into any automatically managed collection, using the /<coll>/record/<url> path. Accessing this path will result in pywb writing new WARCs directly into the collection <coll>.

To enable recording from the live web, simply run wayback --record.

To further customize recording mode, add the recorder block to the root of config.yaml.

The command-line option is equivalent to adding recorder: live.

The full set of configurable options (with their default settings) is as follows:

recorder:
   source_coll: live
   rollover_size: 100000000
   rollover_idle_secs: 600
   filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
   source_filter: live
   enable_put_custom_record: false

The required source_coll setting specifies the source collection from which to load content that will be recorded. Most likely this will be the Live Web Collection collection, which should also be defined. However, it could be any other collection, allowing for “extraction” from other collections or remote web archives. Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.

The other options are optional and may be omitted. The rollover_size and rollover_idle_secs specified the maximum size and maximum idle time, respectively, after which a new WARC file is created. For example, a new WARC will be created if more than 100MB are recorded, or after 600 seconds have elapsed between subsequent requests. This allows the WARC size to be more manageable and prevents files from being left open for long periods of time.

The filename-template specifies the naming convention for WARC files, and allows a timestamp, current hostname, and random string to be inserted into the filename.

When using an aggregate collection or sequential fallback collection as the source, recording can be limited to pages fetched from certain child collection by specifying source_filter as an regex matching the name of the sub-collection.

For example, if recording with the above config into a collection called my-coll, the user would access:

http://my-archive.example.com/my-coll/record/http://example.com/, which would load http://example.com/ from the live web and write the request and response to a WARC named something like:

./collections/my-coll/archive/my-warc-20170102030000000000-archive.example.com-QRTGER.warc.gz

If running with auto indexing, the WARC will also get automatically indexed and available for replay after the index interval.

As a shortcut, recorder: live can also be used to specify only the source_coll option.

Dedup Options for Recording

By default, recording mode will record every URL.

Starting with pywb 2.5.0, it is possible to configure pywb to either write revisit records or skip duplicate URLs altogether using the dedup_policy key.

Using deduplication requires a Redis instance, which will keep track of the index for deduplication in a sorted-set key. The default Redis key used is redis://localhost:6379/0/pywb:{coll}:cdxj where {coll} is replaced with current collection id.

The field can be customized using the dedup_index_url field in the recorder config. The URL must start with redis://, as that is the only supported dedup index at this time.

  • To skip duplicate URLs, set dedup_policy: skip. With this setting, only one instance of any URL will be recorded.
  • To write revist records, set dedup_policy: revisit. With this setting, WARC revisit records will be written when a duplicate URL is detected

and has the same digest as a previous response.

  • To keep all duplicates, use dedup_policy: keep. All WARC records are written to disk normally as with no policy, however, the Redis dedup index is still populated,

which allows for instant replay (see below).

  • To disable the dedup system, set to dedup_policy: none or omit the field. This is the default, and no Redis is required.

Another option, pywb can add an aggressive Cache-Control header to force the browser to cache all responses on a page. This feature is still experimental, but can be enabled via cache: always setting.

For example, the following will enable revisit records to be written using the given Redis URL, and also enable aggressive cacheing when recording:

recorder:
   ...
   cache: always
   dedup_policy: revisit
   dedup_index_url: 'redis://localhost:6379/0/pywb:{coll}:cdxj'   # default when omitted

Instant Replay (experimental)

Starting with pywb 2.5.0, when the dedup_policy is set, pywb can do ‘instant replay’ after recording, without having to regenerate the CDX or waiting for it to be updated with auto-indexing.

When any dedup_policy, pywb can also access the dedup Redis index, along with any on-disk CDX, when replaying the collection.

This feature is still experimental but should generally work. Additional options for working with the Redis Dedup index will be added in the futuer.

Adding Custom Resource Records

pywb now also supports adding custom data to a WARC resource record. This can be used to add custom resources, such as screenshots, logs, error messages, etc.. that are not normally captured as part of recording, but still useful to store in WARCs.

To add a custom resources, simply call PUT /<coll>/record with the data to be added as the request body and the type of the data specified as the content-type. The url can be specified as a query param.

For example, adding a custom record file:///my-custom-resource containing Some Custom Data can be done using curl as follows:

curl -XPUT "localhost:8080/my-web-archive/record?url=file:///my-custom-resource" --data "Some Custom Data"

This feature is only available if enable_put_custom_record: true is set in the recorder config.

Auto-Fetch Responsive Recording

When recording (or browsing the ‘live’ collection), pywb has an option to inspect and automatically fetch additional resources, including:

  • Any urls found in <img srcset="..."> attributes.
  • Any urls within CSS @media rules.

This allows pywb to better capture responsive pages, where all the resources are not directly loaded by the browser, but may be needed for future replay.

The detected urls are loaded in the background using a web worker while the user is browsing the page.

To enable this functionality, add --enable-auto-fetch to the command-line or enable_auto_fetch: true to the root of the config.yaml

The auto-fetch system is provided as part of the Client-Side Rewriting System (wombat.js)

Auto-Indexing Mode

If auto-indexing is enabled, pywb will update the indexes stored in the indexes directory whenever files are added or modified in the archive directory. Auto-indexing can be enabled via the autoindex option set to the check interval in seconds:

autoindex: 30

This specifies that the archive directories should be every 30 seconds. Auto-indexing is useful when WARCs are being appended to or added to the archive by an external operation.

If a user is manually adding a new WARC to the collection, wb-manager add <coll> <path/to/warc> is recommended, as this will add the WARC and perform a one-time reindex the collection, without the need for auto-indexing.

Note: Auto-indexing also does not support deletion of removal of WARCs from the archive directory.

This is not a common operation for web archives, a WARC must be manually removed from the collections/<coll>/archive/ directory and then collection index can be regenreated from the remaining WARCs by running wb-manager reindex <coll>

The auto-indexing mode can also be enabled via command-line by running wayback -a or wayback -a --auto-interval 30 to also set the interval.

(If running pywb with uWSGI in multi-process mode, the auto-indexing is only run in a single worker to avoid race conditions and duplicate indexing)

Client-Side Rewriting System (wombat.js)

In addition to server-side rewriting, pywb includes a Javascript client-rewriting system.

This system intercepts network traffic and emulates the correct JS environment expected by a replayed page.

The auto-fetch system is also implemented as part of wombat.

Wombat was integrated into pywb upto 2.2.x. Starting with 2.3, wombat has been spun off into its own standalone JS module.

For more information on wombat.js and client-side rewriting, see the wombat README

HTTP/S Proxy Mode

In addition to “url rewriting prefix mode” (the default), pywb can also act as a full-fledged HTTP and HTTPS proxy, allowing any browser or client supporting HTTP and HTTPS proxy to access web archives through the proxy.

Proxy mode can provide access to a single collection at time, eg. instead of accessing http://localhost:8080/my-coll/2017/http://example.com/, the user enters http://example.com/ and is served content from the my-coll collection. As a result, the collection and timestamp must be specified separately.

Configuring HTTP Proxy

At this time, pywb requires the collection to be configured at setup time (though collection switching will be added soon).

To enable proxy mode, the collection can be specified by running: wayback --proxy my-coll or by adding to the config:

proxy:
  coll: my-coll

For HTTP proxy access, this is all that is needed to use the proxy. If pywb is running on port 8080 on localhost, the following curl command should provide proxy access: curl -x "localhost:8080"  http://example.com/

Default Timestamp

The timestamp can also be optionally specified by running: wayback --proxy my-coll --proxy-default-timestamp 20181226010203 or by specifying the config:

proxy:
  coll: my-coll
  default_timestamp: "20181226010203"

The ISO date format, eg. 2018-12-26T01:02:03 is also accepted.

If the timestamp is omitted, proxy mode replay defaults to the latest capture.

The timestamp can also be dynamically overriden per-request using the Proxy Mode Memento API.

Proxy Mode Rewriting

By default, pywb performs minimal html rewriting to insert a default banner into the proxy mode replay to make it clear to users that they are viewing replayed content.

Custom rewriting code from the head_insert.html template may also be inserted into <head>.

Checking for the {% if env.pywb_proxy_magic %} allows for inserting custom content for proxy mode only.

However, content rewriting in proxy mode is not necessary and can be disabled completely by customizing the proxy block in the config.

This may be essential when proxying content to older browsers for instance.

  • To disable all content rewriting/modifications from pywb via the head_insert.html template, add enable_content_rewrite: false

    If set to false, this setting overrides and disables all the other options.

  • To disable just the banner, add enable_banner: false

  • To add a light version of rewriting (for overriding Date, random number generators), add enable_wombat: true

If Auto-Fetch Responsive Recording is enabled in the global config, the enable_wombat: true is implied, unless enable_content_rewrite: false is also set (as it will disable the auto-fetch system from being injected into the page).

If omitted, the defaults for these options are:

proxy:
  enable_banner: true
  enable_wombat: false
  enable_content_rewrite: true

For example, to enable wombat rewriting but disable the banner, use the config:

proxy:
  enable_banner: false
  enable_wombat: true

To disable all content rewriting:

proxy:
  enable_content_rewrite: false

Proxy Recording

The proxy can additional be set to recording mode, equivalent to access the /<my-coll>/record/ path, by adding recording: true, as follows:

proxy:
  coll: my-coll
  recording: true

By default, proxy recording will use the live collection if not otherwise configured.

See Recording Mode for full set of configurable recording options.

HTTPS Proxy and pywb Certificate Authority

For HTTPS proxy access, pywb provides its own Certificate Authority and dynamically generates certificates for each host and signs the responses with these certificates. By design, this allows pywb to act as “man-in-the-middle” serving archived copies of a given site.

However, the pywb Certificate Authority (CA) certificate will need to be accepted by the browser. The CA cert can be downloaded from pywb directly using the special download paths. Recommended set up for using the proxy is as follows:

  1. Start pywb with proxy mode enabled (with --proxy option or with a proxy: option block present in the config).

    (The CA root certificate will be auto-created when first starting pywb with proxy mode if it doesn’t exist.)

  2. Configure the browser proxy settings host port, for example localhost and 8080 (if running locally)

  3. Download the CA:

    • For most browsers, use the PEM format: http://wsgiprox/download/pem
    • For windows, use the PKCS12 format: http://wsgiprox/download/p12
  4. You may need to agree to “Trust this CA” to identify websites.

The auto-generated pywb CA, created at ./proxy-certs/pywb-ca.pem may also be added to a keystore directly.

The location of the CA file and the CA name displayed can be changed by setting the ca_file_cache and ca_name proxy options, respectively.

The following are all the available proxy options – only coll is required:

proxy:
  coll: my-coll
  ca_name: pywb HTTPS Proxy CA
  ca_file_cache: ./proxy-certs/pywb-ca.pem
  recording: false
  enable_banner: true
  enable_content_rewrite: true
  default_timestamp: ''

The HTTP/S functionality is provided by the separate wsgiprox utility which provides HTTP/S proxy routing to any WSGI application.

Using wsgiprox, pywb sets FrontEndApp.proxy_route_request() as the proxy resolver, and this function returns the full collection path that pywb uses to route each proxy request. The default implementation returns a path to the fixed collection coll and injects content into <head> if enable_content_rewrite is true. The default banner is inserted if enable_banner is set to true.

Extensions to pywb can override proxy_route_request() to provide custom handling, such as setting the collection dynamically or based on external data sources.

See the wsgiprox README for additional details on setting a proxy resolver.

For more information on custom certificate authority (CA) installation, the mitmproxy certificate page provides a good overview for installing a custom CA on different platforms.

Compatibility: Redirects, Memento, Flash video overrides

Exact Timestamp Redirects

By default, pywb does not redirect urls to the ‘canonical’ representation of a url with the exact timestamp.

For example, when requesting /my-coll/2017js_/http://example.com/example.js but the actual timestamp of the resource is 2017010203000400, there is not a redirect to /my-coll/2017010203000400js_/http://example.com/example.js.

Instead, this ‘canonical’ url is returned with the response in the Content-Location header. (This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.)

However, if the classic redirect behavior is desired, it can be enable by adding:

redirect_to_exact: true

to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other “wayback machine” implementations.

Memento Protocol

Memento API support is enabled by default, and works with no-timestamp-redirect and classic redirect behaviors.

However, Memento API support can be disabled by adding:

enable_memento: false

Flash Video Override

A custom system to override Flash video with a custom download via youtube-dl and replay with a custom player was enabled in previous versions of pywb. However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based. The system is seldom used now that most video is HTML5 based.

For these reasons, this functionality, previously enabled by including the script /static/vidrw.js, is disabled by default.

To enable the previous behavior, add to config:

enable_flash_video_rewrite: true

The system may be revamped in the future and enabled by default, but for now, it is provided “as-is” for compatibility reasons.

Verify SSL-Certificates

By default, SSL-Certificates of websites are not verified. To enable verification, add the following to the config:

certificates:
  cert_reqs: 'CERT_REQUIRED'
  ca_cert_dir: '/etc/ssl/certs'

ca_cert_dir can optionally point to a directory containing the CA certificates that you trust. Most linux distributions provide CA certificates via a package called ca-certificates. If omitted, the default system CA used by Python is used.

Embargo and Access Control

The embargo system allows for date-based rules to block access to captures based on their capture dates.

The access controls system provides additional URL-based rules to allow, block or exclude access to specific URL prefixes or exact URLs.

The embargo and access control rules are configured per collection.

Embargo Settings

The embargo system allows restricting access to all URLs within a collection based on the timestamp of each URL. Access to these resources is ‘embargoed’ until the date range is adjusted or the time interval passes.

The embargo can be used to disallow access to captures based on following criteria:

  • Captures before an exact date
  • Captures after an exact date
  • Captures newer than a time interval
  • Captures older than a time interval

Embargo Before/After Exact Date

To block access to all captures before or after a specific date, use the before or after embargo blocks with a specific timestamp.

For example, the following blocks access to all URLs captured before 2020-12-26 in the collection embargo-before:

embargo-before:
    index_paths: ...
    archive_paths: ...
    embargo:
        before: '20201226'

The following blocks access to all URLs captured on or after 2020-12-26 in collection embargo-after:

embargo-after:
    index_paths: ...
    archive_paths: ...
    embargo:
        after: '20201226'

Embargo By Time Interval

The embargo can also be set for a relative time interval, consisting of years, months, weeks and/or days.

For example, the following blocks access to all URLs newer than 1 year:

embargo-newer:
    ...
    embargo:
        newer:
          years: 1

The following blocks access to all URLs older than 1 year, 2 months, 3 weeks and 4 days:

embargo-older:
    ...
    embargo:
        older:
          years: 1
          months: 2
          weeks: 3
          days: 4

Any combination of years, months, weeks and days can be used (as long as at least one is provided) for the newer or older embargo settings.

Access Control Settings

Access Control Files (.aclj)

URL-based access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order. To determine the best match, a binary search is used (similar to CDXJ lookup) and then the best match is found forward.

An .aclj file may look as follows:

org,httpbin)/anything/something - {"access": "allow", "url": "http://httpbin.org/anything/something"}
org,httpbin)/anything - {"access": "exclude", "url": "http://httpbin.org/anything"}
org,httpbin)/ - {"access": "block", "url": "httpbin.org/"}
com, - {"access": "allow", "url": "com,"}

Each JSON entry contains an access field and the original url field that was used to convert to the SURT (if any).

The JSON entry may also contain a user field, as explained below.

The prefix consists of a SURT key and a - (currently reserved for a timestamp/date range field to be added later).

Given these rules, a user would:

  • be allowed to visit http://httpbin.org/anything/something (allow)
  • but would receive an ‘access blocked’ error message when viewing http://httpbin.org/ (block)
  • would receive a 404 not found error when viewing http://httpbin.org/anything (exclude)

Access Types: allow, block, exclude, allow_ignore_embargo

The available access types are as follows:

  • exclude - when matched, results are excluded from the index, as if they do not exist. User will receive a 404.
  • block - when matched, results are not excluded from the index, but access to the actual content is blocked. User will see a 451.
  • allow - full access to the index and the resource, but may be overriden by embargo.
  • allow_ignore_embargo - full access to the index and resource, overriding any embargo settings.

The difference between exclude and block is that when blocked, the user can be notified that access is blocked, while with exclude, no trace of the resource is presented to the user.

The use of allow is useful to provide access to more specific resources within a broader block/exclude rule, while allow_ignore_embargo can be used to override any embargo settings.

If both are present, the embargo restrictions are checked first and take precedence, unless the allow_ignore_embargo option is used to override the embargo.

User-Based Access Controls

The access control rules can further be customized be specifying different permissions for different ‘users’. Since pywb does not have a user system, a special header, X-Pywb-ACL-User can be used to indicate a specific user.

This setting is designed to allow a more privileged user to access additional content or override an embargo.

For example, the following access control settings restrict access to https://example.com/restricted/ by default, but allow access for the staff user:

com,example)/restricted - {"access": "allow", "user": "staff"}
com,example)/restricted - {"access": "block"}

Combined with the embargo settings, this can also be used to override the embargo for internal organizational users, while keeping the embargo for general access:

com,example)/restricted - {"access": "allow_ignore_embargo", "user": "staff"}
com,example)/restricted - {"access": "allow"}

To make this work, pywb must be running behind an Apache or Nginx system that is configured to set X-Pywb-ACL-User: staff based on certain settings.

For example, this header may be set based on IP range, or based on password authentication.

Further examples of how to set this header will be provided in the deployments section.

Note: Do not use the user-based rules without configuring proper authentication on an Apache or Nginx frontend to set or remove this header, otherwise the ‘X-Pywb-ACL-User’ can easily be faked.

See the Configuring Access Control Header section in Usage for examples on how to configure this header.

Access Error Messages

The special error code 451 is used to indicate that a resource has been blocked (access setting block).

The error.html template contains a special message for this access and can be customized further.

By design, resources that are exclude-ed simply appear as 404 not found and no special error is provided.

Managing Access Lists via Command-Line

The .aclj files need not ever be added or edited manually.

The pywb wb-manager utility has been extended to provide tools for adding, removing and checking access control rules.

The access rules are written to <collection>/acl/access-rules.aclj for a given collection <collection> for automatic collections.

For example, to add the first line to an ACL file access.aclj, one could run:

wb-manager acl add <collection> http://httpbin.org/anything/something exclude

The URL supplied can be a URL or a SURT prefix. If a SURT is supplied, it is used as is:

wb-manager acl add <collection> com, allow

A specific user for user-based rules can also be specified, for example to add allow_ignore_embargo for user staff only, run:

wb-manager acl add <collection> http://httpbin.org/anything/something allow_ignore_embargo -u staff

By default, access control rules apply to a prefix of a given URL or SURT.

To have the rule apply only to the exact match, use:

wb-manager acl add <collection> http://httpbin.org/anything/something allow --exact-match

Rules added with and without the --exact-match flag are considered distinct rules, and can be added and removed separately.

With the above rules, http://httpbin.org/anything/something would be allowed, but http://httpbin.org/anything/something/subpath would be excluded for any subpath.

To remove a rule, one can run:

wb-manager acl remove <collection> http://httpbin.org/anything/something

To import rules in bulk, such as from an OpenWayback-style excludes.txt and mark them as exclude:

wb-manager acl importtxt <collection> ./excludes.txt exclude

See wb-manager acl -h for a list of additional commands such as for validating rules files and running a match against an existing rule set.

Access Controls for Custom Collections

For manually configured collections, there are additional options for configuring access controls. The access control files can be specified explicitly using the acl_paths key and allow specifying multiple ACL files, and allow sharing access control files between different collections.

Single ACLJ:

collections:
     test:
          acl_paths: ./path/to/file.aclj
          default_access: block

Multiple ACLJ:

collections:
     test:
          acl_paths:
               - ./path/to/allows.aclj
               - ./path/to/blocks.aclj
               - ./path/to/other.aclj
               - ./path/to/directory

          default_access: block

The acl_paths can be a single entry or a list, and can also include directories. If a directory is specified, all .aclj files in the directory are checked.

When finding the best rule from multiple .aclj files, each file is binary searched and the result set merge-sorted to find the best match (very similar to the CDXJ index lookup).

Note: It might make sense to separate allows.aclj and blocks.aclj into individual files for organizational reasons, but there is no specific need to keep more than one access control file.

Finally, ACLJ and embargo settings combined for the same collection might look as follows:

collections:
     test:
          ...
          embargo:
              newer:
                  days: 366

          acl_paths:
               - ./path/to/allows.aclj
               - ./path/to/blocks.aclj

Default Access

An additional default_access setting can be added to specify the default rule if no other rules match for custom collections. If omitted, this setting is default_access: allow, which is usually the desired default.

Setting default_access: block and providing a list of allow rules provides a flexible way to allow access to only a limited set of resources, and block access to anything out of scope by default.

UI Customization

Customization Guide

Most aspects of the pywb user-interface can be customized by changing the default styles, or overriding the HTML templates.

This guide covers a few different options for customizing the UI.

New Vue-based UI

With pywb 2.7.0, pywb includes a brand new UI which includes a visual calendar mode and a histogram-based banner.

See Vue-based UI for more information on how to enable this UI.

Customizing UI Templates

pywb renders HTML using the Jinja2 templating engine, loading default templates from the pywb/templates directory.

If running from a custom directory, templates can be placed in the templates directory and will override the defaults.

See Template Guide for more details on customizing the templates.

Static Files

pywb will automatically support static files placed under the following directories:

  • Files under the root static directory: static/my-file.js can be accessed via http://localhost:8080/static/my-file.js
  • Files under the per-collection directory: ./collections/my-coll/static/my-file.js can be accessed via http://localhost:8080/static/_/my-coll/my-file.js

It is possible to change these settings via config.yaml:

  • static_prefix - sets the URL path used in pywb to serve static content (default static)
  • static_dir - sets the directory name used to read static files on disk (default static)

While pywb can serve static files, it is recommended to use an existing web server to serve static files, especially if already using it in production.

For example, this can be done via nginx with:

location /wayback/static {
    alias /pywb/pywb/static;
}

Loading Custom Metadata

pywb includes a default mechanism for loading externally defined metadata, loaded from a per-collection metadata.yaml YAML file at runtime.

See Custom Metadata for more details.

Additionally, the banner template has access to the contents of the config.yaml via the {{ config }} template variable, allowing for passing in arbitrary config information.

For more dynamic loading of data, the banner and all of the templates can load additional data via JS fetch() calls.

Embedding pywb in frames

It should be possible to embed pywb replay itself as an iframe as needed.

For customizing the top-level page and banner, see Customizing the Top Frame Template.

However, there may be other reasons to embed pywb in an iframe.

This can be done simply by including something like:

<html>
  <head>
    <body>
      <div>Embedding pywb replay</div>
      <iframe style="width: 100%; height: 100%" src="http://localhost:8080/pywb/20130729195151/http://test@example.com/"></iframe>
   </body>
</html>

Vue-based UI

With 2.7.0, pywb introduces a new Vue UI based system, which provides a more feature-rich representation of a web archive.

Overview

Calendar UI

The new calendar UI provides a histogram and a clickable calendar representation of a web archive.

The calendar is rendered in place of the URL query page from versions before 2.7.0.

Calendar UI Screenshot
Logo URL

It is possible to configure the logo to link to any URL by setting ui.logo_home_url in config.yml to the URL of your choice.

If omitted, the logo will not link to any page.

For example, to have the logo redirect to https://example.com/web-archive-landing-page, set:

ui:
  logo_home_url: https://example.com/web-archive-landing-page
Printing

As of pywb 2.8, the replay header includes a print button that prints the contents of the replay iframe.

This button can be disabled by setting ui.disable_printing in config.yaml to any value.

For example:

ui:
  disable_printing: true

Updating the Vue UI

The UI is contained within the pywb/vueui directory.

The Vue component sources can be found in pywb/vueui/src.

Updating the UI requires node and yarn.

To install and build, run:

cd pywb/vueui
yarn install
yarn build

This will generate the output to pywb/static/vue/vueui.js which is loaded from the default templates when the Vue UI rendering is enabled.

Additional styles for the banner are loaded from pywb/static/vue_banner.css.

Template Guide

Introduction

This guide provides a reference of all of the templates available in pywb and how they could be modified.

These templates are found in the pywb/templates directory and can be overridden as needed, one HTML page at a time.

Template variables are listed as {{ variable }} to indicate the syntax used for rendering the value of the variable in Jinja2.

Copying a Template For Modification

To modify a template, it is often useful to start with the default template. To do so, simply copy a default template to a local templates directory.

For convenience, you can also run: wb-manager template --add <template-name> to add the template automatically.

For a list of available templates that can be overridden in this way, run wb-manager template --list.

Per-Collection Templates

Certain templates can be customized per-collection, instead of for all of pywb.

To override a template for a specific collection only, run wb-manager template --add <template-name> <coll-name>

For example:

wb-manager init my-coll
wb-manager template --add search_html my-coll

This will create the file collections/my-coll/templates/search.html, a copy of the default search.html, but configured to be used only for the collection my-coll.

Base Templates (and supporting templates)

File: base.html

This template includes the HTML added to all pages other than framed replay. Shared JS and CSS includes meant for pages other than framed replay can be added here.

To customize the default pywb UI across multiple pages, the following additional templates can also be overriden:

  • head.html – Template containing content to be added to the <head> of the base template
  • header.html – Template to be added as the first content of the <body> tag of the base template
  • footer.html – Template for adding content as the “footer” of the <body> tag of the base template

Note: The default pywb head.html and footer.html are currently blank. They can be populated to customize the rendering, add analytics, etc… as needed. Content such as styles or JS code (for example for analytics) must be added to the frame_insert.html template as well (details on that template below) to also be included in framed replay.

The base.html template also provides five blocks that can be supplied by templates that extend it.

  • title – Block for supplying the title for the page
  • head – Block for adding content to the <head>, includes head.html template
  • header – Block for adding content to the <body> before the body block, includes the header.html template
  • body – Block for adding the primary content to template
  • footer – Block for adding content to the <body> after the body block, includes the footer.html template

Home, Collection and Search Templates

Home Page Template

File: index.html

This template renders the home page for pywb, and by default renders a list of available collections.

Template variables:

  • {{ routes }} - a list of available collection routes.
  • {{ all_metadata }} - a dictionary of all metadata for all collections, keyed by collection id. See Custom Metadata for more info on the custom metadata.

Additionally, the Shared Template Variables are also available to the home page template, as well as all other templates.

Collection Page Template

File: search.html

The ‘collection page’ template is the page rendered when no URL is specified, e.g. http://localhost:8080/my-collection/.

The default template renders a search page that can be used to start searching for URLs.

Template variables:

  • {{ coll }} - the collection name identifier.
  • {{ metadata }} - an optional dictionary of metadata. See Custom Metadata for more info.
  • {{ ui }} - an optional ui dictionary from config.yaml, if any
Custom Metadata

If custom collection metadata is provided, this page will automatically show this metadata as well.

It is possible to also add custom metadata per-collection that will be available to the collection.

For dynamic collections, any fields placed in <coll_name>/metadata.yaml files can be accessed

via the {{ metadata }} variable.

For example, if the metadata file contains:

somedata: value

Accessing {{ metadata.somedata }} will resolve to value.

The metadata can also be added via commandline: wb-manager metadata myCollection --set somedata=value.

URL Query/Calendar Page Template

File: query.html

This template is rendered for any URL search response pages, either a single URL or more complex queries.

For example, the page http://localhost:8080/my-collection/*/https://example.com/ will be rendered using this template, with functionality provided by a Vue application.

Template variables:

  • {{ url }} - the URL being queried, e.g. https://example.com/
  • {{ prefix }} - the collection prefix that will be used for replay, e.g. http://localhost:8080/my-collection/
  • {{ ui }} - an optional ui dictionary from config.yaml, if any
  • {{ static_prefix }} - the prefix from which static files will be accessed from, e.g. http://localhost:8080/static/.

Replay and Banner Templates

The following templates are used to configure the replay view itself.

Custom Banner Template

File: custom_banner.html

This template can be used to render a custom banner for frameless replay. It is blank by default.

In frameless replay, the content of this template is injected into the head_insert.html template to render the banner.

Head Insert Template

File: head_insert.html

This template represents the HTML injected into every replay page to add support for client-side rewriting via wombat.js.

This template is part of the core pywb replay, and modifying this template is not recommended.

For customizing the banner, modify the banner.html (framed replay) or custom_banner.html (frameless replay) template instead.

Top Frame Template

File: frame_insert.html

This template represents the top-level frame that is inserted to render the replay in framed mode.

By design, this template does not extend from the base template.

This template is responsible for creating the iframe that will render the content.

This template only renders the banner and is designed not to set the encoding to allow the browser to ‘detect’ the encoding for the containing iframe. For this reason, the template should only contain ASCII text, and %-encode any non-ASCII characters.

Content such as analytics code that is desired in the top frame of framed replay pages should be added to this template.

Template variables:

  • {{ url }} - the URL being replayed.
  • {{ timestamp }} - the timestamp being replayed, e.g. 20211226 in http://localhost:8080/pywb/20211226/mp_/https://example.com/
  • {{ wb_url }} - A complete WbUrl object, which contains the url, timestamp and mod properties, representing the replay url.
  • {{ wb_prefix }} - the collection prefix, e.g. http://localhost:8080/pywb/
  • {{ is_proxy }} - set to true if page is being loaded via an HTTP/S proxy (checks if WSGI env has wsgiprox.proxy_host set)
  • {{ ui }} - an optional ui dictionary from config.yaml, if any.
Customizing the Top Frame Template

The top-frame used for framed replay can be replaced or augmented by modifying the frame_insert.html.

To start with modifying the default outer page, you can add it to the current templates directory by running wb-manager template --add frame_insert_html

To initialize the replay, the outer page should include wb_frame.js, create an <iframe> element and pass the id (or element itself) to the ContentFrame constructor:

<script src='{{ host_prefix }}/{{ static_path }}/wb_frame.js'> </script>
<script>
var cframe = new ContentFrame({"url": "{{ url }}" + window.location.hash,
                               "prefix": "{{ wb_prefix }}",
                               "request_ts": "{{ wb_url.timestamp }}",
                               "iframe": "#replay_iframe"});
</script>

The outer frame can receive notifications of changes to the replay via postMessage

For example, to detect when the content frame changed and log the new url and timestamp, use the following script in the outer frame html:

window.addEventListener("message", function(event) {
  if (event.data.wb_type == "load" || event.data.wb_type == "replace-url") {
    console.log("New Url: " + event.data.url);
    console.log("New Timestamp: " + event.data.ts);
  }
});

The load message is sent when a new page is first loaded, while replace-url is used for url changes caused by content frame History navigation.

Error Templates

The following templates are used to render errors.

Page Not Found Template

File: not_found.html - template for 404 error pages.

This template is used to render any 404/page not found errors that can occur when loading a URL that is not in the web archive.

Template variables:

  • {{ url }} - the URL of the page
  • {{ wbrequest }} - the full WbRequest object which can be used to get additional info about the request.

(The default template checks {{ wbrequest and wbrequest.env.pywb_proxy_magic }} to determine if the request is via an HTTP/S Proxy Mode connection or a regular request).

Generic Error Template

File: error.html - generic error template.

This template is used to render all other errors that are not ‘page not found’.

Template variables:

  • {{ err_msg }} - a shorter error message indicating what went wrong.
  • {{ err_details }} - additional details about the error.

Shared Template Variables

The following template variables are available to all templates.

  • {{ env }} - contains environment variables passed to pywb.
  • {{ env.pywb_proxy_magic }} - if set, indicates pywb is accessed via proxy. See HTTP/S Proxy Mode
  • {{ static_prefix }} - URL path to use for loading static files.
UI Configuration

Starting with pywb 2.7.0, the ui block in config.yaml can contain any custom ui-specific settings.

This block is provided to the search.html, query.html and banner.html templates.

Localization Globals

The Localization system (see: Localization / Multi-lingual Support) adds several additional template globals, to facilitate listing available locales and getting URLs to switch locales, including:

  • {{ _Q() }} - a function used to mark certain text for localization, e.g. {{ _Q('localize this text') }}
  • {{ env.pywb_lang }} - indicates current locale language code used for localization.
  • {{ locales }} - a list of all available locale language codes, used for iterating over all locales.
  • {{ get_locale_prefixes() }} - a function which returns the prefixes to use to switch locales.
  • {{ switch_locale() }} - a function used to render a URL to switch locale for the current page. Ex: <a href="{{ switch_locale(locale) }}">{{ locale }}</a> renders a link to switch to a specific locale.

Localization / Multi-lingual Support

pywb supports configuring different language locales and loading different language translations, and dynamically switching languages.

pywb can extract all text from templates and generate CSV files for translation and convert them back into a binary format used for localization/internationalization.

(pywb uses the Babel library which extends the standard Python i18n system)

To ensure all localization related dependencies are installed, first run:

pip install pywb[i18n]

Locales to use are configured in the config.yaml.

The command-line wb-manager utility provides a way to manage locales for translation, including generating extracted text, and to update translated text.

Adding a Locale and Extracting Text

To add a new locale for translation and automatically extract all text that needs to be translated, run:

wb-manager i18n extract <loc>

The <loc> can be one or more supported two-letter locales or CLDR language codes. To list available codes, you can run pybabel --list-locales.

Localization data is placed in the i18n directory, and translatable strings can be found in i18n/translations/<locale>/LC_MESSAGES/messages.csv

Each CSV file looks as follows, listing each source string and an empty string for the translated version:

"location","source","target"
"pywb/templates/banner.html:6","Live on",""
"pywb/templates/banner.html:8","Calendar icon",""
"pywb/templates/banner.html:9 pywb/templates/query.html:45","View All Captures",""
"pywb/templates/banner.html:10 pywb/templates/header.html:4","Language:",""
"pywb/templates/banner.html:11","Loading...",""
...

This CSV can then be passed to translators to translate the text.

(The extraction parameters are configured to load data from pywb/templates/*.html in babel.ini)

For example, the following will generate translation strings for es and pt locales:

wb-manager i18n extract es pt

The translatable text can then be found in i18n/translations/es/LC_MESSAGES/messages.csv and i18n/translations/pt/LC_MESSAGES/messages.csv.

The CSV files should be updated with a translation for each string in the target column.

The extract command adds any new strings without overwriting existing translations, so after running the update command to compile translated strings (described below), it is safe to run the extract command again.

Updating Locale Catalog

Once the text has been translated, and the CSV files updated, simply run:

wb-manager i18n update <loc>

This will parse the CSVs and compile the translated string tables for use with pywb.

Specifying locales in pywb

To enable the locales in pywb, one or more locales can be added to the locales key in config.yaml, ex:

locales:
   - en
   - es

Single Language Default Locale

pywb can be configured with a default, single-language locale, by setting the default_locale property in config.yaml:

default_locale: es
locales:
   - es

With this configuration, pywb will automatically use the es locale for all text strings in pywb pages.

pywb will also set the <html lang="es"> so that the browser will recognize the correct locale.

Mutli-language Translations

If more than one locale is specified, pywb will automatically show a language switching UI at the top of collection and search pages, with an option for each locale listed. To include English as an option, it should also be added as a locale (and no strings translated). For example:

locales:
   - en
   - es
   - pt

will configure pywb to show a language switch option on all pages.

Localized Collection Paths

When localization is enabled, pywb supports the locale prefix for accessing each collection with a localized language: If pywb has a collection my-web-archive, then:

  • /my-web-archive/ - loads UI with default language (set via default_locale)
  • /en/my-web-archive/ - loads UI with en locale
  • /es/my-web-archive/ - loads UI with es locale
  • /pt/my-web-archive/ - loads UI with pt locale

The language switch options work by changing the locale prefix for the same page.

Listing and Removing Locales

To list the locales that have previously been added, you can also run wb-manager i18n list.

To disable a locale from being used in pywb, simply remove it from the locales key in config.yaml.

To remove data for a locale permanently, you can run: wb-manager i18n remove <loc>. This will remove the locale directory on disk.

To remove all localization data, you can manually delete the i18n directory.

UI Templates: Adding Localizable Text

Text that can be translated, localizable text, can be marked as such directly in the UI templates:

  1. By wrapping the text in {% trans %}/{% endtrans %} tags. For example:

    {% trans %}Collection {{ coll }} Search Page{% endtrans %}
    
  2. Short-hand by calling a special _() function, which can be used in attributes or more dynamically. For example:

    ... title="{{ _('Enter a URL to search for') }}">
    

These methods can be used in all UI templates and are supported by the Jinja2 templating system.

See Customization Guide for a list of all available UI templates.

Architecture

The pywb system consists of 3 distinct components: Warcserver, Recorder and Rewriter, which can be run and scaled separately. The default pywb wayback application uses Warcserver and Rewriter. If recording is enabled, the Recorder is also used.

Additionally, the indexing system is used through all components, and a few command line tools encompass the pywb toolkit.

Warcserver

The Warcserver component is the base component of the pywb stack and can function as a standalone HTTP server.

The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.

This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.

The warcserver can be started directly installing pywb simply by running warcserver (default port is 8070).

Note: when running wayback, an instance of warcserver is also started automatically.

Warcserver API

The Warcserver API encompasses the CDXJ Server API and provides a per collection endpoint, using a list of collections defined in a YAML config file (default config.yaml). It’s also possible to use Warcserver without the YAML config (see: Custom Warcserver Deployments). The endpoints are as follows:

  • / - Home Page, JSON list of available endpoints.

For each collection <coll>:

  • /<coll>/index – Direct Index (compatible with CDXJ Server API)
  • /<coll>/resource – Direct Resource
  • /<coll>/postreq/index – POST request Index
  • /<coll>/postreq/resource – POST request Resource (most flexible for integration with downstream tools)

All endpoints accept the CDXJ Server API query arguments, although the “direct index” route is usually most useful for index lookup. while the “post request resource” route is most useful for integration with other downstream client tools.

POSTing vs Direct Input

The Warcserver is designed to map input requests to output responses, and it is possible to send input requests “directly”, eg:

GET /coll/resource?url=http://example.com/
Connection: close

or by “wrapping” the entire request in a POST request:

POST /coll/postreq/resource?url=http://example.com/
Content-Length: ...
...

GET /
Host: example.com
Connection: close

The “post request” (/postreq endpoint) approach allows more accurately transmitting any HTTP request and headers in the body of another POST request, without worrying about how the headers might be interpreted by the Warcserver connection. The “wrapped HTTP request” is thus unwrapped and processed, allowing hop-by-hop headers like Connection: close to be processed unaltered.

Index vs Resource Output

For any query, the Warcserver can return a matching index result, or the first available WARC record.

Within each collection and input type, the following endpoints are available:

  • /index - perform index lookup
  • /resource - return a single WARC record for the first match of the index list.

For example, an index query might return the CDXJ index:

=> curl "http://localhost:8070/pywb/index?url=iana.org"
org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}

While switching to resource, the result might be:

=> curl "http://localhost:8070/pywb/index?url=iana.org

WARC/1.0
WARC-Type: response
...

The resource lookup attempts to load the first available record (eg. by loading from specified WARC). If the record indicated by first line CDXJ line is not available, the next CDXJ line is tried in succession, and so on, until one succeeds.

If no record can be loaded from any of the CDXJ index results (or if there are no index results), a 404 Not Found error is returned.

WARC Record HTTP Response

When using Warcserver, the entire WARC record is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.

The following example illustrates what is transmitted when retrieving curl-ing http://localhost:8070/pywb/index?url=iana.org:

> GET /pywb/resource?url=iana.org HTTP/1.1
> Host: localhost:8070
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Warcserver-Cdx: org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
< Link: <http://www.iana.org/>; rel="original"
< WARC-Target-URI: http://www.iana.org/
< Warcserver-Source-Coll: pywb:iana.cdx
< Content-Type: application/warc-record
< Memento-Datetime: Sun, 26 Jan 2014 20:06:24 GMT
< Content-Length: 6357
< Warcserver-Type: warc
< Date: Tue, 17 Oct 2017 00:32:12 GMT

< WARC/1.0
< WARC-Type: response
< WARC-Date: 2014-01-26T20:06:24Z
< WARC-Target-URI: http://www.iana.org/
< WARC-Record-ID: <urn:uuid:4eec4942-a541-410a-99f4-50de39b62118>
...

The HTTP payload is the WARC record itself but HTTP headers returned “surface” additional information about the WARC record to make it easier for client to use the data.

  • Memento Headers Memento-Datetime and Link – The datetime is read from the WARC record, and the WARC record it itself a valid “memento” although full Memento compliance is not yet included.
  • Warcserver-Cdx header includes the full CDXJ index line that was used to load this record (usually, but not always, the first line in the index query)
  • Warcserver-Source-Coll header includes the source from which this record was loaded, corresponding to source field in the CDXJ
  • Warcserver-Type: warc indicates that this is a Warcserver WARC record (may be removed in the future)

In particular, the CDXJ and source data can be used to further identify and process the WARC record, without having to parse it. The Recorder component uses the source to determine if recording is necessary or should be skipped.

Warcserver Index Configuration

Warcserver supports several index source types, allow users to mix local and remote sources into a single collection or across multiple collections:

The sources include:

  • Local File
  • Local ZipNum File
  • Live Web Proxy (implicit index)
  • Redis sorted-set key
  • Memento TimeGate Endpoint
  • CDX Server API Endpoint

The index types can be defined using either shorthand sourcename+<url> notation or a long-form full property declaration

The following is an example of defining different special collections:

collections:
    # Live Index
    live: $live

    # rhizome via memento (shorthand)
    rhiz: memento+http://webenact.rhizome.org/all/

    # rhizome via memento (equivalent full properties)
    rhiz_long:
        index:
            type: memento
            timegate_url: http://webenact.rhizome.org/all/{url}
            timemap_url: http://webenact.rhizome.org/all/timemap/link/{url}
            replay_url: http://webenact.rhizome.org/all/{timestamp}id_/{url}
Warcserver Index Aggregators

In addition to individual index types, Warcserver supports ‘index aggregators’, which represent not a single source but multiple index sources, explicit or implicit.

Some explicit aggregators are:

  • Local Directory
  • Redis Key Template (scan/lookup of multiple redis keys)
  • A generic group of index sources looked up in parallel (best match)

The aggregators allow for a complex lookup chains to lookup of resources in dynamic directory structures, using Redis keys, and external web archives.

Note: Warcserver automatically includes a Local Directory aggregator pointing to the collections directory, as explained in the Configuring the Web Archive

Sample “Memento” Aggregator

For example, the following config defines the collection endpoint many_archives to lookup three remote archives, two using memento, and one using CDX Server API:

collections:
  # many archives
  many_archives:
    index_group:
      rhiz: memento+http://webenact.rhizome.org/all/
      ia:   cdx+http://web.archive.org/cdx;/web
      apt:  memento+http://arquivo.pt/wayback/

    timeout: 10

This allows Warcserver to serve as a “Memento Aggregator”, aggregating results from multiple existing archives (using the Memento API and other APIs).

An optional timeout property configures how many seconds to wait for each source before it is considered to have ‘timed out’. (If unspecified, the default value is 5 seconds).

Sequential Fallback Collections

It is also possible to define a “sequential” collection, where if one source/aggregator fails to produce a result, a “fallback” aggregator is tried, until there is a result:

collections:

  # Sequence
  web:
      sequence:
          -
            index: ./local/indexes
            resource: ./local/data
            name: local

          -
            index_group:
                rhiz: memento+http://webenact.rhizome.org/all/
                ia:   cdx+http://web.archive.org/cdx;/web
                apt:  memento+http://arquivo.pt/wayback/

          -
            index: $live
            name: live

In the above example, first the local archive is tried, if the resource could not be successfully loaded, then the group of 3 archives is tried, if they all fail to produce a successful response, the live web is tried. Note that successful response includes a successful index lookup + successful resource fetch – if an index contains results, but they can not be fetched, the next group in the sequence is tried.

The name of each item is include in the CDXJ index in the source field to allow the caller to identify which archive source was used.

Adding Custom Index Sources

It should be easy to add a custom index source, by extending pywb.warcserver.index.indexsource.BaseIndexSource

class MyIndexSource(BaseIndexSource):
   def load_index(self, params):
      ... lookup index data as needed to fill CDXObject
      cdx = CDXObject()
      cdx['url'] = ...
      ...
      yield cdx

  @classmethod
  def init_from_string(cls, value):
      if value == 'my-index-src':
          return cls()
      ...

  @classmethod
  def init_from_config(cls, config):
      if config['type'] != 'my-index-src':
          return

 # Register Index with Warcserver
 register_source(MyIndexSource)

You can then use the index in a config.yaml:

collections:
  my-coll: my-index-src

For more information and definition of existing indexes, see pywb.warcserver.index.indexsource

Custom Warcserver Deployments

It is also possible to use Warcserver directly without the use of a config.yaml file, for more complex deployment scenarios. (Webrecorder uses a customized deployment).

For example, the following config.yaml config:

collections:
  live: $live

  memento:
    index_group:
      rhiz:  memento+http://webenact.rhizome.org/all/
      ia:    memento+http://web.archive.org/web/
      local: ./collections/

could be initialized explicitly, using the pywb.warcserver.basewarcserver.BaseWarcServer class which does not use a YAML config

app = BaseWarcServer()

# /live endpoint
live_agg = SimpleAggregator({'live': LiveIndexSource()})

app.add_route('/live', DefaultResourceHandler(live_agg))


# /memento endpoint
sources = {'rhiz': MementoIndexSource.from_timegate_url('http://webenact.rhizome.org/vvork/'),
           'ia': MementoIndexSource.from_timegate_url('http://web.archive.org/web/'),
           'local': DirectoryIndexSource('./collections')
          }

multi_agg = GeventTimeoutAggregator(sources)

app.add_route('/memento', DefaultResourceHandler(multi_agg))

For more examples on custom Warcserver usage, consult the Warcserver tests, such as those in pywb.warcserver.test.test_handlers.py

Recorder

The recorder component acts a proxy component, intercepting requests to and response from the Warcserver and recording them to a WARC file on disk.

The recorder uses the pywb.recorder.multifilewarcwriter.MultiFileWARCWriter which extends the base warcio.warcwriter.WARCWriter from warcio and provides support for:

  • appending to multiple WARC files at once
  • WARC ‘rollover’ based on maximum size idle time
  • indexing (CDXJ) on write

Many of the features of the Recorder are created for use with Webrecorder project, although the core recorder is used to provide a basic recording via /record/ endpoint. (See: Recording Mode)

Deduplication Filters

The core recorder class provides for optional deduplication using the pywb.recorder.redisindexer.WritableRedisIndexer class which requires Redis to store the index, and can be used to either:

  • write duplicates responses.
  • write revisit records.
  • ignore duplicates and don’t write to WARC.

Custom Filtering

The recorder filter system also includes a filtering system to allow for not writing certain requests and responses. Filters include:

  • Skipping by regex applied to source (Warcserver-Source-Coll header from Warcserver)
  • Skipping if Recorder-Skip: 1 header is provided
  • Skipping if Range request header is provided
  • Filtering out certain HTTP headers, for example, http-only cookies

The additional recorder functionality will be enhanced in a future version.

For a more detailed examples, please consult the tests in pywb.recorder.test.test_recorder

Rewriter

pywb includes a sophisticated server and client-side rewriting systems, including a rules-based configuration for domain and content-specific rewriting rules, fuzzy index matching for replay, and a thorough client-side JS rewriting system.

With pywb 2.3.0, the client-side rewriting system exists in a separate module at https://github.com/webrecorder/wombat

URL Rewriting

URL rewriting is a key aspect of correctly replaying archived pages. It is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser. pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.

(No url rewriting is performed when running in HTTP/S Proxy Mode mode)

Most of the rewriting performed is url-rewriting, changing the original URLs to point to the pywb server instead of the live web. Typically, the rewriting converts:

<url> -> <pywb host>/<coll>/<timestamp><modifier>/<url>

For example, the http://example.com/ might be rewritten as http://localhost:8080/my-coll/2017mp_/http://example.com/

The rewritten url ‘prefixes’ the pywb host, the collection, requested datetime (timestamp) and type modifier to the actual url. The result is an ‘archival url’ which contains the original url and additional information about the archive and timestamp.

Url Rewrite Type Modifier

The type modifier included after the timestamp specifies the format of the resource to be loaded. Currently, pywb supports the following modifiers:

Identity Modifier (id_)

When this modifier is used, eg. /my-coll/id_/http://example.com/, no content rewriting is performed on the response, and the original, un-rewritten content is returned. This is useful for HTML or other text resources that are normally rewritten when using the default (mp_ modifier).

Note that certain HTTP headers (hop-by-hop or cookie related) may still be prefixed with X-Orig-Archive- as they may affect the transmission, so original headers are not guaranteed.

No Modifier

The ‘canonical’ replay url is one without the modifier and represents the url that a user will see and enter into the browser.

The behavior for the canonical/no modifier archival url is only different if framed replay is used (see Framed vs Frameless Replay)

  • If framed replay, this url serves the top level frame
  • If frameless replay, this url serves the content and is equivalent to the mp_ modifier.
Main Page Modifier (mp_)

This modifier is used to indicate ‘main page’ content replay, generally HTML pages. Since pywb also checks content type detection, this modifier can be used for any resources that is being loaded for replay, and generally render it correctly. Binary resources can be rendered with this modifier.

JS and CSS Hint Modifiers (js_ and cs_)

These modifiers are useful to ‘hint’ for pywb that a certain resource is being treated as a JS or CSS file. This only makes a difference where there is an ambiguity.

For example, if a resource has type text/html but is loaded in a <script> tag with the js_ modifier, it will be rewritten as JS instead of as HTML.

Other Modifiers

For compatibility and historical reasons, the pywb HTML parser also adds the following special hints:

  • im_ – hint that this resource is being used as an image.
  • oe_ – hint that this resource is being used as an object or embed
  • if_ – hint that this resource is being used as an iframe
  • fr_ – hint that this resource is being used as an frame

However, these modifiers are essentially treated the same as mp_, deferring to content-type analysis to determine if rewriting is needed.

Configuring Rewriters

pywb provides customizable rewriting based on content-type, the available types are configured in the pywb.rewrite.default_rewriter, which specifies rewriter classes per known type, and mapping of content-types to rewriters.

HTML Rewriting

An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url attributes to add the url rewriting prefix and Url Rewrite Type Modifier based on the HTML tag and attribute.

Inline CSS and JS in HTML is rewritten using CSS and JS specific rewriters.

CSS Rewriting

The CSS rewriter rewrites any urls found in <style> blocks in HTML, as well as any files determined to be css (based on text/css content type or cs_ modifier).

JS Rewriting

The JS rewriter is applied to inline <script> blocks, or inline attribute js, and any files determine to be javascript (based on content type and js_ modifier).

The default JS rewriter does not rewrite any links. Instead, JS rewriter performs limited regular expression on the following:

  • postMessage calls
  • certain this property accessors
  • specific location = assignment

Then, the entire script block is wrapped in a special code block to be executed client side. The result is that client-side execution of location, window, top and other top-level objects follows goes through a client-side proxy object. The client-side rewriting is handled by wombat.js

The server-side rewriting is to aid the client-side execution of wrapped code.

For more information, see pywb.rewrite.regex_rewriters.JSWombatProxyRewriterMixin

JSONP Rewriting

A special case of JS rewriting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure the JSONP callback matches the expected param.

For example, a requested url might be /my-coll/http://example.com?callback=jQuery123 but the returned content might be: jQuery456(...) due to fuzzy matching, which matched this inexact response to the requested url.

To ensure the JSONP callback works as expected, the content is rewritten to jQuery123(...) -> jQuery456(...)

For more information, see pywb.rewrite.jsonp_rewriter

DASH and HLS Rewriting

To support recording and replaying, adaptive streaming formants (DASH and HLS), pywb can perform special rewriting on the manifests for these formats to remoe all but one possible resolution/format. As a result, the non-deterministic format selection is reduced to a single consistent format.

For more information, see pywb.rewrite.rewrite_hls and pywb.rewrite.rewrite_dash and the tests in pywb/rewrite/test/test_content_rewriter.py

Indexing

To provide access to the web archival data (local and remote), pywb uses indexes to represent each “capture” or “memento” in the archive. The WARC format itself does not provide a specific index, so an external index is needed.

Creating an Index

When adding a WARC using wb-manager, pywb automatically generates a CDXJ Format

The index can also be created explicitly using cdx-indexer command line tool:

cdx-indexer -j example2.warc.gz
com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"}

Note: the cdx-indexer tool is deprecated and will be replaced by the standalone cdxj-indexer package.

Index Formats

Classic CDX

Traditionally, an index for a web archive (WARC or ARC) file has been called a CDX file, probably from Capture/Crawl inDeX (CDX).

The CDX format originates with the Internet Archive and represents a plain-text space-delimited format, each line representing the information about a single capture. The CDX format could contain many different fields, and unfortunately, no standardized format existed. The order of the fields typically includes a searchable url key and timestamp, to allow for binary sorting and search. The ‘url search key’ is typically reversed and to allow for easier searching of subdomains, eg. example.com -> com,example,)/

A classic CDX file might look like this:

CDX N b a m s k r M S V g
com,example)/ 20160225042329 http://example.com/ text/html 200 37cf167c2672a4a64af901d9484e75eee0e2c98a - - 1286 363 example2.warc.gz

A header is used to index the fields in the file, though typically a standard variation is used.

CDXJ Format

The pywb system uses a more flexible version of the CDX, called CDXJ, which stores most of the fields in a JSON dictionary:

com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"}

The CDXJ format allows for more flexibility by allowing the index to contain a varying number of fields, while still allow the index to be sortable by a common key (url key + timestamp). This allows CDXJ indexes from different sources and different number of fields to be merged and sorted.

Using CDXJ indexes is recommended and pywb provides the wb-manager migrate-cdx tool for converting classic CDX to CDXJ.

In general, most discussions of CDX also apply to CDXJ indexes.

ZipNum Sharded Index

A CDX(J) file is generally accessed by doing a simple binary search through the file. This scales well to very large (GB+) CDXJ files. However, for very large archives (TB+ or PB+), binary search across a single file has its limits.

A more scalable alternative to a single CDX(J) file is gzip compressed chunked cluster of CDXJ, with a binary searchable index. In this format, sometimes called the ZipNum or Ziplines cluster (for some X number of cdx lines zipped together), all actual CDXJ lines are gzipped compressed an concatenated together. To allow for random access, the lines are gzipped in groups of X lines (often 3000, but can be anything). This allows for the full index to be spread over N number of gzipped files, but has the overhead of requiring N lines to be read for each lookup. Generally, this overhead is negligible when looking up large indexes, and non-existent when doing a range query across many CDX lines.

The index can be split into an arbitrary number of shards, each containing a certain range of the url space. This allows the index to be created in parallel using MapReduce with a reduce task per shard. For each shard, there is an index file and a secondary index file. At the end, the secondary index is concatenated to form the final, binary searchable index.

The webarchive-indexing project provides tools for creating such an index, both locally and via MapReduce.

Single-Shard Index

A ZipNum index need not have multiple shards, and provides advantages even for smaller datasets. For example, in addition to less disk space from using compressed index, using the ZipNum index allows for the Pagination API to be available when using the cdx server for bulk querying.

Command-Line Apps

After installing pywb tool-suite, the following command-line apps are made available (in the Python binary directory or current environment):

All server tools have a different default port, which can be override via the -p <port> command-line option.

cdx-indexer

The CDX Indexer provides a way to create a CDX(J) file from a WARC/ARC. The tool supports both classic-CDX and new CDXJ formats.

The indexer also provides options for including all WARC records, and merging data from POST request (and other HTTP records).

See cdx-indexer -h for a list of options.

Note: In a future pywb release, this tool will be removed in favor of the standalone cdxj-indexer app, which will have additional indexing options.

wb-manager

The wb-manager command-line tool is used to to configure the collections directory structure and its contents, which pywb uses to automatically read collections.

The tool can be used while wayback is running, and pywb will detect many changes automatically.

It can be used to:

  • Create a new collection – wb-manager init <coll>
  • Add WARCs or WACZs to collection – wb-manager add <coll> <warc/wacz>
  • Add override templates
  • Add and remove metadata to a collections metadata.yaml
  • List all collections
  • Reindex a collection
  • Migrate old CDX to CDXJ style indexes.

For more details, run wb-manager -h.

warcserver

The Warcserver is a standalone server component that adheres to the Warcserver API.

The server runs on port 8070 by default serving both index and content.

The CDX Server is a subset of the Warcserver and queries using the CDXJ Server API are included:

http://localhost:8070/<coll>/index?url=http://example.com/

No rewriting or recording is performed by the Warcserver, but all collections from config.yaml are loaded.

wayback (pywb)

The main pywb application is installed as the wayback application. (The pywb name is the same application, may become the primary name in future versions).

The app will start on port 8080 by default, and configuration is read from config.yaml

See Configuring the Web Archive for a detailed overview of configuration options and customizations.

live-rewrite-server

This cli is a shortcut for wayback, but configured to run with only the Live Web Collection.

The live rewrite server runs on port 8090 and rewrites content from live web, useful for testing.

This app is almost equivalent to wayback --live, except no other collections from config.yaml are used.

APIs

pywb supports the following APIs:

CDXJ Server API

The following is a reference of the api for querying and filtering archived resources.

The api can be used to get information about a range of archive captures/mementos, including filtering, sorting, and pagination for bulk query.

The actual archive files (WARC/ARC) files are not loaded during this query, only the generated CDXJ index.

The Warcserver component uses this same api internally to perform all index and resource lookups in a consistent way.

For example, the following query might return the first 10 results from host http://example.com/* where the mime type is text/html:

http://localhost:8080/coll/cdx?url=http://example.com/*&page=1&filter=mime:text/html&limit=10

By default, the api endpoint is available at /<coll>/cdx for a collection named <coll>.

The setting can be changed by setting cdx_api_endpoint in config.yaml.

For example, to change to cdx_api_endpoint: -index to use /<coll>-index as the endpoint (previous default for older version of pywb).

To disable CDXJ access altogether, set cdx_api_endpoint: ''

API Reference

url
The only required parameter to the cdx server api is the url, ex:
http://localhost:8080/coll/cdx?url=example.com

will return a list of captures for ‘example.com’ in the collection coll (see above regarding per-collection api endpoints).

from, to

Setting from=<ts> or to=<ts> will restrict the results to the given date/time range (inclusive).

Timestamps may be <=14 digits and will be padded to either lower or upper bound.

For example, ...?url=example.com&from=2014&to=2014 will return results of example.com that
have a timestamp between 20140101000000 and 20141231235959
matchType

The cdx server supports the following matchType

  • exact – default setting, will return captures that match the url exactly
  • prefix – return captures that begin with a specified path, eg: http://example.com/path/*
  • host – return captures which for a begin host (the path segment is ignored if specified)
  • domain – return captures for the current host and all subdomains, eg. *.example.com

As a shortcut, instead of specifying a separate matchType parameter, wildcards may be used in the url:

  • ...?url=http://example.com/path/* is equivalent to ...?url=http://example.com/path/&matchType=prefix
  • ...?url=*.example.com is equivalent to ...?url=example.com&matchType=domain

Note: if you are using legacy cdx index files which are not SURT-ordered, the ``domain`` option will not be available. if this is the case, you can use the ``wb-manager convert-cdx`` option to easily convert any cdx to latest format`

limit

Setting limit= will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow. (If using a ZipNum compressed cluster, the page size limit is enforced and no captures are read beyond the single page. See :ref:pagination-api for more info).

sort

The sort param can be set as follows:

  • reverse – will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow. (An optimized version is planned)
  • closest – setting this option also requires setting closest=<ts> where <ts> is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp. (pywb uses this option internally for replay in order to fallback to ‘next closest’ capture if one fails)

Both options may be combined with limit to return the top N closest, or the last N results.

output

This option will toggle the output format of the resulting CDXJ.

  • output=cdxj (default) native format used by pywb, it consists of a space-delimited url timestamp followed by a JSON dictionary (url timestamp {…})
  • output=json will return each line as a proper JSON dictionary, resulting in newline-delimited JSON (NDJSON).
  • output=link will return each line in application/link format suitable for use as a Memento TimeMap
  • output=text will return each line as fully space-delimited. As the number of fields may vary due to mix of different sources, this format is not recommended and only provided for backward compatibility.

Using output=json is recommended for extensive analysis and it may become the default option in a future release.

filter

The filter param can be specified multiple times to filter by specific fields in the cdx index. Field names correspond to the fields returned in the JSON output. Filters can be specified as follows:

  • ...?url=example.com/*&filter==mime:text/html&filter=!=status:200 Return captures from example.com/* where mime is text/html and http status is not 200.
  • ...?url=example.com&matchType=domain&filter=~url:.*\.php$ Return captures from the domain example.com which URL ends in .php.

The ! modifier before =status indicates negation. The = and ~ modifiers are optional and specify exact resp. regular expression matches. The default (no specific modifier) is to filter whether the query string is contained in the field value. Negation and exact/regex modifier may be combined, eg. filter=!~text/.*

The formal syntax is: filter=<fieldname>:[!][=|~]<expression> with the following modifiers:

modifier(s) example description
(no modifier) filter=mime:html field “mime” contains string “html”
= filter==mime:text/html exact match: field “mime” is “text/html”
~ filter=~mime:.*/html$ regex match: expression matches beginning of field “mime” (cf. re.match)
! filter=!mime:html field “mime” does not contain string “html”
!= filter=!=mime:text/html field “mime” is not “text/html”
!~ filter=!~mime:.*/html expression does not match beginning of field “mime”
fields

The fields param can be used to specify which fields to include in the output. The standard available fields are usually: urlkey, timestamp, url, mime, status, digest, length, offset, filename

If a minimal cdx index is used, the mime and status fields may not be available. Additional fields may be introduced in the future, especially in the CDX JSON format.

Fields can be comma delimited, for example fields=urlkey,timestamp will only include the urlkey, timestamp and filename in the output.

Pagination API

The cdx server supports an optional pagination api, but it is currently only available when using ZipNum Sharded Index instead of a plain text cdx files. (Additional pagination support may be added for CDXJ files as well).

The pagination api supports the following params:

page

page is the current page number, and defaults to 0 if omitted. If the page exceeds the number of available pages from the page count query, a 400 error will be returned.

pageSize
pageSize is an optional parameter which can increase or decrease the amount of data returned in each page.
The default setting can be configuration dependent.
showNumPages=true

This is a special query which, if successful, always returns a JSON response indicating the size of the full results. The query should be very quick regardless of the size of the query.

{"blocks": 423, "pages": 85, "pageSize": 5}

In this result:

  • pages is the total number of pages available for this query. The page parameter may be between 0 and pages - 1
  • pageSize is the total number of ZipNum compressed blocks that are read for each page. The default value can be set in the pywb config.yaml via the max_blocks: 5 option.
  • blocks is the actual number of compressed blocks that match the query. This can be used to quickly estimate the total number of captures, within a margin of error. In general, blocks / pageSize + 1 = pages (since there is always at least 1 page even if blocks < pageSize)

If changing pageSize, the same value should be used for both the showNumPages query and the regular paged query. ex:

  • Use ...pageSize=2&showNumPages=true and read pages to get total number of pages
  • Use ...pageSize=2&page=N to read the N-th pages from 0 to pages-1
showPagedIndex=true

When this param is set, the returned data is the secondary index instead of the actual CDX. Each line represents a compressed cdx block, and the number of lines returned should correspond to the blocks value in showNumPages query. This query is used internally before reading the actual compressed blocks and should be significantly faster. At this time, this option can not be combined with other query params listed in the api, except for output=json. Using output=json is recommended with this query as the default text format may change in the future.

Memento API

pywb supports the Memento Protocol as specified in RFC 7089 and provides API endpoints for Memento TimeMaps and TimeGates per collection.

Memento support is enabled by default and can be controlled via the enable_memento: true|false setting in the config.yaml

TimeMap API

The timemap API is available at /<coll>/timemap/<type>/<url> for any pywb collection <coll> and <url> in the collection.

The timemap (URI-T) can be provided in several output formats, as specified by the <type> param:

  • link – returns an application/link-format as required by the Memento spec
  • cdxj – returns a timemap in the native CDXJ format.
  • json – returns the timemap as newline-delimited JSON lines (NDJSON) format.

Although not required by the Memento spec, the Link output produced by timemap also includes the extra collection= field, specifying the collection of each url. This is especially useful when accessing the timemap for the special Auto “All” Aggregate Collection to view a timemap across multiple collections in a single response.

The Timemap API is implemented as a subset of the CDXJ Server API and should produce the same result as the equivalent CDX server query.

For example, the timemap query: http://localhost:8080/pywb/timemap/link/http://example.com/ is equivalent to the CDX server query: http://localhost:8080/pywb/cdx?url=http://example.com/&output=link

TimeGate API

The TimeGate API for any pywb collection is /<coll>/<url>, eg. /my-coll/http://example.com/

The timegate can either be a non-redirecting timegate (URI-M, 200-style negotiation) and return a URI-M response, or a redirecting timegate (302-style negotiation) and redirect to a URI-M.

Non-Redirecting TimeGate (Memento Pattern 2.2)

This behavior is consistent with Memento Pattern 2.2 and is the default behavior.

To avoid an extra redirect, the TimeGate returns the requested memento directly (200-style negotiation) without redirecting to its canonical, timestamped url. The ‘canonical’ URI-M is included in the Content-Location header and should be used to reference the memento in the future.

(For HTML Mementos, the rewriting system also injects the url and timestamp into the page so that it can be displayed to the user). This behavior optimizes network traffic by avoiding unneeded redirects.

Redirecting TimeGate (Memento Pattern 2.3)

This behavior is consistent with Memento Pattern 2.3

To enable this behavior, add redirect_to_exact: true to the config.

In this mode, the TimeGate always issues a 302 to redirect a request to the “canonical” URI-M memento. The Location header is always present with the redirect.

As this approach always includes a redirect, use of this system is discouraged when the intent is to render mementos. However, this approach is useful when the goal is to determine the URI-M and to provide backwards compatibility.

Proxy Mode Memento API

When running in HTTP/S Proxy Mode, pywb behaves roughly in accordance with Memento Pattern 1.3

Every URI in proxy mode is also a TimeGate, and the Accept-Datetime header can be used to specify which timestamp to use in proxy mode. The Accept-Datetime header overrides any other timestamp setting in proxy mode.

The main distinction from the standard is that the URI-R, the original resource, is not available in proxy mode. (It is simply the URL loaded without the proxy, which is not possible to specify via the URL alone).

URI-M Headers

When serving a URI-M (any archived url), the following additional headers are included in accordance with Memento spec:

(Note: the Content-Location may also be included in case of fuzzy-matching response, where the actual/canonical url is different than requested url due to an inexact match)

OpenWayback Transition Guide

This guide provides guidelines for transtioning from OpenWayback to pywb, with additional recommendations. The main recommendation is to run pywb along with OutbackCDX and nginx, and this configuration is covered below, along with additional options.

OpenWayback vs pywb Terms

pywb and OpenWayback use slightly different terms to describe the configuration options, as explained below.

Some differences are:
  • The wayback.xml config file in OpenWayback is replaced with config.yaml yaml
  • The terms Access Point and Wayback Collection are replaced with Collection in pywb. The collection configuration represents a unique path (access point) and the data that is accessed at that path.
  • The Resource Store in OpenWayback is known in pywb as the archive paths, configured under archive_paths
  • The Resource Index in OpenWayback is known in pywb as the index paths, configurable under index_paths
  • The Exclusions in OpenWayback are replaced with general Embargo and Access Control

Pywb Collection Basics

A pywb collection must consist of a minimum of three parts: the collection name, the index_paths (where to read the index), and the archive_paths (where to read the WARC files).

The collection is accessed by name, so there is no distinct access point.

The collections are configured in the config.yaml under the collections key:

For example, a basic collection definition can be specified via:

collections:
    wayback:
        index_paths: /archive/cdx/
        archive_paths: /archive/storage/warcs/

Pywb also supports a convention-based directory structure. Collections created in this structure can be detected automatically and need not be specified in the config.yaml. This structure is designed for smaller collections that are all stored locally in a subdirectory.

See the Directory Structure for the default pywb directory structure.

However, for importing existing collections from OpenWayback, it is probably easier to specify the existing paths as shown above.

Using OutbackCDX with pywb

The recommended setup is to run OutbackCDX alongside pywb. OutbackCDX provides an index (CDX) server and can efficiently store and look up web archive data by URL.

Adding CDX to OutbackCDX

To set up OutbackCDX, please follow the instructions on the OutbackCDX README.

Since pywb also uses the default port 8080, be sure to use a different port for OutbackCDX, eg. java -jar outbackcdx*.jar -p 8084.

OutbackCDX can generally ingest existing CDX used in OpenWayback simply by POSTing to OutbackCDX at a new index endpoint.

For example, assuming OutbackCDX is running on port 8084, to add CDX for index1.cdx, index2.cdx, run:

curl -X POST --data-binary @index1.cdx http://localhost:8084/mycoll
curl -X POST --data-binary @index2.cdx http://localhost:8084/mycoll

The contents of each CDX file are added to the mycoll OutbackCDX index, which can correspond to the web archive collection mycoll. The index is created automatically if it does not exist.

See the OutbackCDX Docs for more info on ingesting CDX.

(Re)generating CDX from WARCs

There are some exceptions where it may be useful to re-generate the CDX with pywb for existing WARCs:

  • If your CDX is 9-field and does not include the compressed length, regnerating the CDX will result in more efficient HTTP range requests
  • If you want to replay pages with POST requests, pywb generated CDX will soon be supported in OutbackCDX (see: Issue #585, Issue #91 )

To generate the CDX, run the cdx-indexer command (with -p flag for POST request handling) for each WARC or set of WARCs you wish to index:

cdx-indexer /path/to/mywarcs/my.warc.gz > ./index1.cdx
cdx-indexer /path/to/all_warcs/*warc.gz > ./index2.cdx

Then, run the POST command as shown above to ingest to OutbackCDX.

The above can be repeated for each WARC file, or for a set of WARCs using the *.warc.gz wildcard.

If a CDX index is too big, OutbackCDX may fail and ingesting an index per-WARC may be needed.

Configure pywb with OutbackCDX

The config.yaml should be configured to point to OutbackCDX.

Assuming a collection named mycoll, the config.yaml can be configured as follows to use OutbackCDX

collections:
  mycoll:
    index_paths: cdx+http://localhost:8084/mycoll
    archive_paths: /path/to/mywarcs/

The archive_paths can be configured to point to a directory of WARCs or a path index.

Migrating CDX

If you are not using OutbackCDX, you may need to check on the format of the CDX files that you are using.

Over the years, there have been many variations on the CDX (capture index) format which is used by OpenWayback and pywb to look up captures in WARC/ARC files.

When migrating CDX from OpenWayback, there are a few options.

pywb currently supports:

  • 9 field CDX (surt-ordered)
  • 11 field CDX (surt-ordered)
  • CDXJ (surt-ordered)

pywb will support the 11-field and 9-field CDX format that is also used in OpenWayback.

Non-SURT ordered CDXs are not currently supported, though they may be supported in the future (see this pending pull request).

CDXJ Conversion

The native format used by pywb is the CDXJ Format with SURT-ordering, which uses JSON to encode the fields, allowing for more flexibility by storing most of the index in a JSON, allowing support for optional fields as needed.

If your CDX are not SURT-ordered, 11 or 9 field CDX, or if there is a mix, pywb also offers a conversion utility which will convert all CDX to the pywb native CDXJ:

wb-manager cdx-convert <dir-of-cdx-files>

The converter will read the CDX files and create a corresponding .cdxj file for every cdx file. Since the conversion happens on the .cdx itself, it does not require reindexing the source WARC/ARC files and can happen fairly quickly. The converted CDXJ are guaranteed to be in the right format to work with pywb.

Converting OpenWayback Config to pywb Config

OpenWayback includes many different types of configurations.

For most use cases, using OutbackCDX with pywb is the recommended approach, as explained in Using OutbackCDX with pywb.

The following are a few specific example of WaybackCollections gathered from active OpenWayback configurations and how they can be configured for use with pywb.

Remote Collection / Access Point

A collection configured with a remote index and WARC access can be converted to use OutbackCDX for the remote index, while pywb can load WARCs directly from an HTTP endpoint.

For example, a configuration similar to:

<bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
  <property name="accessPointPath" value="/wayback/"/>
  <property name="collection" ref="remotecollection" />
  ...
</bean>

<bean id="remotecollection" class="org.archive.wayback.webapp.WaybackCollection">
  <property name="resourceStore">
    <bean class="org.archive.wayback.resourcestore.SimpleResourceStore">
      <property name="prefix" value="http://myarchive.example.com/RemoteStore/" />
    </bean>
  </property>
  <property name="resourceIndex">
    <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
      <property name="searchUrlBase" value="http://myarchive.example.com/RemoteIndex" />
    </bean>
  </property>
</bean>

can be converted to the following config, with OutbackCDX assumed to be running at: http://myarchive.example.com/RemoteIndex

collections:
    wayback:
        index_paths: cdx+http://myarchive.example.com/RemoteIndex
        archive_paths: http://myarchive.example.com/RemoteStore/

Local Collection / Access Point

An OpenWayback configuration with a local collection and local CDX, for example:

<bean id="collection" class="org.archive.wayback.webapp.WaybackCollection">
   <property name="resourceIndex">
     <bean class="org.archive.wayback.resourceindex.cdxserver.EmbeddedCDXServerIndex">
       ...
       <property name="cdxServer">
         <bean class="org.archive.cdxserver.CDXServer">
           <property name="cdxSource">
             <bean class="org.archive.format.cdx.MultiCDXInputSource">
               <property name="cdxUris">
                 <list>
                   <value>/wayback/cdx/mycdx1.cdx</value>
                   <value>/wayback/cdx/mycdx2.cdx</value>
                 </list>
               </property>
             </bean>
           </property>
           <property name="cdxFormat" value="cdx11"/>
           <property name="surtMode" value="true"/>
         </bean>
       </property>
       ...
     </bean>
   </property>
 </bean>

can be configured in pywb using the index_paths key.

Note that the CDX files should all be in the same format. See Migrating CDX for more info on converting CDX to pywb native CDXJ format.

collections:
    wayback:
        index_paths: /wayback/cdx/
        archive_paths: ...

It’s also possible to combine directories, individual CDX files, and even a remote index from OutbackCDX in a single collection (as long as all CDX are in the same format).

pywb will query all the sources simultaneously to find the best match.

collections:
    wayback:
        index_group:
            cdx1: /wayback/cdx1/
            cdx2: /wayback/cdx2/mycdx.cdx
            remote: cdx+https://myarchive.example.com/outbackcdx

        archive_paths: ...

However, OutbackCDX is still recommended to avoid more complex CDX configurations.

WatchedCDXSource

OpenWayback includes a ‘Watched CDX Source’ option which watches a directory for new CDX indexes. This functionality is default in pywb when specifying a directory for the index path:

For example, the config:

<property name="source">
  <bean class="org.archive.wayback.resourceindex.WatchedCDXSource">
    <property name="recursive" value="false" />
    <property name="filters">
      <list>
        <value>^.+\.cdx$</value>
      </list>
    </property>
    <property name="path" value="/wayback/cdx-index/" />
  </bean>
</property>

can be replaced with:

collections:
    wayback:
        index_paths: /wayback/cdx-index/
        archive_paths: ...

pywb will load all CDX from that directory.

ZipNum Cluster Index

pywb also supports using a compressed ZipNum Sharded Index instead of a plain text CDX. For example, the following OpenWayback configuration:

<bean id="collection" class="org.archive.wayback.webapp.WaybackCollection">
  <property name="resourceIndex">
    <bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
      ...
      <property name="source">
        <bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
          <property name="cluster">
            <bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
              <property name="summaryFile" value="/webarchive/zipnum-cdx/all.summary"></property>
              <property name="locFile" value="/webarchive/zipnum-cdx/all.loc"></property>
            </bean>
          </property>
        ...
    </bean>
  </property>
</bean>

can simply be converted to the pywb config:

collections:
  wayback:
    index_paths: /webarchive/zipnum-cdx

    # if the index is not surt ordered
    surt_ordered: false

pywb will automatically determine the .summary and use the .loc files for the ZipNum Cluster if they are present in the directory.

Note that if the ZipNum index is not SURT ordered, the surt_ordered: false flag must be added to support this format.

Path Index Configuration

OpenWayback supports a ‘path index’ that can be used to look up a WARC by filename and map to an exact path. For compatibility, pywb supports the same path index lookup, as well as loading WARC files by path or URL prefix.

For example, an OpenWayback configuration that includes a path index:

<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
  <property name="path" value="/archive/warc-paths.txt"/>
</bean>

<bean id="resourceStore" class="org.archive.wayback.resourcestore.LocationDBResourceStore">
  <property name="db" ref="resourcefilelocationdb" />
</bean>

can be configured in the archive_paths field of pywb collection configuration:

collections:
    wayback:
        index_paths: ...
        archive_paths: /archive/warc-paths.txt

The path index is a tab-delimited text file for mapping WARC filenames to full file paths or URLs, eg:

example.warc.gz<tab>/some/path/to/example.warc.gz
another.warc.gz<tab>/some-other/path/another.warc.gz
remote.warc.gz<tab>http://warcstore.example.com/serve/remote.warc.gz

However, if all WARC files are stored in the same directory, or in a few directories, a path index is not needed and pywb will try loading the WARC by prefix.

The archive_paths can accept a list of entries. For example, given the config:

collections:
    wayback:
        index_paths: ...
        archive_paths:
          - /archive/warcs1/
          - /archive/warcs2/
          - https://myarchive.example.com/warcs/
          - /archive/warc-paths.txt

And the WARC file: example.warc.gz, pywb will try to find the WARC in order from:

1. /archive/warcs1/example.warc.gz
2. /archive/warcs2/example.warc.gz
3. https://myarchive.example.com/warcs/example.warc.gz
4. Looking up example.warc.gz in /archive/warc-paths.txt

Proxy Mode Access

A OpenWayback configuration may include many beans to support proxy mode, eg:

<bean id="proxyreplaydispatcher" class="org.archive.wayback.replay.SelectorReplayDispatcher">
  ...
     <property name="renderer">
          <bean class="org.archive.wayback.proxy.HttpsRedirectAndLinksRewriteProxyHTMLMarkupReplayRenderer">
            ...
              <property name="uriConverter">
                  <bean class="org.archive.wayback.proxy.ProxyHttpsResultURIConverter"/>
              </property>
          </bean>
      </propery>
</bean>
<bean name="proxy" class="org.archive.wayback.webapp.AccessPoint">
  <property name="internalPort" value="${proxy.port}"/>
  <property name="accessPointPath" value="${proxy.port}" />
  <property name="collection" ref="localcdxcollection" />
   ...
</bean>

In pywb, the proxy mode can be enabled by adding to the main config.yaml the name of the collection that should be served in proxy mode:

proxy:
  source_coll: wayback

There are some differences between OpenWayback and pywb proxy mode support.

In OpenWayback, proxy mode is configured using separate access points for different collections on different ports. OpenWayback only supports HTTP proxy and attempts to rewrite HTTPS URLs to HTTP.

In pywb, proxy mode is enabled on the same port as regular access, and pywb supports HTTP and HTTPS proxy. pywb does not attempt to rewrite HTTPS to HTTP, as most browsers disallow HTTP access as insecure for many sites. pywb supports a default collection that is enabled for proxy mode, and a default timestamp accessed by the proxy mode. (Switching the collection and date accessed is possible but not currently supported without extensions to pywb).

To support HTTPS access, pywb provides a certificate authority that can be trusted by a browser to rewrite HTTPS content.

See HTTP/S Proxy Mode for all of the options of pywb proxy mode configuration.

Migrating Exclusion Rules

pywb includes a new Embargo and Access Control system, which allows granual allow/block/exclude access control rules on paths and subpaths.

The rules are configured in .aclj files, and a command-line utility exists to import OpenWayback exclusions into the pywb ACLJ format.

For example, given an OpenWayback exclusion list configuration for a static file:

<bean id="excluder-factory-static" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory">
  <property name="file" value="/archive/exclusions.txt"/>
  <property name="checkInterval" value="600000" />
</bean>

The exclusions file can be converted to an .aclj file by running:

wb-manager acl importtxt /archive/exclusions.aclj /archive/exclusions.txt exclude

Then, in the pywb config, specify:

collections:
    wayback:
        index_paths: ...
        archive_paths: ...
        acl_paths: /archive/exclusions.aclj

It is possible to specify multiple access control files, which will all be applied.

Using block instead of exclude will result in pywb returning a 451 error, indicating that URLs are in the index but blocked.

CLI Tool

After exclusions have been imported, it is recommended to use wb-manager acl command-line tool for managing exclusions:

To add an exclusion, run:

wb-manager acl add /archive/exclusions.aclj http://httpbin.org/anything/something exclude

To remove an exclusion, run:

wb-manager acl remove /archive/exclusions.aclj http://httpbin.org/anything/something

For more options, see the full Embargo and Access Control documentation or run wb-manager acl --help.

Not Yet Supported

Some OpenWayback exclusion options are not yet supported in pywb. The following is not yet supported in the access control system:

  • Exclusions/Access Control By specific date range
  • Regex based exclusions
  • Date Range Embargo on All URLs
  • Robots.txt-based exclusions

Deploying pywb: Collection Paths and routing with Nginx/Apache

In pywb, the collection name is also the access point, and each of the collections in config.yaml can be accessed by their name as the subpath:

collections:
  wayback:
      ...

  another-collection:
      ...

If pywb is deployed on port 8080, each collection will be available under: http://<hostname>/wayback/*/https://example.com/ and http://<hostname>/another-collection/*/https://example.com/

To make a collection available under the root, simply set its name to: $root

collections:
  $root:
      ...

  another-collection:
      ...

Now, the first collection is available at: http://<hostname>/*/https://example.com/.

To deploy pywb on a subdirectory, eg. http://<hostname>/pywb/another-collection/*/https://example.com/,

and in general, for production use, it is recommended to deploy pywb behind an Nginx or Apache reverse proxy.

Nginx and Apache Reverse Proxy

The recommended deployment for pywb is with uWSGI and behind an Nginx or Apache frontend.

This configuration allows for more robust deployment, and allowing these servers to handle static files.

See the Sample Nginx Configuration and Sample Apache Configuration sections for more info on deploying with Nginx and Apache.

Working Docker Compose Examples

The pywb Deployment Examples include working examples of deploying pywb with Nginx, Apache and OutbackCDX in Docker using Docker Compose, widely available container orchestration tools.

See Installing Docker and Installing Docker Compose for instructions on how to install these tools.

The examples are available in the sample-deploy directory of the pywb repo. The examples include:

  • docker-compose-outback.yaml – Docker Compose config to start OutbackCDX and pywb, and ingest sample data into OutbackCDX
  • docker-compose-nginx.yaml – Docker Compose config to launch pywb and latest Nginx, with pywb running on subdirectory /wayback and Nginx serving static files from pywb.
  • docker-compose-apache.yaml – Docker Compose config to launch pywb and latest Apache, with pywb running on subdirectory /wayback and Apache serving static files from pywb.

The examples are designed to be run one at a time, and assume port 8080 is available.

After installing Docker and Docker Compose, run either of:

  • docker-compose -f docker-compose-outback.yaml up
  • docker-compose -f docker-compose-nginx.yaml up
  • docker-compose -f docker-compose-apache.yaml up

This will download the standard Docker images and start all of the components in Docker.

If everything works correctly, you should be able to access: http://localhost:8080/pywb/https://example.com/ to view the sample pywb collection.

Press CTRL+C to interrupt and stop the example in the console.

pywb package

Subpackages

pywb.apps package

Submodules
pywb.apps.cli module
class pywb.apps.cli.BaseCli(args=None, default_port=8080, desc='')[source]

Bases: object

Base CLI class that provides the initial arg parser setup, calls load to receive the application to be started and starts the application.

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

run()[source]

Start the application

run_gevent()[source]

Created the server that runs the application supplied a subclass

class pywb.apps.cli.LiveCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.BaseCli

CLI class for starting pywb in replay server in live mode

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

class pywb.apps.cli.ReplayCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.BaseCli

CLI class that adds the cli functionality specific to starting pywb’s Wayback Machine implementation

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

class pywb.apps.cli.WarcServerCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.BaseCli

CLI class for starting a WarcServer

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

class pywb.apps.cli.WaybackCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.ReplayCli

CLI class for starting the pywb’s implementation of the Wayback Machine

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

pywb.apps.cli.get_version()[source]

Get version of the pywb

pywb.apps.cli.live_rewrite_server(args=None)[source]

Utility function for starting pywb’s Wayback Machine implementation in live mode

pywb.apps.cli.warcserver(args=None)[source]

Utility function for starting pywb’s WarcServer

pywb.apps.cli.wayback(args=None)[source]

Utility function for starting pywb’s Wayback Machine implementation

pywb.apps.frontendapp module
class pywb.apps.frontendapp.FrontEndApp(config_file=None, custom_config=None)[source]

Bases: object

Orchestrates pywb’s core Wayback Machine functionality and is comprised of 2 core sub-apps and 3 optional apps.

Sub-apps:
  • WarcServer: Serves the archive content (WARC/ARC and index) as well as from the live web in record/proxy mode
  • RewriterApp: Rewrites the content served by pywb (if it is to be rewritten)
  • WSGIProxMiddleware (Optional): If proxy mode is enabled, performs pywb’s HTTP(s) proxy functionality
  • AutoIndexer (Optional): If auto-indexing is enabled for the collections it is started here
  • RecorderApp (Optional): Recording functionality, available when recording mode is enabled

The RewriterApp is configurable and can be set via the class var REWRITER_APP_CLS, defaults to RewriterApp

ALL_DIGITS = re.compile('^\\d+$')
CDX_API = 'http://localhost:%s/{coll}/index'
PROXY_CA_NAME = 'pywb HTTPS Proxy CA'
PROXY_CA_PATH = 'proxy-certs/pywb-ca.pem'
RECORD_API = 'http://localhost:%s/%s/resource/postreq?param.recorder.coll={coll}'
RECORD_ROUTE = '/record'
RECORD_SERVER = 'http://localhost:%s'
REPLAY_API = 'http://localhost:%s/{coll}/resource/postreq'
REWRITER_APP_CLS

alias of pywb.apps.rewriterapp.RewriterApp

classmethod create_app(port)[source]

Create a new instance of FrontEndApp that listens on port with a hostname of 0.0.0.0

Parameters:port (int) – The port FrontEndApp is to listen on
Returns:A new instance of FrontEndApp wrapped in GeventServer
Return type:GeventServer
get_coll_config(coll)[source]

Retrieve the collection config, including metadata, associated with a collection

Parameters:coll (str) – The name of the collection to receive config info for
Returns:The collections config
Return type:dict
get_upstream_paths(port)[source]

Retrieve a dictionary containing the full URLs of the upstream apps

Parameters:port (int) – The port used by the replay and cdx servers
Returns:A dictionary containing the upstream paths (replay, cdx-server, record [if enabled])
Return type:dict[str, str]
handle_request(environ, start_response)[source]

Retrieves the route handler and calls the handler returning its the response

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • start_response
Returns:

The WbResponse for the request

Return type:

WbResponse

init_autoindex(auto_interval)[source]

Initialize and start the auto-indexing of the collections. If auto_interval is None this is a no op.

Parameters:auto_interval (str|int) – The auto-indexing interval from the configuration file or CLI argument
init_proxy(config)[source]

Initialize and start proxy mode. If proxy configuration entry is not contained in the config this is a no op. Causes handler to become an instance of WSGIProxMiddleware.

Parameters:config (dict) – The configuration object used to configure this instance of FrontEndApp
init_recorder(recorder_config)[source]

Initialize the recording functionality of pywb. If recording_config is None this function is a no op

Parameters:recorder_config (str|dict|None) – The configuration for the recorder app
Return type:None
is_proxy_enabled(environ)[source]

Returns T/F indicating if proxy mode is enabled

Parameters:environ (dict) – The WSGI environment dictionary for the request
Returns:T/F indicating if proxy mode is enabled
Return type:bool
is_valid_coll(coll)[source]

Determines if the collection name for a request is valid (exists)

Parameters:coll (str) – The name of the collection to check
Returns:True if the collection is valid, false otherwise
Return type:bool
proxy_fetch(env, url)[source]

Proxy mode only endpoint that handles OPTIONS requests and COR fetches for Preservation Worker.

Due to normal cross-origin browser restrictions in proxy mode, auto fetch worker cannot access the CSS rules of cross-origin style sheets and must re-fetch them in a manner that is CORS safe. This endpoint facilitates that by fetching the stylesheets for the auto fetch worker and then responds with its contents

Parameters:
  • env (dict) – The WSGI environment dictionary
  • url (str) – The URL of the resource to be fetched
Returns:

WbResponse that is either response to an Options request or the results of fetching url

Return type:

WbResponse

proxy_route_request(url, environ)[source]

Return the full url that this proxy request will be routed to The ‘environ’ PATH_INFO and REQUEST_URI will be modified based on the returned url

Default is to use the ‘proxy_prefix’ to point to the proxy collection

put_custom_record(environ, coll='$root')[source]

When recording, PUT a custom WARC record to the specified collection (Available only when recording)

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
raise_not_found(environ, err_type, url)[source]

Utility function for raising a werkzeug.exceptions.NotFound execption with the supplied WSGI environment and message.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • err_type (str) – The identifier for type of error that occurred
  • url (str) – The url of the archived page that was requested
serve_cdx(environ, coll='$root')[source]

Make the upstream CDX query for a collection and response with the results of the query

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection this CDX query is for
Returns:

The WbResponse containing the results of the CDX query

Return type:

WbResponse

serve_coll_page(environ, coll='$root')[source]

Render and serve a collections search page (search.html).

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection to serve the collections search page for
Returns:

The WbResponse containing the collections search page

Return type:

WbResponse

serve_content(environ, coll='$root', url='', timemap_output='', record=False)[source]

Serve the contents of a URL/Record rewriting the contents of the response when applicable.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
  • url (str) – The URL for the corresponding record to be served if it exists
  • timemap_output (str) – The contents of the timemap included in the link header of the response
  • record (bool) – Should the content being served by recorded (save to a warc). Only valid in record mode
Returns:

WbResponse containing the contents of the record/URL

Return type:

WbResponse

serve_home(environ)[source]

Serves the home (/) view of pywb (not a collections)

Parameters:environ (dict) – The WSGI environment dictionary for the request
Returns:The WbResponse for serving the home (/) path
Return type:WbResponse
serve_listing(environ)[source]

Serves the response for WARCServer fixed and dynamic listing (paths)

Parameters:environ (dict) – The WSGI environment dictionary for the request
Returns:WbResponse containing the frontend apps WARCServer URL paths
Return type:WbResponse
serve_record(environ, coll='$root', url='')[source]

Serve a URL’s content from a WARC/ARC record in replay mode or from the live web in live, proxy, and record mode.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
  • url (str) – The URL for the corresponding record to be served if it exists
Returns:

WbResponse containing the contents of the record/URL

Return type:

WbResponse

serve_static(environ, coll='', filepath='')[source]

Serve a static file associated with a specific collection or one of pywb’s own static assets

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The collection the static file is associated with
  • filepath (str) – The file path (relative to the collection) for the static assest
Returns:

The WbResponse for the static asset

Return type:

WbResponse

setup_paths(environ, coll, record=False)[source]

Populates the WSGI environment dictionary with the path information necessary to perform a response for content or record.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
  • record (bool) – Should the content being served by recorded (save to a warc). Only valid in record mode
class pywb.apps.frontendapp.MetadataCache(template_str)[source]

Bases: object

This class holds the collection medata template string and caches the metadata for a collection once it is rendered once. Cached metadata is updated if its corresponding file has been updated since last cache time (file mtime based)

get_all(routes)[source]

Load the metadata for all routes (collections) and populate the cache

Parameters:routes (list[str]) – List of collection names
Returns:A dictionary containing each collections metadata
Return type:dict
load(coll)[source]

Load and receive the metadata associated with a collection.

If the metadata for the collection is not cached yet its metadata file is read in and stored. If the cache has seen the collection before the mtime of the metadata file is checked and if it is more recent than the cached time, the cache is updated and returned otherwise the cached version is returned.

Parameters:coll (str) – Name of a collection
Returns:The cached metadata for a collection
Return type:dict
store_new(coll, path, mtime)[source]

Load a collections metadata file and store it

Parameters:
  • coll (str) – The name of the collection the metadata is for
  • path (str) – The path to the collections metadata file
  • mtime (float) – The current mtime of the collections metadata file
Returns:

The collections metadata

Return type:

dict

pywb.apps.live module
pywb.apps.rewriterapp module
class pywb.apps.rewriterapp.RewriterApp(framed_replay=False, jinja_env=None, config=None, paths=None)[source]

Bases: object

Primary application for rewriting the content served by pywb (if it is to be rewritten).

This class is also responsible rendering the archives templates

DEFAULT_CSP = "default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'"
VIDEO_INFO_CONTENT_TYPE = 'application/vnd.youtube-dl_formats+json'
add_csp_header(wb_url, status_headers)[source]

Adds Content-Security-Policy headers to the supplied StatusAndHeaders instance if the wb_url’s mod is equal to the replay mod

Parameters:
  • wb_url (WbUrl) – The WbUrl for the URL being operated on
  • status_headers (warcio.StatusAndHeaders) – The status and

headers instance for the reply to the URL

do_query(wb_url, kwargs)[source]

Performs the timemap query request for the supplied WbUrl returning the response

Parameters:
  • wb_url (WbUrl) – The WbUrl to be queried
  • kwargs (dict) – Optional keyword arguments
Returns:

The queries response

Return type:

requests.Response

format_response(response, wb_url, full_prefix, is_timegate, is_proxy, timegate_closest_ts=None)[source]
get_base_url(wb_url, kwargs)[source]
get_full_prefix(environ)[source]
get_host_prefix(environ)[source]
get_rel_prefix(environ)[source]
get_top_frame_params(wb_url, kwargs)[source]
get_top_url(full_prefix, wb_url, cdx, kwargs)[source]
get_upstream_url(wb_url, kwargs, params)[source]
handle_custom_response(environ, wb_url, full_prefix, host_prefix, kwargs)[source]
handle_error(environ, wbe)[source]
handle_query(environ, wb_url, kwargs, full_prefix)[source]
handle_timemap(wb_url, kwargs, full_prefix)[source]
is_ajax(environ)[source]
is_framed_replay(wb_url)[source]

Returns T/F indicating if the rewriter app is configured to be operating in framed replay mode and the supplied WbUrl is also operating in framed replay mode

Parameters:wb_url (WbUrl) – The WbUrl instance to check
Returns:T/F if in framed replay mode
Return type:bool
is_preflight(environ)[source]
make_timemap(wb_url, res, full_prefix, output)[source]
prepare_env(environ)[source]

setup environ path prefixes and scheme

render_content(wb_url, kwargs, environ)[source]
send_redirect(new_path, url_parts, urlrewriter)[source]
unrewrite_referrer(environ, full_prefix)[source]
pywb.apps.static_handler module
class pywb.apps.static_handler.StaticHandler(static_path)[source]

Bases: object

pywb.apps.warcserverapp module
pywb.apps.wayback module
pywb.apps.wbrequestresponse module
class pywb.apps.wbrequestresponse.WbResponse(status_headers, value=None, **kwargs)[source]

Bases: object

Represnts a pywb wsgi response object.

Holds a status_headers object and a response iter, to be returned to wsgi container.

add_access_control_headers(env=None)[source]

Adds Access-Control* HTTP headers to this WbResponse’s HTTP headers.

Parameters:env (dict) – The WSGI environment dictionary
Returns:The same WbResponse but with the values for the Access-Control* HTTP header added
Return type:WbResponse
add_range(*args)[source]

Add HTTP range header values to this response

Parameters:args (int) – The values for the range HTTP header
Returns:The same WbResponse but with the values for the range HTTP header added
Return type:WbResponse
static bin_stream(stream, content_type, status='200 OK', headers=None)[source]

Utility method for constructing a binary response.

Parameters:
  • stream (Any) – The response body stream
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
  • str]] headers (list[tuple[str,) – Additional headers for this response
Returns:

WbResponse that is a binary stream

Return type:

WbResponse

static encode_stream(stream)[source]

Utility method to encode a stream using utf-8.

Parameters:stream (Any) – The stream to be encoded using utf-8
Returns:A generator that yields the contents of the stream encoded as utf-8
static json_response(obj, status='200 OK', content_type='application/json; charset=utf-8')[source]

Utility method for constructing a JSON response.

Parameters:
  • obj (dict) – The dictionary to be serialized in JSON format
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
Returns:

WbResponse JSON response

Return type:

WbResponse

static options_response(env)[source]

Construct WbResponse for OPTIONS based on the WSGI env dictionary

Parameters:env (dict) – The WSGI environment dictionary
Returns:The WBResponse for the options request
Return type:WbResponse
static redir_response(location, status='302 Redirect', headers=None)[source]

Utility method for constructing redirection response.

Parameters:
  • location (str) – The location of the resource redirecting to
  • status (str) – The HTTP status line
  • str]] headers (list[tuple[str,) – Additional headers for this response
Returns:

WbResponse redirection response

Return type:

WbResponse

static text_response(text, status='200 OK', content_type='text/plain; charset=utf-8')[source]

Utility method for constructing a text response.

Parameters:
  • text (str) – The text response body
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
Returns:

WbResponse text response

Return type:

WbResponse

static text_stream(stream, content_type='text/plain; charset=utf-8', status='200 OK')[source]

Utility method for constructing a streaming text response.

Parameters:
  • stream (Any) – The response body stream
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
Returns:

WbResponse that is a text stream

Rtype WbResponse:
 
try_fix_errors()[source]

Utility method to try remove faulty headers from response.

Returns:
Return type:None
Module contents

pywb.indexer package

Submodules
pywb.indexer.archiveindexer module
class pywb.indexer.archiveindexer.ArchiveIndexEntry[source]

Bases: pywb.indexer.archiveindexer.ArchiveIndexEntryMixin, dict

class pywb.indexer.archiveindexer.ArchiveIndexEntryMixin[source]

Bases: object

MIME_RE = re.compile('[; ]')
extract_mime(mime, def_mime='unk')[source]

Utility function to extract mimetype only from a full content type, removing charset settings

extract_status(status_headers)[source]

Extract status code only from status line

merge_request_data(other, options)[source]
reset_entry()[source]
set_rec_info(offset, length)[source]
class pywb.indexer.archiveindexer.DefaultRecordParser(**options)[source]

Bases: object

begin_payload(compute_digest, entry)[source]
create_payload_buffer(entry)[source]
create_record_iter(raw_iter)[source]
end_payload(entry)[source]
handle_payload(buff)[source]
join_request_records(entry_iter)[source]
open(filename)[source]
parse_arc_record(record)[source]

Parse arc record

parse_warc_record(record)[source]

Parse warc record

class pywb.indexer.archiveindexer.OrderedArchiveIndexEntry[source]

Bases: pywb.indexer.archiveindexer.ArchiveIndexEntryMixin, collections.OrderedDict

pywb.indexer.cdxindexer module
class pywb.indexer.cdxindexer.BaseCDXWriter(out)[source]

Bases: object

METADATA_NO_INDEX_TYPES = ('text/anvl',)
write(entry, filename)[source]
class pywb.indexer.cdxindexer.CDX09[source]

Bases: object

write_cdx_line(out, entry, filename)[source]
class pywb.indexer.cdxindexer.CDX11[source]

Bases: object

write_cdx_line(out, entry, filename)[source]
class pywb.indexer.cdxindexer.CDXJ[source]

Bases: object

write_cdx_line(out, entry, filename)[source]
class pywb.indexer.cdxindexer.SortedCDXWriter(out)[source]

Bases: pywb.indexer.cdxindexer.BaseCDXWriter

write(entry, filename)[source]
pywb.indexer.cdxindexer.cdx_filename(filename)[source]
pywb.indexer.cdxindexer.get_cdx_writer_cls(options)[source]
pywb.indexer.cdxindexer.iter_file_or_dir(inputs, recursive=True, rel_root=None)[source]
pywb.indexer.cdxindexer.main(args=None)[source]
pywb.indexer.cdxindexer.remove_ext(filename)[source]
pywb.indexer.cdxindexer.write_cdx_index(outfile, infile, filename, **options)[source]
pywb.indexer.cdxindexer.write_multi_cdx_index(output, inputs, **options)[source]
Module contents

pywb.manager package

Submodules
pywb.manager.aclmanager module
class pywb.manager.aclmanager.ACLManager(r)[source]

Bases: pywb.manager.manager.CollectionsManager

DEFAULT_FILE = 'access-rules.aclj'
SURT_RX = re.compile('([^:.]+[,)])+')
VALID_ACCESS = ('allow', 'block', 'exclude', 'allow_ignore_embargo')
add_excludes(r)[source]

Import old-style excludes, in url-per-line format

Parameters:r (argparse.Namespace) – Parsed result from ArgumentParser
add_rule(r)[source]

Adds a rule the ACL manager

Parameters:r (argparse.Namespace) – The argparse namespace representing the rule to be added
Return type:None
find_match(r)[source]

Finds a matching acl rule

Parameters:r (argparse.Namespace) – Parsed result from ArgumentParser
Return type:None
classmethod init_parser(parser)[source]

Initializes an argument parser for acl commands

Parameters:parser (argparse.ArgumentParser) – The parser to be initialized
Return type:None
is_valid_auto_coll(coll_name)[source]

Returns T/F indicating if the supplied collection name is a valid collection

Parameters:coll_name – The collection name to check
Returns:T/F indicating a valid collection
Return type:bool
list_rules(r)[source]

Print the acl rules to the stdout

Parameters:r (argparse.Namespace|None) – Not used
Return type:None
load_acl(must_exist=True)[source]

Loads the access control list

Parameters:must_exist (bool) – Does the acl file have to exist
Returns:T/F indicating load success
Return type:bool
print_rule(rule)[source]

Prints the supplied rule to the std out

Parameters:rule (CDXObject) – The rule to be printed
Return type:None
process(r)[source]

Process acl command

Parameters:r (argparse.Namespace) – Parsed result from ArgumentParser
Return type:None
remove_rule(r)[source]

Removes a rule from the acl file

Parameters:r (argparse.Namespace) – Parsed result from ArgumentParser
Return type:None
save_acl(r=None)[source]

Save the contents of the rules as cdxj entries to the access control list file

Parameters:r (argparse.Namespace|None) – Not used
Return type:None
to_key(url_or_surt, exact_match=False)[source]

If ‘url_or_surt’ already a SURT, use as is If exact match, add the exact match suffix

Parameters:
  • url_or_surt (str) – The url or surt to be converted to an acl key
  • exact_match (bool) – Should the exact match suffix be added to key
Return type:

str

validate(log=False, correct=False)[source]

Validates the acl rules returning T/F if the list should be saved

Parameters:
  • log (bool) – Should the results of validating be logged to stdout
  • correct (bool) – Should invalid results be corrected and saved
Return type:

None

validate_access(access)[source]

Returns true if the supplied access value is valid otherwise terminates the process

Parameters:access (str) – The access value to be validated
Returns:True if valid
Return type:bool
validate_save(r=None, log=False)[source]

Validates the acl rules and saves the file

Parameters:
  • r (argparse.Namespace|None) – Not used
  • log (bool) – Should a report be printed to stdout
Return type:

None

pywb.manager.autoindex module
class pywb.manager.autoindex.AutoIndexer(colls_dir=None, interval=30, keep_running=True)[source]

Bases: object

AUTO_INDEX_FILE = 'autoindex.cdxj'
EXT_RX = re.compile('.*\\.w?arc(\\.gz)?$')
check_path()[source]
do_index(files)[source]
is_newer_than(path1, path2, track=False)[source]
run()[source]
start()[source]
stop()[source]
pywb.manager.locmanager module
class pywb.manager.locmanager.LocManager[source]

Bases: object

compile_catalog()[source]
extract_loc(locale, no_csv)[source]
extract_text()[source]
init_catalog(loc)[source]
classmethod init_parser(parser)[source]

Initializes an argument parser for acl commands

Parameters:parser (argparse.ArgumentParser) – The parser to be initialized
Return type:None
list_loc()[source]
process(r)[source]
remove_loc(locale)[source]
update_catalog(loc)[source]
update_loc(locale, no_csv)[source]
pywb.manager.manager module
class pywb.manager.manager.CollectionsManager(coll_name, colls_dir=None, must_exist=True)[source]

Bases: object

This utility is designed to simplify the creation and management of web archive collections

It may be used via cmdline to setup and maintain the directory structure expected by pywb

COLLS_DIR = 'collections'
COLL_RX = re.compile('^[\\w][-\\w]*$')
DEF_INDEX_FILE = 'index.cdxj'
WACZ_RX = re.compile('.*\\.wacz$')
WARC_RX = re.compile('.*\\.w?arc(\\.gz)?$')
add_archives(archives, uncompress_wacz=False)[source]
add_collection()[source]
add_template(template_name, force=False, ignore=False)[source]
change_collection(coll_name)[source]
index_merge(filelist, index_file)[source]
list_colls()[source]
list_templates()[source]
migrate_cdxj(path, force=False)[source]
reindex()[source]
remove_template(template_name, force=False)[source]
set_metadata(namevalue_pairs)[source]
pywb.manager.manager.get_input(msg)[source]
pywb.manager.manager.get_version()[source]

Get version of the pywb

pywb.manager.manager.main(args=None)[source]
pywb.manager.manager.main_wrap_exc()[source]
pywb.manager.migrate module
class pywb.manager.migrate.MigrateCDX(dir_)[source]

Bases: object

convert_to_cdxj()[source]
count_cdx()[source]
iter_cdx_files()[source]
Module contents

pywb.recorder package

Submodules
pywb.recorder.filters module
class pywb.recorder.filters.CollectionFilter(accept_colls)[source]

Bases: pywb.recorder.filters.SkipDefaultFilter

skip_response(path, req_headers, resp_headers, params)[source]
class pywb.recorder.filters.ExcludeHttpOnlyCookieHeaders[source]

Bases: object

HTTPONLY_RX = re.compile(';\\s*HttpOnly\\s*(;|$)', re.IGNORECASE)
class pywb.recorder.filters.ExcludeSpecificHeaders(exclude_headers=None)[source]

Bases: object

class pywb.recorder.filters.SkipDefaultFilter[source]

Bases: object

skip_request(path, req_headers)[source]
skip_response(path, req_headers, resp_headers, params)[source]
class pywb.recorder.filters.SkipDupePolicy[source]

Bases: object

class pywb.recorder.filters.SkipRangeRequestFilter[source]

Bases: pywb.recorder.filters.SkipDefaultFilter

skip_request(path, req_headers)[source]
class pywb.recorder.filters.WriteDupePolicy[source]

Bases: object

class pywb.recorder.filters.WriteRevisitDupePolicy[source]

Bases: object

pywb.recorder.multifilewarcwriter module
class pywb.recorder.multifilewarcwriter.MultiFileWARCWriter(dir_template, filename_template=None, max_size=0, max_idle_secs=1800, *args, **kwargs)[source]

Bases: warcio.warcwriter.BaseWARCWriter

FILE_TEMPLATE = 'rec-{timestamp}-{hostname}.warc.gz'
allow_new_file(filename, params)[source]
close()[source]
close_file(match_filename)[source]
close_idle_files()[source]
close_key(dir_key)[source]
get_dir_key(params)[source]
get_new_filename(dir_, params)[source]
iter_open_files()[source]
write_record(record, params=None)[source]
write_stream_to_file(params, stream)[source]
class pywb.recorder.multifilewarcwriter.PerRecordWARCWriter(*args, **kwargs)[source]

Bases: pywb.recorder.multifilewarcwriter.MultiFileWARCWriter

pywb.recorder.recorderapp module
class pywb.recorder.recorderapp.RecorderApp(upstream_host, writer, skip_filters=None, **kwargs)[source]

Bases: object

static create_default_filters(kwargs)[source]
static default_create_buffer(params, name)[source]
handle_call(environ, start_response)[source]
send_error(exc, start_response)[source]
send_message(msg, status, start_response)[source]
class pywb.recorder.recorderapp.ReqWrapper(stream, req_headers, params, create_func)[source]

Bases: pywb.recorder.recorderapp.Wrapper

close()[source]
class pywb.recorder.recorderapp.RespWrapper(stream, headers, req, params, queue, path, create_func)[source]

Bases: pywb.recorder.recorderapp.Wrapper

close()[source]
class pywb.recorder.recorderapp.Wrapper(stream, params, create_func)[source]

Bases: object

read(*args, **kwargs)[source]
pywb.recorder.redisindexer module
class pywb.recorder.redisindexer.RedisPendingCounterTempBuffer(max_size, redis_url, params, name, timeout=30)[source]

Bases: tempfile.SpooledTemporaryFile

close()[source]
write(buf)[source]
class pywb.recorder.redisindexer.WritableRedisIndexer(*args, **kwargs)[source]

Bases: pywb.warcserver.index.indexsource.RedisIndexSource

add_urls_to_index(stream, params, filename, length)[source]
add_warc_file(full_filename, params)[source]
lookup_revisit(lookup_params, digest, url, iso_dt)[source]
Module contents

pywb.rewrite package

Submodules
pywb.rewrite.content_rewriter module
class pywb.rewrite.content_rewriter.BaseContentRewriter(rules_file, replay_mod='')[source]

Bases: object

CHARSET_REGEX = re.compile(b'<meta[^>]*?[\\s;"\']charset\\s*=[\\s"\']*([^\\s"\'/>]*)')
TITLE = re.compile('<\\s*title\\s*>(.*)<\\s*\\/\\s*title\\s*>', re.IGNORECASE|re.MULTILINE|re.DOTALL)
add_prefer_mod(pref, mod)[source]
add_rewriter(rw)[source]
create_rewriter(text_type, rule, rwinfo, cdx, head_insert_func=None)[source]
extract_html_charset(buff)[source]
get_head_insert(rwinfo, rule, head_insert_func, cdx)[source]
get_rewrite_types()[source]
get_rewriter(rw_type, rwinfo=None)[source]
get_rule(cdx)[source]
get_rw_class(rule, text_type, rwinfo)[source]
has_custom_rules(rule, cdx)[source]
html_unescape()

Convert all named and numeric character references (e.g. &gt;, &#62;, &x3e;) in the string s to the corresponding unicode characters. This function uses the rules defined by the HTML 5 standard for both valid and invalid character references, and the list of HTML 5 named character references defined in html.entities.html5.

init_js_regexs(regexs)[source]
load_rules(filename)[source]
mod_to_prefer(mod)[source]
parse_rewrite_rule(config)[source]
prefer_to_mod(pref)[source]
rewrite_headers(rwinfo)[source]
classmethod set_unescape(unescape)[source]
class pywb.rewrite.content_rewriter.BufferedRewriter(url_rewriter=None)[source]

Bases: object

rewrite_stream(stream, rwinfo)[source]
class pywb.rewrite.content_rewriter.RewriteInfo(record, content_rewriter, url_rewriter, cookie_rewriter=None)[source]

Bases: object

JSONP_CONTAINS = ['callback=jQuery', 'callback=jsonp', '.json?']
JSON_REGEX = re.compile(b'^\\s*[{[][{"]')
TAG_REGEX = re.compile(b'^(\xef\xbb\xbf)?\\s*\\<')
TAG_REGEX2 = re.compile(b'^.*<[!]?\\w+[\\s>]')
content_stream
is_identity()[source]
is_url_rw()[source]
read_and_keep(size)[source]
should_rw_content()[source]
class pywb.rewrite.content_rewriter.StreamingRewriter(url_rewriter, align_to_line=True, first_buff='')[source]

Bases: object

final_read()[source]
rewrite(string)[source]
rewrite_complete(string, **kwargs)[source]
rewrite_text_stream_to_gen(stream, rwinfo)[source]

Convert stream to generator using applying rewriting func to each portion of the stream. Align to line boundaries if needed.

pywb.rewrite.cookie_rewriter module
class pywb.rewrite.cookie_rewriter.ExactPathCookieRewriter(url_rewriter)[source]

Bases: pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter

Rewrite cookies only using exact path, useful for live rewrite without a timestamp and to minimize cookie pollution

If path or domain present, simply remove

class pywb.rewrite.cookie_rewriter.HostScopeCookieRewriter(url_rewriter)[source]

Bases: pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter

Attempt to rewrite cookies to current host url..

If path present, rewrite path to current host. Only makes sense in live proxy or no redirect mode, as otherwise timestamp may change.

If domain present, remove domain and set to path prefix

class pywb.rewrite.cookie_rewriter.MinimalScopeCookieRewriter(url_rewriter)[source]

Bases: pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter

Attempt to rewrite cookies to minimal scope possible

If path present, rewrite path to current rewritten url only If domain present, remove domain and set to path prefix

class pywb.rewrite.cookie_rewriter.RemoveAllCookiesRewriter(url_rewriter)[source]

Bases: pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter

rewrite(cookie_str, header='Set-Cookie')[source]
class pywb.rewrite.cookie_rewriter.RootScopeCookieRewriter(url_rewriter)[source]

Bases: pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter

Sometimes it is necessary to rewrite cookies to root scope in order to work across time boundaries and modifiers

This rewriter simply sets all cookies to be in the root

class pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter(url_rewriter)[source]

Bases: object

Base Cookie rewriter for wburl-based requests.

REMOVE_EXPIRES = re.compile('[;]\\s*?expires=.{4}[^,;]+', re.IGNORECASE)

If HttpOnly cookie that is set to a path ending in /, and current mod is mp_ or if_, then assume its meant to be a prefix, and likely needed for other content. Set cookie with same prefix but for all common modifiers: (mp_, js_, cs_, oe_, if_, sw_, wkrf_)

rewrite(cookie_str, header='Set-Cookie')[source]
pywb.rewrite.cookies module
class pywb.rewrite.cookies.CookieTracker(redis, expire_time=120)[source]

Bases: object

get_rewriter(url_rewriter, cookie_key)[source]
static get_subdomains(url)[source]
class pywb.rewrite.cookies.DomainCacheCookieRewriter(url_rewriter, cookie_tracker, cookie_key)[source]

Bases: pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter

get_expire_sec(morsel)[source]
class pywb.rewrite.cookies.HostScopeNoFilterCookieRewriter(url_rewriter)[source]

Bases: pywb.rewrite.cookie_rewriter.HostScopeCookieRewriter

pywb.rewrite.default_rewriter module
class pywb.rewrite.default_rewriter.DefaultRewriter(replay_mod='', config=None)[source]

Bases: pywb.rewrite.content_rewriter.BaseContentRewriter

DEFAULT_REWRITERS = {'amf': <class 'pywb.rewrite.rewrite_amf.RewriteAMF'>, 'cookie': <class 'pywb.rewrite.cookie_rewriter.HostScopeCookieRewriter'>, 'css': <class 'pywb.rewrite.regex_rewriters.CSSRewriter'>, 'dash': <class 'pywb.rewrite.rewrite_dash.RewriteDASH'>, 'header': <class 'pywb.rewrite.header_rewriter.DefaultHeaderRewriter'>, 'hls': <class 'pywb.rewrite.rewrite_hls.RewriteHLS'>, 'html': <class 'pywb.rewrite.html_rewriter.HTMLRewriter'>, 'html-banner-only': <class 'pywb.rewrite.html_insert_rewriter.HTMLInsertOnlyRewriter'>, 'js': <class 'pywb.rewrite.regex_rewriters.JSWombatProxyRewriter'>, 'js-proxy': <class 'pywb.rewrite.regex_rewriters.JSNoneRewriter'>, 'js-worker': <class 'pywb.rewrite.rewrite_js_workers.JSWorkerRewriter'>, 'json': <class 'pywb.rewrite.jsonp_rewriter.JSONPRewriter'>, 'xml': <class 'pywb.rewrite.regex_rewriters.XMLRewriter'>}
default_content_types = {'css': 'text/css', 'html': 'text/html', 'js': 'text/javascript'}
get_rewrite_types()[source]
init_js_regex(regexs)[source]
rewrite_types = {'': 'guess-text', 'application/dash+xml': 'dash', 'application/javascript': 'js', 'application/json': 'json', 'application/octet-stream': 'guess-bin', 'application/vnd.apple.mpegurl': 'hls', 'application/x-amf': 'amf', 'application/x-javascript': 'js', 'application/x-mpegURL': 'hls', 'application/xhtml': 'html', 'application/xhtml+xml': 'html', 'text/css': 'css', 'text/html': 'guess-html', 'text/javascript': 'js', 'text/plain': 'guess-text'}
class pywb.rewrite.default_rewriter.RewriterWithJSProxy(*args, **kwargs)[source]

Bases: pywb.rewrite.default_rewriter.DefaultRewriter

get_rewriter(rw_type, rwinfo=None)[source]
ua_no_obj_proxy(opts)[source]
pywb.rewrite.header_rewriter module
class pywb.rewrite.header_rewriter.DefaultHeaderRewriter(rwinfo, header_prefix='X-Archive-Orig-')[source]

Bases: object

header_rules = {'accept-patch': 'keep', 'accept-ranges': 'keep', 'access-control-allow-credentials': 'prefix-if-url-rewrite', 'access-control-allow-headers': 'prefix-if-url-rewrite', 'access-control-allow-methods': 'prefix-if-url-rewrite', 'access-control-allow-origin': 'prefix-if-url-rewrite', 'access-control-expose-headers': 'prefix-if-url-rewrite', 'access-control-max-age': 'prefix-if-url-rewrite', 'age': 'prefix', 'allow': 'keep', 'alt-svc': 'prefix', 'cache-control': 'prefix', 'connection': 'prefix', 'content-base': 'url-rewrite', 'content-disposition': 'keep', 'content-encoding': 'prefix-if-content-rewrite', 'content-language': 'keep', 'content-length': 'content-length', 'content-location': 'url-rewrite', 'content-md5': 'prefix', 'content-range': 'keep', 'content-security-policy': 'prefix', 'content-security-policy-report-only': 'prefix', 'content-type': 'keep', 'date': 'prefix', 'etag': 'prefix', 'expires': 'prefix', 'last-modified': 'prefix', 'link': 'keep', 'location': 'url-rewrite', 'p3p': 'prefix', 'pragma': 'prefix', 'proxy-authenticate': 'keep', 'public-key-pins': 'prefix', 'retry-after': 'prefix', 'server': 'prefix', 'set-cookie': 'cookie', 'status': 'prefix', 'strict-transport-security': 'prefix', 'tk': 'prefix', 'trailer': 'prefix', 'transfer-encoding': 'transfer-encoding', 'upgrade': 'prefix', 'upgrade-insecure-requests': 'prefix', 'vary': 'prefix', 'via': 'prefix', 'warning': 'prefix', 'www-authenticate': 'keep', 'x-frame-options': 'prefix', 'x-xss-protection': 'prefix'}
rewrite_header(name, value, rule)[source]
pywb.rewrite.html_insert_rewriter module
class pywb.rewrite.html_insert_rewriter.HTMLInsertOnlyRewriter(url_rewriter, **kwargs)[source]

Bases: pywb.rewrite.content_rewriter.StreamingRewriter

Insert custom string into HTML into the head, before any tag not <head> or <html> no other rewriting performed

NOT_HEAD_REGEX = re.compile('(<\\s*\\b)(?!(html|head))', re.IGNORECASE)
XML_HEADER = re.compile('<\\?xml.*\\?>')
final_read()[source]
rewrite(string)[source]
pywb.rewrite.html_rewriter module
class pywb.rewrite.html_rewriter.HTMLRewriter(*args, **kwargs)[source]

Bases: pywb.rewrite.html_rewriter.HTMLRewriterMixin, html.parser.HTMLParser

PARSETAG = re.compile('[<]')
clear_cdata_mode()[source]
feed(string)[source]

Feed data to the parser.

Call this as often as you want, with as little or as much text as you want (may include ‘n’).

handle_comment(data)[source]
handle_data(data)[source]
handle_decl(data)[source]
handle_endtag(tag)[source]
handle_pi(data)[source]
handle_startendtag(tag, attrs)[source]
handle_starttag(tag, attrs)[source]
reset()[source]

Reset this instance. Loses all unprocessed data.

unescape(s)[source]
unknown_decl(data)[source]
class pywb.rewrite.html_rewriter.HTMLRewriterMixin(url_rewriter, head_insert=None, js_rewriter_class=None, js_rewriter=None, css_rewriter=None, css_rewriter_class=None, url='', defmod='', parse_comments=False, charset='utf-8')[source]

Bases: pywb.rewrite.content_rewriter.StreamingRewriter

HTML-Parsing Rewriter for custom rewriting, also delegates to rewriters for script and css

ADD_WINDOW = re.compile('(?<![.])(WB_wombat_)')
class AccumBuff[source]

Bases: object

getvalue()[source]
write(string)[source]
BEFORE_HEAD_TAGS = ['html', 'head']
DATA_RW_PROTOCOLS = ('http://', 'https://', '//')
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$', re.IGNORECASE|re.MULTILINE)
PRELOAD_TYPES = {'audio': 'oe_', 'document': 'if_', 'embed': 'oe_', 'fetch': 'mp_', 'font': 'oe_', 'image': 'im_', 'object': 'oe_', 'script': 'js_', 'style': 'cs_', 'track': 'oe_', 'video': 'oe_', 'worker': 'js_'}
SRCSET_REGEX = re.compile('\\s*(\\S*\\s+[\\d\\.]+[wx]),|(?:\\s*,(?:\\s+|(?=https?:)))')
close()[source]
final_read()[source]
get_attr(tag_attrs, match_name)[source]
has_attr(tag_attrs, attr)[source]
parse_data(data)[source]
rewrite(string)[source]
try_unescape(value)[source]
pywb.rewrite.jsonp_rewriter module
class pywb.rewrite.jsonp_rewriter.JSONPRewriter(url_rewriter, align_to_line=True, first_buff='')[source]

Bases: pywb.rewrite.content_rewriter.StreamingRewriter

CALLBACK = re.compile('[?].*callback=([^&]+)')
JSONP = re.compile('(?:^[ \\t]*(?:(?:\\/\\*[^\\*]*\\*\\/)|(?:\\/\\/[^\\n]+[\\n])))*[ \\t]*(\\w+)\\(\\{', re.MULTILINE)
rewrite(string)[source]
pywb.rewrite.regex_rewriters module
class pywb.rewrite.regex_rewriters.CSSRewriter(rewriter, extra_rules=None, first_buff='')[source]

Bases: pywb.rewrite.regex_rewriters.RegexRewriter

rules_factory = <pywb.rewrite.regex_rewriters.CSSRules object>
class pywb.rewrite.regex_rewriters.CSSRules[source]

Bases: pywb.rewrite.regex_rewriters.RxRules

CSS_IMPORT_REGEX = '@import\\s+(?:url\\s*)?\\(?\\s*[\'"]?([\\w.:/\\\\-]+)'
CSS_URL_REGEX = 'url\\s*\\(\\s*(?:[\\\\"\']|(?:&.{1,4};))*\\s*([^)\'"]+)\\s*(?:[\\\\"\']|(?:&.{1,4};))*\\s*\\)'
class pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriter(rewriter, extra_rules=None, first_buff='')[source]

Bases: pywb.rewrite.regex_rewriters.RegexRewriter

rules_factory = <pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriterRules object>
class pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriterRules(prefix='WB_wombat_')[source]

Bases: pywb.rewrite.regex_rewriters.JSLocationRewriterRules

JS Rewriter rules which also rewrite absolute http://, https:// and // urls at the beginning of a string

JS_HTTPX = '(?:(?<=["\\\';])https?:|(?<=["\\\']))\\\\{0,4}/\\\\{0,4}/[A-Za-z0-9:_@%.\\\\-]+/'
get_rules(prefix)[source]
class pywb.rewrite.regex_rewriters.JSLocationOnlyRewriter(rewriter, extra_rules=None, first_buff='')[source]

Bases: pywb.rewrite.regex_rewriters.RegexRewriter

rules_factory = <pywb.rewrite.regex_rewriters.JSLocationRewriterRules object>
class pywb.rewrite.regex_rewriters.JSLocationRewriterRules(prefix='WB_wombat_')[source]

Bases: pywb.rewrite.regex_rewriters.RxRules

JS Rewriter mixin which rewrites location and domain to the specified prefix (default: WB_wombat_)

get_rules(prefix)[source]
class pywb.rewrite.regex_rewriters.JSNoneRewriter(rewriter, extra_rules=None, first_buff='')[source]

Bases: pywb.rewrite.regex_rewriters.RegexRewriter

class pywb.rewrite.regex_rewriters.JSReplaceFuzzy(*args, **kwargs)[source]

Bases: object

rewrite(string)[source]
rx_obj = None
pywb.rewrite.regex_rewriters.JSRewriter

alias of pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriter

class pywb.rewrite.regex_rewriters.JSWombatProxyRewriter(rewriter, extra_rules=None)[source]

Bases: pywb.rewrite.regex_rewriters.RegexRewriter

JS Rewriter mixin which wraps the contents of the script in an anonymous block scope and inserts Wombat js-proxy setup

final_read()[source]
rewrite_complete(string, **kwargs)[source]
rules_factory = <pywb.rewrite.regex_rewriters.JSWombatProxyRules object>
class pywb.rewrite.regex_rewriters.JSWombatProxyRules[source]

Bases: pywb.rewrite.regex_rewriters.RxRules

class pywb.rewrite.regex_rewriters.RegexRewriter(rewriter, extra_rules=None, first_buff='')[source]

Bases: pywb.rewrite.content_rewriter.StreamingRewriter

filter(m)[source]
static parse_rules_from_config(config)[source]
replace(m)[source]
rewrite(string)[source]
rules_factory = <pywb.rewrite.regex_rewriters.RxRules object>
class pywb.rewrite.regex_rewriters.RxRules(rules=None)[source]

Bases: object

HTTPX_MATCH_STR = 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+'
static add_prefix(prefix)[source]
static add_suffix(suffix)[source]
static archival_rewrite(mod=None)[source]
static compile_rules(rules)[source]
static fixed(string)[source]
static format(template)[source]
static remove_https(string, _)[source]
static replace_prefix_from(prefix, match)[source]
static replace_str(replacer, match='this')[source]
class pywb.rewrite.regex_rewriters.XMLRewriter(rewriter, extra_rules=None, first_buff='')[source]

Bases: pywb.rewrite.regex_rewriters.RegexRewriter

filter(m)[source]
rules_factory = <pywb.rewrite.regex_rewriters.XMLRules object>
class pywb.rewrite.regex_rewriters.XMLRules[source]

Bases: pywb.rewrite.regex_rewriters.RxRules

pywb.rewrite.rewrite_amf module
class pywb.rewrite.rewrite_amf.RewriteAMF(url_rewriter=None)[source]

Bases: pywb.rewrite.content_rewriter.BufferedRewriter

rewrite_stream(stream, rwinfo)[source]
pywb.rewrite.rewrite_dash module
class pywb.rewrite.rewrite_dash.RewriteDASH(url_rewriter=None)[source]

Bases: pywb.rewrite.content_rewriter.BufferedRewriter

rewrite_dash(stream, rwinfo)[source]
rewrite_stream(stream, rwinfo)[source]
pywb.rewrite.rewrite_dash.rewrite_fb_dash(string, *args)[source]
pywb.rewrite.rewrite_dash.rewrite_tw_dash(string, *args)[source]
pywb.rewrite.rewrite_hls module
class pywb.rewrite.rewrite_hls.RewriteHLS(url_rewriter=None)[source]

Bases: pywb.rewrite.content_rewriter.BufferedRewriter

EXT_INF = re.compile('#EXT-X-STREAM-INF:(?:.*[,])?BANDWIDTH=([\\d]+)')
EXT_RESOLUTION = re.compile('RESOLUTION=([\\d]+)x([\\d]+)')
rewrite_stream(stream, rwinfo)[source]
pywb.rewrite.rewrite_js_workers module
class pywb.rewrite.rewrite_js_workers.JSWorkerRewriter(url_rewriter, align_to_line=True, first_buff='')[source]

Bases: pywb.rewrite.content_rewriter.StreamingRewriter

A simple rewriter for rewriting web or service workers. The only rewriting that occurs is the injection of the init code for wombatWorkers.js. This allows for all them to operate as expected on the live web.

pywb.rewrite.rewriteinputreq module
class pywb.rewrite.rewriteinputreq.RewriteInputRequest(env, urlkey, url, rewriter)[source]

Bases: pywb.warcserver.inputrequest.DirectWSGIInputRequest

RANGE_ARG_RX = re.compile('.*.googlevideo.com/videoplayback.*([&?]range=(\\d+)-(\\d+))')
RANGE_HEADER = re.compile('bytes=(\\d+)-(\\d+)?')
extract_range()[source]
get_full_request_uri()[source]
get_req_headers()[source]
pywb.rewrite.templateview module
class pywb.rewrite.templateview.BaseInsertView(jenv, insert_file, banner_view=None)[source]

Bases: object

Base class of all template views used by Pywb

render_to_string(env, **kwargs)[source]

Render this template.

Parameters:
  • env (dict) – The WSGI environment associated with the request causing this template to be rendered
  • kwargs (any) – The keyword arguments to be supplied to the Jninja template render method
Returns:

The rendered template

Return type:

str

class pywb.rewrite.templateview.HeadInsertView(jenv, insert_file, banner_view=None)[source]

Bases: pywb.rewrite.templateview.BaseInsertView

The template view class associated with rendering the HTML inserted into the head of the pages replayed (WB Insert).

create_insert_func(wb_url, wb_prefix, host_prefix, top_url, env, is_framed, coll='', include_ts=True, **kwargs)[source]

Create the function used to render the header insert template for the current request.

Parameters:
  • wb_url (rewrite.wburl.WbUrl) – The WbUrl for the request this template is being rendered for
  • wb_prefix (str) – The URL prefix pywb is serving the content using (e.g. http://localhost:8080/live/)
  • host_prefix (str) – The host URL prefix pywb is running on (e.g. http://localhost:8080)
  • top_url (str) – The full URL for this request (e.g. http://localhost:8080/live/http://example.com)
  • env (dict) – The WSGI environment dictionary for this request
  • is_framed (bool) – Is pywb or a specific collection running in framed mode
  • coll (str) – The name of the collection this request is associated with
  • include_ts (bool) – Should a timestamp be included in the rendered template
  • kwargs – Additional keyword arguments to be supplied to the Jninja template render method
Returns:

A function to be used to render the header insert for the request this template is being rendered for

Return type:

callable

class pywb.rewrite.templateview.JinjaEnv(paths=None, packages=None, assets_path=None, globals=None, overlay=None, extensions=None, env_template_params_key='pywb.template_params', env_template_dir_key='pywb.templates_dir')[source]

Bases: object

Pywb JinjaEnv class that provides utility functions used by the templates, configured template loaders and template paths, and contains the actual Jinja env used by each template.

init_loc(locales_root_dir, locales, loc_map, default_locale)[source]
template_filter(param=None)[source]

Returns a decorator that adds the wrapped function to dictionary of template filters.

The wrapped function is keyed by either the supplied param (if supplied) or by the wrapped functions name.

Parameters:param – Optional name to use instead of the name of the function to be wrapped
Returns:A decorator to wrap a template filter function
Return type:callable
class pywb.rewrite.templateview.PkgResResolver[source]

Bases: webassets.env.Resolver

Class for resolving pywb package resources when install via pypi or setup.py

get_pkg_path(item)[source]

Get the package path for the

Parameters:item (str) – A resources full package path
Returns:The netloc and path from the items package path
Return type:tuple[str, str]
resolve_source(ctx, item)[source]

Given item from a Bundle’s contents, this has to return the final value to use, usually an absolute filesystem path.

Note

It is also allowed to return urls and bundle instances (or generally anything else the calling Bundle instance may be able to handle). Indeed this is the reason why the name of this method does not imply a return type.

The incoming item is usually a relative path, but may also be an absolute path, or a url. These you will commonly want to return unmodified.

This method is also allowed to resolve item to multiple values, in which case a list should be returned. This is commonly used if item includes glob instructions (wildcards).

Note

Instead of this, subclasses should consider implementing search_for_source() instead.

class pywb.rewrite.templateview.RelEnvironment(block_start_string='{%', block_end_string='%}', variable_start_string='{{', variable_end_string='}}', comment_start_string='{#', comment_end_string='#}', line_statement_prefix=None, line_comment_prefix=None, trim_blocks=False, lstrip_blocks=False, newline_sequence='n', keep_trailing_newline=False, extensions=(), optimized=True, undefined=<class 'jinja2.runtime.Undefined'>, finalize=None, autoescape=False, loader=None, cache_size=400, auto_reload=True, bytecode_cache=None, enable_async=False)[source]

Bases: jinja2.environment.Environment

Override join_path() to enable relative template paths.

join_path(template, parent)[source]

Join a template with the parent. By default all the lookups are relative to the loader root so this method returns the template parameter unchanged, but if the paths should be relative to the parent template, this function can be used to calculate the real template name.

Subclasses may override this method and implement template path joining here.

class pywb.rewrite.templateview.TopFrameView(jenv, insert_file, banner_view=None)[source]

Bases: pywb.rewrite.templateview.BaseInsertView

The template view class associated with rendering the replay iframe

get_top_frame(wb_url, wb_prefix, host_prefix, env, frame_mod, replay_mod, coll='', extra_params=None)[source]
Parameters:
  • wb_url (rewrite.wburl.WbUrl) – The WbUrl for the request this template is being rendered for
  • wb_prefix (str) – The URL prefix pywb is serving the content using (e.g. http://localhost:8080/live/)
  • host_prefix (str) – The host URL prefix pywb is running on (e.g. http://localhost:8080)
  • env (dict) – The WSGI environment dictionary for the request this template is being rendered for
  • frame_mod (str) – The modifier to be used for framing (e.g. if_)
  • replay_mod (str) – The modifier to be used in the URL of the page being replayed (e.g. mp_)
  • coll (str) – The name of the collection this template is being rendered for
  • extra_params (dict) – Additional parameters to be supplied to the Jninja template render method
Returns:

The frame insert string

Return type:

str

pywb.rewrite.url_rewriter module
class pywb.rewrite.url_rewriter.IdentityUrlRewriter(wburl, prefix='', full_prefix=None, rel_prefix=None, root_path=None, cookie_scope=None, rewrite_opts=None, pywb_static_prefix=None)[source]

Bases: pywb.rewrite.url_rewriter.UrlRewriter

No rewriting performed, return original url

deprefix_url()[source]
get_new_url(**kwargs)[source]
rebase_rewriter(new_url)[source]
rewrite(url, mod=None, force_abs=False)[source]
class pywb.rewrite.url_rewriter.SchemeOnlyUrlRewriter(*args, **kwargs)[source]

Bases: pywb.rewrite.url_rewriter.IdentityUrlRewriter

A url rewriter which ensures that any urls have the same scheme (http or https) as the base url. Other urls/input is unchanged.

rewrite(url, mod=None, force_abs=False)[source]
class pywb.rewrite.url_rewriter.UrlRewriter(wburl, prefix='', full_prefix=None, rel_prefix=None, root_path=None, cookie_scope=None, rewrite_opts=None, pywb_static_prefix=None)[source]

Bases: object

Main pywb UrlRewriter which rewrites absolute and relative urls to be relative to the current page, as specified via a WbUrl instance and an optional full path prefix

NO_REWRITE_URI_PREFIX = ('#', 'javascript:', 'data:', 'mailto:', 'about:', 'file:', '{')
PARENT_PATH = '../'
PROTOCOLS = ('http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:')
REL_PATH = '/'
REL_SCHEME = ('//', '\\/\\/', '\\\\/\\\\/')
deprefix_url()[source]
get_new_url(**kwargs)[source]
pywb_static_prefix

Returns the static path URL :rtype: str

rebase_rewriter(base_url)[source]
rewrite(url, mod=None, force_abs=False)[source]
static urljoin(orig_url, url)[source]
pywb.rewrite.wburl module

WbUrl represents the standard wayback archival url format. A regular url is a subset of the WbUrl (latest replay).

The WbUrl expresses the common interface for interacting with the wayback machine.

There WbUrl may represent one of the following forms:

query form: [/modifier]/[timestamp][-end_timestamp]*/<url>

modifier, timestamp and end_timestamp are optional:

*/example.com
20101112030201*/http://example.com
2009-2015*/http://example.com
/cdx/*/http://example.com

url query form: used to indicate query across urls same as query form but with a final *:

*/example.com*
20101112030201*/http://example.com*

replay form:

20101112030201/http://example.com
20101112030201im_/http://example.com

latest_replay: (no timestamp):

http://example.com

Additionally, the BaseWbUrl provides the base components (url, timestamp, end_timestamp, modifier, type) which can be used to provide a custom representation of the wayback url format.

class pywb.rewrite.wburl.BaseWbUrl(url='', mod='', timestamp='', end_timestamp='', type=None)[source]

Bases: object

LATEST_REPLAY = 'latest_replay'
QUERY = 'query'
REPLAY = 'replay'
URL_QUERY = 'url_query'
is_latest_replay()[source]
is_query()[source]
static is_query_type(type_)[source]
is_replay()[source]
static is_replay_type(type_)[source]
is_url_query()[source]
class pywb.rewrite.wburl.WbUrl(orig_url)[source]

Bases: pywb.rewrite.wburl.BaseWbUrl

DEFAULT_SCHEME = 'http://'
FIRST_PATH = re.compile('(?<![:/])[/?](?![/])')
QUERY_REGEX = re.compile('^(?:([\\w\\-:]+)/)?(\\d*)[*-](\\d*)/?(.+)$')
REPLAY_REGEX = re.compile('^(\\d*)([a-z]+_|[$][a-z0-9:.-]+)?/{1,3}(.+)$')
SCHEME_RX = re.compile('[a-zA-Z0-9+-.]+(:/)')
deprefix_url(prefix)[source]
get_url(url=None)[source]
is_banner_only
is_embed
is_identity
is_url_rewrite_only
static percent_encode_host(url)[source]

Convert the host of uri formatted with to_uri() to have a %-encoded host instead of punycode host The rest of url should be unchanged

set_replay_timestamp(timestamp)[source]
to_str(**overrides)[source]
static to_uri(url)[source]

Converts a url to an ascii %-encoded form where: - scheme is ascii, - host is punycode, - and remainder is %-encoded Not using urlsplit to also decode partially encoded scheme urls

static to_wburl_str(url, type='latest_replay', mod='', timestamp='', end_timestamp='')[source]
Module contents

pywb.utils package

Submodules
pywb.utils.binsearch module

Utility functions for performing binary search over a sorted text file

pywb.utils.binsearch.binsearch(reader, key, compare_func=<function cmp>, block_size=8192)[source]

Perform a binary search for a specified key to within a ‘block_size’ (default 8192) granularity, and return first full line found.

pywb.utils.binsearch.binsearch_offset(reader, key, compare_func=<function cmp>, block_size=8192)[source]

Find offset of the line which matches a given ‘key’ using binary search If key is not found, the offset is of the line after the key

File is subdivided into block_size (default 8192) sized blocks Optional compare_func may be specified

pywb.utils.binsearch.cmp(a, b)[source]
pywb.utils.binsearch.iter_exact(reader, key, token=b' ')[source]

Create an iterator which iterates over lines where the first field matches the ‘key’, equivalent to token + sep prefix. Default field termin_ator/separator is ‘ ‘

pywb.utils.binsearch.iter_prefix(reader, key)[source]

Creates an iterator which iterates over lines that start with prefix ‘key’ in a sorted text file.

pywb.utils.binsearch.iter_range(reader, start, end, prev_size=0)[source]

Creates an iterator which iterates over lines where start <= line < end (end exclusive)

pywb.utils.binsearch.linearsearch(iter_, key, prev_size=0, compare_func=<function cmp>)[source]

Perform a linear search over iterator until current_line >= key

optionally also tracking upto N previous lines, which are returned before the first matched line.

if end of stream is reached before a match is found, nothing is returned (prev lines discarded also)

pywb.utils.binsearch.search(reader, key, prev_size=0, compare_func=<function cmp>, block_size=8192)[source]

Perform a binary search for a specified key to within a ‘block_size’ (default 8192) sized block followed by linear search within the block to find first matching line.

When performin_g linear search, keep track of up to N previous lines before first matching line.

pywb.utils.canonicalize module

Standard url-canonicalzation, surt and non-surt

exception pywb.utils.canonicalize.UrlCanonicalizeException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.BadRequestException

class pywb.utils.canonicalize.UrlCanonicalizer(surt_ordered=True)[source]

Bases: object

pywb.utils.canonicalize.calc_search_range(url, match_type, surt_ordered=True, url_canon=None)[source]

Canonicalize a url (either with custom canonicalizer or standard canonicalizer with or without surt)

Then, compute a start and end search url search range for a given match type.

Support match types: * exact * prefix * host * domain (only available when for surt ordering)

Examples below:

# surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’) (‘com,example)/path/file.html’, ‘com,example)/path/file.html!’)

>>> calc_search_range('http://example.com/path/file.html', 'prefix')
('com,example)/path/file.html', 'com,example)/path/file.htmm')

# slash and ? >>> calc_search_range(’http://example.com/path/’, ‘prefix’) (‘com,example)/path/’, ‘com,example)/path0’)

>>> calc_search_range('http://example.com/path?', 'prefix')
('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/?', 'prefix')
('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/file.html', 'host')
('com,example)/', 'com,example*')
>>> calc_search_range('http://example.com/path/file.html', 'domain')
('com,example)/', 'com,example-')

special case for tld domain range >>> calc_search_range(‘com’, ‘domain’) (‘com,’, ‘com-‘)

# non-surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’, False) (‘example.com/path/file.html’, ‘example.com/path/file.html!’)

>>> calc_search_range('http://example.com/path/file.html', 'prefix', False)
('example.com/path/file.html', 'example.com/path/file.htmm')
>>> calc_search_range('http://example.com/path/file.html', 'host', False)
('example.com/', 'example.com0')

# errors: domain range not supported >>> calc_search_range(’http://example.com/path/file.html’, ‘domain’, False) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): UrlCanonicalizeException: matchType=domain unsupported for non-surt

>>> calc_search_range('http://example.com/path/file.html', 'blah', False)   # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
UrlCanonicalizeException: Invalid match_type: blah
pywb.utils.canonicalize.canonicalize(url, surt_ordered=True)[source]

Canonicalize url and convert to surt If not in surt ordered mode, convert back to url form as surt conversion is currently part of canonicalization

>>> canonicalize('http://example.com/path/file.html', surt_ordered=True)
'com,example)/path/file.html'
>>> canonicalize('http://example.com/path/file.html', surt_ordered=False)
'example.com/path/file.html'
>>> canonicalize('urn:some:id')
'urn:some:id'
pywb.utils.canonicalize.unsurt(surt)[source]

# Simple surt >>> unsurt(‘com,example)/’) ‘example.com/’

# Broken surt >>> unsurt(‘com,example)’) ‘com,example)’

# Long surt >>> unsurt(‘suffix,domain,sub,subsub,another,subdomain)/path/file/index.html?a=b?c=)/’) ‘subdomain.another.subsub.sub.domain.suffix/path/file/index.html?a=b?c=)/’

pywb.utils.format module
class pywb.utils.format.ParamFormatter(params, name='', prefix='param.')[source]

Bases: string.Formatter

get_value(key, args, kwargs)[source]
pywb.utils.format.query_to_dict(query_str, multi=None)[source]
pywb.utils.format.res_template(template, params, **extra_params)[source]
pywb.utils.format.to_bool(val)[source]
pywb.utils.geventserver module
class pywb.utils.geventserver.GeventServer(app, port=0, hostname='localhost', handler_class=None, direct=False)[source]

Bases: object

Class for optionally running a WSGI application in a greenlet

join()[source]

Joins the greenlet spawned for running the server if it was started in non-direct mode

make_server(app, port, hostname, handler_class, direct=False)[source]

Creates and starts the server. If direct is true the server is run in the current thread otherwise in a greenlet.

Parameters:
  • app – The WSGI application instance to be used
  • port (int) – The port the server is to listen on
  • hostname (str) – The hostname the server is to use
  • handler_class – The class to be used for handling WSGI requests
  • direct (bool) – T/F indicating if the server should be run in a greenlet

or in current thread

stop()[source]

Stops the running server if it was started

class pywb.utils.geventserver.RequestURIWSGIHandler(sock, address, server, rfile=None)[source]

Bases: gevent.pywsgi.WSGIHandler

A specific WSGIHandler subclass that adds REQUEST_URI to the environ dictionary for every request

get_environ()[source]

Returns the WSGI environ dictionary with the REQUEST_URI added to it

Returns:The WSGI environ dictionary for the request
Return type:dict
pywb.utils.io module
class pywb.utils.io.OffsetLimitReader(stream, offset, length)[source]

Bases: warcio.limitreader.LimitReader

read(length=None)[source]
readline(length=None)[source]
class pywb.utils.io.StreamClosingReader(stream)[source]

Bases: object

close()[source]
read(length=None)[source]
readline(length=None)[source]
pywb.utils.io.StreamIter(stream, header1=None, header2=None, size=16384, closer=<class 'contextlib.closing'>)[source]
pywb.utils.io.buffer_iter(status_headers, iterator, buff_size=65536)[source]
pywb.utils.io.call_release_conn(stream)[source]
pywb.utils.io.chunk_encode_iter(orig_iter)[source]
pywb.utils.io.compress_gzip_iter(orig_iter)[source]
pywb.utils.io.no_except_close(closable)[source]

Attempts to call the close method of the supplied object catching all exceptions. Also tries to call release_conn() in case a requests raw stream

Parameters:closable – The object to be closed
Return type:None
pywb.utils.loaders module
class pywb.utils.loaders.BaseLoader(**kwargs)[source]

Bases: object

load(url, offset=0, length=-1)[source]
class pywb.utils.loaders.BlockLoader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

a loader which can stream blocks of content given a uri, offset and optional length. Currently supports: http/https, file/local file system, pkg, WebHDFS, S3

static init_default_loaders()[source]
load(url, offset=0, length=-1)[source]
loaders = {'file': <class 'pywb.utils.loaders.LocalFileLoader'>, 'http': <class 'pywb.utils.loaders.HttpLoader'>, 'https': <class 'pywb.utils.loaders.HttpLoader'>, 'pkg': <class 'pywb.utils.loaders.PackageLoader'>, 's3': <class 'pywb.utils.loaders.S3Loader'>, 'webhdfs': <class 'pywb.utils.loaders.WebHDFSLoader'>}
profile_loader = None
static set_profile_loader(src)[source]
class pywb.utils.loaders.HMACCookieMaker(key, name, duration=10)[source]

Bases: object

Utility class to produce signed HMAC digest cookies to be used with each http request

make(extra_id='')[source]
class pywb.utils.loaders.HttpLoader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

load(url, offset, length)[source]

Load a file-like reader over http using range requests and an optional cookie created via a cookie_maker

class pywb.utils.loaders.LocalFileLoader(**kwargs)[source]

Bases: pywb.utils.loaders.PackageLoader

load(url, offset=0, length=-1)[source]

Load a file-like reader from the local file system

class pywb.utils.loaders.PackageLoader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

load(url, offset=0, length=-1)[source]
class pywb.utils.loaders.S3Loader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

load(url, offset, length)[source]
class pywb.utils.loaders.WebHDFSLoader(**kwargs)[source]

Bases: pywb.utils.loaders.HttpLoader

Loader class specifically for loading webhdfs content

HTTP_URL = 'http://{host}/webhdfs/v1{path}?'
load(url, offset, length)[source]

Loads the supplied web hdfs content

Parameters:
  • url (str) – The URL to the web hdfs content to be loaded
  • offset (int|float|double) – The offset of the content to be loaded
  • length (int|float|double) – The length of the content to be loaded
Returns:

The raw response content

pywb.utils.loaders.from_file_url(url)[source]

Convert from file:// url to file path

pywb.utils.loaders.init_yaml_env_vars()[source]

Initializes the yaml parser to be able to set the value of fields from environment variables

Return type:None
pywb.utils.loaders.is_http(filename)[source]
pywb.utils.loaders.load(filename)[source]
pywb.utils.loaders.load_overlay_config(main_env_var, main_default_file='', overlay_env_var='', overlay_file='')[source]
pywb.utils.loaders.load_py_name(string)[source]
pywb.utils.loaders.load_yaml_config(config_file)[source]
pywb.utils.loaders.read_last_line(fh, offset=256)[source]

Read last line from a seekable file. Start reading from buff before end of file, and double backwards seek until line break is found. If reached beginning of file (no lines), just return whole file

pywb.utils.loaders.to_file_url(filename)[source]

Convert a filename to a file:// url

pywb.utils.memento module
exception pywb.utils.memento.MementoException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.BadRequestException

class pywb.utils.memento.MementoUtils[source]

Bases: object

Creates a memento link string

Parameters:
  • url (str) – A URL
  • type (str) – The rel type
  • dt (str) – The datetime of the URL
  • coll (str|None) – Optional name of a collection
  • memento_format (str|None) – Optional string used to format the supplied URL
Returns:

A memento link string

Return type:

str

classmethod make_timemap(cdx_iter, params)[source]

Creates a memento link string for a timemap

Parameters:
  • cdx (dict) – The cdx object
  • datetime (str|None) – The datetime
  • rel (str) – The rel type
  • end (str) – Optional string appended to the end of the created link string
  • memento_format (str|None) – Optional string used to format the URL
Returns:

A memento link string

Return type:

str

classmethod wrap_timemap_header(url, timegate_url, timemap_url, timemap)[source]
pywb.utils.merge module
pywb.utils.wbexception module
exception pywb.utils.wbexception.AccessException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate an access control violation

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (451)
Return type:int
exception pywb.utils.wbexception.AppPageNotFound(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that a page was not found

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (400)
Return type:int
exception pywb.utils.wbexception.BadRequestException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that request was bad

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (400)
Return type:int
exception pywb.utils.wbexception.LiveResourceException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that an error was encountered during the retrial of a live web resource

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (400)
Return type:int
exception pywb.utils.wbexception.NotFoundException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that a resource was not found

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (404)
Return type:int
exception pywb.utils.wbexception.UpstreamException(status_code, url, details)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that an error was encountered from an upstream endpoint

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response
Return type:int
exception pywb.utils.wbexception.WbException(msg=None, url=None)[source]

Bases: Exception

Base class for exceptions raised by Pywb

status()[source]

Returns the HTTP status line for the error response

Returns:The HTTP status line for the error response
Return type:str
status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (500)
Return type:int
Module contents

pywb.warcserver package

Subpackages
pywb.warcserver.index package
Submodules
pywb.warcserver.index.aggregator module
class pywb.warcserver.index.aggregator.BaseAggregator[source]

Bases: object

get_source_list(params)[source]
load_child_source(name, source, params)[source]
load_index(params)[source]
class pywb.warcserver.index.aggregator.BaseDirectoryIndexSource(base_prefix, base_dir='', name='', config=None)[source]

Bases: pywb.warcserver.index.aggregator.BaseAggregator

INDEX_SOURCES = [(('.cdx', '.cdxj'), <class 'pywb.warcserver.index.indexsource.FileIndexSource'>), (('.idx', '.summary'), <class 'pywb.warcserver.index.zipnum.ZipNumIndexSource'>)]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
class pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource(redis_url=None, redis=None, key_template=None, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.BaseAggregator, pywb.warcserver.index.indexsource.RedisIndexSource

class pywb.warcserver.index.aggregator.BaseSourceListAggregator(sources, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.BaseAggregator

get_all_sources(params)[source]
yield_invert_sources(sel_sources, params)[source]
yield_sources(sel_sources, params)[source]
class pywb.warcserver.index.aggregator.CacheDirectoryIndexSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.CacheDirectoryMixin, pywb.warcserver.index.aggregator.DirectoryIndexSource

class pywb.warcserver.index.aggregator.CacheDirectoryMixin(*args, **kwargs)[source]

Bases: object

class pywb.warcserver.index.aggregator.DirectoryIndexSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.SeqAggMixin, pywb.warcserver.index.aggregator.BaseDirectoryIndexSource

class pywb.warcserver.index.aggregator.GeventMixin(*args, **kwargs)[source]

Bases: object

DEFAULT_TIMEOUT = 5.0
class pywb.warcserver.index.aggregator.GeventTimeoutAggregator(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.TimeoutMixin, pywb.warcserver.index.aggregator.GeventMixin, pywb.warcserver.index.aggregator.BaseSourceListAggregator

class pywb.warcserver.index.aggregator.RedisMultiKeyIndexSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.SeqAggMixin, pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource

class pywb.warcserver.index.aggregator.SeqAggMixin(*args, **kwargs)[source]

Bases: object

class pywb.warcserver.index.aggregator.SimpleAggregator(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.SeqAggMixin, pywb.warcserver.index.aggregator.BaseSourceListAggregator

class pywb.warcserver.index.aggregator.TimeoutMixin(*args, **kwargs)[source]

Bases: object

is_timed_out(name)[source]
pywb.warcserver.index.cdxobject module
exception pywb.warcserver.index.cdxobject.CDXException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (500)
Return type:int
class pywb.warcserver.index.cdxobject.CDXObject(cdxline=b'')[source]

Bases: collections.OrderedDict

dictionary object representing parsed CDX line.

CDX_ALT_FIELDS = {'d': 'digest', 'f': 'filename', 'k': 'urlkey', 'l': 'length', 'm': 'mime', 'mimetype': 'mime', 'o': 'offset', 'original': 'url', 's': 'length', 'statuscode': 'status', 't': 'timestamp', 'u': 'url'}
CDX_FORMATS = [['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'length'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename']]
static conv_to_json(obj, fields=None)[source]

return cdx as json dictionary string if fields is None, output will include all fields in order stored, otherwise only specified fields will be included

Parameters:fields – list of field names to output
is_revisit()[source]

return True if this record is a revisit record.

classmethod json_decode(string)[source]
to_cdxj(fields=None)[source]
to_json(fields=None)[source]
to_text(fields=None)[source]

return plaintext CDX record (includes newline). if fields is None, output will have all fields in the order they are stored.

Parameters:fields – list of field names to output.
class pywb.warcserver.index.cdxobject.IDXObject(idxline)[source]

Bases: collections.OrderedDict

FORMAT = ['urlkey', 'part', 'offset', 'length', 'lineno']
NUM_REQ_FIELDS = 4
to_json(fields=None)[source]
to_text(fields=None)[source]

return plaintext IDX record (including newline).

Parameters:fields – list of field names to output (currently ignored)
pywb.warcserver.index.cdxops module
class pywb.warcserver.index.cdxops.CDXFilter(string)[source]

Bases: object

contains(val)[source]
exact(val)[source]
rx_match(val)[source]
pywb.warcserver.index.cdxops.cdx_clamp(cdx_iter, from_ts, to_ts)[source]

Clamp by start and end ts

pywb.warcserver.index.cdxops.cdx_collapse_time_status(cdx_iter, timelen=10)[source]

collapse by timestamp and status code.

pywb.warcserver.index.cdxops.cdx_filter(cdx_iter, filter_strings)[source]

filter CDX by regex if each filter is field:regex form, apply filter to cdx[field].

pywb.warcserver.index.cdxops.cdx_limit(cdx_iter, limit)[source]

limit cdx to at most limit.

pywb.warcserver.index.cdxops.cdx_load(sources, query, process=True)[source]

merge text CDX lines from sources, return an iterator for filtered and access-checked sequence of CDX objects.

Parameters:
  • sources – iterable for text CDX sources.
  • process – bool, perform processing sorting/filtering/grouping ops
pywb.warcserver.index.cdxops.cdx_resolve_revisits(cdx_iter)[source]

resolve revisits.

this filter adds three fields to CDX: orig.length, orig.offset, and orig.filename. for revisit records, these fields have corresponding field values in previous non-revisit (original) CDX record. They are all "-" for non-revisit records.

pywb.warcserver.index.cdxops.cdx_reverse(cdx_iter, limit)[source]

return cdx records in reverse order.

pywb.warcserver.index.cdxops.cdx_sort_closest(closest, cdx_iter, limit=10)[source]

sort CDXCaptureResult by closest to timestamp.

pywb.warcserver.index.cdxops.cdx_to_json(cdx_iter, fields)[source]
pywb.warcserver.index.cdxops.cdx_to_text(cdx_iter, fields)[source]
pywb.warcserver.index.cdxops.create_merged_cdx_gen(sources, query)[source]

create a generator which loads and merges cdx streams ensures cdxs are lazy loaded

pywb.warcserver.index.cdxops.make_obj_iter(text_iter, query)[source]

convert text cdx stream to CDXObject/IDXObject.

pywb.warcserver.index.cdxops.process_cdx(cdx_iter, query)[source]
pywb.warcserver.index.fuzzymatcher module
class pywb.warcserver.index.fuzzymatcher.FuzzyMatcher(filename=None)[source]

Bases: object

DEFAULT_FILTER = ['urlkey:{0}']
DEFAULT_MATCH_TYPE = 'prefix'
DEFAULT_REPLACE_AFTER = '?'
DEFAULT_RE_TYPE = 'search'
FUZZY_SKIP_PARAMS = ('alt_url', 'reverse', 'closest', 'end_key', 'url', 'matchType', 'filter')
get_ext(url)[source]
get_fuzzy_iter(cdx_iter, index_source, params)[source]
get_fuzzy_match(urlkey, url, params)[source]
make_query_match_regex(params_list)[source]
make_regex(config)[source]
match_general_fuzzy_query(url, urlkey, cdx, rx_cache)[source]
parse_fuzzy_rule(rule)[source]

Parse rules using all the different supported forms

class pywb.warcserver.index.fuzzymatcher.FuzzyRule(url_prefix, regex, replace_after, filter_str, match_type, re_type)

Bases: tuple

filter_str

Alias for field number 3

match_type

Alias for field number 4

re_type

Alias for field number 5

regex

Alias for field number 1

replace_after

Alias for field number 2

url_prefix

Alias for field number 0

pywb.warcserver.index.indexsource module
class pywb.warcserver.index.indexsource.BaseIndexSource[source]

Bases: object

WAYBACK_ORIG_SUFFIX = '{timestamp}id_/{url}'
load_index(params)[source]
logger = <Logger warcserver (WARNING)>
class pywb.warcserver.index.indexsource.FileIndexSource(filename, config=None)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

CDX_EXT = ('.cdx', '.cdxj')
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.LiveIndexSource[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

get_load_url(params)[source]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.MementoIndexSource(timegate_url, timemap_url, replay_url)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

classmethod from_timegate_url(timegate_url, path='link')[source]
handle_timegate(params, timestamp)[source]
handle_timemap(params)[source]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.RedisIndexSource(redis_url=None, redis=None, key_template=None, **kwargs)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
load_key_index(key_template, params)[source]
static parse_redis_url(redis_url, redis_=None)[source]
scan_keys(match_templ, params, member_key=None)[source]
class pywb.warcserver.index.indexsource.RemoteIndexSource(api_url, replay_url, url_field='load_url', closest_limit=100)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

CDX_MATCH_RX = re.compile('^cdxj?\\+(?P<url>https?\\:.*)')
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.WBMementoIndexSource(timegate_url, timemap_url, replay_url)[source]

Bases: pywb.warcserver.index.indexsource.MementoIndexSource

WAYBACK_ORIG_SUFFIX = '{timestamp}im_/{url}'
WBURL_MATCH = re.compile('([0-9]{0,14})?(?:\\w+_)?/{0,3}(.*)')
handle_timegate(params, timestamp)[source]
class pywb.warcserver.index.indexsource.XmlQueryIndexSource(query_api_url)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

An index source class for XML files

EXACT_QUERY = 'type:urlquery url:'
PREFIX_QUERY = 'type:prefixquery url:'
convert_to_cdx(item)[source]

Converts the etree element to an CDX object

Parameters:item – The etree element to be converted
Returns:The CDXObject representing the supplied etree element object
Return type:CDXObject
gettext(item, name)[source]

Returns the value of the supplied name

Parameters:
  • item – The etree element to be converted
  • name – The name of the field to get its value for
Returns:

The value of the field

Return type:

str

classmethod init_from_config(config)[source]

Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied dictionary contains the type key equal to xmlquery

Parameters:str] config (dict[str,) –
Returns:The initialized XmlQueryIndexSource or None
Return type:XmlQueryIndexSource|None
classmethod init_from_string(value)[source]

Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied value starts with xmlquery+

Parameters:value (str) – The string by which to initialize the XmlQueryIndexSource
Returns:The initialized XmlQueryIndexSource or None
Return type:XmlQueryIndexSource|None
load_index(params)[source]

Loads the xml query index based on the supplied params

Parameters:str] params (dict[str,) – The query params
Returns:A list or generator of cdx objects
Raises:NotFoundException – If the query url is not found

or the results of the query returns no cdx entries :raises BadRequestException: If the match type is not exact or prefix

prefix_query_iter(items)[source]

Returns an iterator yielding the results of performing a prefix query

Parameters:items – The xml entry elements representing an query
Returns:An iterator yielding the results of the query
pywb.warcserver.index.query module
class pywb.warcserver.index.query.CDXQuery(params)[source]

Bases: object

allow_fuzzy
closest
collapse_time
custom_ops
end_key
fields
filters
from_ts
is_exact
key
limit
match_type
output
page
page_count
page_size
resolve_revisits
reverse
secondary_index_only
set_key(key, end_key)[source]
to_ts
url
urlencode()[source]
pywb.warcserver.index.zipnum module
class pywb.warcserver.index.zipnum.AlwaysJsonResponse[source]

Bases: dict

to_cdxj(*args)[source]
to_json(*args)[source]
to_text(*args)[source]
class pywb.warcserver.index.zipnum.LocMapResolver(loc_summary, loc_filename)[source]

Bases: object

Lookup shards based on a file mapping shard name to one or more paths. The entries are tab delimited.

load_loc()[source]
class pywb.warcserver.index.zipnum.LocPrefixResolver(loc_summary, loc_config)[source]

Bases: object

Use a prefix lookup, where the prefix can either be a fixed string or can be a regex replacement of the index summary path

load_loc()[source]
class pywb.warcserver.index.zipnum.ZipBlocks(part, offset, length, count)[source]

Bases: object

class pywb.warcserver.index.zipnum.ZipNumIndexSource(summary, config=None)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

DEFAULT_MAX_BLOCKS = 10
DEFAULT_RELOAD_INTERVAL = 10
IDX_EXT = ('.idx', '.summary')
block_to_cdx_iter(blocks, ranges, query)[source]
compute_page_range(reader, query)[source]
idx_to_cdx(idx_iter, query)[source]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_blocks(location, blocks, ranges, query)[source]

Load one or more blocks of compressed cdx lines, return a line iterator which decompresses and returns one line at a time, bounded by query.key and query.end_key

load_index(params)[source]
search_by_line_num(reader, line)[source]
Module contents
pywb.warcserver.resource package
Submodules
pywb.warcserver.resource.blockrecordloader module
class pywb.warcserver.resource.blockrecordloader.BlockArcWarcRecordLoader(loader=None, cookie_maker=None, block_size=16384, *args, **kwargs)[source]

Bases: warcio.recordloader.ArcWarcRecordLoader

load(url, offset, length, no_record_parse=False)[source]

Load a single record from given url at offset with length and parse as either warc or arc record

pywb.warcserver.resource.pathresolvers module
class pywb.warcserver.resource.pathresolvers.DefaultResolverMixin[source]

Bases: object

classmethod make_best_resolver(path)[source]
classmethod make_resolvers(paths)[source]
class pywb.warcserver.resource.pathresolvers.PathIndexResolver(pathindex_file)[source]

Bases: object

class pywb.warcserver.resource.pathresolvers.PrefixResolver(template)[source]

Bases: object

resolve_coll(path, source)[source]
class pywb.warcserver.resource.pathresolvers.RedisResolver(redis_url=None, redis=None, key_template=None, **kwargs)[source]

Bases: pywb.warcserver.index.indexsource.RedisIndexSource

pywb.warcserver.resource.resolvingloader module
class pywb.warcserver.resource.resolvingloader.ResolvingLoader(path_resolvers, record_loader=None, no_record_parse=False)[source]

Bases: object

EMPTY_DIGEST = '3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ'
MISSING_REVISIT_MSG = 'Original for revisit record could not be loaded'
load_cdx_for_dupe(url, timestamp, digest, cdx_loader)[source]

If a cdx_server is available, return response from server, otherwise empty list

load_headers_and_payload(cdx, failed_files, cdx_loader)[source]

Resolve headers and payload for a given capture In the simple case, headers and payload are in the same record. In the case of revisit records, the payload and headers may be in different records.

If the original has already been found, lookup original using orig. fields in cdx dict. Otherwise, call _load_different_url_payload() to get cdx index from a different url to find the original record.

pywb.warcserver.resource.responseloader module
class pywb.warcserver.resource.responseloader.BaseLoader[source]

Bases: object

raise_on_self_redirect(params, cdx, status_code, location_url)[source]

Check if response is a 3xx redirect to the same url If so, reject this capture to avoid causing redirect loop

class pywb.warcserver.resource.responseloader.LiveWebLoader(forward_proxy_prefix=None, adapter=None)[source]

Bases: pywb.warcserver.resource.responseloader.BaseLoader

SKIP_HEADERS = ('link', 'memento-datetime', 'content-location', 'x-archive')
UNREWRITE_HEADERS = ('location', 'content-location')
VIDEO_MIMES = ('application/x-mpegURL', 'application/vnd.apple.mpegurl', 'application/dash+xml')
get_custom_metadata(content_type, dt)[source]
load_resource(cdx, params)[source]
unrewrite_header(cdx, value)[source]
class pywb.warcserver.resource.responseloader.VideoLoader[source]

Bases: pywb.warcserver.resource.responseloader.BaseLoader

CONTENT_TYPE = 'application/vnd.youtube-dl_formats+json'
load_resource(cdx, params)[source]
class pywb.warcserver.resource.responseloader.WARCPathLoader(paths, cdx_source)[source]

Bases: pywb.warcserver.resource.pathresolvers.DefaultResolverMixin, pywb.warcserver.resource.responseloader.BaseLoader

load_resource(cdx, params)[source]
Module contents
Submodules
pywb.warcserver.access_checker module
class pywb.warcserver.access_checker.AccessChecker(access_source, default_access='allow', embargo=None)[source]

Bases: object

An access checker class

EXACT_SUFFIX = '###'
EXACT_SUFFIX_B = b'###'
EXACT_SUFFIX_SEARCH_B = b'####'
check_embargo(url, ts)[source]
create_access_aggregator(source_files)[source]

Creates a new AccessRulesAggregator using the supplied list of access control file names

Parameters:source_files (list[str]) – The list of access control file names
Returns:The created AccessRulesAggregator
Return type:AccessRulesAggregator
create_access_source(filename)[source]

Creates a new access source for the supplied filename.

If the filename is for a directory an CacheDirectoryAccessSource instance is returned otherwise an FileAccessIndexSource instance

Parameters:filename (str) – The name of an file/directory
Returns:An instance of CacheDirectoryAccessSource or FileAccessIndexSource

depending on if the supplied filename is for a directory or file :rtype: CacheDirectoryAccessSource|FileAccessIndexSource :raises Exception: Indicates an invalid access source was supplied

find_access_rule(url, ts=None, urlkey=None, collection=None, acl_user=None)[source]

Attempts to find the access control rule for the supplied URL otherwise returns the default rule

Parameters:
  • url (str) – The URL for the rule to be found
  • ts (str|None) – A timestamp (not used)
  • urlkey (str|None) – The access control url key
  • collection (str|None) – The collection, if any
  • acl_user (str|None) – The access control user, if any
Returns:

The access control rule for the supplied URL

if one exists otherwise the default rule :rtype: CDXObject

parse_embargo(embargo)[source]
wrap_iter(cdx_iter, acl_user)[source]

Wraps the supplied cdx iter and yields cdx objects that contain the access control results for the cdx object being yielded

Parameters:
  • cdx_iter – The cdx object iterator to be wrapped
  • acl_user (str) – The user associated with this request (optional)
Returns:

The wrapped cdx object iterator

class pywb.warcserver.access_checker.AccessRulesAggregator(*args, **kwargs)[source]

Bases: pywb.warcserver.access_checker.ReverseMergeMixin, pywb.warcserver.index.aggregator.SimpleAggregator

An Aggregator specific to access control

class pywb.warcserver.access_checker.CacheDirectoryAccessSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.CacheDirectoryMixin, pywb.warcserver.access_checker.DirectoryAccessSource

An cache directory index source specific to access control

class pywb.warcserver.access_checker.DirectoryAccessSource(*args, **kwargs)[source]

Bases: pywb.warcserver.access_checker.ReverseMergeMixin, pywb.warcserver.index.aggregator.DirectoryIndexSource

An directory index source specific to access control

INDEX_SOURCES = [('.aclj', <class 'pywb.warcserver.access_checker.FileAccessIndexSource'>)]
class pywb.warcserver.access_checker.FileAccessIndexSource(filename, config=None)[source]

Bases: pywb.warcserver.index.indexsource.FileIndexSource

An Index Source class specific to access control lists

static rev_cmp(a, b)[source]

Performs a comparison between two items using the algorithm of the removed builtin cmp

Parameters:
  • a – A value to be compared
  • b – A value to be compared
Returns:

The result of the comparison

Return type:

int

class pywb.warcserver.access_checker.ReverseMergeMixin[source]

Bases: object

A mixin that provides revered merge functionality

pywb.warcserver.amf module
class pywb.warcserver.amf.Amf[source]

Bases: object

static get_representation(request_object, max_calls=500)[source]
pywb.warcserver.basewarcserver module
class pywb.warcserver.basewarcserver.BaseWarcServer(*args, **kwargs)[source]

Bases: object

add_route(path, handler, path_param_name='', default_value='')[source]
get_query_dict(environ)[source]
json_encode(res, out_headers)[source]
send_error(errs, start_response, message='No Resource Found', status=404)[source]
pywb.warcserver.handlers module
class pywb.warcserver.handlers.DefaultResourceHandler(index_source, warc_paths='', forward_proxy_prefix='', **kwargs)[source]

Bases: pywb.warcserver.handlers.ResourceHandler

class pywb.warcserver.handlers.HandlerSeq(handlers)[source]

Bases: object

get_supported_modes()[source]
class pywb.warcserver.handlers.IndexHandler(index_source, opts=None, *args, **kwargs)[source]

Bases: object

DEF_OUTPUT = 'cdxj'
OUTPUTS = {'cdxj': <function to_cdxj>, 'json': <function to_json>, 'link': <function to_link>, 'text': <function to_text>}
get_supported_modes()[source]
class pywb.warcserver.handlers.ResourceHandler(index_source, resource_loaders, **kwargs)[source]

Bases: pywb.warcserver.handlers.IndexHandler

get_supported_modes()[source]
pywb.warcserver.handlers.to_cdxj(cdx_iter, fields, params)[source]
pywb.warcserver.handlers.to_json(cdx_iter, fields, params)[source]
pywb.warcserver.handlers.to_text(cdx_iter, fields, params)[source]
pywb.warcserver.http module
class pywb.warcserver.http.DefaultAdapters[source]

Bases: object

live_adapter = <pywb.warcserver.http.PywbHttpAdapter object>
remote_adapter = <pywb.warcserver.http.PywbHttpAdapter object>
class pywb.warcserver.http.PywbHttpAdapter(cert_reqs='CERT_NONE', ca_cert_dir=None, **init_kwargs)[source]

Bases: requests.adapters.HTTPAdapter

This adaptor exists exists to restore the default behavior of urllib3 < 1.25.x, which was to not verify ssl certs, until a better solution is found

init_poolmanager(connections, maxsize, block=False, **pool_kwargs)[source]

Initializes a urllib3 PoolManager.

This method should not be called from user code, and is only exposed for use when subclassing the HTTPAdapter.

Parameters:
  • connections – The number of urllib3 connection pools to cache.
  • maxsize – The maximum number of connections to save in the pool.
  • block – Block when no free connections are available.
  • pool_kwargs – Extra keyword arguments used to initialize the Pool Manager.
proxy_manager_for(proxy, **proxy_kwargs)[source]

Return urllib3 ProxyManager for the given proxy.

This method should not be called from user code, and is only exposed for use when subclassing the HTTPAdapter.

Parameters:
  • proxy – The proxy to return a urllib3 ProxyManager for.
  • proxy_kwargs – Extra keyword arguments used to configure the Proxy Manager.
Returns:

ProxyManager

Return type:

urllib3.ProxyManager

pywb.warcserver.inputrequest module
class pywb.warcserver.inputrequest.DirectWSGIInputRequest(env)[source]

Bases: object

get_full_request_uri()[source]
get_referrer()[source]
get_req_body()[source]
get_req_headers()[source]
get_req_method()[source]
get_req_protocol()[source]
include_method_query(url)[source]
reconstruct_request(url=None)[source]
class pywb.warcserver.inputrequest.MethodQueryCanonicalizer(method, mime, length, stream, buffered_stream=None, environ=None)[source]

Bases: object

MAX_QUERY_LENGTH = 4096
amf_parse(string, warn_on_error)[source]
append_query(url)[source]
json_parse(string)[source]
class pywb.warcserver.inputrequest.POSTInputRequest(env)[source]

Bases: pywb.warcserver.inputrequest.DirectWSGIInputRequest

get_full_request_uri()[source]
get_req_headers()[source]
get_req_method()[source]
get_req_protocol()[source]
pywb.warcserver.upstreamindexsource module
class pywb.warcserver.upstreamindexsource.UpstreamAggIndexSource(base_url)[source]

Bases: pywb.warcserver.index.indexsource.RemoteIndexSource

class pywb.warcserver.upstreamindexsource.UpstreamMementoIndexSource(proxy_url='{url}')[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

load_index(params)[source]
static upstream_resource(base_url)[source]
pywb.warcserver.warcserver module
class pywb.warcserver.warcserver.WarcServer(config_file='./config.yaml', custom_config=None)[source]

Bases: pywb.warcserver.basewarcserver.BaseWarcServer

AUTO_COLL_TEMPL = '{coll}'
DEFAULT_DEDUP_URL = 'redis://localhost:6379/0/pywb:{coll}:cdxj'
get_coll_config(name)[source]
init_paths(name, abs_path=None)[source]
init_sequence(coll_name, seq_config)[source]
list_dynamic_routes()[source]
list_fixed_routes()[source]
load_auto_colls()[source]
load_coll(name, coll_config)[source]
load_colls()[source]
pywb.warcserver.warcserver.init_index_agg(source_configs, use_gevent=False, timeout=0, source_list=None)[source]
pywb.warcserver.warcserver.init_index_source(value, source_list=None)[source]
pywb.warcserver.warcserver.register_source(source_cls, end=False)[source]
Module contents

Submodules

pywb.version module

Module contents

pywb.get_test_dir()[source]

Indices and tables