Webrecorder pywb documentation!¶
The Webrecorder (pywb
) toolkit is a full-featured, advanced web archiving capture and replay framework for python.
It provides command-line tools and an extensible framework for high-fidelity web archive access and creation.
A subset of features provides the basic functionality of a “Wayback Machine”.
Usage¶
New Features¶
The 2.0 release of pywb
is a significant overhaul from the previous iteration,
and introduces many new features, including:
- Dynamic multi-collection configuration system with no-restart updates.
- New Recording Mode capability to create new web archives from the live web or from other archives.
- Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
- Support for Memento API aggregation and fallback chains for querying multiple remote and local archival sources.
- HTTP/S Proxy Mode with customizable certificate authority for proxy mode recording and replay.
- Flexible rewriting system with pluggable rewriters for different content-types.
- Significantly improved Client-Side Rewriting System (wombat.js) to handle most modern web sites.
- Improved ‘calendar’ query UI with incremental loading, grouping results by year and month, and updated replay banner.
- New in 2.4: Extensible Customization Guide for modifying all aspects of the UI.
- New in 2.4: Robust Embargo and Access Control system for blocking or excluding URLs, by prefix or by exact match.
Getting Started¶
At its core, pywb includes a fully featured web archive replay system, sometimes known as ‘wayback machine’, to provide the ability to replay, or view, archived web content in the browser.
If you have existing web archive (WARC or legacy ARC) files, here’s how to make them accessible using pywb
(If not, see Creating a Web Archive for instructions on how to easily create a WARC file right away)
By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.
pywb ships with several Command-Line Apps. The following two are useful to get started:
- wb-manager is a command line tool for managing common collection operations.
- wayback (pywb) starts a web server that provides the access to web archives.
(For more details, run wb-manager -h
and wayback -h
)
For example, to install pywb and create a new collection “my-web-archive” in ./collections/my-web-archive
.
pip install pywb
wb-manager init my-web-archive
wb-manager add my-web-archive <path/to/my_warc.warc.gz>
wayback
Point your browser to http://localhost:8080/my-web-archive/<url>/
where <url>
is a url you recorded before into your WARC/ARC file.
If all worked well, you should see your archived version of <url>
. Congrats, you are now running your own web archive!
Getting Started Using Docker¶
pywb also comes with an official production-ready Dockerfile, and several automatically built Docker images.
The following Docker image tags are updated automatically with pywb updates on github:
webrecorder/pywb
corresponds to the latest release of pywb and themaster
branch on github.webrecorder/pywb:develop
– corresponds to thedevelop
branch of pywb on github and contains the latest development work.webrecorder/pywb:<VERSION>
– Starting with pywb 2.2, each incremental release will correspond to a Docker image with tag<VERSION>
Using a specific version, eg. webrecorder/pywb:<VERSION>
release is recommended for production. Versioned Docker images are available for pywb releases >= 2.2.
All releases of pywb are listed in the Python Package Index for pywb
All of the currently available Docker image tags are listed on Docker hub
For the below examples, the latest webrecorder/pywb
image is used.
To add WARCs in Docker, the source directory should be added as a volume.
By default, pywb runs out of the /webarchive
directory, which should generally be mounted as a volume to store the data on the host
outside the container. pywb will not change permissions of the data mounted at /webarchive
and will instead attempt to run as same user
that owns the directory.
For example, give a WARC at /path/to/my_warc.warc.gz
and a pywb data directory of /pywb-data
, the following will
add the WARC to a new collection and start pywb:
docker pull webrecorder/pywb
docker run -e INIT_COLLECTION=my-web-archive -v /pywb-data:/webarchive \
-v /path/to:/source webrecorder/pywb wb-manager add default /path/to/my_warc.warc.gz
docker run -p 8080:8080 -v /pywb-data/:/webarchive wayback
This example is equivalent to the non-Docker example above.
Setting INIT_COLLECTION=my-web-archive
results in automatic collection initializiation via wb-manager init my-web-archive
.
The wayback
command is launched on port 8080 and mapped to the same on the local host.
If the wayback
command is not specified, the Docker container launches with the uwsgi
server recommended for production deployment.
See Deployment for more info.
Using Existing Web Archive Collections¶
Existing archives of WARCs/ARCs files can be used with pywb with minimal amount of setup. By using wb-manager add
,
WARC/ARC files will automatically be placed in the collection archive directory and indexed.
By default wb-manager
, places new collections in collections/<coll name>
subdirectory in the current working directory. To specify a different root directory, the wb-manager -d <dir>
. Other options can be set in the config file.
If you have a large number of existing CDX index files, pywb will be able to read them as well after running through a simple conversion process.
It is recommended that any index files be converted to the latest CDXJ format, which can be done by running:
wb-manager cdx-convert <path/to/cdx>
To setup a collection with existing ARC/WARCs and CDX index files, you can:
- Run
wb-manager init <coll name>
. This will initialize all the required collection directories. - Copy any archive files (WARCs and ARCs) to
collections/<coll name>/archive/
- Copy any existing cdx indexes to
collections/<coll name>/indexes/
- Run
wb-manager cdx-convert collections/<coll name>/indexes/
. This strongly recommended, as it will ensure that any legacy indexes are updated to the latest CDXJ format.
This will fully migrate your archive and indexes the collection.
Any new WARCs added with wb-manager add
will be indexed and added to the existing collection.
Dynamic Collections and Automatic Indexing¶
Collections created via wb-manager init
are fully dynamic, and new collections can be added without restarting pywb.
When adding WARCs with wb-manager add
, the indexes are also updated automatically. No restart is required, and the
content is instantly available for replay.
For more complex use cases, mod:pywb also includes a background indexer that checks the archives directory and automatically updates the indexes, if any files have changed or were added.
(Of course, indexing will take some time if adding a large amount of data all at once, but is quite useful for smaller archive updates).
To enable auto-indexing, run with wayback -a
or wayback -a --auto-interval 30
to adjust the frequency of auto-indexing (default is 30 seconds).
Creating a Web Archive¶
Using Webrecorder¶
If you do not have a web archive to test, one easy way to create one is to use Webrecorder
After recording, you can click Stop and then click Download Collection to receive a WARC (.warc.gz) file.
You can then use this with work with pywb.
Using pywb Recorder¶
The core recording functionality in Webrecorder is also part of pywb
. If you want to create a WARC locally, this can be
done by directly recording into your pywb collection:
- Create a collection:
wb-manager init my-web-archive
(if you haven’t already created a web archive collection) - Run:
wayback --record --live -a --auto-interval 10
- Point your browser to
http://localhost:8080/my-web-archive/record/<url>
For example, to record http://example.com/
, visit http://localhost:8080/my-web-archive/record/http://example.com/
In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
http://localhost:8080/my-web-archive/http://example.com/
HTTP/S Proxy Mode Access¶
It is also possible to access any pywb collection via HTTP/S proxy mode, providing possibly better replay without client-side url rewriting.
At this time, a single collection for proxy mode access can be specified with the --proxy
flag.
For example, wayback --proxy my-web-archive
will start pywb and enable proxy mode access.
You can then configure a browser to Proxy Settings host port to: localhost:8080
and then loading any url, eg. http://example.com/
should
load the latest copy from the my-web-archive
collection.
See HTTP/S Proxy Mode section for additional configuration details.
Deployment¶
For testing, development and small production loads, the default wayback
command line may be sufficient.
pywb uses the gevent coroutine library, and the default app will support many concurrent connections in a single process.
For larger scale production deployments, running with uwsgi server application is recommended. The uwsgi.ini
script provided can be used to launch pywb with uwsgi. uwsgi can be scaled to multiple processes to support the necessary workload, and pywb must be run with the Gevent Loop Engine. Nginx or Apache can be used as an additional frontend for uwsgi.
Although uwsgi does not provide a way to specify command line, all command line options can alternatively be configured via config.yaml
. See Configuring the Web Archive for more info on available configuration options.
Docker Deployment¶
The default pywb Docker image uses the production ready uwsgi
server by default.
The following will run pywb in Docker directly on port 80:
docker run -p 80:8080 -v /webarchive-data/:/webarchive
To run pywb in Docker behind a local nginx (as shown below), port 8081 should also be mapped:
docker run -p 8081:8081 -v /webarchive-data/:/webarchive
See Getting Started Using Docker for more info on using pywb with Docker.
Sample Nginx Configuration¶
The following nginx configuration snippet can be used to deploy pywb with uwsgi and nginx.
The configuration assumes pywb is running the uwsgi protocol on port 8081, as is the default
when running uwsgi uwsgi.ini
.
The location /static
block allows nginx to serve static files, and is an optional optimization.
This configuration can be updated to use HTTPS and run on 443, the UWSGI_SCHEME
param ensures that pywb will use the correct scheme
when rewriting.
See the Nginx Docs for a lot more details on how to configure Nginx.
server {
listen 80;
location /static {
alias /path/to/pywb/static;
}
location / {
uwsgi_pass localhost:8081;
include uwsgi_params;
uwsgi_param UWSGI_SCHEME $scheme;
}
}
Sample Apache Configuration¶
The recommended Apache configuration is to use pywb with mod_proxy
and mod_proxy_uwsgi
.
To enable these, ensure that your httpd.conf includes:
LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_uwsgi_module modules/mod_proxy_uwsgi.so
Then, in your config, simply include:
<VirtualHost *:80>
ProxyPass / uwsgi://pywb:8081/
</VirtualHost>
The configuration assumes uwsgi is started with uwsgi uwsgi.ini
Configuring Access Control Header¶
The Embargo and Access Control system allows users to be granted different access settings based on the value of an ACL header, X-pywb-ACL-user
.
The header can be set via Nginx or Apache to grant custom access priviliges based on IP address, password, or other combination of rules.
For example, to set the value of the header to staff
if the IP of the request is from designated local IP ranges (127.0.0.1, 192.168.1.0/24), the following settings can be added to the configs:
For Nginx:
geo $acl_user {
# ensure user is set to empty by default
default "";
# optional: add IP ranges to allow privileged access
127.0.0.1 "staff";
192.168.0.0/24 "staff";
}
...
location /wayback/ {
...
uwsgi_param HTTP_X_PYWB_ACL_USER $acl_user;
}
For Apache:
<If "-R '192.168.1.0/24' || -R '127.0.0.1'">
RequestHeader set X-Pywb-ACL-User staff
</If>
# ensure header is cleared if no match
<Else>
RequestHeader set X-Pywb-ACL-User ""
</Else>
}
Running on Subdirectory Path¶
To run pywb on a subdirectory, rather than at the root of the web server, the recommended configuration is to adjust the uwsgi.ini
to include the subdirectory:
For example, to deploy pywb under the /wayback
subdirectory, the uwsgi.ini
can be configured as follows:
mount = /wayback=./pywb/apps/wayback.py
manage-script-name = true
Deployment Examples¶
The sample-deploy
directory includes working Docker Compose examples for deploying pywb with Nginx and Apache on the /wayback
subdirectory.
- See:
- Docker Compose Nginx for sample Nginx config.
- Docker Compose Apache for sample Apache config.
- uwsgi_subdir.ini for example subdirectory uwsgi config.
Configuring the Web Archive¶
pywb offers an extensible YAML based configuration format via a main config.yaml
at the root of each web archive.
Framed vs Frameless Replay¶
pywb supports several modes for serving archived web content.
With framed replay, the archived content is loaded into an iframe, and a top frame UI provides info and metadata.
In this mode, the top frame url is for example, http://my-archive.example.com/<coll name>/http://example.com/
while
the actual content is served at http://my-archive.example.com/<coll name>/mp_/http://example.com/
With frameless replay, the archived content is loaded directly, and a banner UI is injected into the page.
In this mode, the content is served directly at http://my-archive.example.com/<coll name>/http://example.com/
For security reasons, we recommend running pywb in framed mode, because a malicious site could tamper with the banner
However, for certain situations, frameless replay made be appropriate.
To disable framed replay add:
framed_replay: false
to your config.yaml
Note: pywb also supports HTTP/S proxy mode which requires additional setup. See HTTP/S Proxy Mode for more details.
Directory Structure¶
The pywb system is designed to automatically access and manage web archive collections that follow a defined directory structure. The directory structure can be fully customized and “special” collections can be defined outside the structure as well.
The default directory structure for a web archive is as follows:
+-- config.yaml (optional)
|
+-- templates (optional)
|
+-- static (optional)
|
+-- collections
|
+-- <coll name>
|
+-- archive
| |
| +-- (WARC or ARC files here)
|
+-- indexes
| |
| +-- (CDXJ index files here)
|
|
+-- acl
| |
| +-- (.aclj access control files)
|
+-- templates
| |
| +-- (optional html templates here)
|
+-- static
|
+-- (optional custom static assets here)
If running with default settings, the config.yaml
can be omitted.
It is possible to config these directory paths in the config.yaml The following are some of the implicit default settings which can be customized:
collections_root: collections
archive_paths: archive
index_paths: indexes
(For a complete list of defaults, see the pywb/default_config.yaml
file for reference)
Index Paths¶
The index_paths
key defines the subdirectory for index files (usually CDXJ) and determine the contents of each archive collection.
The index files usually contain a pointer to a WARC file, but not the absolute path.
Archive Paths¶
The archive_paths
key indicates how pywb will resolve WARC files listed in the index.
For example, it is possible to configure multiple archive paths:
archive_paths:
- archive
- http://remote-bakup.example.com/collections/
When resolving a example.warc.gz
, pywb will then check (in order):
- First,
collections/<coll name>/example.warc.gz
- Then,
http://remote-backup.example.com/collections/<coll name>/example.warc.gz
(if first lookup unsuccessful)
Access Controls¶
With pywb 2.4, pywb includes an extensible Embargo and Access Control system.
By default, the access control files are stored in acl
directory of each collection.
UI Customizations¶
The templates
directory supports custom Jinja templates to allow customizing the UI.
See Customization Guide for more details on available options.
Special and Custom Collections¶
While pywb can detect automatically collections following the above directory structure, it also provides the option to fully declare Custom User-Defined Collections explicitly.
In addition, several “special” collection definitions are possible.
All custom defined collections are placed under the collections
key in config.yaml
Live Web Collection¶
The live web collection proxies all data to the live web, and can be defined as follows:
collections:
live: $live
This configures the /live/
route to point to the live web.
(As a shortcut, wayback --live
adds this collection via cli w/o modifying the config.yaml)
This collection can be useful for testing, or even more powerful, when combined with recording.
SOCKS Proxy for Live Web¶
pywb can be configured to use a SOCKS5 proxy when connecting to the live web. This allows pywb to be used with Tor and other services that require a SOCKS proxy.
If the SOCKS_HOST
and optionally SOCKS_PORT
environment variables are set, pywb will attempt to route all live web traffic through the SOCKS5 proxy.
Note that, at this time, it is not possible to configure a SOCKS proxy per pywb collection – all live web traffic will use the SOCKS proxy if enabled.
Auto “All” Aggregate Collection¶
The aggregate all collections automatically aggregates data from all collections in the collections
directory:
collections:
all: $all
Accessing /all/<url>
will cause an aggregate lookup within the collections directory.
Note: It is not (yet) possible to exclude collections from the auto-all collection, although “special” collections are not included.
Collection Provenance¶
When using the auto-all collection, it is possible to determine the original collection of each resource by looking at the Link
header metadata
if Memento API is enabled. The header will include the extra collection
field, specifying the collection:
Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"
For example, if two collections coll-1
and coll-2
contain http://example.com/
, loading the timemap for
/all/timemap/link/http://example.com/
might look like as follows:
<http://localhost:8080/all/timemap/link/http://example.com/>; rel="self"; type="application/link-format"; from="Wed, 20 Sep 2017 03:53:27 GMT",
<http://localhost:8080/all/mp_/http://example.com/>; rel="timegate",
<http://example.com/>; rel="original",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
Remote Memento Collection¶
It’s also possible to define remote archives as easily as location collections.
For example, the following defines a collection /ia/
which accesses
Internet Archive’s Wayback Machine as a single collection:
collections:
ia: memento+https://web.archive.org/web/
Many additional options, including memento “aggregation”, fallback chains are possible using the Warcserver configuration syntax. See Warcserver Index Configuration for more info.
Custom User-Defined Collections¶
The collection definition syntax allows for explicitly setting the index, archive paths and all other templates, per collection, for example:
collections:
custom:
index: ./path/to/indexes
resource: ./some/other/path/to/archive/
query_html: ./path/to/templates/query.html
If possible, it is recommended to use the default directory structure to avoid per-collection configuration. However, this configuration allows for using pywb with existing collections that have unique path requirements.
Root Collection¶
It is also possible to define a “root” collection, for example, accessible at http://my-archive.example.com/<url>
Such a collection must be defined explicitly using the $root
as collection name:
collections:
$root:
index: ./path/to/indexes
resource: ./path/to/archive/
Note: When a root collection is set, no other collections are currently accessible, they are ignored.
Recording Mode¶
Recording mode enables pywb to support recording into any automatically managed collection, using
the /<coll>/record/<url>
path. Accessing this path will result in pywb writing new WARCs directly into
the collection <coll>
.
To enable recording from the live web, simply run wayback --record
.
To further customize recording mode, add the recorder
block to the root of config.yaml
.
The command-line option is equivalent to adding recorder: live
.
The full set of configurable options (with their default settings) is as follows:
recorder:
source_coll: live
rollover_size: 100000000
rollover_idle_secs: 600
filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
source_filter: live
enable_put_custom_record: false
The required source_coll
setting specifies the source collection from which to load content that will be recorded.
Most likely this will be the Live Web Collection collection, which should also be defined.
However, it could be any other collection, allowing for “extraction” from other collections or remote web archives.
Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.
The other options are optional and may be omitted. The rollover_size
and rollover_idle_secs
specified
the maximum size and maximum idle time, respectively, after which a new WARC file is created.
For example, a new WARC will be created if more than 100MB are recorded, or after 600 seconds have elapsed between
subsequent requests. This allows the WARC size to be more manageable and prevents files from being left open for long periods of time.
The filename-template
specifies the naming convention for WARC files, and allows a timestamp, current hostname, and
random string to be inserted into the filename.
When using an aggregate collection or sequential fallback collection as the source, recording can be limited to pages
fetched from certain child collection by specifying source_filter
as an regex matching the name of the sub-collection.
For example, if recording with the above config into a collection called my-coll
, the user would access:
http://my-archive.example.com/my-coll/record/http://example.com/
, which would load http://example.com/
from the live web
and write the request and response to a WARC named something like:
./collections/my-coll/archive/my-warc-20170102030000000000-archive.example.com-QRTGER.warc.gz
If running with auto indexing, the WARC will also get automatically indexed and available for replay after the index interval.
As a shortcut, recorder: live
can also be used to specify only the source_coll
option.
Dedup Options for Recording¶
By default, recording mode will record every URL.
Starting with pywb 2.5.0, it is possible to configure pywb to either write revisit records or skip duplicate URLs altogether using the dedup_policy
key.
Using deduplication requires a Redis instance, which will keep track of the index for deduplication in a sorted-set key.
The default Redis key used is redis://localhost:6379/0/pywb:{coll}:cdxj
where {coll}
is replaced with current collection id.
The field can be customized using the dedup_index_url
field in the recorder config. The URL must start with redis://
, as that is the only
supported dedup index at this time.
- To skip duplicate URLs, set
dedup_policy: skip
. With this setting, only one instance of any URL will be recorded. - To write revist records, set
dedup_policy: revisit
. With this setting, WARCrevisit
records will be written when a duplicate URL is detected
and has the same digest as a previous response.
- To keep all duplicates, use
dedup_policy: keep
. All WARC records are written to disk normally as with no policy, however, the Redis dedup index is still populated,
which allows for instant replay (see below).
- To disable the dedup system, set to
dedup_policy: none
or omit the field. This is the default, and no Redis is required.
Another option, pywb can add an aggressive Cache-Control header to force the browser to cache all responses on a page.
This feature is still experimental, but can be enabled via cache: always
setting.
For example, the following will enable revisit
records to be written using the given Redis URL, and also enable aggressive cacheing when recording:
recorder:
...
cache: always
dedup_policy: revisit
dedup_index_url: 'redis://localhost:6379/0/pywb:{coll}:cdxj' # default when omitted
Instant Replay (experimental)¶
Starting with pywb 2.5.0, when the dedup_policy
is set, pywb can do ‘instant replay’ after recording, without having to regenerate the CDX or waiting for it to be updated with auto-indexing.
When any dedup_policy, pywb can also access the dedup Redis index, along with any on-disk CDX, when replaying the collection.
This feature is still experimental but should generally work. Additional options for working with the Redis Dedup index will be added in the futuer.
Adding Custom Resource Records¶
pywb now also supports adding custom data to a WARC resource
record. This can be used to add custom resources, such as screenshots, logs, error messages,
etc.. that are not normally captured as part of recording, but still useful to store in WARCs.
To add a custom resources, simply call PUT /<coll>/record
with the data to be added as the request body and the type of the data specified as the content-type. The url
can be specified as a query param.
For example, adding a custom record file:///my-custom-resource
containing Some Custom Data
can be done using curl
as follows:
curl -XPUT "localhost:8080/my-web-archive/record?url=file:///my-custom-resource" --data "Some Custom Data"
This feature is only available if enable_put_custom_record: true
is set in the recorder config.
Auto-Fetch Responsive Recording¶
When recording (or browsing the ‘live’ collection), pywb has an option to inspect and automatically fetch additional resources, including:
- Any urls found in
<img srcset="...">
attributes.- Any urls within CSS
@media
rules.
This allows pywb to better capture responsive pages, where all the resources are not directly loaded by the browser, but may be needed for future replay.
The detected urls are loaded in the background using a web worker while the user is browsing the page.
To enable this functionality, add --enable-auto-fetch
to the command-line or enable_auto_fetch: true
to the root of the config.yaml
The auto-fetch system is provided as part of the Client-Side Rewriting System (wombat.js)
Auto-Indexing Mode¶
If auto-indexing is enabled, pywb will update the indexes stored in the indexes
directory whenever files are added or modified in the
archive
directory. Auto-indexing can be enabled via the autoindex
option set to the check interval in seconds:
autoindex: 30
This specifies that the archive
directories should be every 30 seconds. Auto-indexing is useful when WARCs are being
appended to or added to the archive
by an external operation.
If a user is manually adding a new WARC to the collection, wb-manager add <coll> <path/to/warc>
is recommended,
as this will add the WARC and perform a one-time reindex the collection, without the need for auto-indexing.
Note: Auto-indexing also does not support deletion of removal of WARCs from the archive
directory.
This is not a common operation for web archives, a WARC must be manually removed from the
collections/<coll>/archive/
directory and then collection index can be regenreated from the remaining WARCs
by running wb-manager reindex <coll>
The auto-indexing mode can also be enabled via command-line by running wayback -a
or wayback -a --auto-interval 30
to also set the interval.
(If running pywb with uWSGI in multi-process mode, the auto-indexing is only run in a single worker to avoid race conditions and duplicate indexing)
Client-Side Rewriting System (wombat.js)¶
In addition to server-side rewriting, pywb includes a Javascript client-rewriting system.
This system intercepts network traffic and emulates the correct JS environment expected by a replayed page.
The auto-fetch system is also implemented as part of wombat.
Wombat was integrated into pywb upto 2.2.x. Starting with 2.3, wombat has been spun off into its own standalone JS module.
For more information on wombat.js and client-side rewriting, see the wombat README
HTTP/S Proxy Mode¶
In addition to “url rewriting prefix mode” (the default), pywb can also act as a full-fledged HTTP and HTTPS proxy, allowing any browser or client supporting HTTP and HTTPS proxy to access web archives through the proxy.
Proxy mode can provide access to a single collection at time, eg. instead of accessing http://localhost:8080/my-coll/2017/http://example.com/
,
the user enters http://example.com/
and is served content from the my-coll
collection.
As a result, the collection and timestamp must be specified separately.
Configuring HTTP Proxy¶
At this time, pywb requires the collection to be configured at setup time (though collection switching will be added soon).
To enable proxy mode, the collection can be specified by running: wayback --proxy my-coll
or by adding to the config:
proxy:
coll: my-coll
For HTTP proxy access, this is all that is needed to use the proxy. If pywb is running on port 8080 on localhost, the following curl command should provide proxy access: curl -x "localhost:8080" http://example.com/
Default Timestamp¶
The timestamp can also be optionally specified by running: wayback --proxy my-coll --proxy-default-timestamp 20181226010203
or by specifying the config:
proxy:
coll: my-coll
default_timestamp: "20181226010203"
The ISO date format, eg. 2018-12-26T01:02:03
is also accepted.
If the timestamp is omitted, proxy mode replay defaults to the latest capture.
The timestamp can also be dynamically overriden per-request using the Proxy Mode Memento API.
Proxy Mode Rewriting¶
By default, pywb performs minimal html rewriting to insert a default banner into the proxy mode replay to make it clear to users that they are viewing replayed content.
Custom rewriting code from the head_insert.html
template may also be inserted into <head>
.
Checking for the {% if env.pywb_proxy_magic %}
allows for inserting custom content for proxy mode only.
However, content rewriting in proxy mode is not necessary and can be disabled completely by customizing the proxy
block in the config.
This may be essential when proxying content to older browsers for instance.
To disable all content rewriting/modifications from pywb via the
head_insert.html
template, addenable_content_rewrite: false
If set to false, this setting overrides and disables all the other options.
To disable just the banner, add
enable_banner: false
To add a light version of rewriting (for overriding Date, random number generators), add
enable_wombat: true
If Auto-Fetch Responsive Recording is enabled in the global config, the enable_wombat: true
is implied, unless enable_content_rewrite: false
is also set (as it will disable the auto-fetch system from being injected into the page).
If omitted, the defaults for these options are:
proxy:
enable_banner: true
enable_wombat: false
enable_content_rewrite: true
For example, to enable wombat rewriting but disable the banner, use the config:
proxy:
enable_banner: false
enable_wombat: true
To disable all content rewriting:
proxy:
enable_content_rewrite: false
Proxy Recording¶
The proxy can additional be set to recording mode, equivalent to access the /<my-coll>/record/
path,
by adding recording: true
, as follows:
proxy:
coll: my-coll
recording: true
By default, proxy recording will use the live
collection if not otherwise configured.
See Recording Mode for full set of configurable recording options.
HTTPS Proxy and pywb Certificate Authority¶
For HTTPS proxy access, pywb provides its own Certificate Authority and dynamically generates certificates for each host and signs the responses with these certificates. By design, this allows pywb to act as “man-in-the-middle” serving archived copies of a given site.
However, the pywb Certificate Authority (CA) certificate will need to be accepted by the browser. The CA cert can be downloaded from pywb directly using the special download paths. Recommended set up for using the proxy is as follows:
Start pywb with proxy mode enabled (with
--proxy
option or with aproxy:
option block present in the config).(The CA root certificate will be auto-created when first starting pywb with proxy mode if it doesn’t exist.)
Configure the browser proxy settings host port, for example
localhost
and8080
(if running locally)Download the CA:
- For most browsers, use the PEM format:
http://wsgiprox/download/pem
- For windows, use the PKCS12 format:
http://wsgiprox/download/p12
- For most browsers, use the PEM format:
You may need to agree to “Trust this CA” to identify websites.
The auto-generated pywb CA, created at ./proxy-certs/pywb-ca.pem
may also be added to a keystore directly.
The location of the CA file and the CA name displayed can be changed by setting the ca_file_cache
and ca_name
proxy options, respectively.
The following are all the available proxy options – only coll
is required:
proxy:
coll: my-coll
ca_name: pywb HTTPS Proxy CA
ca_file_cache: ./proxy-certs/pywb-ca.pem
recording: false
enable_banner: true
enable_content_rewrite: true
default_timestamp: ''
The HTTP/S functionality is provided by the separate wsgiprox
utility which provides HTTP/S proxy routing
to any WSGI application.
Using wsgiprox, pywb sets FrontEndApp.proxy_route_request()
as the proxy resolver, and this function returns the full collection path that pywb uses to route each proxy request. The default implementation returns a path to the fixed collection coll
and injects content into <head>
if enable_content_rewrite
is true. The default banner is inserted if enable_banner
is set to true.
Extensions to pywb can override proxy_route_request()
to provide custom handling, such as setting the collection dynamically or based on external data sources.
See the wsgiprox README for additional details on setting a proxy resolver.
For more information on custom certificate authority (CA) installation, the mitmproxy certificate page provides a good overview for installing a custom CA on different platforms.
Compatibility: Redirects, Memento, Flash video overrides¶
Exact Timestamp Redirects¶
By default, pywb does not redirect urls to the ‘canonical’ representation of a url with the exact timestamp.
For example, when requesting /my-coll/2017js_/http://example.com/example.js
but the actual timestamp of the resource is 2017010203000400
,
there is not a redirect to /my-coll/2017010203000400js_/http://example.com/example.js
.
Instead, this ‘canonical’ url is returned with the response in the Content-Location
header.
(This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.)
However, if the classic redirect behavior is desired, it can be enable by adding:
redirect_to_exact: true
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other “wayback machine” implementations.
Memento Protocol¶
Memento API support is enabled by default, and works with no-timestamp-redirect and classic redirect behaviors.
However, Memento API support can be disabled by adding:
enable_memento: false
Flash Video Override¶
A custom system to override Flash video with a custom download via youtube-dl
and replay with a custom player was enabled in previous versions of pywb.
However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based.
The system is seldom used now that most video is HTML5 based.
For these reasons, this functionality, previously enabled by including the script /static/vidrw.js
, is disabled by default.
To enable the previous behavior, add to config:
enable_flash_video_rewrite: true
The system may be revamped in the future and enabled by default, but for now, it is provided “as-is” for compatibility reasons.
Verify SSL-Certificates¶
By default, SSL-Certificates of websites are not verified. To enable verification, add the following to the config:
certificates:
cert_reqs: 'CERT_REQUIRED'
ca_cert_dir: '/etc/ssl/certs'
ca_cert_dir
can optionally point to a directory containing the CA certificates that you trust. Most linux distributions provide CA certificates via a package called ca-certificates
.
If omitted, the default system CA used by Python is used.
Embargo and Access Control¶
The embargo system allows for date-based rules to block access to captures based on their capture dates.
The access controls system provides additional URL-based rules to allow, block or exclude access to specific URL prefixes or exact URLs.
The embargo and access control rules are configured per collection.
Embargo Settings¶
The embargo system allows restricting access to all URLs within a collection based on the timestamp of each URL. Access to these resources is ‘embargoed’ until the date range is adjusted or the time interval passes.
The embargo can be used to disallow access to captures based on following criteria:
- Captures before an exact date
- Captures after an exact date
- Captures newer than a time interval
- Captures older than a time interval
Embargo Before/After Exact Date¶
To block access to all captures before or after a specific date, use the before
or after
embargo blocks
with a specific timestamp.
For example, the following blocks access to all URLs captured before 2020-12-26 in the collection embargo-before
:
embargo-before:
index_paths: ...
archive_paths: ...
embargo:
before: '20201226'
The following blocks access to all URLs captured on or after 2020-12-26 in collection embargo-after
:
embargo-after:
index_paths: ...
archive_paths: ...
embargo:
after: '20201226'
Embargo By Time Interval¶
The embargo can also be set for a relative time interval, consisting of years, months, weeks and/or days.
For example, the following blocks access to all URLs newer than 1 year:
embargo-newer:
...
embargo:
newer:
years: 1
The following blocks access to all URLs older than 1 year, 2 months, 3 weeks and 4 days:
embargo-older:
...
embargo:
older:
years: 1
months: 2
weeks: 3
days: 4
Any combination of years, months, weeks and days can be used (as long as at least one is provided) for the newer
or older
embargo settings.
Access Control Settings¶
Access Control Files (.aclj)¶
URL-based access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order. To determine the best match, a binary search is used (similar to CDXJ lookup) and then the best match is found forward.
An .aclj file may look as follows:
org,httpbin)/anything/something - {"access": "allow", "url": "http://httpbin.org/anything/something"}
org,httpbin)/anything - {"access": "exclude", "url": "http://httpbin.org/anything"}
org,httpbin)/ - {"access": "block", "url": "httpbin.org/"}
com, - {"access": "allow", "url": "com,"}
Each JSON entry contains an access
field and the original url
field that was used to convert to the SURT (if any).
The JSON entry may also contain a user
field, as explained below.
The prefix consists of a SURT key and a -
(currently reserved for a timestamp/date range field to be added later).
Given these rules, a user would:
- be allowed to visit
http://httpbin.org/anything/something
(allow) - but would receive an ‘access blocked’ error message when viewing
http://httpbin.org/
(block) - would receive a 404 not found error when viewing
http://httpbin.org/anything
(exclude)
Access Types: allow, block, exclude, allow_ignore_embargo¶
The available access types are as follows:
exclude
- when matched, results are excluded from the index, as if they do not exist. User will receive a 404.block
- when matched, results are not excluded from the index, but access to the actual content is blocked. User will see a 451.allow
- full access to the index and the resource, but may be overriden by embargo.allow_ignore_embargo
- full access to the index and resource, overriding any embargo settings.
The difference between exclude
and block
is that when blocked, the user can be notified that access is blocked, while
with exclude, no trace of the resource is presented to the user.
The use of allow
is useful to provide access to more specific resources within a broader block/exclude rule, while allow_ignore_embargo
can be used to override any embargo settings.
If both are present, the embargo restrictions are checked first and take precedence, unless the allow_ignore_embargo
option is used
to override the embargo.
User-Based Access Controls¶
The access control rules can further be customized be specifying different permissions for different ‘users’. Since pywb does not have a user system,
a special header, X-Pywb-ACL-User
can be used to indicate a specific user.
This setting is designed to allow a more privileged user to access additional content or override an embargo.
For example, the following access control settings restrict access to https://example.com/restricted/
by default, but allow access for the staff
user:
com,example)/restricted - {"access": "allow", "user": "staff"}
com,example)/restricted - {"access": "block"}
Combined with the embargo settings, this can also be used to override the embargo for internal organizational users, while keeping the embargo for general access:
com,example)/restricted - {"access": "allow_ignore_embargo", "user": "staff"}
com,example)/restricted - {"access": "allow"}
To make this work, pywb must be running behind an Apache or Nginx system that is configured to set X-Pywb-ACL-User: staff
based on certain settings.
For example, this header may be set based on IP range, or based on password authentication.
Further examples of how to set this header will be provided in the deployments section.
Note: Do not use the user-based rules without configuring proper authentication on an Apache or Nginx frontend to set or remove this header, otherwise the ‘X-Pywb-ACL-User’ can easily be faked.
See the Configuring Access Control Header section in Usage for examples on how to configure this header.
Access Error Messages¶
The special error code 451 is used to indicate that a resource has been blocked (access setting block
).
The error.html template contains a special message for this access and can be customized further.
By design, resources that are exclude
-ed simply appear as 404 not found and no special error is provided.
Managing Access Lists via Command-Line¶
The .aclj files need not ever be added or edited manually.
The pywb wb-manager
utility has been extended to provide tools for adding, removing and checking access control rules.
The access rules are written to <collection>/acl/access-rules.aclj
for a given collection <collection>
for automatic collections.
For example, to add the first line to an ACL file access.aclj
, one could run:
wb-manager acl add <collection> http://httpbin.org/anything/something exclude
The URL supplied can be a URL or a SURT prefix. If a SURT is supplied, it is used as is:
wb-manager acl add <collection> com, allow
A specific user for user-based rules can also be specified, for example to add allow_ignore_embargo
for user staff
only, run:
wb-manager acl add <collection> http://httpbin.org/anything/something allow_ignore_embargo -u staff
By default, access control rules apply to a prefix of a given URL or SURT.
To have the rule apply only to the exact match, use:
wb-manager acl add <collection> http://httpbin.org/anything/something allow --exact-match
Rules added with and without the --exact-match
flag are considered distinct rules, and can be added
and removed separately.
With the above rules, http://httpbin.org/anything/something
would be allowed, but
http://httpbin.org/anything/something/subpath
would be excluded for any subpath
.
To remove a rule, one can run:
wb-manager acl remove <collection> http://httpbin.org/anything/something
To import rules in bulk, such as from an OpenWayback-style excludes.txt and mark them as exclude
:
wb-manager acl importtxt <collection> ./excludes.txt exclude
See wb-manager acl -h
for a list of additional commands such as for validating rules files and running a match against
an existing rule set.
Access Controls for Custom Collections¶
For manually configured collections, there are additional options for configuring access controls.
The access control files can be specified explicitly using the acl_paths
key and allow specifying multiple ACL files,
and allow sharing access control files between different collections.
Single ACLJ:
collections:
test:
acl_paths: ./path/to/file.aclj
default_access: block
Multiple ACLJ:
collections:
test:
acl_paths:
- ./path/to/allows.aclj
- ./path/to/blocks.aclj
- ./path/to/other.aclj
- ./path/to/directory
default_access: block
The acl_paths
can be a single entry or a list, and can also include directories. If a directory is specified, all .aclj
files
in the directory are checked.
When finding the best rule from multiple .aclj
files, each file is binary searched and the result
set merge-sorted to find the best match (very similar to the CDXJ index lookup).
Note: It might make sense to separate allows.aclj
and blocks.aclj
into individual files for organizational reasons,
but there is no specific need to keep more than one access control file.
Finally, ACLJ and embargo settings combined for the same collection might look as follows:
collections:
test:
...
embargo:
newer:
days: 366
acl_paths:
- ./path/to/allows.aclj
- ./path/to/blocks.aclj
Default Access¶
An additional default_access
setting can be added to specify the default rule if no other rules match for custom collections.
If omitted, this setting is default_access: allow
, which is usually the desired default.
Setting default_access: block
and providing a list of allow
rules provides a flexible way to allow access
to only a limited set of resources, and block access to anything out of scope by default.
UI Customization¶
Customization Guide¶
Most aspects of the pywb user-interface can be customized by changing the default styles, or overriding the HTML templates.
This guide covers a few different options for customizing the UI.
Changing the Default Styles¶
When using the default UI, pywb styles can be configured in pywb/static/default_banner.css
The style definition for #_wb_frame_top_banner
affects the rendering of the default banner in framed mode.
Configuring a Logo¶
An optional logo can be configured at the top-left of the default banner.
To enable the logo set the ui.logo
property in config.yaml
to point to the URL of the logo.
The URL can be any image URL, including a URL served from the static directory.
For example, to add the default pywb logo to the banner, use the following in the config, which will
load the logo from ./static/pywb-logo-sm.png
ui:
logo: pywb-logo-sm.png
New Vue-based UI (Alpha)¶
With pywb 2.7.0, pywb includes a brand new UI which includes a visual calendar mode and a histogram-based banner.
See New Vue-based UI (Alpha) for more information on how to enable this UI.
Customizing UI Templates¶
pywb renders HTML using the Jinja2 templating engine, loading default templates from the pywb/templates
directory.
If running from a custom directory, templates can be placed in the templates
directory and will override the defaults.
See Template Guide for more details on customizing the templates.
Static Files¶
pywb will automatically support static files placed under the following directories:
- Files under the root
static
directory:static/my-file.js
can be accessed viahttp://localhost:8080/static/my-file.js
- Files under the per-collection directory:
./collections/my-coll/static/my-file.js
can be accessed viahttp://localhost:8080/static/_/my-coll/my-file.js
It is possible to change these settings via config.yaml
:
static_prefix
- sets the URL path used in pywb to serve static content (defaultstatic
)static_dir
- sets the directory name used to read static files on disk (defaultstatic
)
While pywb can serve static files, it is recommended to use an existing web server to serve static files, especially if already using it in production.
For example, this can be done via nginx with:
location /wayback/static {
alias /pywb/pywb/static;
}
Loading Custom Metadata¶
pywb includes a default mechanism for loading externally defined metadata, loaded from a per-collection metadata.yaml
YAML file at runtime.
See Custom Metadata for more details.
Additionally, the banner template has access to the contents of the config.yaml
via the {{ config }}
template variable,
allowing for passing in arbitrary config information.
For more dynamic loading of data, the banner and all of the templates can load additional data via JS fetch()
calls.
Embedding pywb in frames¶
It should be possible to embed pywb replay itself as an iframe as needed.
For customizing the top-level page and banner, see Customizing the Top Frame Template.
However, there may be other reasons to embed pywb in an iframe.
This can be done simply by including something like:
<html>
<head>
<body>
<div>Embedding pywb replay</div>
<iframe style="width: 100%; height: 100%" src="http://localhost:8080/pywb/20130729195151/http://test@example.com/"></iframe>
</body>
</html>
New Vue-based UI (Alpha)¶
With 2.7.0, pywb introduces a new Vue UI based system, which can be enabled to provide a more feature-rich representation of a web archive.
The UI consists of two parts, which can be enabled using the ui
block in config.yaml
ui:
vue_calendar_ui: true
vue_timeline_banner: true
Note: This UI is still in development and not all features are operational yet. In particular, localization switching is not yet available in the alpha version.
Overview¶
Calendar UI¶
The new calendar UI provides a histogram and a clickable calendar representation of a web archive.
The calendar is rendered in place of the standard URL query page.

To enable this UI for URL query pages, set the ui.vue_calendar_ui
property to true in the config.yaml
Banner Replay UI¶
The new banner histogram allows for zooming in on captures per year as well as per month.
Navigation preserves the different levels. The full calendar UI is also available as a dropdown by clicking the calendar icon.
The new banner should allow for faster navigation across multiple captures.

To enable this UI for replay pages, set the ui.vue_timeline_banner
property to true in the config.yaml
Custom Logo¶
When using the custom banner, it is possible to configure a logo by setting ui.logo
to a static file.
If omitted, the standard pywb logo will be used by default.
If set, the logo should point to a file in the static directory (default is static
but can be changed via the static_dir
config option).
For example, to use the file ./static/my-logo.png
as the logo, set:
ui:
logo: my-logo.png
Updating the Vue UI¶
The UI is contained within the pywb/vueui
directory.
The Vue component sources can be found in pywb/vueui/src
.
Updating the UI requires node
and yarn
.
To install and build, run:
cd pywb/vueui
yarn install
yarn build
This will generate the output to pywb/static/vue/vueui.js
which is loaded from the default templates when the Vue UI rendering is enabled.
Additional styles for the banner are loaded from pywb/static/vue_banner.css
.
Template Guide¶
Introduction¶
This guide provides a reference of all of the templates available in pywb and how they could be modified.
These templates are found in the pywb/templates
directory and can be overridden as needed, one HTML page at a time.
Template variables are listed as {{ variable }}
to indicate the syntax used for rendering the value of the variable in Jinja2.
Copying a Template For Modification¶
To modify a template, it is often useful to start with the default template. To do so, simply copy a default template
to a local templates
directory.
For convenience, you can also run: wb-manager template --add <template-name>
to add the template automatically.
For a list of available templates that can be overridden in this way, run wb-manager template --list
.
Per-Collection Templates¶
Certain templates can be customized per-collection, instead of for all of pywb.
To override a template for a specific collection only, run wb-manager template --add <template-name> <coll-name>
For example:
wb-manager init my-coll
wb-manager template --add search_html my-coll
This will create the file collections/my-coll/templates/search.html
, a copy of the default search.html, but configured to be used only
for the collection my-coll
.
Base Templates (and supporting templates)¶
File: base.html
This template includes the HTML added to all other pages, replay and non-replay. Shared JS and CSS includes can be added here. For theming all pywb UI, it may be useful to modify this template.
To customize the default pywb UI across multiple pages, the following additional templates can also be overriden:
head.html
– Template containing content to be added to the<head>
of thebase
templateheader.html
– Template to be added as the first content of the<body>
tag of thebase
templatefooter.html
– Template for adding content as the “footer” of the<body>
tag of thebase
template
Note: The default pywb head.html
and footer.html
are currently blank. They can be populated to customize the rendering, add analytics, etc… as needed.
The base.html
template also provides five blocks that can be supplied by templates that extend it.
title
– Block for supplying the title for the pagehead
– Block for adding content to the<head>
, includeshead.html
templateheader
– Block for adding content to the<body>
before thebody
block, includes theheader.html
templatebody
– Block for adding the primary content to templatefooter
– Block for adding content to the<body>
after thebody
block, includes thefooter.html
template
Home, Collection and Search Templates¶
Home Page Template¶
File: index.html
This template renders the home page for pywb, and by default renders a list of available collections.
Template variables:
{{ routes }}
- a list of available collection routes.{{ all_metadata }}
- a dictionary of all metadata for all collections, keyed by collection id. See Custom Metadata for more info on the custom metadata.
Additionally, the Shared Template Variables are also available to the home page template, as well as all other templates.
Collection Page Template¶
File: search.html
The ‘collection page’ template is the page rendered when no URL is specified, e.g. http://localhost:8080/my-collection/
.
The default template renders a search page that can be used to start searching for URLs.
Template variables:
{{ coll }}
- the collection name identifier.{{ metadata }}
- an optional dictionary of metadata. See Custom Metadata for more info.{{ ui }}
- an optionalui
dictionary fromconfig.yaml
, if any
Custom Metadata¶
If custom collection metadata is provided, this page will automatically show this metadata as well.
It is possible to also add custom metadata per-collection that will be available to the collection.
For dynamic collections, any fields placed in <coll_name>/metadata.yaml
files can be accessed
via the {{ metadata }}
variable.
For example, if the metadata file contains:
somedata: value
Accessing {{ metadata.somedata }}
will resolve to value
.
The metadata can also be added via commandline: wb-manager metadata myCollection --set somedata=value
.
URL Query/Calendar Page Template¶
File: query.html
This template is rendered for any URL search response pages, either a single URL or more complex queries.
For example, the page http://localhost:8080/my-collection/*/https://example.com/
will be rendered using this template.
The default template supports the standard pywb table view, as well as a conditional new vue-based UI. (See New Vue-based UI (Alpha) for more info on the new UI)
Template variables:
{{ url }}
- the URL being queried, e.g.https://example.com/
{{ prefix }}
- the collection prefix that will be used for replay, e.g.http://localhost:8080/my-collection/
{{ ui }}
- an optionalui
dictionary fromconfig.yaml
, if any{{ static_prefix }}
- the prefix from which static files will be accessed from, e.g.http://localhost:8080/static/
.
Replay and Banner Templates¶
The following templates are used to configure the replay view itself.
Banner Template¶
File: banner.html
This template is used to render the banner and is used both in framed replay and frameless replay.
In framed replay, the template is only rendered in the top/outer frame, while in frameless replay, it is added to every page.
Template variables:
{{ url }}
- the URL being replayed.{{ timestamp }}
- the timestamp being replayed, e.g.20211226
inhttp://localhost:8080/pywb/20211226/mp_/https://example.com/
{{ is_framed }}
- true/false if currently in framed mode.{{ wb_prefix }}
- the collection prefix, e.g.http://localhost:8080/pywb/
{{ host_prefix }}
- the pywb server origin, e.g.http://localhost:8080
{{ config }}
- provides the contents of theconfig.yaml
as a dictionary.{{ ui }}
- an optionalui
dictionary fromconfig.yaml
, if any.
The default banner creates all UI dynamically via JS. However, a custom banner could also insert HTML to render the banner directly.
By default, the banner checks the {{ ui.vue_timeline_banner }}
and renders the new UI or the standard default UI.
The default UI is created via the default_banner.js
script.
See New Vue-based UI (Alpha) for more details on the new Vue UI.
Head Insert Template¶
File: head_insert.html
This template represents the HTML injected into every replay page to add support for client-side rewriting via wombat.js
.
This template is part of the core pywb replay, and modifying this template is not recommended.
For customizing the banner, modify the banner.html
template instead.
Top Frame Template¶
File: frame_insert.html
This template represents the top-level frame that is inserted to render the replay in framed mode.
By design, this template does not extend from the base template.
This template is responsible for creating the iframe that will render the content.
This template only renders the banner and is designed not to set the encoding to allow the browser to ‘detect’ the encoding for the containing iframe. For this reason, the template should only contain ASCII text, and %-encode any non-ASCII characters.
Template variables:
{{ url }}
- the URL being replayed.{{ wb_url }}
- A completeWbUrl
object, which contains theurl
,timestamp
andmod
properties, representing the replay url.{{ wb_prefix }}
- the collection prefix, e.g.http://localhost:8080/pywb/
{{ is_proxy }}
- set to true if page is being loaded via an HTTP/S proxy (checks if WSGI env haswsgiprox.proxy_host
set)
Customizing the Top Frame Template¶
The top-frame used for framed replay can be replaced or augmented
by modifying the frame_insert.html
.
To start with modifying the default outer page, you can add it to the current
templates directory by running wb-manager template --add frame_insert_html
To initialize the replay, the outer page should include wb_frame.js
,
create an <iframe>
element and pass the id (or element itself) to the ContentFrame
constructor:
<script src='{{ host_prefix }}/{{ static_path }}/wb_frame.js'> </script>
<script>
var cframe = new ContentFrame({"url": "{{ url }}" + window.location.hash,
"prefix": "{{ wb_prefix }}",
"request_ts": "{{ wb_url.timestamp }}",
"iframe": "#replay_iframe"});
</script>
The outer frame can receive notifications of changes to the replay via postMessage
For example, to detect when the content frame changed and log the new url and timestamp, use the following script in the outer frame html:
window.addEventListener("message", function(event) {
if (event.data.wb_type == "load" || event.data.wb_type == "replace-url") {
console.log("New Url: " + event.data.url);
console.log("New Timestamp: " + event.data.ts);
}
});
The load
message is sent when a new page is first loaded, while replace-url
is used
for url changes caused by content frame History navigation.
Error Templates¶
The following templates are used to render errors.
Page Not Found Template¶
File: not_found.html
- template for 404 error pages.
This template is used to render any 404/page not found errors that can occur when loading a URL that is not in the web archive.
Template variables:
{{ url }}
- the URL of the page{{ wbrequest }}
- the fullWbRequest
object which can be used to get additional info about the request.
(The default template checks {{ wbrequest and wbrequest.env.pywb_proxy_magic }}
to determine if the request is via an HTTP/S Proxy Mode connection or a regular request).
Generic Error Template¶
File: error.html
- generic error template.
This template is used to render all other errors that are not ‘page not found’.
Template variables:
{{ err_msg }}
- a shorter error message indicating what went wrong.{{ err_details }}
- additional details about the error.
Localization / Multi-lingual Support¶
pywb supports configuring different language locales and loading different language translations, and dynamically switching languages.
pywb can extract all text from templates and generate CSV files for translation and convert them back into a binary format used for localization/internationalization.
(pywb uses the Babel library which extends the standard Python i18n system)
To ensure all localization related dependencies are installed, first run:
pip install pywb[i18n]
Locales to use are configured in the config.yaml
.
The command-line wb-manager
utility provides a way to manage locales for translation, including generating extracted text, and to update translated text.
Adding a Locale and Extracting Text¶
To add a new locale for translation and automatically extract all text that needs to be translated, run:
wb-manager i18n extract <loc>
The <loc>
can be one or more supported two-letter locales or CLDR language codes. To list available codes, you can run pybabel --list-locales
.
Localization data is placed in the i18n
directory, and translatable strings can be found in i18n/translations/<locale>/LC_MESSAGES/messages.csv
Each CSV file looks as follows, listing each source string and an empty string for the translated version:
"location","source","target"
"pywb/templates/banner.html:6","Live on",""
"pywb/templates/banner.html:8","Calendar icon",""
"pywb/templates/banner.html:9 pywb/templates/query.html:45","View All Captures",""
"pywb/templates/banner.html:10 pywb/templates/header.html:4","Language:",""
"pywb/templates/banner.html:11","Loading...",""
...
This CSV can then be passed to translators to translate the text.
(The extraction parameters are configured to load data from pywb/templates/*.html
in babel.ini
)
For example, the following will generate translation strings for es
and pt
locales:
wb-manager i18n extract es pt
The translatable text can then be found in i18n/translations/es/LC_MESSAGES/messages.csv
and i18n/translations/pt/LC_MESSAGES/messages.csv
.
The CSV files should be updated with a translation for each string in the target
column.
The extract command adds any new strings without overwriting existing translations, so after running the update command to compile translated strings (described below), it is safe to run the extract command again.
Updating Locale Catalog¶
Once the text has been translated, and the CSV files updated, simply run:
wb-manager i18n update <loc>
This will parse the CSVs and compile the translated string tables for use with pywb.
Specifying locales in pywb¶
To enable the locales in pywb, one or more locales can be added to the locales
key in config.yaml
, ex:
locales:
- en
- es
Single Language Default Locale¶
pywb can be configured with a default, single-language locale, by setting the default_locale
property in config.yaml
:
default_locale: es
locales:
- es
With this configuration, pywb will automatically use the es
locale for all text strings in pywb pages.
pywb will also set the <html lang="es">
so that the browser will recognize the correct locale.
Mutli-language Translations¶
If more than one locale is specified, pywb will automatically show a language switching UI at the top of collection and search pages, with an option for each locale listed. To include English as an option, it should also be added as a locale (and no strings translated). For example:
locales:
- en
- es
- pt
will configure pywb to show a language switch option on all pages.
Localized Collection Paths¶
When localization is enabled, pywb supports the locale prefix for accessing each collection with a localized language:
If pywb has a collection my-web-archive
, then:
/my-web-archive/
- loads UI with default language (set viadefault_locale
)/en/my-web-archive/
- loads UI withen
locale/es/my-web-archive/
- loads UI withes
locale/pt/my-web-archive/
- loads UI withpt
locale
The language switch options work by changing the locale prefix for the same page.
Listing and Removing Locales¶
To list the locales that have previously been added, you can also run wb-manager i18n list
.
To disable a locale from being used in pywb, simply remove it from the locales
key in config.yaml
.
To remove data for a locale permanently, you can run: wb-manager i18n remove <loc>
. This will remove the locale directory on disk.
To remove all localization data, you can manually delete the i18n
directory.
UI Templates: Adding Localizable Text¶
Text that can be translated, localizable text, can be marked as such directly in the UI templates:
By wrapping the text in
{% trans %}
/{% endtrans %}
tags. For example:{% trans %}Collection {{ coll }} Search Page{% endtrans %}
Short-hand by calling a special
_()
function, which can be used in attributes or more dynamically. For example:... title="{{ _('Enter a URL to search for') }}">
These methods can be used in all UI templates and are supported by the Jinja2 templating system.
See Customization Guide for a list of all available UI templates.
Architecture¶
The pywb system consists of 3 distinct components: Warcserver, Recorder and Rewriter, which can be run and scaled separately. The default pywb wayback application uses Warcserver and Rewriter. If recording is enabled, the Recorder is also used.
Additionally, the indexing system is used through all components, and a few command line tools encompass the pywb toolkit.
Warcserver¶
The Warcserver component is the base component of the pywb stack and can function as a standalone HTTP server.
The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.
This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.
The warcserver can be started directly installing pywb simply by running warcserver
(default port is 8070).
Note: when running wayback
, an instance of warcserver
is also started automatically.
Warcserver API¶
The Warcserver API encompasses the CDXJ Server API and provides a per collection endpoint, using a list of collections
defined in a YAML config file (default config.yaml
). It’s also possible to use Warcserver without the YAML config (see: Custom Warcserver Deployments). The endpoints are as follows:
/
- Home Page, JSON list of available endpoints.
For each collection <coll>
:
/<coll>/index
– Direct Index (compatible with CDXJ Server API)/<coll>/resource
– Direct Resource/<coll>/postreq/index
– POST request Index/<coll>/postreq/resource
– POST request Resource (most flexible for integration with downstream tools)
All endpoints accept the CDXJ Server API query arguments, although the “direct index” route is usually most useful for index lookup. while the “post request resource” route is most useful for integration with other downstream client tools.
POSTing vs Direct Input¶
The Warcserver is designed to map input requests to output responses, and it is possible to send input requests “directly”, eg:
GET /coll/resource?url=http://example.com/
Connection: close
or by “wrapping” the entire request in a POST request:
POST /coll/postreq/resource?url=http://example.com/
Content-Length: ...
...
GET /
Host: example.com
Connection: close
The “post request” (/postreq
endpoint) approach allows more accurately transmitting any HTTP request and headers in the body of another POST request, without worrying about how the headers might be interpreted by the Warcserver connection. The “wrapped HTTP request” is thus unwrapped and processed, allowing hop-by-hop headers like Connection: close
to be processed unaltered.
Index vs Resource Output¶
For any query, the Warcserver can return a matching index result, or the first available WARC record.
Within each collection and input type, the following endpoints are available:
/index
- perform index lookup/resource
- return a single WARC record for the first match of the index list.
For example, an index query might return the CDXJ index:
=> curl "http://localhost:8070/pywb/index?url=iana.org"
org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
While switching to resource
, the result might be:
=> curl "http://localhost:8070/pywb/index?url=iana.org
WARC/1.0
WARC-Type: response
...
The resource lookup attempts to load the first available record (eg. by loading from specified WARC). If the record indicated by first line CDXJ line is not available, the next CDXJ line is tried in succession, and so on, until one succeeds.
If no record can be loaded from any of the CDXJ index results (or if there are no index results), a 404 Not Found error is returned.
WARC Record HTTP Response¶
When using Warcserver, the entire WARC record is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.
The following example illustrates what is transmitted when retrieving curl
-ing http://localhost:8070/pywb/index?url=iana.org
:
> GET /pywb/resource?url=iana.org HTTP/1.1
> Host: localhost:8070
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Warcserver-Cdx: org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
< Link: <http://www.iana.org/>; rel="original"
< WARC-Target-URI: http://www.iana.org/
< Warcserver-Source-Coll: pywb:iana.cdx
< Content-Type: application/warc-record
< Memento-Datetime: Sun, 26 Jan 2014 20:06:24 GMT
< Content-Length: 6357
< Warcserver-Type: warc
< Date: Tue, 17 Oct 2017 00:32:12 GMT
< WARC/1.0
< WARC-Type: response
< WARC-Date: 2014-01-26T20:06:24Z
< WARC-Target-URI: http://www.iana.org/
< WARC-Record-ID: <urn:uuid:4eec4942-a541-410a-99f4-50de39b62118>
...
The HTTP payload is the WARC record itself but HTTP headers returned “surface” additional information about the WARC record to make it easier for client to use the data.
- Memento Headers
Memento-Datetime
andLink
– The datetime is read from the WARC record, and the WARC record it itself a valid “memento” although full Memento compliance is not yet included. Warcserver-Cdx
header includes the full CDXJ index line that was used to load this record (usually, but not always, the first line in theindex
query)Warcserver-Source-Coll
header includes the source from which this record was loaded, corresponding tosource
field in the CDXJWarcserver-Type: warc
indicates that this is a Warcserver WARC record (may be removed in the future)
In particular, the CDXJ and source data can be used to further identify and process the WARC record, without having to parse it. The Recorder component uses the source to determine if recording is necessary or should be skipped.
Warcserver Index Configuration¶
Warcserver supports several index source types, allow users to mix local and remote sources into a single collection or across multiple collections:
The sources include:
- Local File
- Local ZipNum File
- Live Web Proxy (implicit index)
- Redis sorted-set key
- Memento TimeGate Endpoint
- CDX Server API Endpoint
The index types can be defined using either shorthand sourcename+<url> notation or a long-form full property declaration
The following is an example of defining different special collections:
collections:
# Live Index
live: $live
# rhizome via memento (shorthand)
rhiz: memento+http://webenact.rhizome.org/all/
# rhizome via memento (equivalent full properties)
rhiz_long:
index:
type: memento
timegate_url: http://webenact.rhizome.org/all/{url}
timemap_url: http://webenact.rhizome.org/all/timemap/link/{url}
replay_url: http://webenact.rhizome.org/all/{timestamp}id_/{url}
Warcserver Index Aggregators¶
In addition to individual index types, Warcserver supports ‘index aggregators’, which represent not a single source but multiple index sources, explicit or implicit.
Some explicit aggregators are:
- Local Directory
- Redis Key Template (scan/lookup of multiple redis keys)
- A generic group of index sources looked up in parallel (best match)
The aggregators allow for a complex lookup chains to lookup of resources in dynamic directory structures, using Redis keys, and external web archives.
Note: Warcserver automatically includes a Local Directory aggregator pointing to the collections
directory, as
explained in the Configuring the Web Archive
Sample “Memento” Aggregator¶
For example, the following config defines the collection endpoint many_archives
to
lookup three remote archives, two using memento, and one using CDX Server API:
collections:
# many archives
many_archives:
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
timeout: 10
This allows Warcserver to serve as a “Memento Aggregator”, aggregating results from multiple existing archives (using the Memento API and other APIs).
An optional timeout
property configures how many seconds to wait for each source before
it is considered to have ‘timed out’. (If unspecified, the default value is 5 seconds).
Sequential Fallback Collections¶
It is also possible to define a “sequential” collection, where if one source/aggregator fails to produce a result, a “fallback” aggregator is tried, until there is a result:
collections:
# Sequence
web:
sequence:
-
index: ./local/indexes
resource: ./local/data
name: local
-
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
-
index: $live
name: live
In the above example, first the local archive is tried, if the resource could not be successfully loaded, then the group of 3 archives is tried, if they all fail to produce a successful response, the live web is tried. Note that successful response includes a successful index lookup + successful resource fetch – if an index contains results, but they can not be fetched, the next group in the sequence is tried.
The name
of each item is include in the CDXJ index in the source
field to allow the caller to identify
which archive source was used.
Adding Custom Index Sources¶
It should be easy to add a custom index source, by extending pywb.warcserver.index.indexsource.BaseIndexSource
class MyIndexSource(BaseIndexSource):
def load_index(self, params):
... lookup index data as needed to fill CDXObject
cdx = CDXObject()
cdx['url'] = ...
...
yield cdx
@classmethod
def init_from_string(cls, value):
if value == 'my-index-src':
return cls()
...
@classmethod
def init_from_config(cls, config):
if config['type'] != 'my-index-src':
return
# Register Index with Warcserver
register_source(MyIndexSource)
You can then use the index in a config.yaml
:
collections:
my-coll: my-index-src
For more information and definition of existing indexes, see pywb.warcserver.index.indexsource
Custom Warcserver Deployments¶
It is also possible to use Warcserver directly without the use of a config.yaml
file, for more complex
deployment scenarios. (Webrecorder uses a customized deployment).
For example, the following config.yaml
config:
collections:
live: $live
memento:
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: memento+http://web.archive.org/web/
local: ./collections/
could be initialized explicitly, using the pywb.warcserver.basewarcserver.BaseWarcServer
class
which does not use a YAML config
app = BaseWarcServer()
# /live endpoint
live_agg = SimpleAggregator({'live': LiveIndexSource()})
app.add_route('/live', DefaultResourceHandler(live_agg))
# /memento endpoint
sources = {'rhiz': MementoIndexSource.from_timegate_url('http://webenact.rhizome.org/vvork/'),
'ia': MementoIndexSource.from_timegate_url('http://web.archive.org/web/'),
'local': DirectoryIndexSource('./collections')
}
multi_agg = GeventTimeoutAggregator(sources)
app.add_route('/memento', DefaultResourceHandler(multi_agg))
For more examples on custom Warcserver usage, consult the Warcserver tests, such as those in pywb.warcserver.test.test_handlers.py
Recorder¶
The recorder component acts a proxy component, intercepting requests to and response from the Warcserver and recording them to a WARC file on disk.
The recorder uses the pywb.recorder.multifilewarcwriter.MultiFileWARCWriter
which extends the base warcio.warcwriter.WARCWriter
from warcio
and provides support for:
- appending to multiple WARC files at once
- WARC ‘rollover’ based on maximum size idle time
- indexing (CDXJ) on write
Many of the features of the Recorder are created for use with Webrecorder project, although the core recorder is used to provide
a basic recording via /record/
endpoint. (See: Recording Mode)
Deduplication Filters¶
The core recorder class provides for optional deduplication using the pywb.recorder.redisindexer.WritableRedisIndexer
class which requires Redis to store the index, and can be used to either:
- write duplicates responses.
- write
revisit
records. - ignore duplicates and don’t write to WARC.
Custom Filtering¶
The recorder filter system also includes a filtering system to allow for not writing certain requests and responses. Filters include:
- Skipping by regex applied to source (
Warcserver-Source-Coll
header from Warcserver) - Skipping if
Recorder-Skip: 1
header is provided - Skipping if
Range
request header is provided - Filtering out certain HTTP headers, for example, http-only cookies
The additional recorder functionality will be enhanced in a future version.
For a more detailed examples, please consult the tests in pywb.recorder.test.test_recorder
Rewriter¶
pywb includes a sophisticated server and client-side rewriting systems, including a rules-based configuration for domain and content-specific rewriting rules, fuzzy index matching for replay, and a thorough client-side JS rewriting system.
With pywb 2.3.0, the client-side rewriting system exists in a separate module at https://github.com/webrecorder/wombat
URL Rewriting¶
URL rewriting is a key aspect of correctly replaying archived pages. It is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser. pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.
(No url rewriting is performed when running in HTTP/S Proxy Mode mode)
Most of the rewriting performed is url-rewriting, changing the original URLs to point to the pywb server instead of the live web. Typically, the rewriting converts:
<url>
-> <pywb host>/<coll>/<timestamp><modifier>/<url>
For example, the http://example.com/
might be
rewritten as http://localhost:8080/my-coll/2017mp_/http://example.com/
The rewritten url ‘prefixes’ the pywb host, the collection, requested datetime (timestamp) and type modifier to the actual url. The result is an ‘archival url’ which contains the original url and additional information about the archive and timestamp.
Url Rewrite Type Modifier¶
The type modifier included after the timestamp specifies the format of the resource to be loaded. Currently, pywb supports the following modifiers:
Identity Modifier (id_
)¶
When this modifier is used, eg. /my-coll/id_/http://example.com/
, no content rewriting is performed
on the response, and the original, un-rewritten content is returned.
This is useful for HTML or other text resources that are normally rewritten when using the default (mp_
modifier).
Note that certain HTTP headers (hop-by-hop or cookie related) may still be prefixed with X-Orig-Archive-
as they may affect the transmission,
so original headers are not guaranteed.
No Modifier¶
The ‘canonical’ replay url is one without the modifier and represents the url that a user will see and enter into the browser.
The behavior for the canonical/no modifier archival url is only different if framed replay is used (see Framed vs Frameless Replay)
- If framed replay, this url serves the top level frame
- If frameless replay, this url serves the content and is equivalent to the
mp_
modifier.
Main Page Modifier (mp_
)¶
This modifier is used to indicate ‘main page’ content replay, generally HTML pages. Since pywb also checks content type detection, this modifier can be used for any resources that is being loaded for replay, and generally render it correctly. Binary resources can be rendered with this modifier.
JS and CSS Hint Modifiers (js_
and cs_
)¶
These modifiers are useful to ‘hint’ for pywb that a certain resource is being treated as a JS or CSS file. This only makes a difference where there is an ambiguity.
For example, if a resource has type text/html
but is loaded in a <script>
tag with the js_
modifier, it will be rewritten as JS instead of as HTML.
Other Modifiers¶
For compatibility and historical reasons, the pywb HTML parser also adds the following special hints:
im_
– hint that this resource is being used as an image.oe_
– hint that this resource is being used as an object or embedif_
– hint that this resource is being used as an iframefr_
– hint that this resource is being used as an frame
However, these modifiers are essentially treated the same as mp_
, deferring to content-type analysis to determine if rewriting is needed.
Configuring Rewriters¶
pywb provides customizable rewriting based on content-type, the available types are configured
in the pywb.rewrite.default_rewriter
, which specifies rewriter classes per known type,
and mapping of content-types to rewriters.
HTML Rewriting¶
An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url attributes to add the url rewriting prefix and Url Rewrite Type Modifier based on the HTML tag and attribute.
Inline CSS and JS in HTML is rewritten using CSS and JS specific rewriters.
CSS Rewriting¶
The CSS rewriter rewrites any urls found in <style>
blocks in HTML, as well as any files determined to be css
(based on text/css
content type or cs_
modifier).
JS Rewriting¶
The JS rewriter is applied to inline <script>
blocks, or inline attribute js, and any files determine to be javascript (based on content type and js_
modifier).
The default JS rewriter does not rewrite any links. Instead, JS rewriter performs limited regular expression on the following:
postMessage
calls- certain
this
property accessors - specific
location =
assignment
Then, the entire script block is wrapped in a special code block to be executed client side. The result is that client-side execution of location
, window
, top
and other top-level objects follows goes through a client-side proxy object. The client-side rewriting is handled by wombat.js
The server-side rewriting is to aid the client-side execution of wrapped code.
For more information, see pywb.rewrite.regex_rewriters.JSWombatProxyRewriterMixin
JSONP Rewriting¶
A special case of JS rewriting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure the JSONP callback matches the expected param.
For example, a requested url might be /my-coll/http://example.com?callback=jQuery123
but the returned content might be:
jQuery456(...)
due to fuzzy matching, which matched this inexact response to the requested url.
To ensure the JSONP callback works as expected, the content is rewritten to jQuery123(...)
-> jQuery456(...)
For more information, see pywb.rewrite.jsonp_rewriter
DASH and HLS Rewriting¶
To support recording and replaying, adaptive streaming formants (DASH and HLS), pywb can perform special rewriting on the manifests for these formats to remoe all but one possible resolution/format. As a result, the non-deterministic format selection is reduced to a single consistent format.
For more information, see pywb.rewrite.rewrite_hls
and pywb.rewrite.rewrite_dash
and the tests in pywb/rewrite/test/test_content_rewriter.py
Indexing¶
To provide access to the web archival data (local and remote), pywb uses indexes to represent each “capture” or “memento” in the archive. The WARC format itself does not provide a specific index, so an external index is needed.
Creating an Index¶
When adding a WARC using wb-manager
, pywb automatically generates a CDXJ Format
The index can also be created explicitly using cdx-indexer
command line tool:
cdx-indexer -j example2.warc.gz
com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"}
Note: the cdx-indexer tool is deprecated and will be replaced by the standalone cdxj-indexer package.
Index Formats¶
Classic CDX¶
Traditionally, an index for a web archive (WARC or ARC) file has been called a CDX file, probably from Capture/Crawl inDeX (CDX).
The CDX format originates with the Internet Archive and represents a plain-text space-delimited format, each line representing the information about a single capture. The CDX format could contain many different fields, and unfortunately, no standardized format existed.
The order of the fields typically includes a searchable url key and timestamp, to allow for binary sorting and search.
The ‘url search key’ is typically reversed and to allow for easier searching of subdomains, eg. example.com
-> com,example,)/
A classic CDX file might look like this:
CDX N b a m s k r M S V g
com,example)/ 20160225042329 http://example.com/ text/html 200 37cf167c2672a4a64af901d9484e75eee0e2c98a - - 1286 363 example2.warc.gz
A header is used to index the fields in the file, though typically a standard variation is used.
CDXJ Format¶
The pywb system uses a more flexible version of the CDX, called CDXJ, which stores most of the fields in a JSON dictionary:
com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"}
The CDXJ format allows for more flexibility by allowing the index to contain a varying number of fields, while still allow the index to be sortable by a common key (url key + timestamp). This allows CDXJ indexes from different sources and different number of fields to be merged and sorted.
Using CDXJ indexes is recommended and pywb provides the wb-manager migrate-cdx
tool for converting classic CDX to CDXJ.
In general, most discussions of CDX also apply to CDXJ indexes.
ZipNum Sharded Index¶
A CDX(J) file is generally accessed by doing a simple binary search through the file. This scales well to very large (GB+) CDXJ files. However, for very large archives (TB+ or PB+), binary search across a single file has its limits.
A more scalable alternative to a single CDX(J) file is gzip compressed chunked cluster of CDXJ, with a binary searchable index. In this format, sometimes called the ZipNum or Ziplines cluster (for some X number of cdx lines zipped together), all actual CDXJ lines are gzipped compressed an concatenated together. To allow for random access, the lines are gzipped in groups of X lines (often 3000, but can be anything). This allows for the full index to be spread over N number of gzipped files, but has the overhead of requiring N lines to be read for each lookup. Generally, this overhead is negligible when looking up large indexes, and non-existent when doing a range query across many CDX lines.
The index can be split into an arbitrary number of shards, each containing a certain range of the url space. This allows the index to be created in parallel using MapReduce with a reduce task per shard. For each shard, there is an index file and a secondary index file. At the end, the secondary index is concatenated to form the final, binary searchable index.
The webarchive-indexing project provides tools for creating such an index, both locally and via MapReduce.
Single-Shard Index¶
A ZipNum index need not have multiple shards, and provides advantages even for smaller datasets. For example, in addition to less disk space from using compressed index, using the ZipNum index allows for the Pagination API to be available when using the cdx server for bulk querying.
Command-Line Apps¶
After installing pywb tool-suite, the following command-line apps are made available (in the Python binary directory or current environment):
All server tools have a different default port, which can be override via the -p <port>
command-line option.
cdx-indexer
¶
The CDX Indexer provides a way to create a CDX(J) file from a WARC/ARC. The tool supports both classic-CDX and new CDXJ formats.
The indexer also provides options for including all WARC records, and merging data from POST request (and other HTTP records).
See cdx-indexer -h
for a list of options.
Note: In a future pywb release, this tool will be removed in favor of the standalone cdxj-indexer app, which will have additional indexing options.
wb-manager
¶
The wb-manager command-line tool is used to to configure the collections
directory structure and its contents, which pywb uses to automatically read collections.
The tool can be used while wayback
is running, and pywb will detect many changes automatically.
It can be used to:
- Create a new collection –
wb-manager init <coll>
- Add WARCs to collection –
wb-manager add <coll> <warc>
- Add override templates
- Add and remove metadata to a collections
metadata.yaml
- List all collections
- Reindex a collection
- Migrate old CDX to CDXJ style indexes.
For more details, run wb-manager -h
.
warcserver
¶
The Warcserver is a standalone server component that adheres to the Warcserver API.
The server runs on port 8070
by default serving both index and content.
The CDX Server is a subset of the Warcserver and queries using the CDXJ Server API are included:
http://localhost:8070/<coll>/index?url=http://example.com/
No rewriting or recording is performed by the Warcserver, but all collections from config.yaml
are loaded.
wayback
(pywb
)¶
The main pywb application is installed as the wayback
application. (The pywb
name is the same application, may become the primary name in future versions).
The app will start on port 8080
by default, and configuration is read from config.yaml
See Configuring the Web Archive for a detailed overview of configuration options and customizations.
live-rewrite-server
¶
This cli is a shortcut for wayback
, but configured to run with only the Live Web Collection.
The live rewrite server runs on port 8090
and rewrites content from live web, useful for testing.
This app is almost equivalent to wayback --live
, except no other collections from config.yaml
are used.
APIs¶
pywb supports the following APIs:
CDXJ Server API¶
The following is a reference of the api for querying and filtering archived resources.
The api can be used to get information about a range of archive captures/mementos, including filtering, sorting, and pagination for bulk query.
The actual archive files (WARC/ARC) files are not loaded during this query, only the generated CDXJ index.
The Warcserver component uses this same api internally to perform all index and resource lookups in a consistent way.
For example, the following query might return the first 10 results from host http://example.com/*
where the mime type is text/html:
http://localhost:8080/coll/cdx?url=http://example.com/*&page=1&filter=mime:text/html&limit=10
By default, the api endpoint is available at /<coll>/cdx
for a collection named <coll>
.
The setting can be changed by setting cdx_api_endpoint
in config.yaml
.
For example, to change to cdx_api_endpoint: -index
to use /<coll>-index
as the endpoint (previous default for older version of pywb).
To disable CDXJ access altogether, set cdx_api_endpoint: ''
API Reference¶
url
¶
http://localhost:8080/coll/cdx?url=example.com
will return a list of captures for ‘example.com’ in the collection
coll
(see above regarding per-collection api endpoints).
from, to
¶
Setting from=<ts>
or to=<ts>
will restrict the results to the
given date/time range (inclusive).
Timestamps may be <=14 digits and will be padded to either lower or upper bound.
...?url=example.com&from=2014&to=2014
will
return results of example.com
that20140101000000
and 20141231235959
matchType
¶
The cdx server supports the following matchType
exact
– default setting, will return captures that match the url exactlyprefix
– return captures that begin with a specified path, eg:http://example.com/path/*
host
– return captures which for a begin host (the path segment is ignored if specified)domain
– return captures for the current host and all subdomains, eg.*.example.com
As a shortcut, instead of specifying a separate matchType
parameter,
wildcards may be used in the url:
...?url=http://example.com/path/*
is equivalent to...?url=http://example.com/path/&matchType=prefix
...?url=*.example.com
is equivalent to...?url=example.com&matchType=domain
Note: if you are using legacy cdx index files which are not SURT-ordered, the ``domain`` option will not be available. if this is the case, you can use the ``wb-manager convert-cdx`` option to easily convert any cdx to latest format`
limit
¶
Setting limit=
will limit the number of index lines returned. Limit
must be set to a positive integer. If no limit is provided, all the
matching lines are returned, which may be slow. (If using a ZipNum
compressed cluster, the page size limit is enforced and no captures are
read beyond the single page. See :ref:pagination-api for more info).
sort
¶
The sort
param can be set as follows:
reverse
– will sort the matching captures in reverse order. It is only recommended forexact
query as reverse a large match may be very slow. (An optimized version is planned)closest
– setting this option also requires settingclosest=<ts>
where<ts>
is a specific timestamp to sort by. This option will only work correctly forexact
query and is useful for sorting captures based no time distance from a certain timestamp. (pywb uses this option internally for replay in order to fallback to ‘next closest’ capture if one fails)
Both options may be combined with limit
to return the top N closest,
or the last N results.
output
¶
This option will toggle the output format of the resulting CDXJ.
output=cdxj
(default) native format used by pywb, it consists of a space-delimited url timestamp followed by a JSON dictionary (url timestamp {…})output=json
will return each line as a proper JSON dictionary, resulting in newline-delimited JSON (NDJSON).output=link
will return each line inapplication/link
format suitable for use as a Memento TimeMapoutput=text
will return each line as fully space-delimited. As the number of fields may vary due to mix of different sources, this format is not recommended and only provided for backward compatibility.
Using output=json
is recommended for extensive analysis and it may become the default option in a future release.
filter
¶
The filter
param can be specified multiple times to filter by
specific fields in the cdx index. Field names correspond to the fields
returned in the JSON output. Filters can be specified as follows:
...?url=example.com/*&filter==mime:text/html&filter=!=status:200
Return captures from example.com/* where mime is text/html and http status is not 200....?url=example.com&matchType=domain&filter=~url:.*\.php$
Return captures from the domain example.com which URL ends in.php
.
The !
modifier before =status
indicates negation. The =
and
~
modifiers are optional and specify exact resp. regular expression
matches. The default (no specific modifier) is to filter whether the
query string is contained in the field value. Negation and exact/regex
modifier may be combined, eg. filter=!~text/.*
The formal syntax is: filter=<fieldname>:[!][=|~]<expression>
with
the following modifiers:
modifier(s) | example | description |
---|---|---|
(no modifier) | filter=mime:html |
field “mime” contains string “html” |
= |
filter==mime:text/html |
exact match: field “mime” is “text/html” |
~ |
filter=~mime:.*/html$ |
regex match: expression matches beginning of field “mime” (cf. re.match) |
! |
filter=!mime:html |
field “mime” does not contain string “html” |
!= |
filter=!=mime:text/html |
field “mime” is not “text/html” |
!~ |
filter=!~mime:.*/html |
expression does not match beginning of field “mime” |
fields
¶
The fields
param can be used to specify which fields to include in the
output. The standard available fields are usually: urlkey
,
timestamp
, url
, mime
, status
, digest
, length
,
offset
, filename
If a minimal cdx index is used, the mime
and status
fields may
not be available. Additional fields may be introduced in the future,
especially in the CDX JSON format.
Fields can be comma delimited, for example fields=urlkey,timestamp
will
only include the urlkey
, timestamp
and filename
in the
output.
Pagination API¶
The cdx server supports an optional pagination api, but it is currently only available when using ZipNum Sharded Index instead of a plain text cdx files. (Additional pagination support may be added for CDXJ files as well).
The pagination api supports the following params:
page
¶
page
is the current page number, and defaults to 0 if omitted. If
the page
exceeds the number of available pages
from the page
count query, a 400 error will be returned.
pageSize
¶
pageSize
is an optional parameter which can increase or decrease
the amount of data returned in each page.showNumPages=true
¶
This is a special query which, if successful, always returns a JSON response indicating the size of the full results. The query should be very quick regardless of the size of the query.
{"blocks": 423, "pages": 85, "pageSize": 5}
In this result:
pages
is the total number of pages available for this query. Thepage
parameter may be between 0 andpages - 1
pageSize
is the total number of ZipNum compressed blocks that are read for each page. The default value can be set in the pywbconfig.yaml
via themax_blocks: 5
option.blocks
is the actual number of compressed blocks that match the query. This can be used to quickly estimate the total number of captures, within a margin of error. In general,blocks / pageSize + 1 = pages
(since there is always at least 1 page even ifblocks < pageSize
)
If changing pageSize
, the same value should be used for both the
showNumPages
query and the regular paged query. ex:
- Use
...pageSize=2&showNumPages=true
and readpages
to get total number of pages - Use
...pageSize=2&page=N
to read theN
-th pages from 0 topages-1
showPagedIndex=true
¶
When this param is set, the returned data is the secondary index
instead of the actual CDX. Each line represents a compressed cdx block,
and the number of lines returned should correspond to the blocks
value in showNumPages
query. This query is used internally before
reading the actual compressed blocks and should be significantly faster.
At this time, this option can not be combined with other query params
listed in the api, except for output=json
. Using output=json
is
recommended with this query as the default text format may change in the
future.
Memento API¶
pywb supports the Memento Protocol as specified in RFC 7089 and provides API endpoints for Memento TimeMaps and TimeGates per collection.
Memento support is enabled by default and can be controlled via the enable_memento: true|false
setting in the config.yaml
TimeMap API¶
The timemap API is available at /<coll>/timemap/<type>/<url>
for any pywb collection <coll>
and <url>
in the collection.
The timemap (URI-T) can be provided in several output formats, as specified by the <type>
param:
link
– returns anapplication/link-format
as required by the Memento speccdxj
– returns a timemap in the native CDXJ format.json
– returns the timemap as newline-delimited JSON lines (NDJSON) format.
Although not required by the Memento spec, the Link output produced by timemap also includes the extra collection=
field, specifying
the collection of each url. This is especially useful when accessing the timemap for the special Auto “All” Aggregate Collection to view a timemap across
multiple collections in a single response.
The Timemap API is implemented as a subset of the CDXJ Server API and should produce the same result as the equivalent CDX server query.
For example, the timemap query:
http://localhost:8080/pywb/timemap/link/http://example.com/
is equivalent to the CDX server query:
http://localhost:8080/pywb/cdx?url=http://example.com/&output=link
TimeGate API¶
The TimeGate API for any pywb collection is /<coll>/<url>
, eg. /my-coll/http://example.com/
The timegate can either be a non-redirecting timegate (URI-M, 200-style negotiation) and return a URI-M response, or a redirecting timegate (302-style negotiation) and redirect to a URI-M.
Non-Redirecting TimeGate (Memento Pattern 2.2)¶
This behavior is consistent with Memento Pattern 2.2 and is the default behavior.
To avoid an extra redirect, the TimeGate returns the requested memento directly (200-style negotiation) without redirecting to its canonical, timestamped url.
The ‘canonical’ URI-M is included in the Content-Location
header and should be used to reference the memento in the future.
(For HTML Mementos, the rewriting system also injects the url and timestamp into the page so that it can be displayed to the user). This behavior optimizes network traffic by avoiding unneeded redirects.
Redirecting TimeGate (Memento Pattern 2.3)¶
This behavior is consistent with Memento Pattern 2.3
To enable this behavior, add redirect_to_exact: true
to the config.
In this mode, the TimeGate always issues a 302 to redirect a request to the “canonical” URI-M memento. The Location
header is always present
with the redirect.
As this approach always includes a redirect, use of this system is discouraged when the intent is to render mementos. However, this approach is useful when the goal is to determine the URI-M and to provide backwards compatibility.
Proxy Mode Memento API¶
When running in HTTP/S Proxy Mode, pywb behaves roughly in accordance with Memento Pattern 1.3
Every URI in proxy mode is also a TimeGate, and the Accept-Datetime
header can be used to specify which timestamp to use in proxy mode.
The Accept-Datetime
header overrides any other timestamp setting in proxy mode.
The main distinction from the standard is that the URI-R, the original resource, is not available in proxy mode. (It is simply the URL loaded without the proxy, which is not possible to specify via the URL alone).
URI-M Headers¶
When serving a URI-M (any archived url), the following additional headers are included in accordance with Memento spec:
Link
header with at leastoriginal
,timegate
andtimemap
relationsContent-Location
is included if using Non-Redirecting TimeGate (Memento Pattern 2.2) behavior
(Note: the Content-Location
may also be included in case of fuzzy-matching response, where the actual/canonical url is different than requested url due to an inexact match)
OpenWayback Transition Guide¶
This guide provides guidelines for transtioning from OpenWayback to pywb, with additional recommendations. The main recommendation is to run pywb along with OutbackCDX and nginx, and this configuration is covered below, along with additional options.
OpenWayback vs pywb Terms¶
pywb and OpenWayback use slightly different terms to describe the configuration options, as explained below.
- Some differences are:
- The
wayback.xml
config file in OpenWayback is replaced withconfig.yaml
yaml - The terms
Access Point
andWayback Collection
are replaced withCollection
in pywb. The collection configuration represents a unique path (access point) and the data that is accessed at that path. - The
Resource Store
in OpenWayback is known in pywb as the archive paths, configured underarchive_paths
- The
Resource Index
in OpenWayback is known in pywb as the index paths, configurable underindex_paths
- The
Exclusions
in OpenWayback are replaced with general Embargo and Access Control
- The
Pywb Collection Basics¶
A pywb collection must consist of a minimum of three parts: the collection name, the index_paths
(where to read the index), and the archive_paths
(where to read the WARC files).
The collection is accessed by name, so there is no distinct access point.
The collections are configured in the config.yaml
under the collections
key:
For example, a basic collection definition can be specified via:
collections:
wayback:
index_paths: /archive/cdx/
archive_paths: /archive/storage/warcs/
Pywb also supports a convention-based directory structure. Collections created in this structure can be detected automatically
and need not be specified in the config.yaml
. This structure is designed for smaller collections that are all stored locally in a subdirectory.
See the Directory Structure for the default pywb directory structure.
However, for importing existing collections from OpenWayback, it is probably easier to specify the existing paths as shown above.
Using OutbackCDX with pywb¶
The recommended setup is to run OutbackCDX alongside pywb. OutbackCDX provides an index (CDX) server and can efficiently store and look up web archive data by URL.
Adding CDX to OutbackCDX¶
To set up OutbackCDX, please follow the instructions on the OutbackCDX README.
Since pywb also uses the default port 8080, be sure to use a different port for OutbackCDX, eg. java -jar outbackcdx*.jar -p 8084
.
OutbackCDX can generally ingest existing CDX used in OpenWayback simply by POSTing to OutbackCDX at a new index endpoint.
For example, assuming OutbackCDX is running on port 8084, to add CDX for index1.cdx
, index2.cdx
, run:
curl -X POST --data-binary @index1.cdx http://localhost:8084/mycoll
curl -X POST --data-binary @index2.cdx http://localhost:8084/mycoll
The contents of each CDX file are added to the mycoll
OutbackCDX index, which can correspond to the web archive collection mycoll
.
The index is created automatically if it does not exist.
See the OutbackCDX Docs for more info on ingesting CDX.
(Re)generating CDX from WARCs¶
There are some exceptions where it may be useful to re-generate the CDX with pywb for existing WARCs:
- If your CDX is 9-field and does not include the compressed length, regnerating the CDX will result in more efficient HTTP range requests
- If you want to replay pages with POST requests, pywb generated CDX will soon be supported in OutbackCDX (see: Issue #585, Issue #91 )
To generate the CDX, run the cdx-indexer
command (with -p
flag for POST request handling) for each WARC or set of WARCs you wish to index:
cdx-indexer /path/to/mywarcs/my.warc.gz > ./index1.cdx
cdx-indexer /path/to/all_warcs/*warc.gz > ./index2.cdx
Then, run the POST command as shown above to ingest to OutbackCDX.
The above can be repeated for each WARC file, or for a set of WARCs using the *.warc.gz
wildcard.
If a CDX index is too big, OutbackCDX may fail and ingesting an index per-WARC may be needed.
Configure pywb with OutbackCDX¶
The config.yaml
should be configured to point to OutbackCDX.
Assuming a collection named mycoll
, the config.yaml
can be configured as follows to use OutbackCDX
collections:
mycoll:
index_paths: cdx+http://localhost:8084/mycoll
archive_paths: /path/to/mywarcs/
The archive_paths
can be configured to point to a directory of WARCs or a path index.
Migrating CDX¶
If you are not using OutbackCDX, you may need to check on the format of the CDX files that you are using.
Over the years, there have been many variations on the CDX (capture index) format which is used by OpenWayback and pywb to look up captures in WARC/ARC files.
When migrating CDX from OpenWayback, there are a few options.
pywb currently supports:
- 9 field CDX (surt-ordered)
- 11 field CDX (surt-ordered)
- CDXJ (surt-ordered)
pywb will support the 11-field and 9-field CDX format that is also used in OpenWayback.
Non-SURT ordered CDXs are not currently supported, though they may be supported in the future (see this pending pull request).
CDXJ Conversion¶
The native format used by pywb is the CDXJ Format with SURT-ordering, which uses JSON to encode the fields, allowing for more flexibility by storing most of the index in a JSON, allowing support for optional fields as needed.
If your CDX are not SURT-ordered, 11 or 9 field CDX, or if there is a mix, pywb also offers a conversion utility which will convert all CDX to the pywb native CDXJ:
wb-manager cdx-convert <dir-of-cdx-files>
The converter will read the CDX files and create a corresponding .cdxj file for every cdx file. Since the conversion happens on the .cdx itself, it does not require reindexing the source WARC/ARC files and can happen fairly quickly. The converted CDXJ are guaranteed to be in the right format to work with pywb.
Converting OpenWayback Config to pywb Config¶
OpenWayback includes many different types of configurations.
For most use cases, using OutbackCDX with pywb is the recommended approach, as explained in Using OutbackCDX with pywb.
The following are a few specific example of WaybackCollections gathered from active OpenWayback configurations and how they can be configured for use with pywb.
Remote Collection / Access Point¶
A collection configured with a remote index and WARC access can be converted to use OutbackCDX for the remote index, while pywb can load WARCs directly from an HTTP endpoint.
For example, a configuration similar to:
<bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
<property name="accessPointPath" value="/wayback/"/>
<property name="collection" ref="remotecollection" />
...
</bean>
<bean id="remotecollection" class="org.archive.wayback.webapp.WaybackCollection">
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.SimpleResourceStore">
<property name="prefix" value="http://myarchive.example.com/RemoteStore/" />
</bean>
</property>
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
<property name="searchUrlBase" value="http://myarchive.example.com/RemoteIndex" />
</bean>
</property>
</bean>
can be converted to the following config, with OutbackCDX assumed to be running
at: http://myarchive.example.com/RemoteIndex
collections:
wayback:
index_paths: cdx+http://myarchive.example.com/RemoteIndex
archive_paths: http://myarchive.example.com/RemoteStore/
Local Collection / Access Point¶
An OpenWayback configuration with a local collection and local CDX, for example:
<bean id="collection" class="org.archive.wayback.webapp.WaybackCollection">
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.cdxserver.EmbeddedCDXServerIndex">
...
<property name="cdxServer">
<bean class="org.archive.cdxserver.CDXServer">
<property name="cdxSource">
<bean class="org.archive.format.cdx.MultiCDXInputSource">
<property name="cdxUris">
<list>
<value>/wayback/cdx/mycdx1.cdx</value>
<value>/wayback/cdx/mycdx2.cdx</value>
</list>
</property>
</bean>
</property>
<property name="cdxFormat" value="cdx11"/>
<property name="surtMode" value="true"/>
</bean>
</property>
...
</bean>
</property>
</bean>
can be configured in pywb using the index_paths
key.
Note that the CDX files should all be in the same format. See Migrating CDX for more info on converting CDX to pywb native CDXJ format.
collections:
wayback:
index_paths: /wayback/cdx/
archive_paths: ...
It’s also possible to combine directories, individual CDX files, and even a remote index from OutbackCDX in a single collection (as long as all CDX are in the same format).
pywb will query all the sources simultaneously to find the best match.
collections:
wayback:
index_group:
cdx1: /wayback/cdx1/
cdx2: /wayback/cdx2/mycdx.cdx
remote: cdx+https://myarchive.example.com/outbackcdx
archive_paths: ...
However, OutbackCDX is still recommended to avoid more complex CDX configurations.
WatchedCDXSource¶
OpenWayback includes a ‘Watched CDX Source’ option which watches a directory for new CDX indexes. This functionality is default in pywb when specifying a directory for the index path:
For example, the config:
<property name="source">
<bean class="org.archive.wayback.resourceindex.WatchedCDXSource">
<property name="recursive" value="false" />
<property name="filters">
<list>
<value>^.+\.cdx$</value>
</list>
</property>
<property name="path" value="/wayback/cdx-index/" />
</bean>
</property>
can be replaced with:
collections:
wayback:
index_paths: /wayback/cdx-index/
archive_paths: ...
pywb will load all CDX from that directory.
ZipNum Cluster Index¶
pywb also supports using a compressed ZipNum Sharded Index instead of a plain text CDX. For example, the following OpenWayback configuration:
<bean id="collection" class="org.archive.wayback.webapp.WaybackCollection">
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
...
<property name="source">
<bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
<property name="cluster">
<bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
<property name="summaryFile" value="/webarchive/zipnum-cdx/all.summary"></property>
<property name="locFile" value="/webarchive/zipnum-cdx/all.loc"></property>
</bean>
</property>
...
</bean>
</property>
</bean>
can simply be converted to the pywb config:
collections:
wayback:
index_paths: /webarchive/zipnum-cdx
# if the index is not surt ordered
surt_ordered: false
pywb will automatically determine the .summary
and use the .loc
files for the ZipNum Cluster if they are present in the directory.
Note that if the ZipNum index is not SURT ordered, the surt_ordered: false
flag must be added to support this format.
Path Index Configuration¶
OpenWayback supports a ‘path index’ that can be used to look up a WARC by filename and map to an exact path. For compatibility, pywb supports the same path index lookup, as well as loading WARC files by path or URL prefix.
For example, an OpenWayback configuration that includes a path index:
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
<property name="path" value="/archive/warc-paths.txt"/>
</bean>
<bean id="resourceStore" class="org.archive.wayback.resourcestore.LocationDBResourceStore">
<property name="db" ref="resourcefilelocationdb" />
</bean>
can be configured in the archive_paths
field of pywb collection configuration:
collections:
wayback:
index_paths: ...
archive_paths: /archive/warc-paths.txt
The path index is a tab-delimited text file for mapping WARC filenames to full file paths or URLs, eg:
example.warc.gz<tab>/some/path/to/example.warc.gz
another.warc.gz<tab>/some-other/path/another.warc.gz
remote.warc.gz<tab>http://warcstore.example.com/serve/remote.warc.gz
However, if all WARC files are stored in the same directory, or in a few directories, a path index is not needed and pywb will try loading the WARC by prefix.
The archive_paths
can accept a list of entries. For example, given the config:
collections:
wayback:
index_paths: ...
archive_paths:
- /archive/warcs1/
- /archive/warcs2/
- https://myarchive.example.com/warcs/
- /archive/warc-paths.txt
And the WARC file: example.warc.gz
, pywb will try to find the WARC in order from:
1. /archive/warcs1/example.warc.gz
2. /archive/warcs2/example.warc.gz
3. https://myarchive.example.com/warcs/example.warc.gz
4. Looking up example.warc.gz in /archive/warc-paths.txt
Proxy Mode Access¶
A OpenWayback configuration may include many beans to support proxy mode, eg:
<bean id="proxyreplaydispatcher" class="org.archive.wayback.replay.SelectorReplayDispatcher">
...
<property name="renderer">
<bean class="org.archive.wayback.proxy.HttpsRedirectAndLinksRewriteProxyHTMLMarkupReplayRenderer">
...
<property name="uriConverter">
<bean class="org.archive.wayback.proxy.ProxyHttpsResultURIConverter"/>
</property>
</bean>
</propery>
</bean>
<bean name="proxy" class="org.archive.wayback.webapp.AccessPoint">
<property name="internalPort" value="${proxy.port}"/>
<property name="accessPointPath" value="${proxy.port}" />
<property name="collection" ref="localcdxcollection" />
...
</bean>
In pywb, the proxy mode can be enabled by adding to the main config.yaml
the name of the collection
that should be served in proxy mode:
proxy:
source_coll: wayback
There are some differences between OpenWayback and pywb proxy mode support.
In OpenWayback, proxy mode is configured using separate access points for different collections on different ports. OpenWayback only supports HTTP proxy and attempts to rewrite HTTPS URLs to HTTP.
In pywb, proxy mode is enabled on the same port as regular access, and pywb supports HTTP and HTTPS proxy. pywb does not attempt to rewrite HTTPS to HTTP, as most browsers disallow HTTP access as insecure for many sites. pywb supports a default collection that is enabled for proxy mode, and a default timestamp accessed by the proxy mode. (Switching the collection and date accessed is possible but not currently supported without extensions to pywb).
To support HTTPS access, pywb provides a certificate authority that can be trusted by a browser to rewrite HTTPS content.
See HTTP/S Proxy Mode for all of the options of pywb proxy mode configuration.
Migrating Exclusion Rules¶
pywb includes a new Embargo and Access Control system, which allows granual allow/block/exclude access control rules on paths and subpaths.
The rules are configured in .aclj files, and a command-line utility exists to import OpenWayback exclusions into the pywb ACLJ format.
For example, given an OpenWayback exclusion list configuration for a static file:
<bean id="excluder-factory-static" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory">
<property name="file" value="/archive/exclusions.txt"/>
<property name="checkInterval" value="600000" />
</bean>
The exclusions file can be converted to an .aclj file by running:
wb-manager acl importtxt /archive/exclusions.aclj /archive/exclusions.txt exclude
Then, in the pywb config, specify:
collections:
wayback:
index_paths: ...
archive_paths: ...
acl_paths: /archive/exclusions.aclj
It is possible to specify multiple access control files, which will all be applied.
Using block
instead of exclude
will result in pywb returning a 451 error, indicating that URLs are in the index but blocked.
CLI Tool¶
After exclusions have been imported, it is recommended to use wb-manager acl
command-line tool for managing exclusions:
To add an exclusion, run:
wb-manager acl add /archive/exclusions.aclj http://httpbin.org/anything/something exclude
To remove an exclusion, run:
wb-manager acl remove /archive/exclusions.aclj http://httpbin.org/anything/something
For more options, see the full Embargo and Access Control documentation or run wb-manager acl --help
.
Not Yet Supported¶
Some OpenWayback exclusion options are not yet supported in pywb. The following is not yet supported in the access control system:
- Exclusions/Access Control By specific date range
- Regex based exclusions
- Date Range Embargo on All URLs
- Robots.txt-based exclusions
Deploying pywb: Collection Paths and routing with Nginx/Apache¶
In pywb, the collection name is also the access point, and each of the collections in config.yaml
can be accessed by their name as the subpath:
collections:
wayback:
...
another-collection:
...
If pywb is deployed on port 8080, each collection will be available under:
http://<hostname>/wayback/*/https://example.com/
and http://<hostname>/another-collection/*/https://example.com/
To make a collection available under the root, simply set its name to: $root
collections:
$root:
...
another-collection:
...
Now, the first collection is available at: http://<hostname>/*/https://example.com/
.
To deploy pywb on a subdirectory, eg. http://<hostname>/pywb/another-collection/*/https://example.com/
,
and in general, for production use, it is recommended to deploy pywb behind an Nginx or Apache reverse proxy.
Nginx and Apache Reverse Proxy¶
The recommended deployment for pywb is with uWSGI and behind an Nginx or Apache frontend.
This configuration allows for more robust deployment, and allowing these servers to handle static files.
See the Sample Nginx Configuration and Sample Apache Configuration sections for more info on deploying with Nginx and Apache.
Working Docker Compose Examples¶
The pywb Deployment Examples include working examples of deploying pywb with Nginx, Apache and OutbackCDX in Docker using Docker Compose, widely available container orchestration tools.
See Installing Docker and Installing Docker Compose for instructions on how to install these tools.
The examples are available in the sample-deploy
directory of the pywb repo. The examples include:
docker-compose-outback.yaml
– Docker Compose config to start OutbackCDX and pywb, and ingest sample data into OutbackCDXdocker-compose-nginx.yaml
– Docker Compose config to launch pywb and latest Nginx, with pywb running on subdirectory/wayback
and Nginx serving static files from pywb.docker-compose-apache.yaml
– Docker Compose config to launch pywb and latest Apache, with pywb running on subdirectory/wayback
and Apache serving static files from pywb.
The examples are designed to be run one at a time, and assume port 8080 is available.
After installing Docker and Docker Compose, run either of:
docker-compose -f docker-compose-outback.yaml up
docker-compose -f docker-compose-nginx.yaml up
docker-compose -f docker-compose-apache.yaml up
This will download the standard Docker images and start all of the components in Docker.
If everything works correctly, you should be able to access: http://localhost:8080/pywb/https://example.com/
to view the sample pywb collection.
Press CTRL+C to interrupt and stop the example in the console.
pywb package¶
Subpackages¶
pywb.apps package¶
Submodules¶
pywb.apps.cli module¶
-
class
pywb.apps.cli.
BaseCli
(args=None, default_port=8080, desc='')[source]¶ Bases:
object
Base CLI class that provides the initial arg parser setup, calls load to receive the application to be started and starts the application.
-
class
pywb.apps.cli.
LiveCli
(args=None, default_port=8080, desc='')[source]¶ Bases:
pywb.apps.cli.BaseCli
CLI class for starting pywb in replay server in live mode
-
class
pywb.apps.cli.
ReplayCli
(args=None, default_port=8080, desc='')[source]¶ Bases:
pywb.apps.cli.BaseCli
CLI class that adds the cli functionality specific to starting pywb’s Wayback Machine implementation
-
class
pywb.apps.cli.
WarcServerCli
(args=None, default_port=8080, desc='')[source]¶ Bases:
pywb.apps.cli.BaseCli
CLI class for starting a WarcServer
-
class
pywb.apps.cli.
WaybackCli
(args=None, default_port=8080, desc='')[source]¶ Bases:
pywb.apps.cli.ReplayCli
CLI class for starting the pywb’s implementation of the Wayback Machine
pywb.apps.frontendapp module¶
-
class
pywb.apps.frontendapp.
FrontEndApp
(config_file=None, custom_config=None)[source]¶ Bases:
object
Orchestrates pywb’s core Wayback Machine functionality and is comprised of 2 core sub-apps and 3 optional apps.
- Sub-apps:
- WarcServer: Serves the archive content (WARC/ARC and index) as well as from the live web in record/proxy mode
- RewriterApp: Rewrites the content served by pywb (if it is to be rewritten)
- WSGIProxMiddleware (Optional): If proxy mode is enabled, performs pywb’s HTTP(s) proxy functionality
- AutoIndexer (Optional): If auto-indexing is enabled for the collections it is started here
- RecorderApp (Optional): Recording functionality, available when recording mode is enabled
The RewriterApp is configurable and can be set via the class var REWRITER_APP_CLS, defaults to RewriterApp
-
ALL_DIGITS
= re.compile('^\\d+$')¶
-
CDX_API
= 'http://localhost:%s/{coll}/index'¶
-
PROXY_CA_NAME
= 'pywb HTTPS Proxy CA'¶
-
PROXY_CA_PATH
= 'proxy-certs/pywb-ca.pem'¶
-
RECORD_API
= 'http://localhost:%s/%s/resource/postreq?param.recorder.coll={coll}'¶
-
RECORD_ROUTE
= '/record'¶
-
RECORD_SERVER
= 'http://localhost:%s'¶
-
REPLAY_API
= 'http://localhost:%s/{coll}/resource/postreq'¶
-
REWRITER_APP_CLS
¶ alias of
pywb.apps.rewriterapp.RewriterApp
-
classmethod
create_app
(port)[source]¶ Create a new instance of FrontEndApp that listens on port with a hostname of 0.0.0.0
Parameters: port (int) – The port FrontEndApp is to listen on Returns: A new instance of FrontEndApp wrapped in GeventServer Return type: GeventServer
-
get_coll_config
(coll)[source]¶ Retrieve the collection config, including metadata, associated with a collection
Parameters: coll (str) – The name of the collection to receive config info for Returns: The collections config Return type: dict
-
get_upstream_paths
(port)[source]¶ Retrieve a dictionary containing the full URLs of the upstream apps
Parameters: port (int) – The port used by the replay and cdx servers Returns: A dictionary containing the upstream paths (replay, cdx-server, record [if enabled]) Return type: dict[str, str]
-
handle_request
(environ, start_response)[source]¶ Retrieves the route handler and calls the handler returning its the response
Parameters: - environ (dict) – The WSGI environment dictionary for the request
- start_response –
Returns: The WbResponse for the request
Return type:
-
init_autoindex
(auto_interval)[source]¶ Initialize and start the auto-indexing of the collections. If auto_interval is None this is a no op.
Parameters: auto_interval (str|int) – The auto-indexing interval from the configuration file or CLI argument
-
init_proxy
(config)[source]¶ Initialize and start proxy mode. If proxy configuration entry is not contained in the config this is a no op. Causes handler to become an instance of WSGIProxMiddleware.
Parameters: config (dict) – The configuration object used to configure this instance of FrontEndApp
-
init_recorder
(recorder_config)[source]¶ Initialize the recording functionality of pywb. If recording_config is None this function is a no op
Parameters: recorder_config (str|dict|None) – The configuration for the recorder app Return type: None
-
is_proxy_enabled
(environ)[source]¶ Returns T/F indicating if proxy mode is enabled
Parameters: environ (dict) – The WSGI environment dictionary for the request Returns: T/F indicating if proxy mode is enabled Return type: bool
-
is_valid_coll
(coll)[source]¶ Determines if the collection name for a request is valid (exists)
Parameters: coll (str) – The name of the collection to check Returns: True if the collection is valid, false otherwise Return type: bool
-
proxy_fetch
(env, url)[source]¶ Proxy mode only endpoint that handles OPTIONS requests and COR fetches for Preservation Worker.
Due to normal cross-origin browser restrictions in proxy mode, auto fetch worker cannot access the CSS rules of cross-origin style sheets and must re-fetch them in a manner that is CORS safe. This endpoint facilitates that by fetching the stylesheets for the auto fetch worker and then responds with its contents
Parameters: Returns: WbResponse that is either response to an Options request or the results of fetching url
Return type:
-
proxy_route_request
(url, environ)[source]¶ Return the full url that this proxy request will be routed to The ‘environ’ PATH_INFO and REQUEST_URI will be modified based on the returned url
Default is to use the ‘proxy_prefix’ to point to the proxy collection
-
put_custom_record
(environ, coll='$root')[source]¶ When recording, PUT a custom WARC record to the specified collection (Available only when recording)
Parameters:
-
raise_not_found
(environ, err_type, url)[source]¶ Utility function for raising a werkzeug.exceptions.NotFound execption with the supplied WSGI environment and message.
Parameters:
-
serve_cdx
(environ, coll='$root')[source]¶ Make the upstream CDX query for a collection and response with the results of the query
Parameters: Returns: The WbResponse containing the results of the CDX query
Return type:
-
serve_coll_page
(environ, coll='$root')[source]¶ Render and serve a collections search page (search.html).
Parameters: Returns: The WbResponse containing the collections search page
Return type:
-
serve_content
(environ, coll='$root', url='', timemap_output='', record=False)[source]¶ Serve the contents of a URL/Record rewriting the contents of the response when applicable.
Parameters: - environ (dict) – The WSGI environment dictionary for the request
- coll (str) – The name of the collection the record is to be served from
- url (str) – The URL for the corresponding record to be served if it exists
- timemap_output (str) – The contents of the timemap included in the link header of the response
- record (bool) – Should the content being served by recorded (save to a warc). Only valid in record mode
Returns: WbResponse containing the contents of the record/URL
Return type:
-
serve_home
(environ)[source]¶ Serves the home (/) view of pywb (not a collections)
Parameters: environ (dict) – The WSGI environment dictionary for the request Returns: The WbResponse for serving the home (/) path Return type: WbResponse
-
serve_listing
(environ)[source]¶ Serves the response for WARCServer fixed and dynamic listing (paths)
Parameters: environ (dict) – The WSGI environment dictionary for the request Returns: WbResponse containing the frontend apps WARCServer URL paths Return type: WbResponse
-
serve_record
(environ, coll='$root', url='')[source]¶ Serve a URL’s content from a WARC/ARC record in replay mode or from the live web in live, proxy, and record mode.
Parameters: Returns: WbResponse containing the contents of the record/URL
Return type:
-
serve_static
(environ, coll='', filepath='')[source]¶ Serve a static file associated with a specific collection or one of pywb’s own static assets
Parameters: Returns: The WbResponse for the static asset
Return type:
-
class
pywb.apps.frontendapp.
MetadataCache
(template_str)[source]¶ Bases:
object
This class holds the collection medata template string and caches the metadata for a collection once it is rendered once. Cached metadata is updated if its corresponding file has been updated since last cache time (file mtime based)
-
get_all
(routes)[source]¶ Load the metadata for all routes (collections) and populate the cache
Parameters: routes (list[str]) – List of collection names Returns: A dictionary containing each collections metadata Return type: dict
-
load
(coll)[source]¶ Load and receive the metadata associated with a collection.
If the metadata for the collection is not cached yet its metadata file is read in and stored. If the cache has seen the collection before the mtime of the metadata file is checked and if it is more recent than the cached time, the cache is updated and returned otherwise the cached version is returned.
Parameters: coll (str) – Name of a collection Returns: The cached metadata for a collection Return type: dict
-
pywb.apps.live module¶
pywb.apps.rewriterapp module¶
-
class
pywb.apps.rewriterapp.
RewriterApp
(framed_replay=False, jinja_env=None, config=None, paths=None)[source]¶ Bases:
object
Primary application for rewriting the content served by pywb (if it is to be rewritten).
This class is also responsible rendering the archives templates
-
DEFAULT_CSP
= "default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'"¶
-
VIDEO_INFO_CONTENT_TYPE
= 'application/vnd.youtube-dl_formats+json'¶
-
add_csp_header
(wb_url, status_headers)[source]¶ Adds Content-Security-Policy headers to the supplied StatusAndHeaders instance if the wb_url’s mod is equal to the replay mod
Parameters: - wb_url (WbUrl) – The WbUrl for the URL being operated on
- status_headers (warcio.StatusAndHeaders) – The status and
headers instance for the reply to the URL
-
do_query
(wb_url, kwargs)[source]¶ Performs the timemap query request for the supplied WbUrl returning the response
Parameters: Returns: The queries response
Return type: requests.Response
-
format_response
(response, wb_url, full_prefix, is_timegate, is_proxy, timegate_closest_ts=None)[source]¶
-
is_framed_replay
(wb_url)[source]¶ Returns T/F indicating if the rewriter app is configured to be operating in framed replay mode and the supplied WbUrl is also operating in framed replay mode
Parameters: wb_url (WbUrl) – The WbUrl instance to check Returns: T/F if in framed replay mode Return type: bool
-
pywb.apps.static_handler module¶
pywb.apps.warcserverapp module¶
pywb.apps.wayback module¶
pywb.apps.wbrequestresponse module¶
-
class
pywb.apps.wbrequestresponse.
WbResponse
(status_headers, value=None, **kwargs)[source]¶ Bases:
object
Represnts a pywb wsgi response object.
Holds a status_headers object and a response iter, to be returned to wsgi container.
-
add_access_control_headers
(env=None)[source]¶ Adds Access-Control* HTTP headers to this WbResponse’s HTTP headers.
Parameters: env (dict) – The WSGI environment dictionary Returns: The same WbResponse but with the values for the Access-Control* HTTP header added Return type: WbResponse
-
add_range
(*args)[source]¶ Add HTTP range header values to this response
Parameters: args (int) – The values for the range HTTP header Returns: The same WbResponse but with the values for the range HTTP header added Return type: WbResponse
-
static
bin_stream
(stream, content_type, status='200 OK', headers=None)[source]¶ Utility method for constructing a binary response.
Parameters: Returns: WbResponse that is a binary stream
Return type:
-
static
encode_stream
(stream)[source]¶ Utility method to encode a stream using utf-8.
Parameters: stream (Any) – The stream to be encoded using utf-8 Returns: A generator that yields the contents of the stream encoded as utf-8
-
static
json_response
(obj, status='200 OK', content_type='application/json; charset=utf-8')[source]¶ Utility method for constructing a JSON response.
Parameters: Returns: WbResponse JSON response
Return type:
-
static
options_response
(env)[source]¶ Construct WbResponse for OPTIONS based on the WSGI env dictionary
Parameters: env (dict) – The WSGI environment dictionary Returns: The WBResponse for the options request Return type: WbResponse
-
static
redir_response
(location, status='302 Redirect', headers=None)[source]¶ Utility method for constructing redirection response.
Parameters: Returns: WbResponse redirection response
Return type:
-
static
text_response
(text, status='200 OK', content_type='text/plain; charset=utf-8')[source]¶ Utility method for constructing a text response.
Parameters: Returns: WbResponse text response
Return type:
-
Module contents¶
pywb.indexer package¶
Submodules¶
pywb.indexer.archiveindexer module¶
-
class
pywb.indexer.archiveindexer.
ArchiveIndexEntry
[source]¶ Bases:
pywb.indexer.archiveindexer.ArchiveIndexEntryMixin
,dict
-
class
pywb.indexer.archiveindexer.
ArchiveIndexEntryMixin
[source]¶ Bases:
object
-
MIME_RE
= re.compile('[; ]')¶
-
-
class
pywb.indexer.archiveindexer.
OrderedArchiveIndexEntry
[source]¶ Bases:
pywb.indexer.archiveindexer.ArchiveIndexEntryMixin
,collections.OrderedDict
pywb.indexer.cdxindexer module¶
-
class
pywb.indexer.cdxindexer.
BaseCDXWriter
(out)[source]¶ Bases:
object
-
METADATA_NO_INDEX_TYPES
= ('text/anvl',)¶
-
Module contents¶
pywb.manager package¶
Submodules¶
pywb.manager.aclmanager module¶
-
class
pywb.manager.aclmanager.
ACLManager
(r)[source]¶ Bases:
pywb.manager.manager.CollectionsManager
-
DEFAULT_FILE
= 'access-rules.aclj'¶
-
SURT_RX
= re.compile('([^:.]+[,)])+')¶
-
VALID_ACCESS
= ('allow', 'block', 'exclude', 'allow_ignore_embargo')¶
-
add_excludes
(r)[source]¶ Import old-style excludes, in url-per-line format
Parameters: r (argparse.Namespace) – Parsed result from ArgumentParser
-
add_rule
(r)[source]¶ Adds a rule the ACL manager
Parameters: r (argparse.Namespace) – The argparse namespace representing the rule to be added Return type: None
-
find_match
(r)[source]¶ Finds a matching acl rule
Parameters: r (argparse.Namespace) – Parsed result from ArgumentParser Return type: None
-
classmethod
init_parser
(parser)[source]¶ Initializes an argument parser for acl commands
Parameters: parser (argparse.ArgumentParser) – The parser to be initialized Return type: None
-
is_valid_auto_coll
(coll_name)[source]¶ Returns T/F indicating if the supplied collection name is a valid collection
Parameters: coll_name – The collection name to check Returns: T/F indicating a valid collection Return type: bool
-
list_rules
(r)[source]¶ Print the acl rules to the stdout
Parameters: r (argparse.Namespace|None) – Not used Return type: None
-
load_acl
(must_exist=True)[source]¶ Loads the access control list
Parameters: must_exist (bool) – Does the acl file have to exist Returns: T/F indicating load success Return type: bool
-
print_rule
(rule)[source]¶ Prints the supplied rule to the std out
Parameters: rule (CDXObject) – The rule to be printed Return type: None
-
process
(r)[source]¶ Process acl command
Parameters: r (argparse.Namespace) – Parsed result from ArgumentParser Return type: None
-
remove_rule
(r)[source]¶ Removes a rule from the acl file
Parameters: r (argparse.Namespace) – Parsed result from ArgumentParser Return type: None
-
save_acl
(r=None)[source]¶ Save the contents of the rules as cdxj entries to the access control list file
Parameters: r (argparse.Namespace|None) – Not used Return type: None
-
to_key
(url_or_surt, exact_match=False)[source]¶ If ‘url_or_surt’ already a SURT, use as is If exact match, add the exact match suffix
Parameters: Return type:
-
validate
(log=False, correct=False)[source]¶ Validates the acl rules returning T/F if the list should be saved
Parameters: Return type:
-
pywb.manager.autoindex module¶
pywb.manager.locmanager module¶
pywb.manager.manager module¶
-
class
pywb.manager.manager.
CollectionsManager
(coll_name, colls_dir=None, must_exist=True)[source]¶ Bases:
object
This utility is designed to simplify the creation and management of web archive collections
It may be used via cmdline to setup and maintain the directory structure expected by pywb
-
COLLS_DIR
= 'collections'¶
-
COLL_RX
= re.compile('^[\\w][-\\w]*$')¶
-
DEF_INDEX_FILE
= 'index.cdxj'¶
-
pywb.manager.migrate module¶
Module contents¶
pywb.recorder package¶
Submodules¶
pywb.recorder.filters module¶
-
class
pywb.recorder.filters.
ExcludeHttpOnlyCookieHeaders
[source]¶ Bases:
object
-
HTTPONLY_RX
= re.compile(';\\s*HttpOnly\\s*(;|$)', re.IGNORECASE)¶
-
pywb.recorder.multifilewarcwriter module¶
-
class
pywb.recorder.multifilewarcwriter.
MultiFileWARCWriter
(dir_template, filename_template=None, max_size=0, max_idle_secs=1800, *args, **kwargs)[source]¶ Bases:
warcio.warcwriter.BaseWARCWriter
-
FILE_TEMPLATE
= 'rec-{timestamp}-{hostname}.warc.gz'¶
-
-
class
pywb.recorder.multifilewarcwriter.
PerRecordWARCWriter
(*args, **kwargs)[source]¶ Bases:
pywb.recorder.multifilewarcwriter.MultiFileWARCWriter
pywb.recorder.recorderapp module¶
-
class
pywb.recorder.recorderapp.
RecorderApp
(upstream_host, writer, skip_filters=None, **kwargs)[source]¶ Bases:
object
pywb.recorder.redisindexer module¶
-
class
pywb.recorder.redisindexer.
RedisPendingCounterTempBuffer
(max_size, redis_url, params, name, timeout=30)[source]¶ Bases:
tempfile.SpooledTemporaryFile
Module contents¶
pywb.rewrite package¶
Submodules¶
pywb.rewrite.content_rewriter module¶
-
class
pywb.rewrite.content_rewriter.
BaseContentRewriter
(rules_file, replay_mod='')[source]¶ Bases:
object
-
CHARSET_REGEX
= re.compile(b'<meta[^>]*?[\\s;"\']charset\\s*=[\\s"\']*([^\\s"\'/>]*)')¶
-
TITLE
= re.compile('<\\s*title\\s*>(.*)<\\s*\\/\\s*title\\s*>', re.IGNORECASE|re.MULTILINE|re.DOTALL)¶
-
html_unescape
()¶ Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters. This function uses the rules defined by the HTML 5 standard for both valid and invalid character references, and the list of HTML 5 named character references defined in html.entities.html5.
-
-
class
pywb.rewrite.content_rewriter.
RewriteInfo
(record, content_rewriter, url_rewriter, cookie_rewriter=None)[source]¶ Bases:
object
-
JSONP_CONTAINS
= ['callback=jQuery', 'callback=jsonp', '.json?']¶
-
JSON_REGEX
= re.compile(b'^\\s*[{[][{"]')¶
-
TAG_REGEX
= re.compile(b'^(\xef\xbb\xbf)?\\s*\\<')¶
-
TAG_REGEX2
= re.compile(b'^.*<\\w+[\\s>]')¶
-
content_stream
¶
-
pywb.rewrite.cookie_rewriter module¶
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Rewrite cookies only using exact path, useful for live rewrite without a timestamp and to minimize cookie pollution
If path or domain present, simply remove
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Attempt to rewrite cookies to current host url..
If path present, rewrite path to current host. Only makes sense in live proxy or no redirect mode, as otherwise timestamp may change.
If domain present, remove domain and set to path prefix
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Attempt to rewrite cookies to minimal scope possible
If path present, rewrite path to current rewritten url only If domain present, remove domain and set to path prefix
Bases:
pywb.rewrite.cookie_rewriter.WbUrlBaseCookieRewriter
Sometimes it is necessary to rewrite cookies to root scope in order to work across time boundaries and modifiers
This rewriter simply sets all cookies to be in the root
Bases:
object
Base Cookie rewriter for wburl-based requests.
If HttpOnly cookie that is set to a path ending in /, and current mod is mp_ or if_, then assume its meant to be a prefix, and likely needed for other content. Set cookie with same prefix but for all common modifiers: (mp_, js_, cs_, oe_, if_, sw_, wkrf_)
pywb.rewrite.default_rewriter module¶
-
class
pywb.rewrite.default_rewriter.
DefaultRewriter
(replay_mod='', config=None)[source]¶ Bases:
pywb.rewrite.content_rewriter.BaseContentRewriter
-
DEFAULT_REWRITERS
= {'amf': <class 'pywb.rewrite.rewrite_amf.RewriteAMF'>, 'cookie': <class 'pywb.rewrite.cookie_rewriter.HostScopeCookieRewriter'>, 'css': <class 'pywb.rewrite.regex_rewriters.CSSRewriter'>, 'dash': <class 'pywb.rewrite.rewrite_dash.RewriteDASH'>, 'header': <class 'pywb.rewrite.header_rewriter.DefaultHeaderRewriter'>, 'hls': <class 'pywb.rewrite.rewrite_hls.RewriteHLS'>, 'html': <class 'pywb.rewrite.html_rewriter.HTMLRewriter'>, 'html-banner-only': <class 'pywb.rewrite.html_insert_rewriter.HTMLInsertOnlyRewriter'>, 'js': <class 'pywb.rewrite.regex_rewriters.JSLocationOnlyRewriter'>, 'js-proxy': <class 'pywb.rewrite.regex_rewriters.JSNoneRewriter'>, 'js-worker': <class 'pywb.rewrite.rewrite_js_workers.JSWorkerRewriter'>, 'json': <class 'pywb.rewrite.jsonp_rewriter.JSONPRewriter'>, 'xml': <class 'pywb.rewrite.regex_rewriters.XMLRewriter'>}¶
-
default_content_types
= {'css': 'text/css', 'html': 'text/html', 'js': 'text/javascript'}¶
-
rewrite_types
= {'': 'guess-text', 'application/dash+xml': 'dash', 'application/javascript': 'js', 'application/json': 'json', 'application/octet-stream': 'guess-bin', 'application/vnd.apple.mpegurl': 'hls', 'application/x-amf': 'amf', 'application/x-javascript': 'js', 'application/x-mpegURL': 'hls', 'application/xhtml': 'html', 'application/xhtml+xml': 'html', 'text/css': 'css', 'text/html': 'guess-html', 'text/javascript': 'js', 'text/plain': 'guess-text'}¶
-
pywb.rewrite.header_rewriter module¶
-
class
pywb.rewrite.header_rewriter.
DefaultHeaderRewriter
(rwinfo, header_prefix='X-Archive-Orig-')[source]¶ Bases:
object
-
header_rules
= {'accept-patch': 'keep', 'accept-ranges': 'keep', 'access-control-allow-credentials': 'prefix-if-url-rewrite', 'access-control-allow-headers': 'prefix-if-url-rewrite', 'access-control-allow-methods': 'prefix-if-url-rewrite', 'access-control-allow-origin': 'prefix-if-url-rewrite', 'access-control-expose-headers': 'prefix-if-url-rewrite', 'access-control-max-age': 'prefix-if-url-rewrite', 'age': 'prefix', 'allow': 'keep', 'alt-svc': 'prefix', 'cache-control': 'prefix', 'connection': 'prefix', 'content-base': 'url-rewrite', 'content-disposition': 'keep', 'content-encoding': 'prefix-if-content-rewrite', 'content-language': 'keep', 'content-length': 'content-length', 'content-location': 'url-rewrite', 'content-md5': 'prefix', 'content-range': 'keep', 'content-security-policy': 'prefix', 'content-security-policy-report-only': 'prefix', 'content-type': 'keep', 'date': 'prefix', 'etag': 'prefix', 'expires': 'prefix', 'last-modified': 'prefix', 'link': 'keep', 'location': 'url-rewrite', 'p3p': 'prefix', 'pragma': 'prefix', 'proxy-authenticate': 'keep', 'public-key-pins': 'prefix', 'retry-after': 'prefix', 'server': 'prefix', 'set-cookie': 'cookie', 'status': 'prefix', 'strict-transport-security': 'prefix', 'tk': 'prefix', 'trailer': 'prefix', 'transfer-encoding': 'transfer-encoding', 'upgrade': 'prefix', 'upgrade-insecure-requests': 'prefix', 'vary': 'prefix', 'via': 'prefix', 'warning': 'prefix', 'www-authenticate': 'keep', 'x-frame-options': 'prefix', 'x-xss-protection': 'prefix'}¶
-
pywb.rewrite.html_insert_rewriter module¶
-
class
pywb.rewrite.html_insert_rewriter.
HTMLInsertOnlyRewriter
(url_rewriter, **kwargs)[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
Insert custom string into HTML into the head, before any tag not <head> or <html> no other rewriting performed
-
NOT_HEAD_REGEX
= re.compile('(<\\s*\\b)(?!(html|head))', re.IGNORECASE)¶
-
XML_HEADER
= re.compile('<\\?xml.*\\?>')¶
-
pywb.rewrite.html_rewriter module¶
-
class
pywb.rewrite.html_rewriter.
HTMLRewriter
(*args, **kwargs)[source]¶ Bases:
pywb.rewrite.html_rewriter.HTMLRewriterMixin
,html.parser.HTMLParser
-
PARSETAG
= re.compile('[<]')¶
-
-
class
pywb.rewrite.html_rewriter.
HTMLRewriterMixin
(url_rewriter, head_insert=None, js_rewriter_class=None, js_rewriter=None, css_rewriter=None, css_rewriter_class=None, url='', defmod='', parse_comments=False, charset='utf-8')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
HTML-Parsing Rewriter for custom rewriting, also delegates to rewriters for script and css
-
ADD_WINDOW
= re.compile('(?<![.])(WB_wombat_)')¶
-
BEFORE_HEAD_TAGS
= ['html', 'head']¶
-
DATA_RW_PROTOCOLS
= ('http://', 'https://', '//')¶
-
META_REFRESH_REGEX
= re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$', re.IGNORECASE|re.MULTILINE)¶
-
PRELOAD_TYPES
= {'audio': 'oe_', 'document': 'if_', 'embed': 'oe_', 'fetch': 'mp_', 'font': 'oe_', 'image': 'im_', 'object': 'oe_', 'script': 'js_', 'style': 'cs_', 'track': 'oe_', 'video': 'oe_', 'worker': 'js_'}¶
-
SRCSET_REGEX
= re.compile('\\s*(\\S*\\s+[\\d\\.]+[wx]),|(?:\\s*,(?:\\s+|(?=https?:)))')¶
-
pywb.rewrite.jsonp_rewriter module¶
-
class
pywb.rewrite.jsonp_rewriter.
JSONPRewriter
(url_rewriter, align_to_line=True, first_buff='')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
-
CALLBACK
= re.compile('[?].*callback=([^&]+)')¶
-
JSONP
= re.compile('(?:^[ \\t]*(?:(?:\\/\\*[^\\*]*\\*\\/)|(?:\\/\\/[^\\n]+[\\n])))*[ \\t]*(\\w+)\\(\\{', re.MULTILINE)¶
-
pywb.rewrite.regex_rewriters module¶
-
class
pywb.rewrite.regex_rewriters.
CSSRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.CSSRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
CSSRules
[source]¶ Bases:
pywb.rewrite.regex_rewriters.RxRules
-
CSS_IMPORT_REGEX
= '@import\\s+(?:url\\s*)?\\(?\\s*[\'"]?([\\w.:/\\\\-]+)'¶
-
CSS_URL_REGEX
= 'url\\s*\\(\\s*(?:[\\\\"\']|(?:&.{1,4};))*\\s*([^)\'"]+)\\s*(?:[\\\\"\']|(?:&.{1,4};))*\\s*\\)'¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLinkAndLocationRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriterRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLinkAndLocationRewriterRules
(prefix='WB_wombat_')[source]¶ Bases:
pywb.rewrite.regex_rewriters.JSLocationRewriterRules
JS Rewriter rules which also rewrite absolute http://, https:// and // urls at the beginning of a string
-
JS_HTTPX
= '(?:(?<=["\\\';])https?:|(?<=["\\\']))\\\\{0,4}/\\\\{0,4}/[A-Za-z0-9:_@%.\\\\-]+/'¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLocationOnlyRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.JSLocationRewriterRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
JSLocationRewriterRules
(prefix='WB_wombat_')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RxRules
JS Rewriter mixin which rewrites location and domain to the specified prefix (default:
WB_wombat_
)
-
class
pywb.rewrite.regex_rewriters.
JSNoneRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶
-
class
pywb.rewrite.regex_rewriters.
JSReplaceFuzzy
(*args, **kwargs)[source]¶ Bases:
object
-
rx_obj
= None¶
-
-
pywb.rewrite.regex_rewriters.
JSRewriter
¶ alias of
pywb.rewrite.regex_rewriters.JSLinkAndLocationRewriter
-
class
pywb.rewrite.regex_rewriters.
JSWombatProxyRewriter
(rewriter, extra_rules=None)[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
JS Rewriter mixin which wraps the contents of the script in an anonymous block scope and inserts Wombat js-proxy setup
-
rules_factory
= <pywb.rewrite.regex_rewriters.JSWombatProxyRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
RegexRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.RxRules object>¶
-
-
class
pywb.rewrite.regex_rewriters.
RxRules
(rules=None)[source]¶ Bases:
object
-
HTTPX_MATCH_STR
= 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+'¶
-
-
class
pywb.rewrite.regex_rewriters.
XMLRewriter
(rewriter, extra_rules=None, first_buff='')[source]¶ Bases:
pywb.rewrite.regex_rewriters.RegexRewriter
-
rules_factory
= <pywb.rewrite.regex_rewriters.XMLRules object>¶
-
pywb.rewrite.rewrite_amf module¶
pywb.rewrite.rewrite_dash module¶
pywb.rewrite.rewrite_hls module¶
pywb.rewrite.rewrite_js_workers module¶
-
class
pywb.rewrite.rewrite_js_workers.
JSWorkerRewriter
(url_rewriter, align_to_line=True, first_buff='')[source]¶ Bases:
pywb.rewrite.content_rewriter.StreamingRewriter
A simple rewriter for rewriting web or service workers. The only rewriting that occurs is the injection of the init code for wombatWorkers.js. This allows for all them to operate as expected on the live web.
pywb.rewrite.rewriteinputreq module¶
-
class
pywb.rewrite.rewriteinputreq.
RewriteInputRequest
(env, urlkey, url, rewriter)[source]¶ Bases:
pywb.warcserver.inputrequest.DirectWSGIInputRequest
-
RANGE_ARG_RX
= re.compile('.*.googlevideo.com/videoplayback.*([&?]range=(\\d+)-(\\d+))')¶
-
RANGE_HEADER
= re.compile('bytes=(\\d+)-(\\d+)?')¶
-
pywb.rewrite.templateview module¶
-
class
pywb.rewrite.templateview.
BaseInsertView
(jenv, insert_file, banner_view=None)[source]¶ Bases:
object
Base class of all template views used by Pywb
-
render_to_string
(env, **kwargs)[source]¶ Render this template.
Parameters: - env (dict) – The WSGI environment associated with the request causing this template to be rendered
- kwargs (any) – The keyword arguments to be supplied to the Jninja template render method
Returns: The rendered template
Return type:
-
-
class
pywb.rewrite.templateview.
HeadInsertView
(jenv, insert_file, banner_view=None)[source]¶ Bases:
pywb.rewrite.templateview.BaseInsertView
The template view class associated with rendering the HTML inserted into the head of the pages replayed (WB Insert).
-
create_insert_func
(wb_url, wb_prefix, host_prefix, top_url, env, is_framed, coll='', include_ts=True, **kwargs)[source]¶ Create the function used to render the header insert template for the current request.
Parameters: - wb_url (rewrite.wburl.WbUrl) – The WbUrl for the request this template is being rendered for
- wb_prefix (str) – The URL prefix pywb is serving the content using (e.g. http://localhost:8080/live/)
- host_prefix (str) – The host URL prefix pywb is running on (e.g. http://localhost:8080)
- top_url (str) – The full URL for this request (e.g. http://localhost:8080/live/http://example.com)
- env (dict) – The WSGI environment dictionary for this request
- is_framed (bool) – Is pywb or a specific collection running in framed mode
- coll (str) – The name of the collection this request is associated with
- include_ts (bool) – Should a timestamp be included in the rendered template
- kwargs – Additional keyword arguments to be supplied to the Jninja template render method
Returns: A function to be used to render the header insert for the request this template is being rendered for
Return type: callable
-
-
class
pywb.rewrite.templateview.
JinjaEnv
(paths=None, packages=None, assets_path=None, globals=None, overlay=None, extensions=None, env_template_params_key='pywb.template_params', env_template_dir_key='pywb.templates_dir')[source]¶ Bases:
object
Pywb JinjaEnv class that provides utility functions used by the templates, configured template loaders and template paths, and contains the actual Jinja env used by each template.
-
template_filter
(param=None)[source]¶ Returns a decorator that adds the wrapped function to dictionary of template filters.
The wrapped function is keyed by either the supplied param (if supplied) or by the wrapped functions name.
Parameters: param – Optional name to use instead of the name of the function to be wrapped Returns: A decorator to wrap a template filter function Return type: callable
-
-
class
pywb.rewrite.templateview.
PkgResResolver
[source]¶ Bases:
webassets.env.Resolver
Class for resolving pywb package resources when install via pypi or setup.py
-
get_pkg_path
(item)[source]¶ Get the package path for the
Parameters: item (str) – A resources full package path Returns: The netloc and path from the items package path Return type: tuple[str, str]
-
resolve_source
(ctx, item)[source]¶ Given
item
from a Bundle’s contents, this has to return the final value to use, usually an absolute filesystem path.Note
It is also allowed to return urls and bundle instances (or generally anything else the calling
Bundle
instance may be able to handle). Indeed this is the reason why the name of this method does not imply a return type.The incoming item is usually a relative path, but may also be an absolute path, or a url. These you will commonly want to return unmodified.
This method is also allowed to resolve
item
to multiple values, in which case a list should be returned. This is commonly used ifitem
includes glob instructions (wildcards).Note
Instead of this, subclasses should consider implementing
search_for_source()
instead.
-
-
class
pywb.rewrite.templateview.
RelEnvironment
(block_start_string='{%', block_end_string='%}', variable_start_string='{{', variable_end_string='}}', comment_start_string='{#', comment_end_string='#}', line_statement_prefix=None, line_comment_prefix=None, trim_blocks=False, lstrip_blocks=False, newline_sequence='n', keep_trailing_newline=False, extensions=(), optimized=True, undefined=<class 'jinja2.runtime.Undefined'>, finalize=None, autoescape=False, loader=None, cache_size=400, auto_reload=True, bytecode_cache=None, enable_async=False)[source]¶ Bases:
jinja2.environment.Environment
Override join_path() to enable relative template paths.
-
join_path
(template, parent)[source]¶ Join a template with the parent. By default all the lookups are relative to the loader root so this method returns the template parameter unchanged, but if the paths should be relative to the parent template, this function can be used to calculate the real template name.
Subclasses may override this method and implement template path joining here.
-
-
class
pywb.rewrite.templateview.
TopFrameView
(jenv, insert_file, banner_view=None)[source]¶ Bases:
pywb.rewrite.templateview.BaseInsertView
The template view class associated with rendering the replay iframe
-
get_top_frame
(wb_url, wb_prefix, host_prefix, env, frame_mod, replay_mod, coll='', extra_params=None)[source]¶ Parameters: - wb_url (rewrite.wburl.WbUrl) – The WbUrl for the request this template is being rendered for
- wb_prefix (str) – The URL prefix pywb is serving the content using (e.g. http://localhost:8080/live/)
- host_prefix (str) – The host URL prefix pywb is running on (e.g. http://localhost:8080)
- env (dict) – The WSGI environment dictionary for the request this template is being rendered for
- frame_mod (str) – The modifier to be used for framing (e.g. if_)
- replay_mod (str) – The modifier to be used in the URL of the page being replayed (e.g. mp_)
- coll (str) – The name of the collection this template is being rendered for
- extra_params (dict) – Additional parameters to be supplied to the Jninja template render method
Returns: The frame insert string
Return type:
-
pywb.rewrite.url_rewriter module¶
-
class
pywb.rewrite.url_rewriter.
IdentityUrlRewriter
(wburl, prefix='', full_prefix=None, rel_prefix=None, root_path=None, cookie_scope=None, rewrite_opts=None, pywb_static_prefix=None)[source]¶ Bases:
pywb.rewrite.url_rewriter.UrlRewriter
No rewriting performed, return original url
-
class
pywb.rewrite.url_rewriter.
SchemeOnlyUrlRewriter
(*args, **kwargs)[source]¶ Bases:
pywb.rewrite.url_rewriter.IdentityUrlRewriter
A url rewriter which ensures that any urls have the same scheme (http or https) as the base url. Other urls/input is unchanged.
-
class
pywb.rewrite.url_rewriter.
UrlRewriter
(wburl, prefix='', full_prefix=None, rel_prefix=None, root_path=None, cookie_scope=None, rewrite_opts=None, pywb_static_prefix=None)[source]¶ Bases:
object
Main pywb UrlRewriter which rewrites absolute and relative urls to be relative to the current page, as specified via a WbUrl instance and an optional full path prefix
-
NO_REWRITE_URI_PREFIX
= ('#', 'javascript:', 'data:', 'mailto:', 'about:', 'file:', '{')¶
-
PARENT_PATH
= '../'¶
-
PROTOCOLS
= ('http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:')¶
-
REL_PATH
= '/'¶
-
REL_SCHEME
= ('//', '\\/\\/', '\\\\/\\\\/')¶
-
pywb_static_prefix
¶ Returns the static path URL :rtype: str
-
pywb.rewrite.wburl module¶
WbUrl represents the standard wayback archival url format. A regular url is a subset of the WbUrl (latest replay).
The WbUrl expresses the common interface for interacting with the wayback machine.
There WbUrl may represent one of the following forms:
query form: [/modifier]/[timestamp][-end_timestamp]*/<url>
modifier, timestamp and end_timestamp are optional:
*/example.com
20101112030201*/http://example.com
2009-2015*/http://example.com
/cdx/*/http://example.com
url query form: used to indicate query across urls
same as query form but with a final *
:
*/example.com*
20101112030201*/http://example.com*
replay form:
20101112030201/http://example.com
20101112030201im_/http://example.com
latest_replay: (no timestamp):
http://example.com
Additionally, the BaseWbUrl provides the base components (url, timestamp, end_timestamp, modifier, type) which can be used to provide a custom representation of the wayback url format.
-
class
pywb.rewrite.wburl.
BaseWbUrl
(url='', mod='', timestamp='', end_timestamp='', type=None)[source]¶ Bases:
object
-
LATEST_REPLAY
= 'latest_replay'¶
-
QUERY
= 'query'¶
-
REPLAY
= 'replay'¶
-
URL_QUERY
= 'url_query'¶
-
-
class
pywb.rewrite.wburl.
WbUrl
(orig_url)[source]¶ Bases:
pywb.rewrite.wburl.BaseWbUrl
-
DEFAULT_SCHEME
= 'http://'¶
-
FIRST_PATH
= re.compile('(?<![:/])[/?](?![/])')¶
-
QUERY_REGEX
= re.compile('^(?:([\\w\\-:]+)/)?(\\d*)[*-](\\d*)/?(.+)$')¶
-
REPLAY_REGEX
= re.compile('^(\\d*)([a-z]+_|[$][a-z0-9:.-]+)?/{1,3}(.+)$')¶
-
SCHEME_RX
= re.compile('[a-zA-Z0-9+-.]+(:/)')¶
-
is_embed
¶
-
is_identity
¶
-
is_url_rewrite_only
¶
-
static
percent_encode_host
(url)[source]¶ Convert the host of uri formatted with to_uri() to have a %-encoded host instead of punycode host The rest of url should be unchanged
-
Module contents¶
pywb.utils package¶
Submodules¶
pywb.utils.binsearch module¶
Utility functions for performing binary search over a sorted text file
-
pywb.utils.binsearch.
binsearch
(reader, key, compare_func=<function cmp>, block_size=8192)[source]¶ Perform a binary search for a specified key to within a ‘block_size’ (default 8192) granularity, and return first full line found.
-
pywb.utils.binsearch.
binsearch_offset
(reader, key, compare_func=<function cmp>, block_size=8192)[source]¶ Find offset of the line which matches a given ‘key’ using binary search If key is not found, the offset is of the line after the key
File is subdivided into block_size (default 8192) sized blocks Optional compare_func may be specified
-
pywb.utils.binsearch.
iter_exact
(reader, key, token=b' ')[source]¶ Create an iterator which iterates over lines where the first field matches the ‘key’, equivalent to token + sep prefix. Default field termin_ator/seperator is ‘ ‘
-
pywb.utils.binsearch.
iter_prefix
(reader, key)[source]¶ Creates an iterator which iterates over lines that start with prefix ‘key’ in a sorted text file.
-
pywb.utils.binsearch.
iter_range
(reader, start, end, prev_size=0)[source]¶ Creates an iterator which iterates over lines where start <= line < end (end exclusive)
-
pywb.utils.binsearch.
linearsearch
(iter_, key, prev_size=0, compare_func=<function cmp>)[source]¶ Perform a linear search over iterator until current_line >= key
optionally also tracking upto N previous lines, which are returned before the first matched line.
if end of stream is reached before a match is found, nothing is returned (prev lines discarded also)
-
pywb.utils.binsearch.
search
(reader, key, prev_size=0, compare_func=<function cmp>, block_size=8192)[source]¶ Perform a binary search for a specified key to within a ‘block_size’ (default 8192) sized block followed by linear search within the block to find first matching line.
When performin_g linear search, keep track of up to N previous lines before first matching line.
pywb.utils.canonicalize module¶
Standard url-canonicalzation, surt and non-surt
-
pywb.utils.canonicalize.
calc_search_range
(url, match_type, surt_ordered=True, url_canon=None)[source]¶ Canonicalize a url (either with custom canonicalizer or standard canonicalizer with or without surt)
Then, compute a start and end search url search range for a given match type.
Support match types: * exact * prefix * host * domain (only available when for surt ordering)
Examples below:
# surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’) (‘com,example)/path/file.html’, ‘com,example)/path/file.html!’)
>>> calc_search_range('http://example.com/path/file.html', 'prefix') ('com,example)/path/file.html', 'com,example)/path/file.htmm')
# slash and ? >>> calc_search_range(’http://example.com/path/’, ‘prefix’) (‘com,example)/path/’, ‘com,example)/path0’)
>>> calc_search_range('http://example.com/path?', 'prefix') ('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/?', 'prefix') ('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/file.html', 'host') ('com,example)/', 'com,example*')
>>> calc_search_range('http://example.com/path/file.html', 'domain') ('com,example)/', 'com,example-')
special case for tld domain range >>> calc_search_range(‘com’, ‘domain’) (‘com,’, ‘com-‘)
# non-surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’, False) (‘example.com/path/file.html’, ‘example.com/path/file.html!’)
>>> calc_search_range('http://example.com/path/file.html', 'prefix', False) ('example.com/path/file.html', 'example.com/path/file.htmm')
>>> calc_search_range('http://example.com/path/file.html', 'host', False) ('example.com/', 'example.com0')
# errors: domain range not supported >>> calc_search_range(’http://example.com/path/file.html’, ‘domain’, False) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): UrlCanonicalizeException: matchType=domain unsupported for non-surt
>>> calc_search_range('http://example.com/path/file.html', 'blah', False) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): UrlCanonicalizeException: Invalid match_type: blah
-
pywb.utils.canonicalize.
canonicalize
(url, surt_ordered=True)[source]¶ Canonicalize url and convert to surt If not in surt ordered mode, convert back to url form as surt conversion is currently part of canonicalization
>>> canonicalize('http://example.com/path/file.html', surt_ordered=True) 'com,example)/path/file.html'
>>> canonicalize('http://example.com/path/file.html', surt_ordered=False) 'example.com/path/file.html'
>>> canonicalize('urn:some:id') 'urn:some:id'
-
pywb.utils.canonicalize.
unsurt
(surt)[source]¶ # Simple surt >>> unsurt(‘com,example)/’) ‘example.com/’
# Broken surt >>> unsurt(‘com,example)’) ‘com,example)’
# Long surt >>> unsurt(‘suffix,domain,sub,subsub,another,subdomain)/path/file/index.html?a=b?c=)/’) ‘subdomain.another.subsub.sub.domain.suffix/path/file/index.html?a=b?c=)/’
pywb.utils.format module¶
-
class
pywb.utils.format.
ParamFormatter
(params, name='', prefix='param.')[source]¶ Bases:
string.Formatter
pywb.utils.geventserver module¶
-
class
pywb.utils.geventserver.
GeventServer
(app, port=0, hostname='localhost', handler_class=None, direct=False)[source]¶ Bases:
object
Class for optionally running a WSGI application in a greenlet
-
join
()[source]¶ Joins the greenlet spawned for running the server if it was started in non-direct mode
-
pywb.utils.io module¶
-
class
pywb.utils.io.
OffsetLimitReader
(stream, offset, length)[source]¶ Bases:
warcio.limitreader.LimitReader
pywb.utils.loaders module¶
-
class
pywb.utils.loaders.
BlockLoader
(**kwargs)[source]¶ Bases:
pywb.utils.loaders.BaseLoader
a loader which can stream blocks of content given a uri, offset and optional length. Currently supports: http/https and file/local file system
-
loaders
= {'file': <class 'pywb.utils.loaders.LocalFileLoader'>, 'http': <class 'pywb.utils.loaders.HttpLoader'>, 'https': <class 'pywb.utils.loaders.HttpLoader'>, 'pkg': <class 'pywb.utils.loaders.PackageLoader'>, 's3': <class 'pywb.utils.loaders.S3Loader'>, 'webhdfs': <class 'pywb.utils.loaders.WebHDFSLoader'>}¶
-
profile_loader
= None¶
-
-
class
pywb.utils.loaders.
HMACCookieMaker
(key, name, duration=10)[source]¶ Bases:
object
Utility class to produce signed HMAC digest cookies to be used with each http request
-
class
pywb.utils.loaders.
WebHDFSLoader
(**kwargs)[source]¶ Bases:
pywb.utils.loaders.HttpLoader
Loader class specifically for loading webhdfs content
-
HTTP_URL
= 'http://{host}/webhdfs/v1{path}?'¶
-
load
(url, offset, length)[source]¶ Loads the supplied web hdfs content
Parameters: - url (str) – The URL to the web hdfs content to be loaded
- offset (int|float|double) – The offset of the content to be loaded
- length (int|float|double) – The length of the content to be loaded
Returns: The raw response content
-
-
pywb.utils.loaders.
init_yaml_env_vars
()[source]¶ Initializes the yaml parser to be able to set the value of fields from environment variables
Return type: None
-
pywb.utils.loaders.
load_overlay_config
(main_env_var, main_default_file='', overlay_env_var='', overlay_file='')[source]¶
pywb.utils.memento module¶
-
class
pywb.utils.memento.
MementoUtils
[source]¶ Bases:
object
-
classmethod
make_memento_link
(url, type, dt, coll=None, memento_format=None)[source]¶ Creates a memento link string
Parameters: Returns: A memento link string
Return type:
-
classmethod
pywb.utils.merge module¶
pywb.utils.wbexception module¶
-
exception
pywb.utils.wbexception.
AccessException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate an access control violation
-
exception
pywb.utils.wbexception.
AppPageNotFound
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that a page was not found
-
exception
pywb.utils.wbexception.
BadRequestException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that request was bad
-
exception
pywb.utils.wbexception.
LiveResourceException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that an error was encountered during the retrial of a live web resource
-
exception
pywb.utils.wbexception.
NotFoundException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that a resource was not found
-
exception
pywb.utils.wbexception.
UpstreamException
(status_code, url, details)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that an error was encountered from an upstream endpoint
Module contents¶
pywb.warcserver package¶
Subpackages¶
pywb.warcserver.index package¶
-
class
pywb.warcserver.index.aggregator.
BaseDirectoryIndexSource
(base_prefix, base_dir='', name='', config=None)[source]¶ Bases:
pywb.warcserver.index.aggregator.BaseAggregator
-
INDEX_SOURCES
= [(('.cdx', '.cdxj'), <class 'pywb.warcserver.index.indexsource.FileIndexSource'>), (('.idx', '.summary'), <class 'pywb.warcserver.index.zipnum.ZipNumIndexSource'>)]¶
-
-
class
pywb.warcserver.index.aggregator.
BaseRedisMultiKeyIndexSource
(redis_url=None, redis=None, key_template=None, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.BaseAggregator
,pywb.warcserver.index.indexsource.RedisIndexSource
-
class
pywb.warcserver.index.aggregator.
CacheDirectoryIndexSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.CacheDirectoryMixin
,pywb.warcserver.index.aggregator.DirectoryIndexSource
-
class
pywb.warcserver.index.aggregator.
DirectoryIndexSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin
,pywb.warcserver.index.aggregator.BaseDirectoryIndexSource
-
class
pywb.warcserver.index.aggregator.
GeventMixin
(*args, **kwargs)[source]¶ Bases:
object
-
DEFAULT_TIMEOUT
= 5.0¶
-
-
class
pywb.warcserver.index.aggregator.
GeventTimeoutAggregator
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.TimeoutMixin
,pywb.warcserver.index.aggregator.GeventMixin
,pywb.warcserver.index.aggregator.BaseSourceListAggregator
-
class
pywb.warcserver.index.aggregator.
RedisMultiKeyIndexSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin
,pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource
-
class
pywb.warcserver.index.aggregator.
SimpleAggregator
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin
,pywb.warcserver.index.aggregator.BaseSourceListAggregator
-
class
pywb.warcserver.index.cdxobject.
CDXObject
(cdxline=b'')[source]¶ Bases:
collections.OrderedDict
dictionary object representing parsed CDX line.
-
CDX_ALT_FIELDS
= {'d': 'digest', 'f': 'filename', 'k': 'urlkey', 'l': 'length', 'm': 'mime', 'mimetype': 'mime', 'o': 'offset', 'original': 'url', 's': 'length', 'statuscode': 'status', 't': 'timestamp', 'u': 'url'}¶
-
CDX_FORMATS
= [['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'length'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename']]¶
-
-
class
pywb.warcserver.index.cdxobject.
IDXObject
(idxline)[source]¶ Bases:
collections.OrderedDict
-
FORMAT
= ['urlkey', 'part', 'offset', 'length', 'lineno']¶
-
NUM_REQ_FIELDS
= 4¶
-
-
pywb.warcserver.index.cdxops.
cdx_collapse_time_status
(cdx_iter, timelen=10)[source]¶ collapse by timestamp and status code.
-
pywb.warcserver.index.cdxops.
cdx_filter
(cdx_iter, filter_strings)[source]¶ filter CDX by regex if each filter is
field:regex
form, apply filter tocdx[field]
.
-
pywb.warcserver.index.cdxops.
cdx_load
(sources, query, process=True)[source]¶ merge text CDX lines from sources, return an iterator for filtered and access-checked sequence of CDX objects.
Parameters: - sources – iterable for text CDX sources.
- process – bool, perform processing sorting/filtering/grouping ops
-
pywb.warcserver.index.cdxops.
cdx_resolve_revisits
(cdx_iter)[source]¶ resolve revisits.
this filter adds three fields to CDX:
orig.length
,orig.offset
, andorig.filename
. for revisit records, these fields have corresponding field values in previous non-revisit (original) CDX record. They are all"-"
for non-revisit records.
-
pywb.warcserver.index.cdxops.
cdx_reverse
(cdx_iter, limit)[source]¶ return cdx records in reverse order.
-
pywb.warcserver.index.cdxops.
cdx_sort_closest
(closest, cdx_iter, limit=10)[source]¶ sort CDXCaptureResult by closest to timestamp.
-
pywb.warcserver.index.cdxops.
create_merged_cdx_gen
(sources, query)[source]¶ create a generator which loads and merges cdx streams ensures cdxs are lazy loaded
-
class
pywb.warcserver.index.fuzzymatcher.
FuzzyMatcher
(filename=None)[source]¶ Bases:
object
-
DEFAULT_FILTER
= ['urlkey:{0}']¶
-
DEFAULT_MATCH_TYPE
= 'prefix'¶
-
DEFAULT_REPLACE_AFTER
= '?'¶
-
FUZZY_SKIP_PARAMS
= ('alt_url', 'reverse', 'closest', 'end_key', 'url', 'matchType', 'filter')¶
-
-
class
pywb.warcserver.index.fuzzymatcher.
FuzzyRule
(url_prefix, regex, replace_after, filter_str, match_type, find_all)¶ Bases:
tuple
-
filter_str
¶ Alias for field number 3
-
find_all
¶ Alias for field number 5
-
match_type
¶ Alias for field number 4
-
regex
¶ Alias for field number 1
-
replace_after
¶ Alias for field number 2
-
url_prefix
¶ Alias for field number 0
-
-
class
pywb.warcserver.index.indexsource.
BaseIndexSource
[source]¶ Bases:
object
-
WAYBACK_ORIG_SUFFIX
= '{timestamp}id_/{url}'¶
-
logger
= <Logger warcserver (WARNING)>¶
-
-
class
pywb.warcserver.index.indexsource.
FileIndexSource
(filename, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
-
CDX_EXT
= ('.cdx', '.cdxj')¶
-
-
class
pywb.warcserver.index.indexsource.
MementoIndexSource
(timegate_url, timemap_url, replay_url)[source]¶
-
class
pywb.warcserver.index.indexsource.
RedisIndexSource
(redis_url=None, redis=None, key_template=None, **kwargs)[source]¶
-
class
pywb.warcserver.index.indexsource.
RemoteIndexSource
(api_url, replay_url, url_field='load_url', closest_limit=100)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
-
CDX_MATCH_RX
= re.compile('^cdxj?\\+(?P<url>https?\\:.*)')¶
-
-
class
pywb.warcserver.index.indexsource.
WBMementoIndexSource
(timegate_url, timemap_url, replay_url)[source]¶ Bases:
pywb.warcserver.index.indexsource.MementoIndexSource
-
WAYBACK_ORIG_SUFFIX
= '{timestamp}im_/{url}'¶
-
WBURL_MATCH
= re.compile('([0-9]{0,14})?(?:\\w+_)?/{0,3}(.*)')¶
-
-
class
pywb.warcserver.index.indexsource.
XmlQueryIndexSource
(query_api_url)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
An index source class for XML files
-
EXACT_QUERY
= 'type:urlquery url:'¶
-
PREFIX_QUERY
= 'type:prefixquery url:'¶
-
convert_to_cdx
(item)[source]¶ Converts the etree element to an CDX object
Parameters: item – The etree element to be converted Returns: The CDXObject representing the supplied etree element object Return type: CDXObject
-
gettext
(item, name)[source]¶ Returns the value of the supplied name
Parameters: - item – The etree element to be converted
- name – The name of the field to get its value for
Returns: The value of the field
Return type:
-
classmethod
init_from_config
(config)[source]¶ Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied dictionary contains the type key equal to xmlquery
Parameters: str] config (dict[str,) – Returns: The initialized XmlQueryIndexSource or None Return type: XmlQueryIndexSource|None
-
classmethod
init_from_string
(value)[source]¶ Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied value starts with xmlquery+
Parameters: value (str) – The string by which to initialize the XmlQueryIndexSource Returns: The initialized XmlQueryIndexSource or None Return type: XmlQueryIndexSource|None
-
load_index
(params)[source]¶ Loads the xml query index based on the supplied params
Parameters: str] params (dict[str,) – The query params Returns: A list or generator of cdx objects Raises: NotFoundException – If the query url is not found or the results of the query returns no cdx entries :raises BadRequestException: If the match type is not exact or prefix
-
-
class
pywb.warcserver.index.zipnum.
LocMapResolver
(loc_summary, loc_filename)[source]¶ Bases:
object
Lookup shards based on a file mapping shard name to one or more paths. The entries are tab delimited.
-
class
pywb.warcserver.index.zipnum.
LocPrefixResolver
(loc_summary, loc_config)[source]¶ Bases:
object
Use a prefix lookup, where the prefix can either be a fixed string or can be a regex replacement of the index summary path
-
class
pywb.warcserver.index.zipnum.
ZipNumIndexSource
(summary, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
-
DEFAULT_MAX_BLOCKS
= 10¶
-
DEFAULT_RELOAD_INTERVAL
= 10¶
-
IDX_EXT
= ('.idx', '.summary')¶
-
pywb.warcserver.resource package¶
-
class
pywb.warcserver.resource.pathresolvers.
PathIndexResolver
(pathindex_file)[source]¶ Bases:
object
-
class
pywb.warcserver.resource.resolvingloader.
ResolvingLoader
(path_resolvers, record_loader=None, no_record_parse=False)[source]¶ Bases:
object
-
EMPTY_DIGEST
= '3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ'¶
-
MISSING_REVISIT_MSG
= 'Original for revisit record could not be loaded'¶
-
load_cdx_for_dupe
(url, timestamp, digest, cdx_loader)[source]¶ If a cdx_server is available, return response from server, otherwise empty list
-
load_headers_and_payload
(cdx, failed_files, cdx_loader)[source]¶ Resolve headers and payload for a given capture In the simple case, headers and payload are in the same record. In the case of revisit records, the payload and headers may be in different records.
If the original has already been found, lookup original using orig. fields in cdx dict. Otherwise, call _load_different_url_payload() to get cdx index from a different url to find the original record.
-
-
class
pywb.warcserver.resource.responseloader.
LiveWebLoader
(forward_proxy_prefix=None, adapter=None)[source]¶ Bases:
pywb.warcserver.resource.responseloader.BaseLoader
-
SKIP_HEADERS
= ('link', 'memento-datetime', 'content-location', 'x-archive')¶
-
UNREWRITE_HEADERS
= ('location', 'content-location')¶
-
VIDEO_MIMES
= ('application/x-mpegURL', 'application/vnd.apple.mpegurl', 'application/dash+xml')¶
-
-
class
pywb.warcserver.resource.responseloader.
VideoLoader
[source]¶ Bases:
pywb.warcserver.resource.responseloader.BaseLoader
-
CONTENT_TYPE
= 'application/vnd.youtube-dl_formats+json'¶
-
-
class
pywb.warcserver.resource.responseloader.
WARCPathLoader
(paths, cdx_source)[source]¶ Bases:
pywb.warcserver.resource.pathresolvers.DefaultResolverMixin
,pywb.warcserver.resource.responseloader.BaseLoader
Submodules¶
pywb.warcserver.access_checker module¶
-
class
pywb.warcserver.access_checker.
AccessChecker
(access_source, default_access='allow', embargo=None)[source]¶ Bases:
object
An access checker class
-
EXACT_SUFFIX
= '###'¶
-
EXACT_SUFFIX_B
= b'###'¶
-
EXACT_SUFFIX_SEARCH_B
= b'####'¶
-
create_access_aggregator
(source_files)[source]¶ Creates a new AccessRulesAggregator using the supplied list of access control file names
Parameters: source_files (list[str]) – The list of access control file names Returns: The created AccessRulesAggregator Return type: AccessRulesAggregator
-
create_access_source
(filename)[source]¶ Creates a new access source for the supplied filename.
If the filename is for a directory an CacheDirectoryAccessSource instance is returned otherwise an FileAccessIndexSource instance
Parameters: filename (str) – The name of an file/directory Returns: An instance of CacheDirectoryAccessSource or FileAccessIndexSource depending on if the supplied filename is for a directory or file :rtype: CacheDirectoryAccessSource|FileAccessIndexSource :raises Exception: Indicates an invalid access source was supplied
-
find_access_rule
(url, ts=None, urlkey=None, collection=None, acl_user=None)[source]¶ Attempts to find the access control rule for the supplied URL otherwise returns the default rule
Parameters: - url (str) – The URL for the rule to be found
- ts (str|None) – A timestamp (not used)
- urlkey (str|None) – The access control url key
- collection (str|None) – The collection, if any
- acl_user (str|None) – The access control user, if any
Returns: The access control rule for the supplied URL
if one exists otherwise the default rule :rtype: CDXObject
-
wrap_iter
(cdx_iter, acl_user)[source]¶ Wraps the supplied cdx iter and yields cdx objects that contain the access control results for the cdx object being yielded
Parameters: - cdx_iter – The cdx object iterator to be wrapped
- acl_user (str) – The user associated with this request (optional)
Returns: The wrapped cdx object iterator
-
-
class
pywb.warcserver.access_checker.
AccessRulesAggregator
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.access_checker.ReverseMergeMixin
,pywb.warcserver.index.aggregator.SimpleAggregator
An Aggregator specific to access control
-
class
pywb.warcserver.access_checker.
CacheDirectoryAccessSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.CacheDirectoryMixin
,pywb.warcserver.access_checker.DirectoryAccessSource
An cache directory index source specific to access control
-
class
pywb.warcserver.access_checker.
DirectoryAccessSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.access_checker.ReverseMergeMixin
,pywb.warcserver.index.aggregator.DirectoryIndexSource
An directory index source specific to access control
-
INDEX_SOURCES
= [('.aclj', <class 'pywb.warcserver.access_checker.FileAccessIndexSource'>)]¶
-
-
class
pywb.warcserver.access_checker.
FileAccessIndexSource
(filename, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.FileIndexSource
An Index Source class specific to access control lists
pywb.warcserver.amf module¶
pywb.warcserver.basewarcserver module¶
pywb.warcserver.handlers module¶
-
class
pywb.warcserver.handlers.
DefaultResourceHandler
(index_source, warc_paths='', forward_proxy_prefix='', **kwargs)[source]¶
-
class
pywb.warcserver.handlers.
IndexHandler
(index_source, opts=None, *args, **kwargs)[source]¶ Bases:
object
-
DEF_OUTPUT
= 'cdxj'¶
-
OUTPUTS
= {'cdxj': <function to_cdxj>, 'json': <function to_json>, 'link': <function to_link>, 'text': <function to_text>}¶
-
pywb.warcserver.http module¶
-
class
pywb.warcserver.http.
DefaultAdapters
[source]¶ Bases:
object
-
live_adapter
= <pywb.warcserver.http.PywbHttpAdapter object>¶
-
remote_adapter
= <pywb.warcserver.http.PywbHttpAdapter object>¶
-
-
class
pywb.warcserver.http.
PywbHttpAdapter
(cert_reqs='CERT_NONE', ca_cert_dir=None, **init_kwargs)[source]¶ Bases:
requests.adapters.HTTPAdapter
This adaptor exists exists to restore the default behavior of urllib3 < 1.25.x, which was to not verify ssl certs, until a better solution is found
-
init_poolmanager
(connections, maxsize, block=False, **pool_kwargs)[source]¶ Initializes a urllib3 PoolManager.
This method should not be called from user code, and is only exposed for use when subclassing the
HTTPAdapter
.Parameters: - connections – The number of urllib3 connection pools to cache.
- maxsize – The maximum number of connections to save in the pool.
- block – Block when no free connections are available.
- pool_kwargs – Extra keyword arguments used to initialize the Pool Manager.
-
proxy_manager_for
(proxy, **proxy_kwargs)[source]¶ Return urllib3 ProxyManager for the given proxy.
This method should not be called from user code, and is only exposed for use when subclassing the
HTTPAdapter
.Parameters: - proxy – The proxy to return a urllib3 ProxyManager for.
- proxy_kwargs – Extra keyword arguments used to configure the Proxy Manager.
Returns: ProxyManager
Return type: urllib3.ProxyManager
-
pywb.warcserver.inputrequest module¶
pywb.warcserver.upstreamindexsource module¶
pywb.warcserver.warcserver module¶
-
class
pywb.warcserver.warcserver.
WarcServer
(config_file='./config.yaml', custom_config=None)[source]¶ Bases:
pywb.warcserver.basewarcserver.BaseWarcServer
-
AUTO_COLL_TEMPL
= '{coll}'¶
-
DEFAULT_DEDUP_URL
= 'redis://localhost:6379/0/pywb:{coll}:cdxj'¶
-