Welcome to Scrapy Do’s documentation!

Scrapy Do is a daemon that provides a convenient way to run Scrapy spiders. It can either do it once - immediately; or it can run them periodically, at specified time intervals. It’s been inspired by scrapyd but written from scratch. For the time being, it only comes with a REST API. Version 0.2.0 will come with a command line client, and version 0.3.0 will have an interactive web interface.

Contents

Quick Start

  • Install scrapy-do using pip:

    $ pip install scrapy-do
    
  • Start the daemon in the foreground:

    $ scrapy-do -n scrapy-do
    
  • Open another terminal window, download the Scrapy’s Quotesbot example and create a deployable archive:

    $ git clone https://github.com/scrapy/quotesbot.git
    $ cd quotesbot
    $ git archive master -o quotesbot.zip --prefix=quotesbot/
    
  • Push the code to the server:

    $ curl -s http://localhost:7654/push-project.json \
           -F name=quotesbot \
           -F archive=@quotesbot.zip | jq -r
    {
      "status": "ok",
      "spiders": [
        "toscrape-css",
        "toscrape-xpath"
      ]
    }
    
  • Schedule some jobs:

    $ curl -s http://localhost:7654/schedule-job.json \
           -F project=quotesbot \
           -F spider=toscrape-css \
           -F "when=every 2 to 3 hours" | jq -r
    {
      "status": "ok",
      "identifier": "04a38a03-1ce4-4077-aee1-e8275d1c20b6"
    }
    
    $ curl -s http://localhost:7654/schedule-job.json \
           -F project=quotesbot \
           -F spider=toscrape-css \
           -F when=now | jq -r
    {
      "status": "ok",
      "identifier": "83d447b0-ba6e-42c5-a80f-6982b2e860cf"
    }
    
  • See what’s going on:

    $ curl -s "http://localhost:7654/list-jobs.json?status=ACTIVE" | jq -r
    {
      "status": "ok",
      "jobs": [
        {
          "identifier": "83d447b0-ba6e-42c5-a80f-6982b2e860cf",
          "status": "RUNNING",
          "actor": "USER",
          "schedule": "now",
          "project": "quotesbot",
          "spider": "toscrape-css",
          "timestamp": "2017-12-10 22:33:14.853565",
          "duration": null
        },
        {
          "identifier": "04a38a03-1ce4-4077-aee1-e8275d1c20b6",
          "status": "SCHEDULED",
          "actor": "USER",
          "schedule": "every 2 to 3 hours",
          "project": "quotesbot",
          "spider": "toscrape-css",
          "timestamp": "2017-12-10 22:31:12.320832",
          "duration": null
        }
      ]
    }
    

Basic Concepts

Projects

Scrapy Do handles zipped scrapy projects. The only expectation it has about the structure of the archive is that it contains a directory whose name is the same as the name of the project. This directory, in turn, includes the Scrapy project itself. Doing things this way ends up being quite convenient if you use version control like git to manage the code of your spiders (which you probably should). Let’s consider the quotesbot:

$ git clone https://github.com/scrapy/quotesbot.git
$ cd quotesbot

You can create a valid archive like this:

$ git archive master -o quotesbot.zip --prefix=quotesbot/

You can, of course, create the zip file any way you wish as long as it meets the criteria described above.

Jobs

When you submit a job, it will end up being classified as either SCHEDULED or PENDING depending on the scheduling spec you provide. Any PENDING job will be picked up for execution as soon as there is a free job slot and its status will be changed to RUNNING. SCHEDULED jobs spawn new PENDING jobs at the intervals specified in the scheduling spec. A RUNNING job may end up being SUCCESSFUL, FAILED, or CANCELED depending on the return code of the spider process or your actions.

Scheduling Specs

Scrapy Do uses the excellent Schedule library to handle scheduled jobs. The user-supplied scheduling specs get translated to a series of calls to the schedule library. Therefore, whatever is valid for this library should be a valid scheduling spec. For example:

  • ‘every monday at 12:30’
  • ‘every 2 to 3 hours’
  • ‘every 6 minutes’
  • ‘every hour at 00:15’

are all valid. A scheduling spec must start with either: ‘every’ or ‘now’. The former will result in creating a SCHEDULED job while the latter will produce a PENDING job for immediate execution. Other valid keywords are:

  • second
  • seconds
  • minute
  • minutes
  • hour
  • hours
  • day
  • days
  • week
  • weeks
  • monday
  • tuesday
  • wednesday
  • thursday
  • friday
  • saturday
  • sunday
  • at - expects an hour-like parameter immediately afterwards (ie. 12:12)
  • to - expects an integer immediately afterwards

Installation

The easy way

The easiest way to install Scrapy Do is using pip. You can then create a directory where you want your project data stored and just start the daemon there.

$ pip install scrapy-do
$ mkdir /home/user/my-scrapy-do-data
$ cd /home/user/my-scrapy-do-data
$ scrapy-do scrapy-do

Yup, you need to type scrapy-do twice. That’s how twisted works, don’t ask me. After doing that, you will see some content in this directory including the log file and the pidfile of the Scrapy Do daemon.

A systemd service

Installing Scrapy Do as a systemd service is a far better idea than the easy way described above. It’s a bit of work that should really be done by a proper Debian/Ubuntu package, but we do not have one for the time being, so I will show you how to do it “by hand.”

  • Although not strictly necessary, it’s a good practice to run the daemon under a separate user account. I will create one called pydaemon because I run a couple more python daemons this way.

    $ sudo useradd -m -d /opt/pydaemon pydaemon
    
  • Make sure you have all of the following packages installed:

    $ sudo apt-get install python3 python3-dev python3-virtualenv
    $ sudo apt-get install build-essential
    
  • Switch your session to this new user account:

    $ sudo su - pydaemon
    
  • Create the virtual env and install Scrapy Do:

    $ mkdir virtualenv
    $ cd virtualenv/
    $ python3 /usr/lib/python3/dist-packages/virtualenv.py -p /usr/bin/python3 .
    $ . ./bin/activate
    $ pip install scrapy-do
    $ cd ..
    
  • Create a bin directory and a wrapper script that will set up the virtualenv on startup:

    $ mkdir bin
    $ cat > bin/scrapy-do << EOF
    > #!/bin/bash
    > . /opt/pydaemon/virtualenv/bin/activate
    > exec /opt/pydaemon/virtualenv/bin/scrapy-do "\${@}"
    > EOF
    $ chmod 755 bin/scrapy-do
    
  • Create a data directory and a configuration file:

    $ mkdir -p data/scrapy-do
    $ mkdir etc
    $ cat > etc/scrapy-do.conf << EOF
    > [scrapy-do]
    > project-store = /opt/pydaemon/data/scrapy-do
    > EOF
    
  • As root, create the following file with the following content:

    # cat > /etc/systemd/system/scrapy-do.service << EOF
    > [Unit]
    > Description=Scrapy Do Service
    >
    > [Service]
    > ExecStart=/opt/pydaemon/bin/scrapy-do --nodaemon --pidfile= \
    >           scrapy-do --config /opt/pydaemon/etc/scrapy-do.conf
    > User=pydaemon
    > Group=pydaemon
    > Restart=always
    >
    > [Install]
    > WantedBy=multi-user.target
    > EOF
    
  • You can then reload the systemd configuration and let it manage the Scrapy Do daemon:

    $ sudo systemctl daemon-reload
    $ sudo systemctl start scrapy-do
    $ sudo systemctl enable scrapy-do
    
  • Finally, you should now be able to see that the daemon is running:

    $ sudo systemctl status scrapy-do
    ● scrapy-do.service - Scrapy Do Service
       Loaded: loaded (/etc/systemd/system/scrapy-do.service; enabled; vendor preset: enabled)
       Active: active (running) since Sun 2017-12-10 22:42:55 UTC; 4min 23s ago
     Main PID: 27543 (scrapy-do)
    ...
    

I know its awfully complicated. I will do some packaging work when I have a spare moment.

Server Configuration

You can pass a configuration file to the Scrapy Do daemon in the following way:

$ scrapy-do scrapy-do --config /path/to/config/file.conf

The remaining part of this section describes the meaning of the configurable parameters.

[scrapy-do] section

  • project-store: A directory where all the state of the Scrapy Do daemon is stored. Defaults to projects, meaning that it will use a subdirectory of the Current Working Directory.
  • job-slots: A numer of jobs that can run in parallel. Defaults to 3.
  • completed-cap: A number of completed jobs to keep. All the jobs that exceed the cap and their log files will be purged. Older jobs are purged first. Defaults to 50.

[web] section

  • interface: An interface to listen on. Defaults to 127.0.0.1.
  • port: A port number to listen on. Defaults to: 7654.
  • https: The HTTPS switch. Defaults to off.
  • key: Path to your certificate key. Defaults to: scrapy-do.key.
  • cert: Path to your certificate. Defaults to: scrapy-do.crt.
  • auth: The authentication switch. Scrapy Do uses the digest authentication method and it will not transmit your password over the network. Therefore, it’s safe to use even without TLS. Defaults to off.
  • auth-db: Path to your authentication database file. The file contains username-password pairs, each in a separate line. The user and password parts are separated by a colon (:). I.e., myusername:mypassword. Defaults to auth.db.

Example configuration

[scrapy-do]
project-store = /var/scrapy-do
job-slots = 5
completed-cap = 250

[web]
interface = 10.8.0.1
port = 9999

https = on
key = /etc/scrapy-do/scrapy-do.key
cert = /etc/scrapy-do/scrapy-do.crt

auth = on
auth-db = /etc/scrapy-do/auth.db

REST API

This section describes the REST API provided by Scrapy Do. The responses to all of the requests except for get-log are JSON dictionaries. Error responses look like this:

{
  "msg": "Error message",
  "status": "error"
}

Successful responses have the status part set to ok and a variety of query dependent keys described below. The request examples use curl and jq.

status.json

Get information about the daemon and its environment.

  • Method: GET

Example request:

$ curl -s "http://localhost:7654/status.json" | jq -r
{
  "status": "ok",
  "memory-usage": 39.89453125,
  "cpu-usage": 0,
  "time": "2017-12-11 15:20:42.415793",
  "timezone": "CET; CEST",
  "hostname": "host",
  "uptime": "1d 12m 24s",
  "jobs-run": 24,
  "jobs-successful": 24,
  "jobs-failed": 0,
  "jobs-canceled": 0
}

push-project.json

Push a project archive to the server replacing an existing one of the same name if it is already present.

  • Method: POST

  • Parameters:

    • name - name of the project
    • archive - a binary buffer containing the project archive
    $ curl -s http://localhost:7654/push-project.json \
           -F name=quotesbot \
           -F archive=@quotesbot.zip | jq -r
    
    {
      "status": "ok",
      "spiders": [
        "toscrape-css",
        "toscrape-xpath"
      ]
    }
    

list-projects.json

Get a list of the projects registered with the server.

  • Method: GET

    $ curl -s http://localhost:7654/list-projects.json | jq -r
    
    {
      "status": "ok",
      "projects": [
        "quotesbot"
      ]
    }
    

list-spiders.json

List spiders provided by the given project.

  • Method: GET

  • Parameters:

    • project - name of the project
    $ curl -s "http://localhost:7654/list-spiders.json?project=quotesbot" | jq -r
    
    {
      "status": "ok",
      "project": "quotesbot",
      "spiders": [
        "toscrape-css",
        "toscrape-xpath"
      ]
    }
    

schedule-job.json

Schedule a job.

  • Method: POST

  • Parameters:

    • project - name of the project
    • spider - name of the spider
    • when - a schedling spec, see Scheduling Specs.
    $ curl -s http://localhost:7654/schedule-job.json \
           -F project=quotesbot \
           -F spider=toscrape-css \
           -F "when=every 10 minutes" | jq -r
    
    {
      "status": "ok",
      "identifier": "5b30c8a2-42e5-4ad5-b143-4cb0420955a5"
    }
    

list-jobs.json

Get information about a job or jobs.

  • Method: GET
  • Parameters (one required):
    • status - status of the jobs to list, see Jobs; addtionally ACTIVE and COMPLETED are accepted to get lists of jobs with related statuses.
    • id - id of the job to list

Query by status:

$ curl -s "http://localhost:7654/list-jobs.json?status=ACTIVE" | jq -r
{
  "status": "ok",
  "jobs": [
    {
      "identifier": "5b30c8a2-42e5-4ad5-b143-4cb0420955a5",
      "status": "SCHEDULED",
      "actor": "USER",
      "schedule": "every 10 minutes",
      "project": "quotesbot",
      "spider": "toscrape-css",
      "timestamp": "2017-12-11 15:34:13.008996",
      "duration": null
    },
    {
      "identifier": "451e6083-54cd-4628-bc5d-b80e6da30e72",
      "status": "SCHEDULED",
      "actor": "USER",
      "schedule": "every minute",
      "project": "quotesbot",
      "spider": "toscrape-css",
      "timestamp": "2017-12-09 20:53:31.219428",
      "duration": null
    }
  ]
}

Query by id:

$ curl -s "http://localhost:7654/list-jobs.json?id=317d71ea-ddea-444b-bb3f-f39d82855e19" | jq -r
 {
   "status": "ok",
   "jobs": [
     {
       "identifier": "317d71ea-ddea-444b-bb3f-f39d82855e19",
       "status": "SUCCESSFUL",
       "actor": "SCHEDULER",
       "schedule": "now",
       "project": "quotesbot",
       "spider": "toscrape-css",
       "timestamp": "2017-12-11 15:40:39.621948",
       "duration": 2
     }
   ]
}

cancel-job.json

Cancel a job.

  • Method: POST

  • Parameters (one required):

    • id - id of the job to cancel
    $ curl -s http://localhost:7654/cancel-job.json \
           -F id=451e6083-54cd-4628-bc5d-b80e6da30e72 | jq -r
    
    {
      "status": "ok"
    }
    

get-log

Retrieve the log file of the job that has either been completed or is still running.

  • Method:: GET

Get the log of the standard output:

$ curl -s http://localhost:7654/get-log/data/bf825a9e-b0c6-4c52-89f6-b5c8209e7977.out

Get the log of the standard error output:

$ curl -s http://localhost:7654/get-log/data/bf825a9e-b0c6-4c52-89f6-b5c8209e7977.err

Source Documentation

Indices and tables