Scrapy ドキュメント¶
このドキュメントには、Scrapyについて学ぶべきすべての事柄が書かれています。
ヘルプ¶
以下のトピックがトラブル解決の手助けとなるでしょう。
- FAQ をご覧ください。一般的な質問に対する回答があります。
- 特定の情報をお探しですか? 索引 または モジュール索引 をご覧ください。
- scrapyタグを使用して StackOverflow で質問や検索をしてください。
- Scrapy subreddit で質問や検索をしてください。
- scrapy-users メーリングリスト のアーカイブから質問を検索してください。
- #scrapy IRC チャンネル で質問してください。
- Issue tracker でScrapyのバグを報告してください。
最初のステップ¶
Scrapyの概要¶
Scrapyは、Webサイトをクロールし、データマイニング、情報処理、アーカイブなどの幅広い有用なアプリケーションに使用できる構造化データを抽出するためのアプリケーションフレームワークです。
Scrapyはもともと Webスクレイピング 用に設計されていましたが、API(Amazon Associates Web Services など)または汎用Webクローラーとしてデータを抽出するためにも使用できます。
Spiderサンプルのウォークスルー¶
Scrapyが提供してくれるものを示すために、簡単なSpiderの実行例を紹介します。
ウェブサイト http://quotes.toscrape.com からページネーションを辿って引用を抽出するSpiderのコードは次のとおりです:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
このコードを quotes_spider.py
といった名前で保存し、runspider
コマンドでSpiderを実行してください:
scrapy runspider quotes_spider.py -o quotes.json
終了すると引用リストが quotes.json
というJSON形式のファイルに保存されます。これは次のような内容になります(読みやすくするために再フォーマットしています)。
[{
"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"author": "Groucho Marx",
"text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]
何か起きたのか?¶
scrapy runspider quotes_spider.py
コマンドを実行すると、Scrapyは該当のSpider定義を探して、クローラーエンジンを通して実行します。
クロールが開始すると、 start_urls
属性で定義されたURL(この場合は humor カテゴリーのURLのみ)にリクエストを行い、responseオブジェクトを引数としてデフォルトのコールバックメソッド parse
が呼び出されます。parse
コールバックでは、CSSセレクタを使用して引用テキストの要素をループし、抽出された引用と作者でPythonのdictを生成し、次のページへのリンクを探し、コールバックと同じ parse
メソッドを使用して次ページへのリクエストをスケジューリングします。
ここにScrapyの利点の1つがあります。リクエストは スケジューリングされ、非同期に処理されます 。つまり、リクエストが完了するのを待つ必要はなく、その間に別のリクエストを送信したり、他の処理を行うことができます。これは、リクエストが失敗した場合や、処理中にエラーが発生した場合でも、他のリクエストが続行できることを意味します。
これにより、非常に高速なクロールが可能になります(複数の同時リクエストをフォールトトレラントな方法で送信できます)。また、Scrapyを使用すると、 いくつかの設定 でクロールの礼儀正しさを制御できます。各リクエストごとのダウンロード遅延を設定したり、ドメインまたはIPごとに平行リクエストの量を制限したり、クロール速度を自動的に調整する拡張機能を使用する ことができます。
注釈
フィードのエクスポート を使用してJSONファイルを生成したり、エクスポート形式(XMLやCSVなど)やストレージバックエンド(FTPや Amazon S3 など)を簡単に変更することができます。Itemパイプライン を作成してItemをデータベースに格納することもできます。
それ以外には?¶
ここまで、Scrapyを使用してウェブサイトからアイテムを抽出して保存する方法を見てきましたが、これはほんの一部に過ぎません。Scrapyは、簡単で効率的なスクレイピングを作成するための強力な機能を、以下のように多数提供しています。
- 拡張CSSセレクタとXPath式を使用し、正規表現を使用して抽出するヘルパーメソッドを併用して、HTML/XMLのソースからデータを 選択および抽出 する組み込みのサポート。
- CSSやXPath式を試してデータを抽出するための インタラクティブなシェル (IPython対応)。Spiderの作成やデバッグに非常に便利です。
- 複数の形式(JSON, CSV, XML)で フィードのエクスポート を生成し、それらを複数のバックエンド(FTP、S3、ローカルファイルシステム)に格納するための組み込みサポート。
- 国外、非標準、壊れたエンコーディング宣言を扱うための強力なエンコーディングのサポートと自動検出。
- 強力な拡張性 により、 シグナル と明確に定義されたAPI(ミドルウェア、 拡張機能 、 パイプライン )を使用して、プラグインとして独自の機能を追加することができます。
- 幅広い組み込み拡張機能とハンドリング用のミドルウェア:
- クッキーとセッション処理
- 圧縮、認証、キャッシュなどのHTTP機能
- ユーザーエージェントのなりすまし
- robots.txt
- クロール深度の制限
- その他
- Scrapyプロセス内で動作するPythonコンソールにフックするための Telnetコンソール 、クローラの内部調査とデバッグ
- さらに、サイトマップ やXML/CSVフィードからWebサイトをクロールするための再利用可能なSpider、抽出したアイテムに関連付けられた画像(または他のメディア)を 自動的にダウンロード するメディアパイプライン、DNSリゾルバのキャッシュなど、他にもたくさんの機能があります。
次のステップは?¶
次のステップは、 Scrapyをインストール して、チュートリアル に従って本格的なScrapyプロジェクトを作成し、そして コミュニティに参加する ことです。興味を持って頂いて感謝します!
インストールガイド¶
Scrapyのインストール¶
Scrapyは、CPython(デフォルトのPython実装)のPython 2.7とPython 3.4以上、またはPyPy(5.9以降)で動作します。
Anaconda または Miniconda を使用している場合は、Linux、Windows、およびOS X用の最新パッケージを含む conda-forge チャネルからパッケージをインストールできます。
conda
を使用してScrapyをインストールするには、次のコマンドを実行します。
conda install -c conda-forge scrapy
Pythonパッケージのインストールに慣れている場合は、PyPIからScrapyとその依存パッケージをインストールすることができます。
pip install Scrapy
ご使用のオペレーティングシステムによっては、Scrapyの依存関係によってコンパイルの問題を解決する必要がありますので、プラットフォーム別のインストール を必ず確認してください。
また、virtualenv でScrapyをインストールして、システムパッケージとの衝突を避けることを強くお勧めします。
For more detailed and platform specifics instructions, as well as troubleshooting information, read on.
お役立ち情報¶
Scrapyは純粋なPythonで書かれており、いくつかの重要なPythonパッケージに依存しています。
- lxml : 効率的なXMLとHTMLのパーサー
- parsel : lxmlの上に書かれたHTML / XMLデータ抽出ライブラリ
- w3lib : URLとWebページのエンコーディングを扱うための多目的ヘルパー
- twisted : 非同期通信フレームワーク
- cryptography と pyOpenSSL : さまざまなネットワークレベルのセキュリティ要因に対処
Scrapyがテストされている最小バージョンは次のとおりです。
- Twisted 14.0
- lxml 3.4
- pyOpenSSL 0.14
Scrapyはこれよりも古いバージョンでも動作するかもしれませんが、テストされていないため動作の保証はされません。
パッケージのいくつかは、プラットフォーム別に追加のインストール手順が必要な非Pythonパッケージに依存しています。プラットフォーム別のインストール を確認してください。
依存関係に関連する問題が発生した場合は、以下のインストール手順を参照してください。
仮想環境を使用する(推奨)¶
要約:どんなプラットフォームでも仮想環境内にScrapyをインストールすることをお勧めします。
Pythonパッケージは、グローバル(システム全体)またはユーザごとにインストールできますが、Scrapyのシステムをそのようにインストールすることはお勧めしません。
そのかわり、いわゆる「仮想環境」( virtualenv )内にScrapyをインストールすることをお勧めします。virtualenvを使うと、すでにインストールされているPythonパッケージと競合することなく、pip
で( sudo
などを使わずに)パッケージをインストールできます。
仮想環境を使い始めるには、virtualenvのインストール手順 を参照してください。virtualenvをグローバルにインストールする(こちらはグローバルにインストールしたほうが役に立ちます)には、以下を実行する必要があります。
$ [sudo] pip install virtualenv
virtualenvの作成方法については、 ユーザーガイド を参照してください。
注釈
LinuxまたはOS Xを使用する場合、 virtualenvwrapper はvirtualenvを作成するための便利なツールです。
virtualenvを作成したら、他のPythonパッケージと同様に、 pip
でscrapyをインストールすることができます(事前にインストールする必要のあるPython以外の依存パッケージについては、 プラットフォーム固有のガイド を参照してください)。
virtualenvはデフォルトでPython 2、またはPython 3を使用するように作成できます。
- Python 3の場合は、Python 3のvirtualenv内にScrapyをインストールしてください。
- また、Python 2の場合は、Python 2のvirtualenv内にScrapyをインストールしてください。
プラットフォーム別のインストール¶
Windows¶
pipを使用してWindowsにScrapyをインストールすることも可能ですが、Anaconda または Miniconda をインストールし、conda-forge チャネルからパッケージを使用することをお勧めします。これでほとんどのインストールの問題を回避できます。
Anaconda または Miniconda をインストールしたら、Scrapyを次のようにインストールします。
conda install -c conda-forge scrapy
Ubuntu 14.04以上¶
Scrapyは現在、多くのバージョンのlxml、twisted、pyOpenSSLでテストされており、最近のUbuntuディストリビューションと互換性があります。Ubuntuの古いバージョンもサポートしてはいますが、Ubuntu 14.04のようにTLS接続の潜在的な問題があります。
Ubuntuで提供されている python-scrapy
パッケージを 使用しないでください。古いバージョンであり、最新のScrapyに追いつくのが遅いです。
Ubuntu(またはUbuntuベースの)システムにScrapyをインストールするには、以下の依存パッケージをインストールする必要があります。
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
python-dev
,zlib1g-dev
,libxml2-dev
,libxslt1-dev
はlxml
に必要libssl-dev
とlibffi-dev
はcryptography
に必要
Python 3にscrapyをインストールする場合は、開発用パッケージも必要です。
sudo apt-get install python3 python3-dev
その後、 virtualenv の中で pip
を使ってScrapyをインストールすることができます。
pip install scrapy
注釈
Debian Jessie (8.0) 以上でScrapyをインストールする場合も、同様のパッケージを使用することができます。
Mac OS X¶
Scrapyの依存パッケージの導入には、Cコンパイラと開発ヘッダーが必要です。OS Xでは、これは通常AppleのXcode開発ツールによって提供されます。Xcodeのコマンドラインツールをインストールするには、ターミナルウィンドウを開き、次のコマンドを実行します。
xcode-select --install
pip
がシステムパッケージを更新しないようにする 既知の問題 があります。この問題は、Scrapyとその依存関係を正常にインストールするために対処する必要がありますので、いくつかの解決策を示します。
(推奨) システムのPythonを 使用しないでください。システムの他の部分と競合しない新しいバージョンをインストールしてください。 homebrew パッケージマネージャーを使用する場合を次に示します。
https://brew.sh/ の内容に従って homebrew をインストールします。
PATH
環境変数を更新して、システムパッケージの前にhomebrewパッケージを使用するようにしてください(デフォルトのシェルとして zsh を使用している場合は.bashrc
を.zshrc
に変更してください)。echo "export PATH=/usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc
.bashrc
をリロードして、変更が行われたことを確認します。source ~/.bashrc
Pythonをインストールします。
brew install python
Pythonの最新バージョンには
pip
がバンドルされているため、別途インストールする必要はありません。そうでない場合は、Pythonをアップグレードしてください。brew update; brew upgrade python
(オプション) 独立したPython環境にScrapyをインストールします。
この方法は、上記のOS Xの問題の回避策ですが、依存関係を管理するための全体的な良い方法であり、最初の方法を補完することができます。
virtualenv は、Pythonで仮想環境を作成することができるツールです。使用を開始するには、http://docs.python-guide.org/en/latest/dev/virtualenvs/ のようなチュートリアルを読むことをおすすめします。
これらの回避策のいずれかを実行すると、Scrapyをインストールすることができます。
pip install Scrapy
PyPy¶
最新バージョンのPyPyを使用することをお勧めします。テストされたバージョンは5.9.0です。PyPy3では、Linuxのインストールだけがテストされました。
ほとんどのScrapyの依存パッケージは、CPythonではバイナリのwheelを使用できますが、PyPyでは使用できません。OS Xでは、暗号化の依存関係を構築する問題が発生する可能性があります。この問題の解決策は、 ここ に書かれているように、 brew install openssl
してからこのコマンドが推奨するフラグをエクスポート(Scrapyをインストールするときのみ必要)することです。Linuxへのインストールには、依存パッケージをインストールする以外に特別な問題はありません。PyPyをWindowsにインストールすることはテストされていません。
scrapy bench
を実行することで、Scrapyが正しくインストールされているかどうかを確認できます。このコマンドで TypeError: ... got 2 unexpected keyword arguments
などのエラーが発生した場合、setuptoolsがPyPy固有の依存関係を取得できなかったことを意味します。この問題を解決するには、 pip install 'PyPyDispatcher>=2.1.0'
を実行してください。
Troubleshooting¶
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'¶
After you install or upgrade Scrapy, Twisted or pyOpenSSL, you may get an exception with the following traceback:
[…]
File "[…]/site-packages/twisted/protocols/tls.py", line 63, in <module>
from twisted.internet._sslverify import _setAcceptableProtocols
File "[…]/site-packages/twisted/internet/_sslverify.py", line 38, in <module>
TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'
The reason you get this exception is that your system or virtual environment has a version of pyOpenSSL that your version of Twisted does not support.
To install a version of pyOpenSSL that your version of Twisted supports,
reinstall Twisted with the tls
extra option:
pip install twisted[tls]
For details, see Issue #2473.
チュートリアル¶
このチュートリアルでは、既にScrapyがシステムにインストールされていることを想定しています。インストールされていない場合は、 インストールガイド を参照してください。
有名な著者からの引用を掲載する quotes.toscrape.com からスクレイピングしてみましょう。
このチュートリアルでは、以下のタスクについて解説します。
- 新しいScrapyプロジェクトの作成
- サイトをクロールしてデータを抽出する Spider の作成
- コマンドラインを使用して抽出したデータをエクスポート
- 再帰的にリンクをたどるようにSpiderを変更
- 引数の使用
Scrapyは Python で書かれています。Pythonに慣れていない場合は、どのようなことができるかを理解してからのほうがScrapyを最大限に活用できるかもしれません。
If you're already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.
If you're new to programming and want to start with Python, the following books may be useful to you:
- Automate the Boring Stuff With Python
- How To Think Like a Computer Scientist
- Learn Python 3 The Hard Way
You can also take a look at this list of Python resources for non-programmers, as well as the suggested resources in the learnpython-subreddit.
プロジェクトの作成¶
スクレイピングを開始する前に、新しいScrapyプロジェクトをセットアップする必要があります。コードを保存するディレクトリに入って以下を実行してください。
scrapy startproject tutorial
これにより、次の内容の tutorial
ディレクトリが作成されます。
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
最初のSpider¶
Spiderは、ScrapyがWebサイト(またはWebサイトのグループ)から情報を抽出するために定義するクラスです。Spiderは scrapy.Spider
をサブクラス化したもので、最初のリクエスト、ページ間のリンクをたどる方法、ダウンロードされたページの内容を解析してデータを抽出する方法などを定義する必要があります。
これは最初のSpiderのコードです。 tutorial/spiders
ディレクトリに quotes_spider.py
という名前のファイルで保存します。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
ご覧のように、 scrapy.Spider
のサブクラスでいくつかの変数とメソッドを定義しています。
name
: Spiderを識別します。プロジェクト内で一意でなければなりません。つまり、異なるSpiderに対して同じ名前を設定することはできません。start_requests()
: Spiderがクロールを開始するリクエストの繰り返し(リクエストのリスト、またはジェネレータ関数)を返す必要があります。最初のリクエストから順番に生成されます。parse()
: 各リクエストによってダウンロードされたレスポンスを処理するためのメソッドです。responseパラメータはページコンテンツを保持するTextResponse
のインスタンスであり、それを処理するための役立つメソッドがあります。parse()
メソッドは通常、レスポンスを解析し、取り込まれたデータをdictとして抽出し、新しいURLを見つけ、それらから新しいリクエスト (Request
) を作成します。
Spiderの実行方法¶
Spiderを動作させるには、プロジェクトの最上位ディレクトリに移動し、次のコマンドを実行します。
scrapy crawl quotes
このコマンドは、先ほど追加した quotes
という名前のSpiderを実行し、 quotes.toscrape.com
ドメインにいくつかのリクエストを送信します。実際に実行すると次のような出力が得られます。
... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...
カレントディレクトリのファイルをチェックしてみてください。 parse
メソッドにより quotes-1.html と quotes-2.html の2つの新しいファイルが作成されていることに気がつくでしょう。
注釈
HTMLを解析していないのを疑問に思うかも知れませんが、この後すぐにカバーします。
内部で何が起こったのか?¶
Scrapyは、Spiderの start_requests
メソッドによって返された scrapy.Request
オブジェクトをスケジュールします。それぞれのリクエストに対して応答を受け取ると、 Response
オブジェクトをインスタンス化し、それを引数としてリクエストに関連付けられたコールバックメソッド(ここでは parse
メソッド)を呼び出します。
start_requestsメソッドのショートカット¶
URLから scrapy.Request
オブジェクトを生成する start_requests()
メソッドを実装する代わりに、 start_urls
クラス変数でURLのリストを定義できます。このリストは、 start_requests()
のデフォルトの実装として使用され、Spiderの最初のリクエストが作成されます。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
parse()
メソッドは、明示的に指示していなくても、これらのURLのリクエストを処理するために呼び出されます。これは parse()
がScrapyのデフォルトのコールバックメソッドであり、明示的に割り当てられたコールバックのないリクエストに対して呼び出されるためです。
データの抽出¶
Scrapyのデータを抽出方法を学ぶには、 Scrapy shell を使ってセレクタを試してみるのが最良の方法です。次を実行してみてください。
scrapy shell 'http://quotes.toscrape.com/page/1/'
注釈
コマンドラインからScrapyシェルを実行するときは、常にURLをクォーテーションで囲むことを忘れないでください。そうしないとクエリを含むURL( &
文字)が動作しません。
Windowsではダブルクォーテーションを使用します。
scrapy shell "http://quotes.toscrape.com/page/1/"
次のようなものが表示されます。
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>>
シェルを使用して、responseオブジェクトで CSS を指定して要素を選択することができます。
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
response.css('title')
を実行すると、XML/HTML要素をラップする Selector
オブジェクトのリストを表す SelectorList
というリストに似たオブジェクトが返されて、さらに細かく選択や抽出を行うためのクエリを実行できます。
このタイトルからテキストを抽出するには、次のようにします。
>>> response.css('title::text').getall()
['Quotes to Scrape']
ここで2つ注意すべき点があります。1つは、 <title>
要素の中のテキストだけを選択するために、CSSのクエリに ::text
を追加したことです。 ::text
を指定しないとタグを含めた完全なtitle要素が得られます。
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
The other thing is that the result of calling .getall()
is a list: it is
possible that a selector returns more than one result, so we extract them all.
When you know you just want the first result, as in this case, you can do:
>>> response.css('title::text').get()
'Quotes to Scrape'
もしくは、次のように書くこともできます。
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
However, using .get()
directly on a SelectorList
instance avoids an IndexError
and returns None
when it doesn't
find any element matching the selection.
ここで注意することがあります。スクレイピングのコードでは、ページにないものが原因で発生するエラーに対しての柔軟性を高めるべきです。そのため、一部の抽出に失敗しても、少なくとも いくつかの のデータは取得できるようにします。
Besides the getall()
and
get()
methods, you can also use
the re()
method to extract using regular
expressions:
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
In order to find the proper CSS selectors to use, you might find useful opening
the response page from the shell in your web browser using view(response)
.
You can use your browser developer tools to inspect the HTML and come up
with a selector (see section about Using your browser's Developer Tools for scraping).
Selector Gadget は、選択された要素のCSSセレクタを視覚的にすばやく見つけるツールです。多くのブラウザで動作します。
XPathの簡単な紹介¶
CSS の他に、Scrapyセレクタでは XPath 式をサポートしています。
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'
XPath式はとても強力で、Scrapyセレクタの基盤となっています。実際のところCSSセレクタは、内部でXPathに変換されます。シェルのセレクタオブジェクトのテキスト表現をよく読んでみると分かります。
XPath式は、CSSセレクタほど普及していないかもしれませんが、構造を辿るだけでなく、コンテンツを見ることもできます。XPathを使用すると、例えば "Next Page" というテキストを含むリンクを選択できます。このように、XPathはスクレイピングの作業にとても適しています。ですから、CSSセレクタを構築する方法をすでに知っていても、XPathを学ぶことをお勧めします。
ここではXPathについて多くは扱いませんが、 ScrapyセレクタでXPathを使用する方法 で詳しく知ることができます。XPathの詳細については、 例を使ってXPathを学習するチュートリアル や、 「XPathの考え方」を学ぶチュートリアル をお勧めします。
引用と著者の抽出¶
選択と抽出について少し知ることができたので、Webページから引用を抽出するコードを書いて、Spiderを完成させましょう。
http://quotes.toscrape.com の各引用は、次のようなHTML要素で表されます。
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
欲しいデータを抽出する方法を見つけるために、Scrapyシェルを開き、以下を試してみましょう。
$ scrapy shell 'http://quotes.toscrape.com'
引用のHTML要素のセレクタリストを以下のように取得します。
>>> response.css("div.quote")
このクエリによって返された各セレクタのサブ要素に対してさらにクエリを実行できます。最初のセレクタを変数に代入して、CSSセレクタを特定の引用で直接実行できるようにしてみましょう。
>>> quote = response.css("div.quote")[0]
作成した quote
オブジェクトを使って、 title
, author
, tags
を抽出してみましょう。
>>> title = quote.css("span.text::text").get()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
Given that the tags are a list of strings, we can use the .getall()
method
to get all of them:
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
各引用をどのように抽出するかが分かったので、今度はすべての引用の要素を繰り返して取得し、それらをまとめてPythonの辞書に入れてみましょう。
>>> for quote in response.css("div.quote"):
... text = quote.css("span.text::text").get()
... author = quote.css("small.author::text").get()
... tags = quote.css("div.tags a.tag::text").getall()
... print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
... a few more of these, omitted for brevity
>>>
Spiderでデータを抽出する¶
Spiderに戻りましょう。これまではデータを抽出することはなく、HTMLページ全体をローカルファイルに保存するだけでした。上記の抽出ロジックをSpiderに統合してみましょう。
ScrapyのSpiderは通常、ページから抽出されたデータを含む多くの辞書を生成します。これを行うために、コールバックでPythonの yield
キーワードを使用してみます。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
このSpiderを実行すると、抽出されたデータがログに出力されます。
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
抽出されたデータの保存¶
抽出されたデータを保存する最も簡単な方法は、次のコマンドによって Feed exports を使用することです。
scrapy crawl quotes -o quotes.json
これで JSON でシリアライズされた、抽出されたすべてのアイテムを含む quotes.json
ファイルが生成されます。
歴史的な理由により、Scrapyはその内容を上書きするのではなく、指定されたファイルに追加します。ファイルを削除せずにこのコマンドを2回実行すると、JSONファイルが壊れてしまいます。
JSON Lines のような他のフォーマットを使うこともできます。
scrapy crawl quotes -o quotes.jl
JSON Lines 形式はストリームライクなので便利です。簡単に新しいレコードを追加できます。2回実行してもJSONのような問題はありません。また、各レコードが別々の行であるため、メモリにすべてを収める必要なく大きなファイルを処理できます。また JQ のような、役に立つコマンドラインツールがあります。
このチュートリアルのような小さなプロジェクトでは、これで十分です。しかし、抽出したアイテムでより複雑な作業を実行する場合は、 Itemパイプライン を作成することができます。Itemパイプライン用のプレースホルダは、プロジェクトの作成時に tutorial/pipelines.py
に作成されています。抽出したアイテムを保存するだけの場合は、Itemパイプラインを実装する必要はありません。
リンクをたどる¶
http://quotes.toscrape.com の最初の2ページからデータを抽出するのではなく、サイトのすべてのページから引用を抽出したいとします。
ページからデータを抽出する方法は理解したので、そこからリンクをたどる方法を見てみましょう。
最初にすることは、たどりたいページへのリンクを抽出することです。ページを調べると、次のマークアップを持つ次ページへのリンクがあることがわかります。
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
シェルでそれを抽出してみましょう。
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
これはa要素を取得しますが、欲しいのは href
属性です。そのために、Scrapyは属性の内容を選択できるCSS拡張をサポートしています。
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
There is also an attrib
property available
(see Selecting element attributes for more):
>>> response.css('li.next a').attrib['href']
'/page/2'
次のページへのリンクを再帰的にたどってデータを抽出するようにSpiderを修正しました。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
parse()
メソッドはデータを抽出した後、次ページへのリンクを探し、 urljoin()
メソッドを使用して絶対URLを作成し(リンクは相対URLであるため)、コールバックを登録した次のページへの新しいリクエストをyieldし、それを繰り返してすべてのページをクロールします。
ここにリンクをたどるScrapyのメカニズムを見ることができます。コールバックメソッドのリクエストをyieldすると、Scrapyはそのリクエストを送信するようスケジュールし、リクエストが終了したときに実行されるコールバックメソッドを登録します。
これを利用して、定義したルールに従ってリンクをたどる複雑なクローラーを構築し、訪れるページに応じて異なる種類のデータを抽出することができます。
この例では、次ページへのリンクが見つからなくなるまで、一種のループを作成します。これはページネーションのあるブログ、フォーラム、その他のサイトをクロールするのに便利です。
リクエストを作成するためのショートカット¶
Requestオブジェクトを作成するためのショートカットとして、 response.follow
を使用することができます。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
scrapy.Requestとは異なり、 response.follow
は相対URLをサポートしているため、urljoinを呼び出す必要はありません。 response.follow
は単にRequestインスタンスを返すことに注意してください。Requestをyieldする必要があります。
response.follow
に文字列の代わりにセレクタを渡すこともできます。このセレクタは必要な属性を抽出する必要があります。
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)
<a>
要素にはショートカットがあり、 response.follow
はそのhref属性を自動的に使います。これによりコードをさらに短縮することができます。
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)
注釈
response.follow(response.css('li.next a'))
は有効ではありません。 response.css
は、セレクタのすべての結果を持つリストのようなオブジェクトを返すためです。上の例のように for
ループ、または response.follow(response.css('li.next a')[0])
ならば問題ありません。
より多くの例とパターン¶
コールバックとリンクをたどる説明のために、別のSpiderを示します。今回は、著者の情報を集めるためのものです。
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
このSpiderはメインページから始まり、 parse_author
コールバックによってすべての著者ページへのリンクと、前と同様に parse
コールバックによってページネーションリンクをたどります。
ここではコードをより短くするために、 response.follow
に位置引数としてコールバックを渡しています。この方法でも scrapy.Request
は動作します。
parse_author
コールバックは、CSSクエリからデータを抽出してクリーンアップするヘルパー関数を定義し、著者データをPythonのdictにしてyieldします。
このSpiderが示すもう1つの面白い点として、同じ著者の引用がたくさんあっても、同じ著者ページを複数回訪問する心配がありません。デフォルトでScrapyは重複したリクエストをすでに訪問したURLとしてフィルタリングします。これによりプログラミングミスのためにサーバーに過度の負荷がかかるという問題を回避します。この動作は、 DUPEFILTER_CLASS
の設定で変更できます。
これで、あなたがScrapyでリンクをたどることとコールバックの仕組みをよく理解できるように願っています。
リンクをたどるメカニズムを活用したもう1つの例として、小さなルールエンジンを実装した汎用的なSpiderの CrawlSpider
クラスをチェックしてみてください。
また、複数のページのデータを持つアイテムを作成する一般的なパターンとして、 追加のデータをコールバックに渡すためのトリック を参照してください。
引数の使用¶
Spiderにコマンドライン引数を渡すには、 -a
オプションを使用します。
scrapy crawl quotes -o quotes-humor.json -a tag=humor
これらの引数はSpiderの __init__
メソッドに渡され、デフォルトでSpiderのインスタンス変数になります。
以下の例では、 tag
引数に指定された値が self.tag
を介して利用可能になります。これによりSpiderが引数に基づいてURLを構築し、特定のタグで引用を絞り込むことができます。
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
tag=humor
引数をこのSpiderに渡すと、 humor
タグのURL (http://quotes.toscrape.com/tag/humor
) のみを訪問することに気づくでしょう。
スパイダー引数の扱いについては、 こちら をご覧ください。
次のステップ¶
このチュートリアルでは、Scrapyの基本についてのみ説明しましたが、ここでは触れられていない多くの機能があります。重要なものの概要については、 Scrapyの概要 の それ以外には? セクションをチェックしてください。
Scrapyの基本コンセプト セクションからコマンドラインツール、Spider、セレクタ、および抽出されたデータのモデリングのような、チュートリアルでは扱っていない事柄についてさらに詳しく知ることができます。サンプルプロジェクトを試してみたい場合は、 実例 セクションをチェックしてみてください。
実例¶
物事を学ぶ上で最善の方法は実例を真似ることであり、これはScrapyも例外ではありません。そのため、quotesbot というScrapyのプロジェクトがあります。このプロジェクトでScrapyを試し、学ぶことができます。このプロジェクトには http://quotes.toscrape.com に対する2つのSpiderが含まれています。一方はCSSセレクタを使用し、もう一方はXPath式を使用しています。
quotesbot プロジェクトは https://github.com/scrapy/quotesbot から利用できます。より詳しい情報については quotesbot のREADMEにあります。
もしあなたがgitに慣れているなら、コードをチェックアウトすることができます。あるいは ここ をクリックしてzip形式でプロジェクトをダウンロードすることもできます。
Scrapyの基本コンセプト¶
コマンドラインツール¶
バージョン 0.10 で追加.
Scrapyは scrapy
コマンドラインツールによって制御できます。ここでは、「コマンド」または「Scrapyコマンド」と呼ばれるサブコマンドと区別するために、「Scrapyツール」と呼びます。
Scrapyツールは複数の目的のためにいくつかのコマンドを提供し、それぞれ異なる引数とオプションのセットを受け入れます。
( scrapy deploy
コマンドは、スタンドアロンの scrapyd-deploy
があるので、バージョン1.0で削除されました。 Deploying your project を参照してください。)
環境設定¶
Scrapyは、以下の標準的な場所にあるini形式の scrapy.cfg
ファイルを探します。
/etc/scrapy.cfg
またはc:\scrapy\scrapy.cfg
(システム全体)- ユーザー全体の設定として
~/.config/scrapy.cfg
($XDG_CONFIG_HOME
) と~/.scrapy.cfg
- Scrapyプロジェクトのルートの
scrapy.cfg
(次のセクションを参照)
これらのファイルの設定は、優先順位によりマージされます。ユーザー定義の値はシステム全体の設定よりも優先され、プロジェクトの設定が定義されている場合は、他のすべての設定を上書きします。
また、Scrapyの設定として使えるいくつかの環境変数があります。
SCRAPY_SETTINGS_MODULE
( Designating the settings を参照)SCRAPY_PROJECT
(see Sharing the root directory between projects)SCRAPY_PYTHON_SHELL
( Scrapy shell を参照)
Scrapyプロジェクトのデフォルト構成¶
コマンドラインツールとそのサブコマンドを解説する前に、まずScrapyプロジェクトのディレクトリ構造について理解しておきましょう。
すべてのScrapyプロジェクトは、デフォルトで以下のようなファイル構造になっています。変更することもできます。
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
scrapy.cfg
ファイルが存在するディレクトリは、 プロジェクトのルートディレクトリ と呼ばれます。このファイルには、プロジェクト設定を定義するPythonモジュールの名前が含まれています。以下に例を示します。
[settings]
default = myproject.settings
Sharing the root directory between projects¶
A project root directory, the one that contains the scrapy.cfg
, may be
shared by multiple Scrapy projects, each with its own settings module.
In that case, you must define one or more aliases for those settings modules
under [settings]
in your scrapy.cfg
file:
[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings
By default, the scrapy
command-line tool will use the default
settings.
Use the SCRAPY_PROJECT
environment variable to specify a different project
for scrapy
to use:
$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot
scrapy
ツールの使用¶
引数を指定せずにScrapyツールを実行すると、使用方法と利用可能なコマンドが表示されます。
Scrapy X.Y - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
crawl Run a spider
fetch Fetch a URL using the Scrapy downloader
[...]
Scrapyプロジェクトの中にいる場合、最初の行は現在アクティブなプロジェクトを表します。この例では、プロジェクトの外から実行されました。プロジェクトの中から実行すると、次のような内容が出力されます。
Scrapy X.Y - project: myproject
Usage:
scrapy <command> [options] [args]
[...]
プロジェクトの作成¶
あなたが scrapy
ツールで最初に行うことは、Scrapyプロジェクトを作成することです。
scrapy startproject myproject [project_dir]
これにより、 project_dir
ディレクトリの下にScrapyプロジェクトが作成されます。 project_dir
を指定しなければ、 project_dir
は myproject
と指定した場合と同じになります。
次に、作成したプロジェクトディレクトリの中に入ります。
cd project_dir
これで、 scrapy
コマンドを使用してプロジェクトの管理やコントロールをする準備が整いました。
プロジェクトのコントロール¶
プロジェクト内から scrapy
ツールを使用して、プロジェクトのコントロールや管理をします。
たとえば、新しいSpiderを作成するには次のようにします。
scrapy genspider mydomain mydomain.com
crawl
など、一部のScrapyコマンドはScrapyプロジェクト内から実行する必要があります。プロジェクト内から実行しなければならないコマンドについては、 コマンドリファレンス を参照してください。
また、いくつかのコマンドは、プロジェクト内から実行するときに、少し違った動作をすることがあります。たとえば、フェッチされたURLが特定のSpiderに関連付けられている場合、fetchコマンドはSpiderの属性でデフォルトの属性を上書きします(例えば、ユーザーエージェントを上書きする user_agent
属性など)。 fetch
コマンドは、Spiderがどのようにページをダウンロードしているかを調べるために使用されるため、これは意図的なものです。
使用可能なコマンド¶
このセクションでは、使用可能な組み込みコマンドのリストと、使用例を示します。次のコマンドを実行することによって、各コマンドについての詳細情報をいつでも得ることができます。
scrapy <command> -h
次のコマンドで、使用可能なすべてのコマンドを表示できます。
scrapy -h
コマンドは、Scrapyプロジェクト内でのみ動作するプロジェクト固有のコマンドと、Scrapyプロジェクトの外でも動作するグローバルなコマンドの2種類があります。グローバルなコマンドをプロジェクト内から実行すると、プロジェクトで上書きされた設定を使用するため、動作が若干異なる場合があります。
グローバルなコマンド:
プロジェクト固有のコマンド:
startproject¶
- 構文:
scrapy startproject <project_name> [project_dir]
- プロジェクト: 不要
project_dir
ディレクトリの下に project_name
という名前の新しいScrapyプロジェクトを作成します。 project_dir
が指定されていない場合、 project_dir
は project_name
と同じになります。
使用例:
$ scrapy startproject myproject
genspider¶
- 構文:
scrapy genspider [-t template] <name> <domain>
- プロジェクト: 不要
新しいSpiderをカレントフォルダ、またはプロジェクト内から呼び出された場合は現在のプロジェクトの spiders
フォルダに作成します。 <name>
はSpiderの名前、 <domain>
はSpiderの変数 allowed_domains
および start_urls
を生成するために使用されます。
使用例:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
$ scrapy genspider example example.com
Created spider 'example' using template 'basic'
$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'
これはあらかじめ定義されたテンプレートに基づいてSpiderを作成する便利なショートカットコマンドですが、Spiderを作成する唯一の方法ではありません。このコマンドを使用する代わりに、Spiderのソースファイルを自分で作成することもできます。
crawl¶
- 構文:
scrapy crawl <spider>
- プロジェクト: 要
Spiderを使用してクロールを開始します。
使用例:
$ scrapy crawl myspider
[ ... myspider starts crawling ... ]
check¶
- 構文:
scrapy check [-l] <spider>
- プロジェクト: 要
contractのチェックを実行します。
使用例:
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
list¶
- 構文:
scrapy list
- プロジェクト: 要
現在のプロジェクトで使用可能なすべてのSpiderを一覧表示します。出力は1行につき1つのSpiderです。
使用例:
$ scrapy list
spider1
spider2
edit¶
- 構文:
scrapy edit <spider>
- プロジェクト: 要
EDITOR
環境変数で定義されたエディタまたは、未設定の場合 EDITOR
設定を使用して、指定されたSpiderを編集します。
このコマンドは、最も一般的なケースの便利なショートカットとして提供されています。もちろん開発者は、Spiderの作成やデバッグをするツールやIDEを自由に選択することができます。
使用例:
$ scrapy edit spider1
fetch¶
- 構文:
scrapy fetch <url>
- プロジェクト: 不要
Scrapyダウンローダーを使用してURLをダウンロードし、標準出力にその内容を出力します。
このコマンドの興味深い点は、Spiderのダウンロードするやり方でページを取得することです。たとえば、Spiderがユーザーエージェントを上書きする USER_AGENT
属性を持っている場合は、それを使用します。
ですから、このコマンドはSpiderが特定のページをどのようにフェッチするかを「見る」ために使うことができます。
プロジェクト外で使用される場合は、Spiderごとの特定の動作は適用されず、デフォルトのScrapyダウンローダーの設定が使用されます。
サポートされるオプション:
--spider=SPIDER
: Spiderの自動検出をバイパスし、特定のSpiderを強制的に使用する--headers
: レスポンスボディの代わりにレスポンスのHTTPヘッダーを表示する--no-redirect
: HTTP 3xxリダイレクトに従わない(デフォルトでは従う)
使用例:
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
'Age': ['1263 '],
'Connection': ['close '],
'Content-Length': ['596'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
'Etag': ['"573c1-254-48c9c87349680"'],
'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
'Server': ['Apache/2.2.3 (CentOS)']}
view¶
- 構文:
scrapy view <url>
- プロジェクト: 不要
Spiderがそれを「見る」ように、ブラウザで指定されたURLを開きます。Spiderは通常のユーザーとは違うページを表示することがあるため、Spiderが何を見ているかを確認し、期待通りのものであることを確認することができます。
サポートされるオプション:
--spider=SPIDER
: Spiderの自動検出をバイパスし、特定のSpiderを強制的に使用する--no-redirect
: HTTP 3xxリダイレクトに従わない(デフォルトでは従う)
使用例:
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
shell¶
- 構文:
scrapy shell [url]
- プロジェクト: 不要
URLが指定されている場合はそのURL、指定されていない場合は空のScrapyシェルを開始します。 ./
や ../
で始まる相対パス、または絶対ファイルパスでのUNIX形式のローカルファイルパスをサポートします。詳細は Scrapy shell を参照してください。
サポートされるオプション:
--spider=SPIDER
: Spiderの自動検出をバイパスし、特定のSpiderを強制的に使用する-c code
: シェルでコードを評価し、結果を出力して終了する--no-redirect
: HTTP 3xxリダイレクトに従わない(デフォルトでは従う)。これは、コマンドラインで引数として渡されたURLにのみ影響します。シェル内では、fetch(url)
はデフォルトでHTTPリダイレクトに従います。
使用例:
$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')
# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
parse¶
- 構文:
scrapy parse <url> [options]
- プロジェクト: 要
指定されたURLを取得し、それを処理するSpiderでパースします。 --callback
オプションがあればそのメソッドを使用し、なければ parse
を使用します。
サポートされるオプション:
--spider=SPIDER
: Spiderの自動検出をバイパスし、特定のSpiderを強制的に使用する--a NAME=VALUE
: Spider引数を設定する(繰り返し可能)--callback
または-c
: レスポンスをパースするためのコールバックとして使用するSpiderのメソッド--meta
または-m
: コールバックリクエストに渡される追加のリクエストメタ情報。これは有効なJSON文字列でなければなりません。例:--meta='{"foo" : "bar"}'
--pipelines
: パイプラインを通じてItemを処理する--rules
または-r
:CrawlSpider
のルールを使用して、レスポンスのパースに使用するコールバック(Spiderのメソッド)を検出する--noitems
: 抽出したItemを表示しない--nolinks
: 抽出したリンクを表示しない--nocolour
: 出力を色分けするためにpygmentsを使わない--depth
または-d
: リクエストを再帰的に追跡する深度レベル(デフォルト:1)--verbose
or-v
: 深度レベルごとの情報を表示する
使用例:
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': 'Example item',
'category': 'Furniture',
'length': '12 cm'}]
# Requests -----------------------------------------------------------------
[]
settings¶
- 構文:
scrapy settings [options]
- プロジェクト: 不要
Scrapyの設定値を取得します。
プロジェクト内で実行した場合はプロジェクトの設定値が表示され、そうでない場合はScrapyのデフォルトの値が表示されます。
使用例:
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
runspider¶
- 構文:
scrapy runspider <spider_file.py>
- プロジェクト: 不要
プロジェクトを作成していなくても、Pythonファイルに記述したSpiderを実行できます。
使用例:
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
version¶
- 構文:
scrapy version [-v]
- プロジェクト: 不要
Scrapyのバージョンを出力します。 -v
と一緒に使用すると、バグレポートに有用なPython、Twisted、プラットフォームの情報も出力されます。
カスタムプロジェクトコマンド¶
また、 COMMANDS_MODULE
設定を使用してカスタムプロジェクトコマンドを追加することもできます。コマンドの実装例については、 scrapy/commands のScrapyコマンドを参照してください。
COMMANDS_MODULE¶
デフォルト: ''
(空文字列)
カスタムScrapyコマンドを検索するためのモジュールです。これは、Scrapyプロジェクトにカスタムコマンドを追加するために使用されます。
例:
COMMANDS_MODULE = 'mybot.commands'
setup.pyエントリポイントを介してコマンドを登録する¶
注釈
これは実験的な機能であり、注意して使用してください。
ライブラリの setup.py
ファイルのエントリポイントに scrapy.commands
セクションを追加することで、外部ライブラリからScrapyコマンドを追加することができます。
次の例では、 my_command
コマンドを追加します。
from setuptools import setup, find_packages
setup(name='scrapy-mymodule',
entry_points={
'scrapy.commands': [
'my_command=my_scrapy_module.commands:MyCommand',
],
},
)
Spider¶
Spiderは特定のWebサイト(あるいは複数のひとかたまりのWebサイト)を、どのように抽出するかを定義したクラスです。抽出方法にはクロール方法(どのようにリンクを辿るか)やどのような構造化されたデータ(抽出したアイテム)をそのページから抽出するかが含まれます。言い換えれば、Spiderは特定のWebサイト(あるいは複数のひとかたまりのWebサイト)のページをクロールして解析するためのカスタマイズされた動作を定義する場所です。
Spiderの抽出サイクルは以下の様なものです:
最初のURLを対象としたRequestsを生成し、それらのRequestsによってダウンロードされたResponseとともに特定のcallback関数が呼び出されます。
最初のrequestsは
start_requests()
メソッド(デフォルト)によって取得されます。これは、start_urls
で指定されたURLとparse
をコールバック関数に指定してRequest
を生成します。コールバック関数では、レスポンス(Webページの) をパースし、抽出されたデータのdict、
Item
オブジェクト、Request
オブジェクト、あるいはそれらのIterableなオブジェクトを戻します。それらのRequestもまたコールバック関数(おそらく同じもの)が指定されており、Scrapyによってダウンロードされたレスポンスを処理します。コールバック関数では、ページのコンテンツをパースします。通常は セレクタ を使用し、(BeautifulSoupやlxmlなどの他のパーサを利用することも可能です)パースされたデータを含んだItemを生成しします。
最後は、spiderから戻されたitemは一般的には、データベースに永続化(詳しくは Item Pipeline 参照)されるか、Feed exports を使用してファイルに書き出されます。
このサイクルは(多かれ少なかれ)どのようなSpiderにも適用されます。様々な目的のための様々な種類のSpiderがデフォルトでScrapyにはバンドルされています。それらのSpiderについて説明します。
scrapy.Spider¶
-
class
scrapy.spiders.
Spider
¶ これは最もシンプルなSpiderであり、他のすべてのSpiderが継承しなければなりません。(Scrapyにバンドルされているものだけではなく、貴方が自ら作成したSpiderにも該当します)これは特別な機能は提供しません。 提供されるのはSpiderの
start_urls
属性をもとにしたRequestを送るためのデフォルトのstart_requests()
メソッドの実装とResponseごとに呼び出されるSpiderのparse
メソッドの実装です。-
name
¶ Spiderの名前を定義します。このSpiderの名前はScrapyによってどのようにSpiderが配置されインスタンス化されているのかの識別に使われます。そのため一意である必要がありますが、同じSpiderから複数のインスタンスを生成することを妨げるものではありません。この属性はSpiderにおいて最も重要であり、必須です。
単一のドメインを扱うSpiderでは、TLD の有無にかかわらずSpiderの名前にドメイン名をつけるのが一般的な手法です。例えば、
mywebsite.com
をクロールするSpiderの名前は多くの場合mywebsite
とされます。注釈
Python 2では、ASCII文字のみでなければなりません。
-
allowed_domains
¶ Spiderにクロールを許可するドメイン名のオプショナルなリストです。リクエストされたURLがそれらのリストに含まれるドメイン(あるいばサブドメイン)に所属していなかった場合
OffsiteMiddleware
が有効化されていない限りは追跡されません。例えば、対象のURLが
https://www.example.com/1.html
である場合、'example.com'
をリストに追加します。
-
start_urls
¶ URLを明示的に指定しなかった時に、Spiderがクロールを始めるURLのリストです。その場合、最初にダウンロードされるページは、これにリストされたページになります。後続の
Request
はstart_urls
に含まれるデータから順番に生成されます。
-
custom_settings
¶ 設定値を含んだディクショナリであり、Spiderが実行される際にプロジェクト単位で指定された設定を上書きます。インスタンス化される前に設定を更新するため、クラス属性として定義する必要があります。
利用可能なビルトインの設定のリストは Built-in settings reference を参照してください。
-
crawler
¶ この属性はクラスの初期化後に、
from_crawler()
クラスメソッドによってセットされこのSpiderインスタンスがどのCrawler
オブジェクトにリンクされているかを持ちます。クローラは、単一のエントリーアクセス(extension, middleware, signal manager等)のため、プロジェクト内のたくさんのコンポーネントをカプセル化しています。詳細については Crawler API を参照してください。
-
logger
¶ Spiderの
name
でPythonのloggerが作成されます。ログメッセージの送信はこれを介して行うことができます。 詳しくは Logging from Spiders を参照してください。
-
from_crawler
(crawler, *args, **kwargs)¶ これはScrapyによってSpiderを作成するために用いられるクラスメソッドです。
このメソッドを直接オーバライドする必要はありません。デフォルトの実装は
__init__()
に必要な引数 args と 名前付き引数 kwagrs を渡すProxyとして動作するためです。それでもなお、このメソッドは
crawler
とsettings
を新しいインスタンスに設定するため、Spiderのコード内でそれらの属性にアクセスすることができます。パラメータ: - crawler (
Crawler
instance) -- Spiderに紐付けられているCrawler - args (list) --
__init__()
に渡された引数 - kwargs (dict) --
__init__()
に渡された名前付き引数
- crawler (
-
start_requests
()¶ このメソッドはSpiderがクロールするための最初のリクエストを含むiterableなインスタンスを戻さなければいけません。これはScrapyによってSpiderがクロールするために起動する際に呼び出されます。Scrapyはこれを一度だけ呼び出すため、
start_requests()
をジェネレータとして安全に実装することができます。デフォルトの実装では
start_urls
内の各URLに対してRequest(url, dont_filter=True)
を生成します。もし、ドメインのクロール開始のリクエストを変更したい場合は、このメソッドをオーバライドします。例えば、最初にPOSTリクエストでログインする必要がある場合は、以下のようにします。
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
-
parse
(response)¶ これはScrapyによってダウンロードされたレスポンスを処理するために、コールバックを指定しなかったリクエストでのデフォルトのコールバックです。
parse
メソッドはレスポンスの処理と、抽出されたデータを戻します。また、さらに辿るべきURLを戻すこともあります。 他のリクエストのコールバックはSpider
と同じ要件を持ちます。このメソッドは、他のリクエストのコールバックと同じように、
Request
のiterable、dict、Item
オブジェクトのいずれかを戻さなければいけません。パラメータ: response ( Response
) -- 処理されるレスポンス
-
log
(message[, level, component])¶ Spiderの
logger
を通じてログメッセージを送るための下位互換を保つために設けられたラッパーです。詳しくは Logging from Spiders を確認してください。
-
closed
(reason)¶ スパイダーの終了時に呼ばれます。このメソッドは
spider_closed
シグナルを送るためのsignals.connect() のショートカットを提供します。
-
例:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
1つのコールバックで複数のRequestとItemを戻す場合:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
for h3 in response.xpath('//h3').getall():
yield {"title": h3}
for href in response.xpath('//a/@href').getall():
yield scrapy.Request(response.urljoin(href), self.parse)
start_urls
の代わりに start_requests()
を直接使い、またより構造化されたデータを戻すため Item を使用する場合:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
def parse(self, response):
for h3 in response.xpath('//h3').getall():
yield MyItem(title=h3)
for href in response.xpath('//a/@href').getall():
yield scrapy.Request(response.urljoin(href), self.parse)
Spiderの引数¶
Spiderは挙動の変更をもたらすいくつかの引数を受け入れます。一般的な用途としては、クロールの開始URLやサイトの一部をクロールするためのセクション指定などです。しかし、それ以外の機能を構成するためにも使用できます。
Spiderの引数は crawl
コマンドの -a
オプションを通じて渡されます。例えば以下の様になります:
scrapy crawl myspider -a category=electronics
Spiderに渡された引数は __init__ メソッドでアクセスすることができます:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
# ...
デフォルトの __init__ メソッドはすべてのSpiderの引数も受け取り、それらはSpiderの属性にコピーされます。上記の例は、次のように書くこともできます。
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.Request('http://www.example.com/categories/%s' % self.category)
Spiderの引数は文字列のみが使用できることに注意してください。Spiderは引数をパースすることはありません。 start_urls をコマンドラインからセットしようとする場合、ユーザ自身で ast.literal_eval や json.loads 等を使用してListにパースし、属性としてセットする必要があります。そうしない場合、 start_urls に渡した文字列自体をiterableとして処理され(Pythonでよくある落とし穴)、1文字ごとにURLとして扱われてしまいます。
有効なユースケースとしては、HttpAuthMiddleware
を使用してHTTP認証のクレデンシャルをセットするものや、 UserAgentMiddleware
を使用してユーザーエージェントをセットするものがあります:
scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot
Spiderの引数はScrapydの schedule.json
APIでも指定することができます。詳しくは Scrapyd documentation を参照してください。
一般的なSpider¶
Scrapyはいくつかの有用なSpiderを提供しています。ユーザはこれらのSpiderのサブクラスとして自身のSpiderを作成することができます。これらのSpiderの目的は、いくつかのクロールケースに合わせて有効な機能を提供することです。それは例えば、特定のルールに基づいてサイト上のすべてのリンクをたどったり、 Sitemaps からクロールしたり、XML/CSVフィードをパースしたりすることです。
以下の例で使用されるSpiderは、myproject.items
モジュールで宣言された TestItem
を持ったプロジェクト配下で作成されているものとします。
import scrapy
class TestItem(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
description = scrapy.Field()
CrawlSpider¶
-
class
scrapy.spiders.
CrawlSpider
¶ 通常のウェブサイトをクロールする最も一般的なSpiderです。これは、いくつかのルールに従ってリンクをたどっていく便利なメカニズムを提供します。あなたの特定のウェブサイトやプロジェクトには最適ではないかもしれませんが、多くのケースではこのSpiderで十分ですので、より多くのカスタム機能のために必要に応じて上書きしたり、独自のスパイダーを実装することができます。
Spiderから継承した属性(指定する必要がある)を除いて、 このクラスは新しい属性をサポートします:
-
rules
¶ Rule
オブジェクトの1つ(あるいは複数)のリストです。それぞれのRule
はサイトをクロールするための特定の動作を定義します。もし、あるリンクについて複数のルールが該当する場合この属性に定義した順に従った、最初の1つが使用されます。
このスパイダーは, オーバーライド可能なメソッドも公開しています:
-
クロールのルール¶
-
class
scrapy.spiders.
Rule
(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)¶ link_extractor
はクロールされたそれぞれのページからどのようにリンクを抽出するかを定義した Link Extractor オブジェクトです。callback
はlink_extractorによって抽出された各リンク毎に呼び出される、呼び出し可能オブジェクトか文字列(この場合, その名前を持つSpiderのメソッドが使用されます)です。このコールバックは第一引数にレスポンスを受け取り、Item
および/またはRequest
オブジェクト(あるいはそれらのサブクラス)のリストを戻す必要があります。警告
スパイダーのクロールのルールを作成する際は、
parse
メソッドを使用しないように注意します。CrawlSpider
はparse
メソッドを自身のロジックに組み込んでいるためです。そのため、もしparse
メソッドをオーバーライドしてしまうと、CrawlSpiderは動作しなくなります。cb_kwargs
はコールバック関数に渡される名前付き引数を含んだdictです。follow
は, このルールで抽出された各レスポンスからリンクをたどるかどうかを指定する bool 値です. もしcallback
がNoneの場合、follow
はデフォルトでTrue
になり、それ以外ではFalse
になります。process_links
はlink_extractorによって抽出された各リンク毎に呼び出される、呼び出し可能オブジェクトか文字列(この場合、その名前を持つSpiderのメソッドが使用されます)です。主な目的はフィルタリングです。process_request
は ruleによって抽出されたすべてのリクエストに対して呼び出される、呼び出し可能オブジェクトか文字列(この場合、その名前を持つSpiderのメソッドが使用されます)です。これはrequestかNone(フィルタされた事を示す)を戻す必要があります。
CrawlSpiderの例¶
ルールを使用したCrawlSpiderの例を見てみましょう:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
return item
このSpiderはexample.comのホームページをクロールし、カテゴリのリンクとアイテムのリンクを収集し、後者を parse_item
メソッドによって解析します。各アイテムのレスポンスでは、XPathを使ってHTMLからデータを抽出し、 Item
にデータを埋めます。
XMLFeedSpider¶
-
class
scrapy.spiders.
XMLFeedSpider
¶ XMLFeedSpiderはXMLフィードの特定のノード名について繰り返し解析するように設計されています。イテレータは
iternodes
,xml
, そしてhtml
から選ぶことができます。パフォーマンス的な観点から推奨はiternodes
を使用することです。何故なら、xml
やhtml
のイテレータは解析のため、一度に完全なDOM構造を生成するためです。しかし、html
は不完全なマークアップで作成されたXMLを解析する際には有用です。イテレータとタグ名を設定するには、次のクラス属性を定義する必要があります:
-
iterator
¶ 使用するイテレータを定義する文字列です。以下の値が指定できます。
'iternodes'
- 正規表現に基づいた高速なイテレータ'html'
-Selector
を使用したイテレータ。これはDOMの解析のためメモリ上にすべてのDOMをロードする必要があり、大きなフィードではそれが問題につながることに注意してください。'xml'
-Selector
を使用したイテレータ。これはDOMの解析のためメモリ上にすべてのDOMをロードする必要があり、大きなフィードではそれが問題につながることに注意してください。
デフォルト:
'iternodes'
-
itertag
¶ 繰り返し対象であるノード名(あるいは要素)を示す文字列です。例:
itertag = 'product'
-
namespaces
¶ このスパイダーで処理される、文書で利用可能な名前空間を定義する
(prefix, uri)
のタプルのリストです。prefix
とurl
はregister_namespace()
メソッドを使用して名前空間を自動的に登録するために使用されます。itertag
属性に名前空間をもったノードを指定することができます。例:
class YourSpider(XMLFeedSpider): namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')] itertag = 'n:url' # ...
これらの新しい属性とは別に、このスパイダーは次のオーバーライド可能なメソッドも持っています:
-
adapt_response
(response)¶ SpiderミドルウェアからのレスポンスがSpiderによって解析される前に受け取るメソッドです。解析の前にレスポンスボディを変更するために利用できます。このメソッドはレスポンスを受け取りレスポンスを戻します。(同じものかあるいは別のものです)
-
parse_node
(response, selector)¶ 指定されたタグ名(
itertag
)と一致するノードに対して呼び出されるメソッドです。レスポンスとSelector
を各ノードごとに受け取ります。このメソッドをオーバーライドすることは必須です。そうしない場合、Spiderは動作しません。このメソッドはItem
かRequest
あるいは、それらを含んだiterableを戻す必要があります。
-
process_results
(response, results)¶ Spiderによって戻された各結果(ItemかRequest)に対して呼び出されるメソッドであり、結果をフレームワークのコアに戻す際に必要な最後の処理を行うことを目的としています。それは例えば、アイテムのID設定等です。このメソッドは結果のリストとそれらの結果の元になったレスポンスを受け取ります。これは結果(ItemかRequest)のリストを戻す必要があります。
-
XMLFeedSpiderの例¶
これらのSpiderは簡単に利用できます。まずは例を見てみましょう:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.getall()))
item = TestItem()
item['id'] = node.xpath('@id').get()
item['name'] = node.xpath('name').get()
item['description'] = node.xpath('description').get()
return item
上述の基本的な例は、 start_urls
によって指定されたフィードからダウンロードしそれぞれの item
タグを繰り返し処理し、出力し、ランダムなデータを Item
に格納するSpiderです。
CSVFeedSpider¶
-
class
scrapy.spiders.
CSVFeedSpider
¶ このSpiderは、ノードの代わりに行を繰り返し処理する点を除きXMLFeedSpiderとよく似ています。各繰り返しに対して呼び出されるメソッドは
parse_row()
です。-
delimiter
¶ CSVファイルの各フィールドの区切り文字を指定します。デフォルトは
','
(カンマ) です。
-
quotechar
¶ CSVファイル内の各フィールドの囲み文字を含む文字列です。デフォルトは
'"'
(ダブルクォテーション)です。
-
headers
¶ CSVファイルに含まれるカラム名のリストです。
-
parse_row
(response, row)¶ レスポンスと、CSVファイルを元に指定された(あるいは検知された)ヘッダーをキーとするdict(各行を表現する)を受け取ります。このSpiderも
adapt_response
とprocess_results
メソッドをオーバライドすることができ、前後に独自の処理を行うことができます。
-
CSVFeedSpiderの例¶
前述のものと似ていますが、 CSVFeedSpider
を用いた例を見てみましょう:
from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
delimiter = ';'
quotechar = "'"
headers = ['id', 'name', 'description']
def parse_row(self, response, row):
self.logger.info('Hi, this is a row!: %r', row)
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
SitemapSpider¶
-
class
scrapy.spiders.
SitemapSpider
¶ SitemapSpiderは Sitemaps を使ってURLを探索しながらサイトをクロールすることができます。
ネストされたSitemapや robots.txt からSitemapを取得して利用することもサポートされています。
-
sitemap_urls
¶ クロールしたいSitemapを指すURLのリストです。
robots.txt を指定することもできます。その場合、その解析結果に含まれるSitemapが利用されます。
-
sitemap_rules
¶ タプル
(regex, callback)
のリストです:regex
はSitemapから抽出されたURLにマッチさせる正規表現です。regex
は文字列でもコンパイル済のregexオブジェクトでも指定できます。- callbackは正規表現にマッチしたURLを処理するためのコールバック関数です。
callback
は文字列(この場合、その名前を持つSpiderのメソッドが使用されます)か呼び出し可能オブジェクトを指定します。
例:
sitemap_rules = [('/product/', 'parse_product')]
ルールは順番に適用され、最初に該当したものだけが使用されます。
もしこの属性を省略した場合、Sitemapに含まれる全てのURLが処理され、
parse
コールバックメソッドで処理されます。
-
sitemap_follow
¶ 従うべきSitemapの正規表現のリストです。これは ほかのSitemapファイルを指し示す Sitemap index files を使っているサイトでのみ利用されます。
デフォルトでは全てのSitemapに従います。
-
sitemap_alternate_links
¶ url
の代替リンクに従うかどうかを指定します。これらのリンクは、同じurl
ブロック内で渡された別の言語の同じWEBサイトへのリンクです。例:
<url> <loc>http://example.com/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/> </url>
sitemap_alternate_links
が設定されている場合、両方のURLが取得されます。sitemap_alternate_links
が無効化されている場合は、http://example.com/
のみが取得されます。デフォルトでは
sitemap_alternate_links
は無効化されています。
-
sitemap_filter
(entries)¶ This is a filter funtion that could be overridden to select sitemap entries based on their attributes.
例:
<url> <loc>http://example.com/</loc> <lastmod>2005-01-01</lastmod> </url>
We can define a
sitemap_filter
function to filterentries
by date:from datetime import datetime from scrapy.spiders import SitemapSpider class FilteredSitemapSpider(SitemapSpider): name = 'filtered_sitemap_spider' allowed_domains = ['example.com'] sitemap_urls = ['http://example.com/sitemap.xml'] def sitemap_filter(self, entries): for entry in entries: date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d') if date_time.year >= 2005: yield entry
This would retrieve only
entries
modified on 2005 and the following years.Entries are dict objects extracted from the sitemap document. Usually, the key is the tag name and the value is the text inside it.
It's important to notice that:
- as the loc attribute is required, entries without this tag are discarded
- alternate links are stored in a list with the key
alternate
(seesitemap_alternate_links
) - namespaces are removed, so lxml tags named as
{namespace}tagname
become onlytagname
If you omit this method, all entries found in sitemaps will be processed, observing other attributes and their settings.
-
SitemapSpiderの例¶
もっともシンプルな例: Sitemapから発見された全てのURLを parse
コールバックを用いて処理する場合:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
def parse(self, response):
pass # ... scrape item here ...
いくつかのURLを特定のコールバックで処理し、他のURLは異なるコールバックで処理する場合:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
def parse_product(self, response):
pass # ... scrape product ...
def parse_category(self, response):
pass # ... scrape category ...
robots.txt ファイルに定義されたSitemapのうち /sitemap_shop
をURLに含んだものにのみ従う場合:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
SitemapSpiderを他のソースから取得されたURLと組み合わせる場合:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
other_urls = ['http://www.example.com/about']
def start_requests(self):
requests = list(super(MySpider, self).start_requests())
requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
return requests
def parse_shop(self, response):
pass # ... scrape shop here ...
def parse_other(self, response):
pass # ... scrape other here ...
セレクタ¶
When you're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this, such as:
- BeautifulSoup は、Pythonプログラマーの間で非常に人気のあるWebスクレイピングライブラリです。HTMLコードの構造に基づいてPythonオブジェクトを作成し、また、不適切なマークアップを適切に処理しますが、遅いという欠点があります。
- lxml は、 ElementTree をベースにしたPythonicなAPIを備えたXML解析ライブラリ(HTMLも解析できます)です。lxmlはPython標準ライブラリの一部ではありません。
Scrapyには、データを抽出するための独自のメカニズムが付属しています。HTMLドキュメントの特定の部分を XPath 式または CSS 式で「選択」するため、これらはセレクタと呼ばれます。
XPath はXML文書内のノードを選択するための言語ですが、HTMLでも使用できます。 CSS はHTMLドキュメントにスタイルを適用するための言語です。スタイルを特定のHTML要素に関連付けるためのセレクタを定義します。
注釈
Scrapy Selectors is a thin wrapper around parsel library; the purpose of this wrapper is to provide better integration with Scrapy Response objects.
parsel is a stand-alone web scraping library which can be used without Scrapy. It uses lxml library under the hood, and implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy to lxml.
セレクタの使用¶
セレクタの構築¶
Response objects expose a Selector
instance
on .selector
attribute:
>>> response.selector.xpath('//span/text()').get()
'good'
Querying responses using XPath and CSS is so common that responses include two
more shortcuts: response.xpath()
and response.css()
:
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'
Scrapy selectors are instances of Selector
class
constructed by passing either TextResponse
object or
markup as an unicode string (in text
argument).
Usually there is no need to construct Scrapy selectors manually:
response
object is available in Spider callbacks, so in most cases
it is more convenient to use response.css()
and response.xpath()
shortcuts. By using response.selector
or one of these shortcuts
you can also ensure the response body is parsed only once.
But if required, it is possible to use Selector
directly.
Constructing from text:
>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'
Constructing from response - HtmlResponse
is one of
TextResponse
subclasses:
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'
Selector
automatically chooses the best parsing rules
(XML vs HTML) based on input type.
セレクタの使用¶
セレクタの使用方法を説明するために、 Scrapyシェル (これは対話式のテストを提供します)とScrapy文書サーバーにあるサンプルページを使用します。
For the sake of completeness, here's its full HTML code:
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
まず、シェルを開きましょう。
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
シェルがロードされた後に、レスポンスとして利用可能な response
シェル変数と、セレクタとして利用可能な response.selector
属性を使うことができます。
HTMLを扱うので、セレクタは自動的にHTMLパーサーになります。
そのページの HTMLコード を見て、titleタグ内のテキストを選択するためのXPathを作成しましょう。
>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
To actually extract the textual data, you must call the selector .get()
or .getall()
methods, as follows:
>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
.get()
always returns a single result; if there are several matches,
content of a first match is returned; if there are no matches, None
is returned. .getall()
returns a list with all results.
CSSセレクタの場合は、CSS3の疑似要素を使用してテキストまたは属性ノードを選択できることに注意してください。
>>> response.css('title::text').get()
'Example website'
ご覧のように、 .xpath()
および .css()
メソッドは、新しいセレクタのリストである SelectorList
インスタンスを返します。このAPIはネストしたデータを素早く選択するために使用できます。
>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
If you want to extract only the first matched element, you can call the
selector .get()
(or its alias .extract_first()
commonly used in
previous Scrapy versions):
>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '
It returns None
if no element was found:
>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True
デフォルトの戻り値を引数として渡し、 None
の代わりに使用することもできます。
>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'
Instead of using e.g. '@src'
XPath it is possible to query for attributes
using .attrib
property of a Selector
:
>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
As a shortcut, .attrib
is also available on SelectorList directly;
it returns attributes for the first matching element:
>>> response.css('img').attrib['src']
'image1_thumb.jpg'
This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a web page:
>>> response.css('base').attrib['href']
'http://example.com/'
ベースURLと画像リンクを取得します。
>>> response.xpath('//base/@href').get()
'http://example.com/'
>>> response.css('base::attr(href)').get()
'http://example.com/'
>>> response.css('base').attrib['href']
'http://example.com/'
>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
>>> response.css('a[href*=image]::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
>>> response.css('a[href*=image] img::attr(src)').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
Extensions to CSS Selectors¶
Per W3C standards, CSS selectors do not support selecting text nodes or attribute values. But selecting these is so essential in a web scraping context that Scrapy (parsel) implements a couple of non-standard pseudo-elements:
- to select text nodes, use
::text
- to select attribute values, use
::attr(name)
where name is the name of the attribute that you want the value of
警告
These pseudo-elements are Scrapy-/Parsel-specific. They will most probably not work with other libraries like lxml or PyQuery.
Examples:
title::text
selects children text nodes of a descendant<title>
element:>>> response.css('title::text').get() 'Example website'
*::text
selects all descendant text nodes of the current selector context:>>> response.css('#images *::text').getall() ['\n ', 'Name: My image 1 ', '\n ', 'Name: My image 2 ', '\n ', 'Name: My image 3 ', '\n ', 'Name: My image 4 ', '\n ', 'Name: My image 5 ', '\n ']
foo::text
returns no results iffoo
element exists, but contains no text (i.e. text is empty):>>> response.css('img::text').getall() []
This means
.css('foo::text').get()
could return None even if an element exists. Usedefault=''
if you always want a string:>>> response.css('img::text').get() >>> response.css('img::text').get(default='') ''
a::attr(href)
selects the href attribute value of descendant links:>>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
注釈
See also: Selecting element attributes.
注釈
You cannot chain these pseudo-elements. But in practice it would not make much sense: text nodes do not have attributes, and attribute values are string values already and do not have children nodes.
ネストされたセレクタ¶
選択メソッド( .xpath()
または .css()
)は同じタイプのセレクタのリストを返すので、それらのセレクタでさらに選択メソッドを呼び出すことができます。次に例を示します。
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
... args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
... print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'
Selecting element attributes¶
There are several ways to get a value of an attribute. First, one can use XPath syntax:
>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
XPath syntax has a few advantages: it is a standard XPath feature, and
@attributes
can be used in other parts of an XPath expression - e.g.
it is possible to filter by attribute value.
Scrapy also provides an extension to CSS selectors (::attr(...)
)
which allows to get attribute values:
>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In addition to that, there is a .attrib
property of Selector.
You can use it if you prefer to lookup attributes in Python
code, without using XPaths or CSS extensions:
>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
This property is also available on SelectorList; it returns a dictionary with attributes of a first matching element. It is convenient to use when a selector is expected to give a single result (e.g. when selecting by element ID, or when selecting an unique element on a page):
>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/'
.attrib
property of an empty SelectorList is empty:
>>> response.css('foo').attrib
{}
セレクタで正規表現を使う¶
Selector
には、正規表現を使用してデータを抽出するための .re()
メソッドもあります。ただし、 .xpath()
や .css()
メソッドを使用するのとは異なり、 .re()
はUnicode文字列のリストを返します。そのため、 .re()
呼び出しを入れ子にすることはできません。
上記の HTMLコード からイメージ名を抽出するのに使用される例は、次のとおりです。
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']
There's an additional helper reciprocating .get()
(and its
alias .extract_first()
) for .re()
, named .re_first()
.
Use it to extract just the first matching string:
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'
extract() and extract_first()¶
If you're a long-time Scrapy user, you're probably familiar
with .extract()
and .extract_first()
selector methods. Many blog posts
and tutorials are using them as well. These methods are still supported
by Scrapy, there are no plans to deprecate them.
However, Scrapy usage docs are now written using .get()
and
.getall()
methods. We feel that these new methods result in a more concise
and readable code.
The following examples show how these methods map to each other.
SelectorList.get()
is the same asSelectorList.extract_first()
:>>> response.css('a::attr(href)').get() 'image1.html' >>> response.css('a::attr(href)').extract_first() 'image1.html'
SelectorList.getall()
is the same asSelectorList.extract()
:>>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.css('a::attr(href)').extract() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
Selector.get()
is the same asSelector.extract()
:>>> response.css('a::attr(href)')[0].get() 'image1.html' >>> response.css('a::attr(href)')[0].extract() 'image1.html'
For consistency, there is also
Selector.getall()
, which returns a list:>>> response.css('a::attr(href)')[0].getall() ['image1.html']
So, the main difference is that output of .get()
and .getall()
methods
is more predictable: .get()
always returns a single result, .getall()
always returns a list of all extracted results. With .extract()
method
it was not always obvious if a result is a list or not; to get a single
result either .extract()
or .extract_first()
should be called.
Working with XPaths¶
Here are some tips which may help you to use XPath with Scrapy selectors effectively. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.
注釈
Some of the tips are based on this post from ScrapingHub's blog.
XPathを相対的に操作する¶
セレクタを入れ子にして /
で始まるXPathを使用する場合、そのXPathはドキュメントに対して絶対的なものであり、呼び出し元の Selector
に対して相対的なものではないことに留意してください。
たとえば、 <div>
要素内のすべての <p>
要素を抽出するとします。まず、すべての <div>
要素を取得します。
>>> divs = response.xpath('//div')
最初は以下のようなアプローチを取りがちですが、これは、実際には <div>
要素の内部だけでなく、すべての <p>
要素を文書から抽出するため、間違っています。
>>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
... print(p.get())
こちらは適切な方法です(XPath .//p
にドットプレフィックスがあります)。
>>> for p in divs.xpath('.//p'): # extracts all <p> inside
... print(p.get())
もうひとつの一般的なケースは、直接の子である <p>
を抽出することです。
>>> for p in divs.xpath('p'):
... print(p.get())
相対XPathの詳細については、XPath仕様の Location Paths セクションを参照してください。
クラスで検索するときに、CSSの使用を検討する¶
要素には複数のCSSクラスを含めることができるので、クラスで要素を選択するXPathの方法はかなり冗長です。
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
@class='someclass'
を使用すると、要素が他のクラスも持っていると取りこぼすことがあります。そのために contains(@class, 'someclass')
を使用しても、文字列 someclass
を含む異なるクラス名がある場合は、そちらも取得してしまいます。
結局のところ、Scrapyセレクタを使うとセレクタをチェーンさせることができるので、ほとんどの場合はCSSを使用してクラスで選択し、必要に応じてXPathに切り替えることができます。
>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']
これは、上記の冗長なXPathのトリックを使用するよりもクリーンです。続くXPath式で .
を使用することを忘れないでください。
//node[1] と (//node)[1] の違いに注意¶
//node[1]
は、それぞれの親の下に最初に出現するすべてのノードを選択します。
(//node)[1]
はドキュメント内のすべてのノードを選択し、それらのうち最初のノードだけを取得します。
例:
>>> from scrapy import Selector
>>> sel = Selector(text="""
....: <ul class="list">
....: <li>1</li>
....: <li>2</li>
....: <li>3</li>
....: </ul>
....: <ul class="list">
....: <li>4</li>
....: <li>5</li>
....: <li>6</li>
....: </ul>""")
>>> xp = lambda x: sel.xpath(x).getall()
こちらは、親であるものが何であれ、最初の <li>
要素をすべて取得します。
>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']
そしてこちらは、文書全体の最初の <li>
要素を取得します。
>>> xp("(//li)[1]")
['<li>1</li>']
こちらは、親が <ul>
である最初の <li>
要素をすべて取得します。
>>> xp("//ul/li[1]")
['<li>1</li>', '<li>4</li>']
そしてこちらは、文書全体の最初の、親が <ul>
である <li>
要素を取得します。
>>> xp("(//ul/li)[1]")
['<li>1</li>']
条件内でのテキストノードの使用¶
XPath文字列関数 の引数としてテキストコンテンツを使う必要がある場合は、 .//text()
を使用せず、代わりに .
を使用してください。
これは、式 .//text()
がテキスト要素の集合、つまり ノードセット を生成するためです。そして、ノードセットが contains()
や starts-with()
のような文字列関数に引数として渡され文字列に変換されるとき、最初の要素のみのテキストを返します。
例:
>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
ノードセットを文字列に変換します。
>>> sel.xpath('//a//text()').getall() # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall() # convert it to string
['Click here to go to the ']
ただし、文字列に変換された ノード は、それ自体のテキストとそのすべての子孫のテキストをまとめたものです。
>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
したがって、 .//text()
ノードセットを使用しても、この場合は何も選択されません。
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
しかし、ノードを意味する .
を使用すると、うまくいきます。
>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
XPath式の中の変数¶
XPathでは、 $somevariable
構文を使用して、XPath式内の変数を参照できます。 これは、クエリの引数の一部を ?
などのプレースホルダに置き換えたSQLの世界でのパラメータ化クエリやプリペアドステートメントとやや似ています。これらはクエリに渡された値に置き換えられます。
次の例は、ハードコーディングをせずに、その「id」属性値に基づいて要素を照合します(以前に示したものです)。
>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '
別の例として、5つの <a>
の子を含む <div>
タグの「id」属性を見つけるものを示します(ここでは値 5
を整数として渡します)。
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'
すべての変数参照は、 .xpath()
を呼び出すときにバインド値を持つ必要があります(そうでなければ、 ValueError: XPath error:
例外が発生します)。これは、必要な数の名前付き引数を渡すことによって行われます。
ネームスペースを削除する¶
スクレイピングプロジェクトを扱うときに、ネームスペースを完全に取り除くと、要素名を扱うだけで、より単純で便利なXPathを書くことが非常に便になります。これには Selector.remove_namespaces()
メソッドを使用できます。
Let's show an example that illustrates this with the Python Insider blog atom feed.
まず、抽出したいURLでシェルを開きます。
$ scrapy shell https://feeds.feedburner.com/PythonInsider
This is how the file starts:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:blogger="http://schemas.google.com/blogger/2008"
xmlns:georss="http://www.georss.org/georss"
xmlns:gd="http://schemas.google.com/g/2005"
xmlns:thr="http://purl.org/syndication/thread/1.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
...
You can see several namespace declarations including a default "http://www.w3.org/2005/Atom" and another one using the "gd:" prefix for "http://schemas.google.com/g/2005".
シェルに入ったら、すべての <link>
オブジェクトを選択しても機能しないことを確認します(Atom XMLネームスペースがこれらのノードを見づらくしているため)。
>>> response.xpath("//link")
[]
しかし、 Selector.remove_namespaces()
メソッドを呼び出すと、すべてのノードにそれらの名前で直接アクセスできます。
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
<Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
...
ネームスペースの削除を手動で呼び出すのではなく、デフォルトで常に呼び出されない理由は、以下の2つです。
- Removing namespaces requires to iterate and modify all nodes in the document, which is a reasonably expensive operation to perform by default for all documents crawled by Scrapy
- 非常にまれなケースですが、いくつかの要素名がネームスペースの間で衝突する場合のために、ネームスペースの使用が実際に必要とされる場合があるかもしれません。
EXSLT拡張機能を使う¶
Being built atop lxml, Scrapy selectors support some EXSLT extensions and come with these pre-registered namespaces to use in XPath expressions:
プレフィクス | ネームスペース | 使用法 |
---|---|---|
re | http://exslt.org/regular-expressions | regular expressions |
set | http://exslt.org/sets | set manipulation |
正規表現¶
たとえば、 test()
関数は、XPathの starts-with()
や contains()
だけでは不十分な場合に非常に便利です。
数字で終わる「class」属性を持つリスト項目内のリンクを選択する例:
>>> from scrapy import Selector
>>> doc = u"""
... <div>
... <ul>
... <li class="item-0"><a href="link1.html">first item</a></li>
... <li class="item-1"><a href="link2.html">second item</a></li>
... <li class="item-inactive"><a href="link3.html">third item</a></li>
... <li class="item-1"><a href="link4.html">fourth item</a></li>
... <li class="item-0"><a href="link5.html">fifth item</a></li>
... </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']
>>>
警告
Cライブラリ libxslt
はEXSLT正規表現をネイティブサポートしていないため、 lxml の実装はPythonの re
モジュールへのフックを使用します。そのため、XPath式で正規表現関数を使用すると、パフォーマンスが若干低下する可能性があります。
集合演算¶
これらは、例えばテキスト要素を抽出する前に文書ツリーの一部を除外するのに便利です。
itemscopeとそれに対応するitempropのグループを含むmicrodataの抽出例( http://schema.org/Product から取得したサンプルコンテンツ):
>>> doc = u"""
... <div itemscope itemtype="http://schema.org/Product">
... <span itemprop="name">Kenmore White 17" Microwave</span>
... <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
... <div itemprop="aggregateRating"
... itemscope itemtype="http://schema.org/AggregateRating">
... Rated <span itemprop="ratingValue">3.5</span>/5
... based on <span itemprop="reviewCount">11</span> customer reviews
... </div>
...
... <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
... <span itemprop="price">$55.00</span>
... <link itemprop="availability" href="http://schema.org/InStock" />In stock
... </div>
...
... Product description:
... <span itemprop="description">0.7 cubic feet countertop microwave.
... Has six preset cooking categories and convenience features like
... Add-A-Minute and Child Lock.</span>
...
... Customer reviews:
...
... <div itemprop="review" itemscope itemtype="http://schema.org/Review">
... <span itemprop="name">Not a happy camper</span> -
... by <span itemprop="author">Ellie</span>,
... <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
... <meta itemprop="worstRating" content = "1">
... <span itemprop="ratingValue">1</span>/
... <span itemprop="bestRating">5</span>stars
... </div>
... <span itemprop="description">The lamp burned out and now I have to replace
... it. </span>
... </div>
...
... <div itemprop="review" itemscope itemtype="http://schema.org/Review">
... <span itemprop="name">Value purchase</span> -
... by <span itemprop="author">Lucas</span>,
... <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
... <meta itemprop="worstRating" content = "1"/>
... <span itemprop="ratingValue">4</span>/
... <span itemprop="bestRating">5</span>stars
... </div>
... <span itemprop="description">Great microwave for the price. It is small and
... fits in my apartment.</span>
... </div>
... ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath('//div[@itemscope]'):
... print("current scope:", scope.xpath('@itemtype').getall())
... props = scope.xpath('''
... set:difference(./descendant::*/@itemprop,
... .//*[@itemscope]/*/@itemprop)''')
... print(" properties: %s" % (props.getall()))
... print("")
current scope: ['http://schema.org/Product']
properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']
current scope: ['http://schema.org/AggregateRating']
properties: ['ratingValue', 'reviewCount']
current scope: ['http://schema.org/Offer']
properties: ['price', 'availability']
current scope: ['http://schema.org/Review']
properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
properties: ['worstRating', 'ratingValue', 'bestRating']
current scope: ['http://schema.org/Review']
properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
properties: ['worstRating', 'ratingValue', 'bestRating']
>>>
ここではまず itemscope
要素を反復し、それぞれに対して itemprops
要素をすべて探し、別の itemscope
内にあるものを除外します。
Other XPath extensions¶
Scrapy selectors also provide a sorely missed XPath extension function
has-class
that returns True
for nodes that have all of the specified
HTML classes.
For the following HTML:
<p class="foo bar-baz">First</p>
<p class="foo">Second</p>
<p class="bar">Third</p>
<p>Fourth</p>
You can use it like this:
>>> response.xpath('//p[has-class("foo")]')
[<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> response.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
>>> response.xpath('//p[has-class("foo", "bar")]')
[]
So XPath //p[has-class("foo", "bar-baz")]
is roughly equivalent to CSS
p.foo.bar-baz
. Please note, that it is slower in most of the cases,
because it's a pure-Python function that's invoked for every node in question
whereas the CSS lookup is translated into XPath and thus runs more efficiently,
so performance-wise its uses are limited to situations that are not easily
described with CSS selectors.
Parsel also simplifies adding your own XPath extensions.
Examples¶
HTMLレスポンスに関するセレクタの例¶
Here are some Selector
examples to illustrate several concepts.
In all cases, we assume there is already a Selector
instantiated with
a HtmlResponse
object like this:
sel = Selector(html_response)
HTMLのレスポンスボディからすべての
<h1>
要素を選択し、Selector
オブジェクトのリスト(つまりSelectorList
オブジェクト)を返します。sel.xpath("//h1")
HTMLのレスポンスボディからすべての
<h1>
要素のテキストを抽出し、Unicode文字列のリストを返します。sel.xpath("//h1").getall() # this includes the h1 tag sel.xpath("//h1/text()").getall() # this excludes the h1 tag
すべての
<p>
タグを繰り返し処理し、それらのclass属性を出力します。for node in sel.xpath("//p"): print(node.attrib['class'])
XMLレスポンスに関するセレクタの例¶
Here are some examples to illustrate concepts for Selector
objects
instantiated with an XmlResponse
object:
sel = Selector(xml_response)
XMLのレスポンスボディからすべての
<product>
要素を選択し、Selector
オブジェクトのリスト(つまりSelectorList
オブジェクト)を返します。sel.xpath("//product")
ネームスペースを登録する必要がある Google Base XML feed フィードからすべての価格を抽出します。
sel.register_namespace("g", "http://base.google.com/ns/1.0") sel.xpath("//g:price").getall()
Item¶
スクレイピングの主な目的は、構造化されていないソース(通常はWebページ)から構造化されたデータを抽出することです。ScrapyのSpiderは、抽出されたデータをPythonのdictとして返すことができます。便利で手軽なのですが、Pythonのdictでは構造化が不十分です。特に、多くのSpiderを持つ大規模なプロジェクトでは、簡単にフィールド名をtypoしたり、矛盾したデータを返してしまいます。
共通の出力データ形式を定義するために、Scrapyは Item
クラスを提供します。 Item
オブジェクトは、抽出されたデータを収集するためのシンプルなコンテナです。利用可能なフィールドを宣言するための便利な構文を備えた dictライク なAPIを提供します。
さまざまなScrapyコンポーネントがItemが提供する情報を使用します。たとえば、エクスポーターは宣言されたフィールドを参照してエクスポートする列を調べ、シリアライズはItemフィールドのメタデータを利用してカスタマイズできます。 trackref
はメモリのリークを見つけるための項目インスタンスを追跡します( Debugging memory leaks with trackref 参照)。
Itemの宣言¶
Itemはシンプルなクラス定義と Field
オブジェクトを使用して宣言されます。次に例を示します。
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
注釈
Django に精通している人は、Itemが Django Models と同じように宣言されていることに気づくでしょう。ただScrapyのItemsは異なるフィールド型の概念がないので、それよりはるかに簡単です。
Itemフィールド¶
Field
オブジェクトは、各フィールドのメタデータを指定するために使用されます。たとえば、上記の例の last_updated
フィールドのserializer関数がそれにあたります。
各フィールドには任意の種類のメタデータを指定できます。 Field
オブジェクトが受け入れる値に制限はありません。同じ理由から、使用可能なすべてのメタデータキーの参照リストはありません。 Field
オブジェクトで定義された各キーは、別のコンポーネントによって使用され、それらのコンポーネントだけがそれを知ることができます。また、自分のニーズに合わせて、プロジェクト内の他の Field
キーを定義して使用することもできます。 Field
オブジェクトの主な目的は、すべてのフィールドのメタデータを一ヶ所で定義する方法を提供することです。通常、その動作が各フィールドに依存するコンポーネントは、特定のフィールドキーを使用してその動作を構成します。各コンポーネントでどのメタデータキーが使用されているかについては、ドキュメントを参照する必要があります。
項目を宣言するために使用される Field
オブジェクトは、クラスの属性として割り当てられたままではないことに注意してください。代わりに、 Item.fields
属性を使用してアクセスできます。
Itemの操作¶
上記で宣言された Product
Itemを使用したタスクの例をいくつか示します。APIは dict API と非常によく似ています。
Itemの作成¶
>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)
フィールド値の取得¶
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['last_updated']
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
フィールド値の設定¶
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
すべての設定された値にアクセスする¶
すべての設定された値には dict API 的にアクセスできます。
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
その他の一般的なタスク¶
Itemのコピー
>>> product2 = Product(product)
>>> print(product2)
Product(name='Desktop PC', price=1000)
>>> product3 = product2.copy()
>>> print(product3)
Product(name='Desktop PC', price=1000)
Itemからdictを作成する
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
dictからItemを作成する
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Itemの拡張¶
元のItemのサブクラスを宣言することで、フィールドを追加したり、一部のフィールドのメタデータを変更したりなど、Itemの拡張ができます。
例:
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
次のように、前のフィールドのメタデータを使用してさらに値を追加したり、既存の値を変更したりすることで、フィールドのメタデータを拡張することもできます。
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
これにより、 name
フィールドの serializer
メタデータキーが追加(または置換)され、以前に存在したすべてのメタデータ値が保持されます。
Itemオブジェクト¶
Itemローダー¶
Itemローダーは、抽出したデータを Item に投入するための便利なメカニズムを提供します。Itemは独自の辞書ライクなAPIを使用して作成できますが、Itemローダーは抽出前のデータを解析するなどの一般的なタスクを自動化することで、スクレイピングの過程でデータを取り込むための便利なAPIを提供します。
言い換えれば、 Item は抽出されたデータのコンテナを提供し、Itemローダーはそのコンテナに投入するためのメカニズムを提供します。
Itemローダーは、Spiderまたはソースフォーマット(HTML、XMLなど)のいずれかで、メンテナンス地獄にならないように、フィールドの解析ルールを拡張・上書きするための柔軟で効率的で簡単なメカニズムを提供するように設計されています。
Itemローダーを使用してItemを設定する¶
Itemローダーを使用するには、まずそれをインスタンス化する必要があります。dictライクなオブジェクト(Itemやdictなど)でインスタンス化することも、 ItemLoader.default_item_class
属性で指定されたItemクラスを使用してItemローダーのコンストラクタでItemを自動的にインスタンス化することもできます。
次に Selectors を使用してItemローダーに値を収集します。同じItemフィールドに複数の値を追加可能で、Itemローダーでは適切な関数を使用してそれらの値を「結合」する方法があります。
Itemの章 で宣言された Product Item を使用して、 Spider でItemローダーを使う一般的な方法を以下に示します。
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_xpath('price', '//p[@id="price"]')
l.add_css('stock', 'p#stock]')
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
コードをざっと眺めてみると、XPathでページ内の2箇所から name
フィールドが抽出されていることがわかります。
//div[@class="product_name"]
//div[@class="product_title"]
つまり、 add_xpath()
メソッドを使用して、XPathで2箇所からデータを抽出しています。これは name
フィールドに割り当てられるデータです。
その後、 price
フィールドと stock
フィールド( add_css()
メソッドでCSSセレクタを使用)に同様の呼び出しがされ、最後に last_update
フィールドに add_value()
メソッドを使用してリテラル値( today
)が直接設定されます。
最後に、すべてのデータが収集されると、 ItemLoader.load_item()
メソッドが呼び出され、 add_xpath()
, add_css()
および add_value()
の呼び出しで抽出されたデータがItemに格納されます。
入力および出力プロセッサ¶
Itemローダーには、Itemの各フィールドに入力プロセッサと出力プロセッサが1つずつ含まれています。入力プロセッサは、 add_xpath()
, add_css()
または add_value()
メソッドを介して受け取ったデータをすぐに処理し、ItemLoader内部に保持します。すべてのデータを収集した後、 ItemLoader.load_item()
メソッドが呼び出され、データが Item
オブジェクトに格納されます。そのとき、以前に収集されたデータ、および入力プロセッサを使用して処理されたデータを使用して出力プロセッサが呼び出されます。出力プロセッサの結果は、Itemに割り当てられる最終値です。
フィールドに対して入出力プロセッサがどのように呼び出されるかを示す例を見てみましょう。
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
何が起こるでしょうか?
xpath1
からのデータが抽出され、name
フィールドの 入力プロセッサ に渡されます。入力プロセッサの結果は収集され、Itemローダーに保持されます(しかし、Itemにはまだ割り当てられません)。xpath2
からのデータが抽出され、(1)と同じ入力プロセッサを通過します。入力プロセッサの結果が存在する場合は、(1)で収集されたデータに追加されます。- このケースは前のものと似ていますが、データが
css
CSSセレクタから抽出され、(1)と(2)で使用されたものと同じ入力プロセッサを通過します。入力プロセッサの結果が存在する場合は、(1)および(2)で収集されたデータに追加されます - このケースも前のものと似ていますが、収集する値がXPath式やCSSセレクタから抽出される代わりに直接割り当てられる点が異なります。ただし、値は引き続き入力プロセッサを通過します。この場合の値はiterableではありませんが、入力プロセッサは常にiterableを受け取るので、値は入力プロセッサに渡す前に単一要素のiterableに変換されます。
- ステップ(1)(2)(3)(4)で収集されたデータは、
name
フィールドの 出力プロセッサ を通過します。出力プロセッサの結果は、Itemのname
フィールドに割り当てられる値です。
プロセッサは呼び出し可能なオブジェクトであり、パース対象のデータで呼び出され、パースされた値を返すだけであることに注意してください。したがって、入力または出力プロセッサとして任意の関数を使用できます。唯一の条件は、イテレータである1つの位置引数を受け入れる必要があることです。
注釈
入力プロセッサと出力プロセッサのどちらも、最初の引数としてイテレータを受け取る必要があります。これらの関数の出力は何でもかまいません。入力プロセッサの結果は、そのフィールドのために収集された値を含むローダー内の内部リストに追加されます。出力プロセッサの結果は、最終的にItemに割り当てられる値です。
プレーンな関数をプロセッサとして使用する場合は、最初の引数として self
を受け取るようにしてください。
def lowercase_processor(self, values):
for v in values:
yield v.lower()
class MyItemLoader(ItemLoader):
name_in = lowercase_processor
これは、関数をクラスの属性として割り当てるとメソッドになり、呼び出されるときに最初の引数としてインスタンスが渡されるためです。詳細については、 stackoverflowのこの回答 を参照してください。
もうひとつ留意すべきことは、入力プロセッサから返された値が内部的にリストに集められ、フィールドにデータを投入するために出力プロセッサに渡されることです。
最後に、Scrapyには、利便性のため 一般的に使用されるプロセッサ がいくつかビルトインされています。
Itemローダーの宣言¶
Itemローダーは、クラス定義構文を使用してItemのように宣言されます。次に例を示します。
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(unicode.title)
name_out = Join()
price_in = MapCompose(unicode.strip)
# ...
ご覧のように、入力プロセッサは _in
接尾辞を使用して宣言され、出力プロセッサは _out
接尾辞を使用して宣言されます。また、 ItemLoader.default_input_processor
および ItemLoader.default_output_processor
属性を使用して、デフォルトの入出力プロセッサを宣言することもできます。
入力プロセッサおよび出力プロセッサの宣言¶
前のセクションで見たように、入出力プロセッサはItemローダーの定義で宣言することができ、このように入力プロセッサを宣言することは非常に一般的です。しかしもう1つ、 Itemフィールド メタデータに使用する入力プロセッサおよび出力プロセッサを指定することもできます。次に例を示します。
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
>>> il.add_value('price', [u'€', u'<span>1000</span>'])
>>> il.load_item()
{'name': u'Welcome to my website', 'price': u'1000'}
入力プロセッサと出力プロセッサの優先順位は次のとおりです。
- Item ローダーのフィールド固有の属性:
field_in
およびfield_out
(最も優先度が高い) - フィールドメタデータ(
input_processor
およびoutput_processor
キー) - Itemローダーのデフォルト:
ItemLoader.default_input_processor()
およびItemLoader.default_output_processor()
(最も優先度が低い)
参照: Itemローダーの再利用と拡張
Itemローダーコンテキスト¶
Itemローダーコンテキストは、Itemローダー内のすべての入出力プロセッサで共有される任意のキー/値のdictです。これは、Itemローダーの宣言、インスタンス化、または使用時に渡すことが可能で、入出力プロセッサの動作を変更するために使用されます。
たとえば、テキストを受け取りそこから長さを抽出する関数 parse_length
があるとします。
def parse_length(text, loader_context):
unit = loader_context.get('unit', 'm')
# ... length parsing code goes here ...
return parsed_length
loader_context
引数を指定することによって、関数がItemローダーコンテキストを受け取ることを明示的にItemローダーに伝えます。そして、Itemローダーは呼び出し時に現在アクティブなコンテキストを渡し、プロセッサ関数(この場合は parse_length
)はそれを使うことができます。
Itemローダーコンテキストの値を変更する方法はいくつかあります。
現在アクティブなItemローダーコンテキスト(
context
属性)の変更時。loader = ItemLoader(product) loader.context['unit'] = 'cm'
Itemローダーのインスタンス化時(コンストラクタのキーワード引数がItemローダーコンテキストに格納されます)。
loader = ItemLoader(product, unit='cm')
Itemローダーの宣言時。Itemローダーコンテキスト付きでインスタンス化をサポートする入出力プロセッサ用に宣言されます。
MapCompose
はその1つです。class ProductLoader(ItemLoader): length_out = MapCompose(parse_length, unit='cm')
Itemローダーオブジェクト¶
-
class
scrapy.loader.
ItemLoader
([item, selector, response, ]**kwargs)¶ 指定されたItemを取り込むための新しいItemローダーを返します。Itemが指定されていない場合は、
default_item_class
のクラスを使用して自動的にインスタンス化されます。selector または response パラメータを使用してインスタンス化すると、
ItemLoader
クラスは セレクタ を使用してWebページからデータを抽出する便利なメカニズムを提供します。パラメータ: - item (
Item
object) --add_xpath()
,add_css()
,add_value()
を使用して設定するItemインスタンス。 - selector (
Selector
object) --add_xpath()
,add_css()
メソッドまたはreplace_xpath()
,replace_css()
メソッドを使用してデータを抽出するときのセレクタ。 - response (
Response
object) --default_selector_class
を使用してセレクタを構成するために使用されたレスポンス。selector引数が指定されていない場合は、この引数は無視されます。
item, selector, responseおよび残りのキーワード引数は、(
context
属性を通してアクセス可能な)ローダーコンテキストに割り当てられます。ItemLoader
インスタンスには、次のメソッドがあります。-
get_value
(value, *processors, **kwargs)¶ processors
とキーワードの引数でvalue
を処理します。使用可能なキーワード引数:
パラメータ: re (str or compiled regex) -- extract_regex()
メソッドを使用して与えられた値からデータを抽出するための正規表現。processorsの前に適用されます。例:
>>> from scrapy.loader.processors import TakeFirst >>> loader.get_value(u'name: foo', TakeFirst(), unicode.upper, re='name: (.+)') 'FOO`
-
add_value
(field_name, value, *processors, **kwargs)¶ 指定されたフィールド用に
value
を処理して追加します。このvalueは、最初に
processors
とkwargs
を与えることによってget_value()
に渡され、次に フィールドの入力プロセッサ を通過し、結果がそのフィールド用に収集されたデータに追加されます。フィールドにすでに収集されたデータが含まれている場合は、新しいデータが追加されます。field_name
はNone
にすることができます。その場合は複数のフィールドの値を追加できますが、valueは、field_nameと値がマップされたdictでなければなりません。例:
loader.add_value('name', u'Color TV') loader.add_value('colours', [u'white', u'blue']) loader.add_value('length', u'100') loader.add_value('name', u'name: foo', TakeFirst(), re='name: (.+)') loader.add_value(None, {'name': u'foo', 'sex': u'male'})
-
replace_value
(field_name, value, *processors, **kwargs)¶ add_value()
と似ていますが、収集されたデータを追加するのではなく新しい値に置き換えます。
-
get_xpath
(xpath, *processors, **kwargs)¶ ItemLoader.get_value()
と同様ですが、値の代わりにXPathを受け取ります。これは、このItemLoader
に関連付けられたセレクタからUnicode文字列のリストを抽出するために使用されます。パラメータ: - xpath (str) -- データを抽出するXPath
- re (str or compiled regex) -- XPathで選択された領域からデータを抽出するために使用される正規表現
例:
# HTML snippet: <p class="product-name">Color TV</p> loader.get_xpath('//p[@class="product-name"]') # HTML snippet: <p id="price">the price is $1200</p> loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
-
add_xpath
(field_name, xpath, *processors, **kwargs)¶ ItemLoader.add_value()
と同様ですが、値の代わりにXPathを受け取ります。これは、このItemLoader
に関連付けられたセレクタからUnicode文字列のリストを抽出するために使用されます。kwargs
についてはget_xpath()
を参照してください。パラメータ: xpath (str) -- データを抽出するXPath 例:
# HTML snippet: <p class="product-name">Color TV</p> loader.add_xpath('name', '//p[@class="product-name"]') # HTML snippet: <p id="price">the price is $1200</p> loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
-
replace_xpath
(field_name, xpath, *processors, **kwargs)¶ add_xpath()
と似ていますが、収集されたデータを追加する代わりに置き換えます。
-
get_css
(css, *processors, **kwargs)¶ ItemLoader.get_value()
と同様ですが、値の代わりにCSSセレクタを受け取ります。このセレクタは、このItemLoader
に関連付けられたセレクタからUnicode文字列のリストを抽出するために使用されます。パラメータ: - css (str) -- データを抽出するCSSセレクタ
- re (str or compiled regex) -- CSSセレクタで選択された領域からデータを抽出するために使用される正規表現
例:
# HTML snippet: <p class="product-name">Color TV</p> loader.get_css('p.product-name') # HTML snippet: <p id="price">the price is $1200</p> loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
-
add_css
(field_name, css, *processors, **kwargs)¶ ItemLoader.add_value()
と同様ですが、値の代わりにCSSセレクタを受け取ります。このセレクタは、このItemLoader
に関連付けられたセレクタからUnicode文字列のリストを抽出するために使用されます。kwargs
についてはget_css()
を参照してください。パラメータ: css (str) -- データを抽出するCSSセレクタ 例:
# HTML snippet: <p class="product-name">Color TV</p> loader.add_css('name', 'p.product-name') # HTML snippet: <p id="price">the price is $1200</p> loader.add_css('price', 'p#price', re='the price is (.*)')
-
nested_xpath
(xpath)¶ XPathセレクタでネストされたローダーを作成します。指定されたセレクタは、この
ItemLoader
に関連付けられたセレクタに対して相対的に適用されます。ネストされたローダーは親ItemLoader
とItem
を共有するため、add_xpath()
,add_value()
,replace_value()
などの呼び出しは期待どおりに動作します。
-
nested_css
(css)¶ CSSセレクタでネストされたローダーを作成します。指定されたセレクタは、この
ItemLoader
に関連付けられたセレクタに対して相対的に適用されます。ネストされたローダーは親ItemLoader
とItem
を共有するため、add_xpath()
,add_value()
,replace_value()
などの呼び出しは期待どおりに動作します。
-
get_collected_values
(field_name)¶ 指定されたフィールドの収集された値を返します。
-
get_output_value
(field_name)¶ 指定されたフィールドに対して、出力プロセッサを使用してパースされた収集値を返します。このメソッドはItemを作成したり変更したりしません。
-
get_input_processor
(field_name)¶ 指定されたフィールドの入力プロセッサを返します。
-
get_output_processor
(field_name)¶ 指定されたフィールドの出力プロセッサを返します。
ItemLoader
インスタンスには、以下の属性があります。-
default_item_class
¶ コンストラクタで指定されていないときに項目をインスタンス化するために使用されるItemクラス(またはファクトリ)。
-
default_input_processor
¶ 入力プロセッサが指定されていないフィールドに使用するデフォルトの入力プロセッサ。
-
default_output_processor
¶ 出力プロセッサが指定されていないフィールドに使用するデフォルトの出力プロセッサ。
-
default_selector_class
¶ コンストラクタでresponseのみが与えられている場合、この
ItemLoader
のselector
を構築するために使用されるクラスです。セレクタがコンストラクタで指定されている場合、この属性は無視されます。この属性はサブクラスでオーバーライドされることもあります。
-
selector
¶ データを抽出する
Selector
オブジェクト。これは、コンストラクタで指定されたセレクタか、default_selector_class
を使用してコンストラクタで指定されたresponseから作成されたセレクタです。この属性は読み取り専用です。
- item (
ネストされたローダー¶
関連する値をドキュメントのサブセクションから解析するときは、ネストされたローダーを作成すると便利です。例えば、次のようなページのフッターから詳細を抽出するとします。
例:
<footer>
<a class="social" href="https://facebook.com/whatever">Like Us</a>
<a class="social" href="https://twitter.com/whatever">Follow Us</a>
<a class="email" href="mailto:whatever@example.com">Email Us</a>
</footer>
ネストされたローダーがなければ、抽出するそれぞれの値に対して完全なXPath(またはCSS)を指定する必要があります。
例:
loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath('social', '//footer/a[@class = "social"]/@href')
loader.add_xpath('email', '//footer/a[@class = "email"]/@href')
loader.load_item()
代わりに、footerセレクタでネストされたローダーを作成し、フッターを基準にして値を追加することができます。機能としては同じですが、footerセレクタの繰り返しを避けることが可能です。
例:
loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()
ローダーは任意にネストすることができ、XPathかCSSセレクタを使用して動きます。一般的なガイドラインとして、ネストされたローダーを使用するとコードが単純になりますが、極端にネストするとパーサーが読みにくくなることがあります。
Itemローダーの再利用と拡張¶
プロジェクトが大きくなり、多くのSpiderが動くようになると、メンテナンスが根本的な問題になります。特に、Spiderごとにさまざまな解析ルールを処理しなければならないときは、多くの例外があり、しかし一般的なプロセッサを再利用したい要望があります。
Itemローダーは、柔軟性を失うことなくルール解析のメンテナンスの負担を軽減すると同時に、それらを拡張したり上書きするための便利なメカニズムを提供するように設計されています。この理由から、Itemローダーは特定のSpider(またはSpiderのグループ)の違いに対処するために従来のPythonクラス継承をサポートしています。
たとえば、特定のサイトが製品名を3つのダッシュ(例:: ---Plasma TV---
)で囲んでいて、最終的に製品名にはこれらのダッシュをスクレイピングしたくないとします。
デフォルトのProduct Itemローダー (ProductLoader
) を再利用して拡張することによって、これらのダッシュを削除する方法は次のとおりです。
from scrapy.loader.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
def strip_dashes(x):
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
Itemローダーを拡張することが役に立つもう1つのケースは、複数のソースフォーマット(XMLやHTMLなど)がある場合です。XMLバージョンでは、 CDATA
を削除したいかも知れません。この例を次に示します。
from scrapy.loader.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata
class XmlProductLoader(ProductLoader):
name_in = MapCompose(remove_cdata, ProductLoader.name_in)
これが入力プロセッサを拡張する通常の方法です。
出力プロセッサに関しては、入力プロセッサとは違い、通常はフィールドにのみ依存し、サイトの特定のパースルールには依存しないので、フィールドメタデータで宣言する方が一般的です。 入力プロセッサおよび出力プロセッサの宣言 を参照してください。
Itemローダーを拡張、継承、オーバーライドする方法は他にもたくさんあります。また、プロジェクトごとに適したItemローダーの階層は異なるかもしれません。Scrapyはメカニズムを提供するだけです。ローダーの特定の組み立て方を強制するものではありません。それはあなたとプロジェクトのニーズによって変わります。
使用可能なビルトインプロセッサ¶
呼び出し可能ならばどんな関数も入出力プロセッサとして使用することができますが、Scrapyは一般的に使用されるいくつかのプロセッサを提供しています。これについては後述します。 MapCompose
(通常は入力プロセッサとして使用される)などは、最終的にパースされた値を生成するために、いくつかの関数を順番に実行して出力を構成します。
すべての組み込みプロセッサの一覧は次のとおりです。
-
class
scrapy.loader.processors.
Identity
¶ 最も単純なプロセッサ。何もしません。元の値をそのまま返します。コンストラクタ引数を受け取らず、ローダーコンテキストも受け入れません。
例:
>>> from scrapy.loader.processors import Identity >>> proc = Identity() >>> proc(['one', 'two', 'three']) ['one', 'two', 'three']
-
class
scrapy.loader.processors.
TakeFirst
¶ 受け取った値から、nullでもなく空でもない最初の値を返します。したがって、通常、単一値フィールドの出力プロセッサとして使用されます。コンストラクタ引数を受け取らず、ローダーコンテキストも受け入れません。
例:
>>> from scrapy.loader.processors import TakeFirst >>> proc = TakeFirst() >>> proc(['', 'one', 'two', 'three']) 'one'
-
class
scrapy.loader.processors.
Join
(separator=u' ')¶ コンストラクタで指定されたセパレータで連結された値を返します。デフォルトは
u' '
です。ローダーコンテキストは受け付けません。デフォルトのセパレータを使用する場合、このプロセッサは
u' '.join
と同等です。例:
>>> from scrapy.loader.processors import Join >>> proc = Join() >>> proc(['one', 'two', 'three']) 'one two three' >>> proc = Join('<br>') >>> proc(['one', 'two', 'three']) 'one<br>two<br>three'
-
class
scrapy.loader.processors.
Compose
(*functions, **default_loader_context)¶ 与えられた関数の構成から合成されるプロセッサ。このプロセッサの各入力値が最初の関数に渡され、その関数の結果が2番目の関数に渡され、最後の関数の結果がこのプロセッサの出力値となります。
デフォルトでは
None
の値で処理を停止します。この動作は、キーワード引数stop_on_none=False
を渡すことで変更できます。例:
>>> from scrapy.loader.processors import Compose >>> proc = Compose(lambda v: v[0], str.upper) >>> proc(['hello', 'world']) 'HELLO'
各関数は、オプションで
loader_context
パラメータを受け取ることができます。そうすると、このプロセッサは現在アクティブな ローダーコンテキスト をそのパラメータで渡します。コンストラクタで渡されるキーワード引数は、各関数に渡されるデフォルトのローダーコンテキスト値として使用されます。ただし、最終的に関数に渡されローダーコンテキスト値は、
ItemLoader.context()
属性を介してアクセス可能な現在アクティブなローダーコンテキストでオーバーライドされます。
-
class
scrapy.loader.processors.
MapCompose
(*functions, **default_loader_context)¶ Compose
プロセッサと同様に、与えられた関数の構成から合成されるプロセッサ。違いは、内部の結果が関数間で受け渡される方法です。このプロセッサの入力値が 反復 され、各要素に最初の関数が適用されます。これらの関数呼び出しの結果(各要素に1つずつ)は連結されて新しいiterableを生成し、次にそれを使用して2番目の関数を適用します。そうして収集されたリストの各値が、最後の関数まで適用されます。最後の関数の出力値は、このプロセッサの出力を生成するために連結されます。
それぞれの関数は値または値のリストを返すことができ、これは他の入力値に適用された同じ関数によって返される値のリストで平坦化されます。関数は
None
を返すこともできます。その場合、その関数の出力に続くチェーン処理は無視されます。このプロセッサは(iterableではなく)単一の値でしか機能しない関数を合成する便利な方法を提供します。このため、
MapCompose
プロセッサは通常、入力プロセッサとして使用されます。データは セレクタ のextract()
メソッドを使用して抽出されることが多く、Unicode文字列のリストをすためです。どのように動作するかを明確にした例を示します。
>>> def filter_world(x): ... return None if x == 'world' else x ... >>> from scrapy.loader.processors import MapCompose >>> proc = MapCompose(filter_world, str.upper) >>> proc(['hello', 'world', 'this', 'is', 'scrapy']) ['HELLO, 'THIS', 'IS', 'SCRAPY']
Composeプロセッサと同様に、関数はローダーコンテキストを受け取ることができ、コンストラクタのキーワード引数はデフォルトのコンテキスト値として使用されます。詳細については、
Compose
プロセッサを参照してください。
-
class
scrapy.loader.processors.
SelectJmes
(json_path)¶ コンストラクタに渡されたJSONパスを使用して値を照会し、出力を返します。実行するにはjmespath (https://github.com/jmespath/jmespath.py) が必要です。このプロセッサは、一度に1つの入力のみを受け取ります。
例:
>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose >>> proc = SelectJmes("foo") #for direct use on lists and dictionaries >>> proc({'foo': 'bar'}) 'bar' >>> proc({'foo': {'bar': 'baz'}}) {'bar': 'baz'}
JSONの場合
>>> import json >>> proc_single_json_str = Compose(json.loads, SelectJmes("foo")) >>> proc_single_json_str('{"foo": "bar"}') 'bar' >>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('foo'))) >>> proc_json_list('[{"foo":"bar"}, {"baz":"tar"}]') ['bar']
Scrapy shell¶
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It's meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell.
The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web pages you're trying to scrape. It allows you to interactively test your expressions while you're writing your spider, without having to run the spider to test every change.
Once you get familiarized with the Scrapy shell, you'll see that it's an invaluable tool for developing and debugging your spiders.
Configuring the shell¶
If you have IPython installed, the Scrapy shell will use it (instead of the standard Python console). The IPython console is much more powerful and provides smart auto-completion and colorized output, among other things.
We highly recommend you install IPython, specially if you're working on Unix systems (where IPython excels). See the IPython installation guide for more info.
Scrapy also has support for bpython, and will try to use it where IPython is unavailable.
Through scrapy's settings you can configure it to use any one of
ipython
, bpython
or the standard python
shell, regardless of which
are installed. This is done by setting the SCRAPY_PYTHON_SHELL
environment
variable; or by defining it in your scrapy.cfg:
[settings]
shell = bpython
Launch the shell¶
To launch the Scrapy shell you can use the shell
command like
this:
scrapy shell <url>
Where the <url>
is the URL you want to scrape.
shell
also works for local files. This can be handy if you want
to play around with a local copy of a web page. shell
understands
the following syntaxes for local files:
# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html
# File URI
scrapy shell file:///absolute/path/to/file.html
注釈
When using relative file paths, be explicit and prepend them
with ./
(or ../
when relevant).
scrapy shell index.html
will not work as one might expect (and
this is by design, not a bug).
Because shell
favors HTTP URLs over File URIs,
and index.html
being syntactically similar to example.com
,
shell
will treat index.html
as a domain name and trigger
a DNS lookup error:
$ scrapy shell index.html
[ ... scrapy shell starts ... ]
[ ... traceback ... ]
twisted.internet.error.DNSLookupError: DNS lookup failed:
address 'index.html' not found: [Errno -5] No address associated with hostname.
shell
will not test beforehand if a file called index.html
exists in the current directory. Again, be explicit.
Using the shell¶
The Scrapy shell is just a regular Python console (or IPython console if you have it available) which provides some additional shortcut functions for convenience.
Available Shortcuts¶
shelp()
- print a help with the list of available objects and shortcutsfetch(url[, redirect=True])
- fetch a new response from the given URL and update all related objects accordingly. You can optionaly ask for HTTP 3xx redirections to not be followed by passingredirect=False
fetch(request)
- fetch a new response from the given request and update all related objects accordingly.view(response)
- open the given response in your local web browser, for inspection. This will add a <base> tag to the response body in order for external links (such as images and style sheets) to display properly. Note, however, that this will create a temporary file in your computer, which won't be removed automatically.
Available Scrapy objects¶
The Scrapy shell automatically creates some convenient objects from the
downloaded page, like the Response
object and the
Selector
objects (for both HTML and XML
content).
Those objects are:
crawler
- the currentCrawler
object.spider
- the Spider which is known to handle the URL, or aSpider
object if there is no spider found for the current URLrequest
- aRequest
object of the last fetched page. You can modify this request usingreplace()
or fetch a new request (without leaving the shell) using thefetch
shortcut.response
- aResponse
object containing the last fetched pagesettings
- the current Scrapy settings
Example of shell session¶
Here's an example of a typical shell session where we start by scraping the https://scrapy.org page, and then proceed to scrape the https://reddit.com page. Finally, we modify the (Reddit) request method to POST and re-fetch it getting an error. We end the session by typing Ctrl-D (in Unix systems) or Ctrl-Z in Windows.
Keep in mind that the data extracted here may not be the same when you try it, as those pages are not static and could have changed by the time you test this. The only purpose of this example is to get you familiarized with how the Scrapy shell works.
First, we launch the shell:
scrapy shell 'https://scrapy.org' --nolog
Then, the shell fetches the URL (using the Scrapy downloader) and prints the
list of available objects and useful shortcuts (you'll notice that these lines
all start with the [s]
prefix):
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f07395dd690>
[s] item {}
[s] request <GET https://scrapy.org>
[s] response <200 https://scrapy.org/>
[s] settings <scrapy.settings.Settings object at 0x7f07395dd710>
[s] spider <DefaultSpider 'default' at 0x7f0735891690>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
After that, we can start playing with the objects:
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>>> fetch("https://reddit.com")
>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'
>>> request = request.replace(method="POST")
>>> fetch(request)
>>> response.status
404
>>> from pprint import pprint
>>> pprint(response.headers)
{'Accept-Ranges': ['bytes'],
'Cache-Control': ['max-age=0, must-revalidate'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
'Server': ['snooserv'],
'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
'Vary': ['accept-encoding'],
'Via': ['1.1 varnish'],
'X-Cache': ['MISS'],
'X-Cache-Hits': ['0'],
'X-Content-Type-Options': ['nosniff'],
'X-Frame-Options': ['SAMEORIGIN'],
'X-Moose': ['majestic'],
'X-Served-By': ['cache-cdg8730-CDG'],
'X-Timer': ['S1481214079.394283,VS0,VE159'],
'X-Ua-Compatible': ['IE=edge'],
'X-Xss-Protection': ['1; mode=block']}
>>>
Invoking the shell from spiders to inspect responses¶
Sometimes you want to inspect the responses that are being processed in a certain point of your spider, if only to check that response you expect is getting there.
This can be achieved by using the scrapy.shell.inspect_response
function.
Here's an example of how you would call it from your spider:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]
def parse(self, response):
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response, self)
# Rest of parsing code.
When you run the spider, you will get something similar to this:
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'
Then, you can check if the extraction code is working:
>>> response.xpath('//h1[@class="fn"]')
[]
Nope, it doesn't. So you can open the response in your web browser and see if it's the response you were expecting:
>>> view(response)
True
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...
Note that you can't use the fetch
shortcut here since the Scrapy engine is
blocked by the shell. However, after you leave the shell, the spider will
continue crawling where it stopped, as shown above.
Item Pipeline¶
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.
Each item pipeline component (sometimes referred as just "Item Pipeline") is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
Typical uses of item pipelines are:
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database
Writing your own item pipeline¶
Each item pipeline component is a Python class that must implement the following method:
-
process_item
(self, item, spider)¶ This method is called for every item pipeline component.
process_item()
must either: return a dict with data, return anItem
(or any descendant class) object, return a Twisted Deferred or raiseDropItem
exception. Dropped items are no longer processed by further pipeline components.パラメータ:
Additionally, they may also implement the following methods:
-
open_spider
(self, spider)¶ This method is called when the spider is opened.
パラメータ: spider ( Spider
object) -- the spider which was opened
-
close_spider
(self, spider)¶ This method is called when the spider is closed.
パラメータ: spider ( Spider
object) -- the spider which was closed
-
from_crawler
(cls, crawler)¶ If present, this classmethod is called to create a pipeline instance from a
Crawler
. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.パラメータ: crawler ( Crawler
object) -- crawler that uses this pipeline
Item pipeline example¶
Price validation and dropping items with no prices¶
Let's take a look at the following hypothetical pipeline that adjusts the
price
attribute for those items that do not include VAT
(price_excludes_vat
attribute), and drops those items which don't
contain a price:
from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
def process_item(self, item, spider):
if item.get('price'):
if item.get('price_excludes_vat'):
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)
Write items to a JSON file¶
The following pipeline stores all scraped items (from all spiders) into a
single items.jl
file, containing one item per line serialized in JSON
format:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
注釈
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
Write items to MongoDB¶
In this example we'll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.
The main point of this example is to show how to use from_crawler()
method and how to clean up the resources properly.:
import pymongo
class MongoPipeline(object):
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
return item
Take screenshot of item¶
This example demonstrates how to return Deferred from process_item()
method.
It uses Splash to render screenshot of item url. Pipeline
makes request to locally running instance of Splash. After request is downloaded
and Deferred callback fires, it saves item to a file and adds filename to an item.
import scrapy
import hashlib
from urllib.parse import quote
class ScreenshotPipeline(object):
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
def process_item(self, item, spider):
encoded_item_url = quote(item["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
dfd = spider.crawler.engine.download(request, spider)
dfd.addBoth(self.return_item, item)
return dfd
def return_item(self, response, item):
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = item["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = "{}.png".format(url_hash)
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
item["screenshot_filename"] = filename
return item
Duplicates filter¶
A filter that looks for duplicate items, and drops those items that were already processed. Let's say that our items have a unique id, but our spider returns multiples items with the same id:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Activating an Item Pipeline component¶
To activate an Item Pipeline component you must add its class to the
ITEM_PIPELINES
setting, like in the following example:
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes. It's customary to define these numbers in the 0-1000 range.
Feed exports¶
バージョン 0.10 で追加.
One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an "export file" with the scraped data (commonly called "export feed") to be consumed by other systems.
Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends.
Serialization formats¶
For serializing the scraped data, the feed exports use the Item exporters. These formats are supported out of the box:
But you can also extend the supported format through the
FEED_EXPORTERS
setting.
JSON¶
FEED_FORMAT
:json
- Exporter used:
JsonItemExporter
- See this warning if you're using JSON with large feeds.
JSON lines¶
FEED_FORMAT
:jsonlines
- Exporter used:
JsonLinesItemExporter
CSV¶
FEED_FORMAT
:csv
- Exporter used:
CsvItemExporter
- To specify columns to export and their order use
FEED_EXPORT_FIELDS
. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
XML¶
FEED_FORMAT
:xml
- Exporter used:
XmlItemExporter
Pickle¶
FEED_FORMAT
:pickle
- Exporter used:
PickleItemExporter
Marshal¶
FEED_FORMAT
:marshal
- Exporter used:
MarshalItemExporter
Storages¶
When using the feed exports you define where to store the feed using a URI
(through the FEED_URI
setting). The feed exports supports multiple
storage backend types which are defined by the URI scheme.
The storages backends supported out of the box are:
- Local filesystem
- FTP
- S3 (requires botocore or boto)
- Standard output
Some storage backends may be unavailable if the required external libraries are not available. For example, the S3 backend is only available if the botocore or boto library is installed (Scrapy supports boto only on Python 2).
Storage URI parameters¶
The storage URI can also contain parameters that get replaced when the feed is being created. These parameters are:
%(time)s
- gets replaced by a timestamp when the feed is being created%(name)s
- gets replaced by the spider name
Any other named parameter gets replaced by the spider attribute of the same
name. For example, %(site_id)s
would get replaced by the spider.site_id
attribute the moment the feed is being created.
Here are some examples to illustrate:
- Store in FTP using one directory per spider:
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
- Store in S3 using one directory per spider:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
Storage backends¶
Local filesystem¶
The feeds are stored in the local filesystem.
- URI scheme:
file
- Example URI:
file:///tmp/export.csv
- Required external libraries: none
Note that for the local filesystem storage (only) you can omit the scheme if
you specify an absolute path like /tmp/export.csv
. This only works on Unix
systems though.
FTP¶
The feeds are stored in a FTP server.
- URI scheme:
ftp
- Example URI:
ftp://user:pass@ftp.example.com/path/to/export.csv
- Required external libraries: none
S3¶
The feeds are stored on Amazon S3.
The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:
Standard output¶
The feeds are written to the standard output of the Scrapy process.
- URI scheme:
stdout
- Example URI:
stdout:
- Required external libraries: none
Settings¶
These are the settings used for configuring the feed exports:
FEED_URI¶
Default: None
The URI of the export feed. See Storage backends for supported URI schemes.
This setting is required for enabling the feed exports.
FEED_FORMAT¶
The serialization format to be used for the feed. See Serialization formats for possible values.
FEED_EXPORT_ENCODING¶
Default: None
The encoding to be used for the feed.
If unset or set to None
(default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.
Use utf-8
if you want UTF-8 for JSON too.
FEED_EXPORT_FIELDS¶
Default: None
A list of fields to export, optional.
Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]
.
Use FEED_EXPORT_FIELDS option to define fields to export and their order.
When FEED_EXPORT_FIELDS is empty or None (default), Scrapy uses fields
defined in dicts or Item
subclasses a spider is yielding.
If an exporter requires a fixed set of fields (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, then Scrapy tries to infer field names from the exported data - currently it uses field names from the first item.
FEED_EXPORT_INDENT¶
Default: 0
Amount of spaces used to indent the output on each level. If FEED_EXPORT_INDENT
is a non-negative integer, then array elements and object members will be pretty-printed
with that indent level. An indent level of 0
(the default), or negative,
will put each item on a new line. None
selects the most compact representation.
Currently implemented only by JsonItemExporter
and XmlItemExporter
, i.e. when you are exporting
to .json
or .xml
.
FEED_STORAGES¶
Default: {}
A dict containing additional feed storage backends supported by your project. The keys are URI schemes and the values are paths to storage classes.
FEED_STORAGES_BASE¶
Default:
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
A dict containing the built-in feed storage backends supported by Scrapy. You
can disable any of these backends by assigning None
to their URI scheme in
FEED_STORAGES
. E.g., to disable the built-in FTP storage backend
(without replacement), place this in your settings.py
:
FEED_STORAGES = {
'ftp': None,
}
FEED_EXPORTERS¶
Default: {}
A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.
FEED_EXPORTERS_BASE¶
Default:
{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}
A dict containing the built-in feed exporters supported by Scrapy. You can
disable any of these exporters by assigning None
to their serialization
format in FEED_EXPORTERS
. E.g., to disable the built-in CSV exporter
(without replacement), place this in your settings.py
:
FEED_EXPORTERS = {
'csv': None,
}
Requests and Responses¶
Scrapy uses Request
and Response
objects for crawling web
sites.
Typically, Request
objects are generated in the spiders and pass
across the system until they reach the Downloader, which executes the request
and returns a Response
object which travels back to the spider that
issued the request.
Both Request
and Response
classes have subclasses which add
functionality not required in the base classes. These are described
below in Request subclasses and
Response subclasses.
Request objects¶
-
class
scrapy.http.
Request
(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])¶ A
Request
object represents an HTTP request, which is usually generated in the Spider and executed by the Downloader, and thus generating aResponse
.パラメータ: - url (string) -- the URL of this request
- callback (callable) -- the function that will be called with the response of this
request (once its downloaded) as its first parameter. For more information
see Passing additional data to callback functions below.
If a Request doesn't specify a callback, the spider's
parse()
method will be used. Note that if exceptions are raised during processing, errback is called instead. - method (string) -- the HTTP method of this request. Defaults to
'GET'
. - meta (dict) -- the initial values for the
Request.meta
attribute. If given, the dict passed in this parameter will be shallow copied. - body (str or unicode) -- the request body. If a
unicode
is passed, then it's encoded tostr
using the encoding passed (which defaults toutf-8
). Ifbody
is not given, an empty string is stored. Regardless of the type of this argument, the final value stored will be astr
(neverunicode
orNone
). - headers (dict) -- the headers of this request. The dict values can be strings
(for single valued headers) or lists (for multi-valued headers). If
None
is passed as value, the HTTP header will not be sent at all. - cookies (dict or list) --
the request cookies. These can be sent in two forms.
- Using a dict:
request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})
- Using a list of dicts:
request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])
The latter form allows for customizing the
domain
andpath
attributes of the cookie. This is only useful if the cookies are saved for later requests.When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That's the typical behaviour of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by setting the
dont_merge_cookies
key to True in theRequest.meta
.Example of request without merging cookies:
request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True})
For more info see CookiesMiddleware.
- Using a dict:
- encoding (string) -- the encoding of this request (defaults to
'utf-8'
). This encoding will be used to percent-encode the URL and to convert the body tostr
(if given asunicode
). - priority (int) -- the priority of this request (defaults to
0
). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority. - dont_filter (boolean) -- indicates that this request should not be filtered by
the scheduler. This is used when you want to perform an identical
request multiple times, to ignore the duplicates filter. Use it with
care, or you will get into crawling loops. Default to
False
. - errback (callable) -- a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter. For more information, see Using errbacks to catch exceptions in request processing below.
- flags (list) -- Flags sent to the request, can be used for logging or similar purposes.
-
url
¶ A string containing the URL of this request. Keep in mind that this attribute contains the escaped URL, so it can differ from the URL passed in the constructor.
This attribute is read-only. To change the URL of a Request use
replace()
.
-
method
¶ A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example:
"GET"
,"POST"
,"PUT"
, etc
-
headers
¶ A dictionary-like object which contains the request headers.
-
body
¶ A str that contains the request body.
This attribute is read-only. To change the body of a Request use
replace()
.
-
meta
¶ A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
This dict is shallow copied when the request is cloned using the
copy()
orreplace()
methods, and can also be accessed, in your spider, from theresponse.meta
attribute.
-
copy
()¶ Return a new Request which is a copy of this Request. See also: Passing additional data to callback functions.
-
replace
([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])¶ Return a Request object with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute
Request.meta
is copied by default (unless a new value is given in themeta
argument). See also Passing additional data to callback functions.
Passing additional data to callback functions¶
The callback of a request is a function that will be called when the response
of that request is downloaded. The callback function will be called with the
downloaded Response
object as its first argument.
Example:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
In some cases you may be interested in passing arguments to those callback
functions so you can receive the arguments later, in the second callback. You
can use the Request.meta
attribute for that.
Here's an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
yield item
Using errbacks to catch exceptions in request processing¶
The errback of a request is a function that will be called when an exception is raise while processing it.
It receives a Twisted Failure instance as first parameter and can be used to track connection establishment timeouts, DNS errors etc.
Here's an example spider logging all errors and catching some specific errors if needed:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
Request.meta special keys¶
The Request.meta
attribute can contain any arbitrary data, but there
are some special keys recognized by Scrapy and its built-in extensions.
Those are:
dont_redirect
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies
cookiejar
dont_cache
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout
download_maxsize
download_latency
download_fail_on_dataloss
proxy
ftp_user
(SeeFTP_USER
for more info)ftp_password
(SeeFTP_PASSWORD
for more info)referrer_policy
max_retry_times
bindaddress¶
The IP of the outgoing IP address to use for the performing the request.
download_timeout¶
The amount of time (in secs) that the downloader will wait before timing out.
See also: DOWNLOAD_TIMEOUT
.
download_latency¶
The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
download_fail_on_dataloss¶
Whether or not to fail on broken responses. See:
DOWNLOAD_FAIL_ON_DATALOSS
.
max_retry_times¶
The meta key is used set retry times per request. When initialized, the
max_retry_times
meta key takes higher precedence over the
RETRY_TIMES
setting.
Request subclasses¶
Here is the list of built-in Request
subclasses. You can also subclass
it to implement your own custom functionality.
FormRequest objects¶
The FormRequest class extends the base Request
with functionality for
dealing with HTML forms. It uses lxml.html forms to pre-populate form
fields with form data from Response
objects.
-
class
scrapy.http.
FormRequest
(url[, formdata, ...])¶ The
FormRequest
class adds a new argument to the constructor. The remaining arguments are the same as for theRequest
class and are not documented here.パラメータ: formdata (dict or iterable of tuples) -- is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request. The
FormRequest
objects support the following class method in addition to the standardRequest
methods:-
classmethod
from_response
(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])¶ Returns a new
FormRequest
object with its form field values pre-populated with those found in the HTML<form>
element contained in the given response. For an example see Using FormRequest.from_response() to simulate a user login.The policy is to automatically simulate a click, by default, on any form control that looks clickable, like a
<input type="submit">
. Even though this is quite convenient, and often the desired behaviour, sometimes it can cause problems which could be hard to debug. For example, when working with forms that are filled and/or submitted using javascript, the defaultfrom_response()
behaviour may not be the most appropriate. To disable this behaviour you can set thedont_click
argument toTrue
. Also, if you want to change the control clicked (instead of disabling it) you can also use theclickdata
argument.ご用心
Using this method with select elements which have leading or trailing whitespace in the option values will not work due to a bug in lxml, which should be fixed in lxml 3.8 and above.
パラメータ: - response (
Response
object) -- the response containing a HTML form which will be used to pre-populate the form fields - formname (string) -- if given, the form with name attribute set to this value will be used.
- formid (string) -- if given, the form with id attribute set to this value will be used.
- formxpath (string) -- if given, the first form that matches the xpath will be used.
- formcss (string) -- if given, the first form that matches the css selector will be used.
- formnumber (integer) -- the number of form to use, when the response contains
multiple forms. The first one (and also the default) is
0
. - formdata (dict) -- fields to override in the form data. If a field was
already present in the response
<form>
element, its value is overridden by the one passed in this parameter. If a value passed in this parameter isNone
, the field will not be included in the request, even if it was present in the response<form>
element. - clickdata (dict) -- attributes to lookup the control clicked. If it's not
given, the form data will be submitted simulating a click on the
first clickable element. In addition to html attributes, the control
can be identified by its zero-based index relative to other
submittable inputs inside the form, via the
nr
attribute. - dont_click (boolean) -- If True, the form data will be submitted without clicking in any element.
The other parameters of this class method are passed directly to the
FormRequest
constructor.バージョン 0.10.3 で追加: The
formname
parameter.バージョン 0.17 で追加: The
formxpath
parameter.バージョン 1.1.0 で追加: The
formcss
parameter.バージョン 1.1.0 で追加: The
formid
parameter.- response (
-
classmethod
Request usage examples¶
Using FormRequest to send data via HTTP POST¶
If you want to simulate a HTML Form POST in your spider and send a couple of
key-value fields, you can return a FormRequest
object (from your
spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Using FormRequest.from_response() to simulate a user login¶
It is usual for web sites to provide pre-populated form fields through <input
type="hidden">
elements, such as session related data or authentication
tokens (for login pages). When scraping, you'll want these fields to be
automatically pre-populated and only override a couple of them, such as the
user name and password. You can use the FormRequest.from_response()
method for this job. Here's an example spider which uses it:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
Response objects¶
-
class
scrapy.http.
Response
(url[, status=200, headers=None, body=b'', flags=None, request=None])¶ A
Response
object represents an HTTP response, which is usually downloaded (by the Downloader) and fed to the Spiders for processing.パラメータ: - url (string) -- the URL of this response
- status (integer) -- the HTTP status of the response. Defaults to
200
. - headers (dict) -- the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
- body (bytes) -- the response body. To access the decoded text as str (unicode
in Python 2) you can use
response.text
from an encoding-aware Response subclass, such asTextResponse
. - flags (list) -- is a list containing the initial values for the
Response.flags
attribute. If given, the list will be shallow copied. - request (
Request
object) -- the initial value of theResponse.request
attribute. This represents theRequest
that generated this response.
-
url
¶ A string containing the URL of the response.
This attribute is read-only. To change the URL of a Response use
replace()
.
-
status
¶ An integer representing the HTTP status of the response. Example:
200
,404
.
-
headers
¶ A dictionary-like object which contains the response headers. Values can be accessed using
get()
to return the first header value with the specified name orgetlist()
to return all header values with the specified name. For example, this call will give you all cookies in the headers:response.headers.getlist('Set-Cookie')
-
body
¶ The body of this Response. Keep in mind that Response.body is always a bytes object. If you want the unicode version use
TextResponse.text
(only available inTextResponse
and subclasses).This attribute is read-only. To change the body of a Response use
replace()
.
-
request
¶ The
Request
object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:- HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection).
- Response.request.url doesn't always equal Response.url
- This attribute is only available in the spider code, and in the
Spider Middlewares, but not in
Downloader Middlewares (although you have the Request available there by
other means) and handlers of the
response_downloaded
signal.
-
meta
¶ A shortcut to the
Request.meta
attribute of theResponse.request
object (ie.self.request.meta
).Unlike the
Response.request
attribute, theResponse.meta
attribute is propagated along redirects and retries, so you will get the originalRequest.meta
sent from your spider.参考
Request.meta
attribute
-
flags
¶ A list that contains flags for this response. Flags are labels used for tagging Responses. For example: 'cached', 'redirected', etc. And they're shown on the string representation of the Response (__str__ method) which is used by the engine for logging.
-
copy
()¶ Returns a new Response which is a copy of this Response.
-
replace
([url, status, headers, body, request, flags, cls])¶ Returns a Response object with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute
Response.meta
is copied by default.
-
urljoin
(url)¶ Constructs an absolute url by combining the Response's
url
with a possible relative url.This is a wrapper over urlparse.urljoin, it's merely an alias for making this call:
urlparse.urljoin(response.url, url)
Response subclasses¶
Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.
TextResponse objects¶
-
class
scrapy.http.
TextResponse
(url[, encoding[, ...]])¶ TextResponse
objects adds encoding capabilities to the baseResponse
class, which is meant to be used only for binary data, such as images, sounds or any media file.TextResponse
objects support a new constructor argument, in addition to the baseResponse
objects. The remaining functionality is the same as for theResponse
class and is not documented here.パラメータ: encoding (string) -- is a string which contains the encoding to use for this response. If you create a TextResponse
object with a unicode body, it will be encoded using this encoding (remember the body attribute is always a string). Ifencoding
isNone
(default value), the encoding will be looked up in the response headers and body instead.TextResponse
objects support the following attributes in addition to the standardResponse
ones:-
text
¶ Response body, as unicode.
The same as
response.body.decode(response.encoding)
, but the result is cached after the first call, so you can accessresponse.text
multiple times without extra overhead.注釈
unicode(response.body)
is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.
-
encoding
¶ A string with the encoding of this response. The encoding is resolved by trying the following mechanisms, in order:
- the encoding passed in the constructor encoding argument
- the encoding declared in the Content-Type HTTP header. If this encoding is not valid (ie. unknown), it is ignored and the next resolution mechanism is tried.
- the encoding declared in the response body. The TextResponse class
doesn't provide any special functionality for this. However, the
HtmlResponse
andXmlResponse
classes do. - the encoding inferred by looking at the response body. This is the more fragile method but also the last one tried.
-
selector
¶ A
Selector
instance using the response as target. The selector is lazily instantiated on first access.
TextResponse
objects support the following methods in addition to the standardResponse
ones:-
xpath
(query)¶ A shortcut to
TextResponse.selector.xpath(query)
:response.xpath('//p')
-
css
(query)¶ A shortcut to
TextResponse.selector.css(query)
:response.css('p')
-
HtmlResponse objects¶
-
class
scrapy.http.
HtmlResponse
(url[, ...])¶ The
HtmlResponse
class is a subclass ofTextResponse
which adds encoding auto-discovering support by looking into the HTML meta http-equiv attribute. SeeTextResponse.encoding
.
XmlResponse objects¶
-
class
scrapy.http.
XmlResponse
(url[, ...])¶ The
XmlResponse
class is a subclass ofTextResponse
which adds encoding auto-discovering support by looking into the XML declaration line. SeeTextResponse.encoding
.
Link Extractors¶
Link extractors are objects whose only purpose is to extract links from web
pages (scrapy.http.Response
objects) which will be eventually
followed.
There is scrapy.linkextractors.LinkExtractor
available
in Scrapy, but you can create your own custom Link Extractors to suit your
needs by implementing a simple interface.
The only public method that every link extractor has is extract_links
,
which receives a Response
object and returns a list
of scrapy.link.Link
objects. Link extractors are meant to be
instantiated once and their extract_links
method called several times
with different responses to extract links to follow.
Link extractors are used in the CrawlSpider
class (available in Scrapy), through a set of rules, but you can also use it in
your spiders, even if you don't subclass from
CrawlSpider
, as its purpose is very simple: to
extract links.
Built-in link extractors reference¶
Link extractors classes bundled with Scrapy are provided in the
scrapy.linkextractors
module.
The default link extractor is LinkExtractor
, which is the same as
LxmlLinkExtractor
:
from scrapy.linkextractors import LinkExtractor
There used to be other link extractor classes in previous Scrapy versions, but they are deprecated now.
LxmlLinkExtractor¶
-
class
scrapy.linkextractors.lxmlhtml.
LxmlLinkExtractor
(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=False, unique=True, process_value=None, strip=True)¶ LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml's robust HTMLParser.
パラメータ: - allow (a regular expression (or list of)) -- a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
- deny (a regular expression (or list of)) -- a single regular expression (or list of regular expressions)
that the (absolute) urls must match in order to be excluded (ie. not
extracted). It has precedence over the
allow
parameter. If not given (or empty) it won't exclude any links. - allow_domains (str or list) -- a single value or a list of string containing domains which will be considered for extracting the links
- deny_domains (str or list) -- a single value or a list of strings containing domains which won't be considered for extracting the links
- deny_extensions (list) -- a single value or list of strings containing
extensions that should be ignored when extracting links.
If not given, it will default to the
IGNORED_EXTENSIONS
list defined in the scrapy.linkextractors package. - restrict_xpaths (str or list) -- is an XPath (or list of XPath's) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
- restrict_css (str or list) -- a CSS selector (or list of selectors) which defines
regions inside the response where links should be extracted from.
Has the same behaviour as
restrict_xpaths
. - tags (str or list) -- a tag or a list of tags to consider when extracting links.
Defaults to
('a', 'area')
. - attrs (list) -- an attribute or list of attributes which should be considered when looking
for links to extract (only for those tags specified in the
tags
parameter). Defaults to('href',)
- canonicalize (boolean) -- canonicalize each extracted url (using
w3lib.url.canonicalize_url). Defaults to
False
. Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different for requests with canonicalized and raw URLs. If you're using LinkExtractor to follow links it is more robust to keep the defaultcanonicalize=False
. - unique (boolean) -- whether duplicate filtering should be applied to extracted links.
- process_value (callable) --
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return
None
to ignore the link altogether. If not given,process_value
defaults tolambda x: x
.For example, to extract links from this code:
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
You can use the following function in
process_value
:def process_value(value): m = re.search("javascript:goToPage\('(.*?)'", value) if m: return m.group(1)
- strip (boolean) -- whether to strip whitespaces from extracted attributes.
According to HTML5 standard, leading and trailing whitespaces
must be stripped from
href
attributes of<a>
,<area>
and many other elements,src
attribute of<img>
,<iframe>
elements, etc., so LinkExtractor strips space chars by default. Setstrip=False
to turn it off (e.g. if you're extracting urls from elements or attributes which allow leading/trailing whitespaces).
Settings¶
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.
The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from. The settings can be populated through different mechanisms, which are described below.
The settings are also the mechanism for selecting the currently active Scrapy project (in case you have many).
For a list of available built-in settings see: Built-in settings reference.
Designating the settings¶
When you use Scrapy, you have to tell it which settings you're using. You can
do this by using an environment variable, SCRAPY_SETTINGS_MODULE
.
The value of SCRAPY_SETTINGS_MODULE
should be in Python path syntax, e.g.
myproject.settings
. Note that the settings module should be on the
Python import search path.
Populating the settings¶
Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence:
- Command line options (most precedence)
- Settings per-spider
- Project settings module
- Default settings per-command
- Default global settings (less precedence)
The population of these settings sources is taken care of internally, but a manual handling is possible using API calls. See the Settings API topic for reference.
These mechanisms are described in more detail below.
1. Command line options¶
Arguments provided by the command line are the ones that take most precedence,
overriding any other options. You can explicitly override one (or more)
settings using the -s
(or --set
) command line option.
Example:
scrapy crawl myspider -s LOG_FILE=scrapy.log
2. Settings per-spider¶
Spiders (See the Spider chapter for reference) can define their
own settings that will take precedence and override the project ones. They can
do so by setting their custom_settings
attribute:
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
3. Project settings module¶
The project settings module is the standard configuration file for your Scrapy
project, it's where most of your custom settings will be populated. For a
standard Scrapy project, this means you'll be adding or changing the settings
in the settings.py
file created for your project.
4. Default settings per-command¶
Each Scrapy tool command can have its own default
settings, which override the global default settings. Those custom command
settings are specified in the default_settings
attribute of the command
class.
5. Default global settings¶
The global defaults are located in the scrapy.settings.default_settings
module and documented in the Built-in settings reference section.
How to access settings¶
In a spider, the settings are available through self.settings
:
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
def parse(self, response):
print("Existing settings: %s" % self.settings.attributes.keys())
注釈
The settings
attribute is set in the base Spider class after the spider
is initialized. If you want to use the settings before the initialization
(e.g., in your spider's __init__()
method), you'll need to override the
from_crawler()
method.
Settings can be accessed through the scrapy.crawler.Crawler.settings
attribute of the Crawler that is passed to from_crawler
method in
extensions, middlewares and item pipelines:
class MyExtension(object):
def __init__(self, log_is_enabled=False):
if log_is_enabled:
print("log is enabled!")
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings.getbool('LOG_ENABLED'))
The settings object can be used like a dict (e.g.,
settings['LOG_ENABLED']
), but it's usually preferred to extract the setting
in the format you need it to avoid type errors, using one of the methods
provided by the Settings
API.
Rationale for setting names¶
Setting names are usually prefixed with the component that they configure. For
example, proper setting names for a fictional robots.txt extension would be
ROBOTSTXT_ENABLED
, ROBOTSTXT_OBEY
, ROBOTSTXT_CACHEDIR
, etc.
Built-in settings reference¶
Here's a list of all available Scrapy settings, in alphabetical order, along with their default values and the scope where they apply.
The scope, where available, shows where the setting is being used, if it's tied to any particular component. In that case the module of that component will be shown, typically an extension, middleware or pipeline. It also means that the component must be enabled in order for the setting to have any effect.
AWS_ACCESS_KEY_ID¶
Default: None
The AWS access key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
AWS_SECRET_ACCESS_KEY¶
Default: None
The AWS secret key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
AWS_ENDPOINT_URL¶
Default: None
Endpoint URL used for S3-like storage, for example Minio or s3.scality.
Only supported with botocore
library.
AWS_USE_SSL¶
Default: None
Use this option if you want to disable SSL connection for communication with
S3 or S3-like storage. By default SSL will be used.
Only supported with botocore
library.
AWS_VERIFY¶
Default: None
Verify SSL connection between Scrapy and S3 or S3-like storage. By default
SSL verification will occur. Only supported with botocore
library.
AWS_REGION_NAME¶
Default: None
The name of the region associated with the AWS client.
Only supported with botocore
library.
BOT_NAME¶
Default: 'scrapybot'
The name of the bot implemented by this Scrapy project (also known as the project name). This will be used to construct the User-Agent by default, and also for logging.
It's automatically populated with your project name when you create your
project with the startproject
command.
CONCURRENT_ITEMS¶
Default: 100
Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).
CONCURRENT_REQUESTS¶
Default: 16
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.
CONCURRENT_REQUESTS_PER_DOMAIN¶
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
See also: AutoThrottle extension and its
AUTOTHROTTLE_TARGET_CONCURRENCY
option.
CONCURRENT_REQUESTS_PER_IP¶
Default: 0
The maximum number of concurrent (ie. simultaneous) requests that will be
performed to any single IP. If non-zero, the
CONCURRENT_REQUESTS_PER_DOMAIN
setting is ignored, and this one is
used instead. In other words, concurrency limits will be applied per IP, not
per domain.
This setting also affects DOWNLOAD_DELAY
and
AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP
is non-zero, download delay is enforced per IP, not per domain.
DEFAULT_ITEM_CLASS¶
Default: 'scrapy.item.Item'
The default class that will be used for instantiating items in the the Scrapy shell.
DEFAULT_REQUEST_HEADERS¶
Default:
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. They're populated in the
DefaultHeadersMiddleware
.
DEPTH_LIMIT¶
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY¶
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
An integer that is used to adjust the request priority based on its depth:
- if zero (default), no priority adjustment is made from depth
- a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO)
- a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO)
See also: Does Scrapy crawl in breadth-first or depth-first order? about tuning Scrapy for BFO or DFO.
注釈
This setting adjusts priority in the opposite way compared to
other priority settings REDIRECT_PRIORITY_ADJUST
and RETRY_PRIORITY_ADJUST
.
DEPTH_STATS_VERBOSE¶
Default: False
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
Whether to collect verbose depth stats. If this is enabled, the number of requests for each depth is collected in the stats.
DOWNLOADER_HTTPCLIENTFACTORY¶
Default: 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
Defines a Twisted protocol.ClientFactory
class to use for HTTP/1.0
connections (for HTTP10DownloadHandler
).
注釈
HTTP/1.0 is rarely used nowadays so you can safely ignore this setting,
unless you use Twisted<11.1, or if you really want to use HTTP/1.0
and override DOWNLOAD_HANDLERS_BASE
for http(s)
scheme
accordingly, i.e. to
'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler'
.
DOWNLOADER_CLIENTCONTEXTFACTORY¶
Default: 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
Represents the classpath to the ContextFactory to use.
Here, "ContextFactory" is a Twisted term for SSL/TLS contexts, defining the TLS/SSL protocol version to use, whether to do certificate verification, or even enable client-side authentication (and various other things).
注釈
Scrapy default context factory does NOT perform remote server certificate verification. This is usually fine for web scraping.
If you do need remote server certificate verification enabled,
Scrapy also has another context factory class that you can set,
'scrapy.core.downloader.contextfactory.BrowserLikeContextFactory'
,
which uses the platform's certificates to validate remote endpoints.
This is only available if you use Twisted>=14.0.
If you do use a custom ContextFactory, make sure it accepts a method
parameter at init (this is the OpenSSL.SSL
method mapping
DOWNLOADER_CLIENT_TLS_METHOD
).
DOWNLOADER_CLIENT_TLS_METHOD¶
Default: 'TLS'
Use this setting to customize the TLS/SSL method used by the default HTTP/1.1 downloader.
This setting must be one of these string values:
'TLS'
: maps to OpenSSL'sTLS_method()
(a.k.aSSLv23_method()
), which allows protocol negotiation, starting from the highest supported by the platform; default, recommended'TLSv1.0'
: this value forces HTTPS connections to use TLS version 1.0 ; set this if you want the behavior of Scrapy<1.1'TLSv1.1'
: forces TLS version 1.1'TLSv1.2'
: forces TLS version 1.2'SSLv3'
: forces SSL version 3 (not recommended)
注釈
We recommend that you use PyOpenSSL>=0.13 and Twisted>=0.13 or above (Twisted>=14.0 if you can).
DOWNLOADER_MIDDLEWARES¶
Default:: {}
A dict containing the downloader middlewares enabled in your project, and their orders. For more info see Activating a downloader middleware.
DOWNLOADER_MIDDLEWARES_BASE¶
Default:
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
A dict containing the downloader middlewares enabled by default in Scrapy. Low
orders are closer to the engine, high orders are closer to the downloader. You
should never modify this setting in your project, modify
DOWNLOADER_MIDDLEWARES
instead. For more info see
Activating a downloader middleware.
DOWNLOAD_DELAY¶
Default: 0
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY
setting (which is enabled by default). By default, Scrapy doesn't wait a fixed
amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY
and 1.5 * DOWNLOAD_DELAY
.
When CONCURRENT_REQUESTS_PER_IP
is non-zero, delays are enforced
per ip address instead of per domain.
You can also change this setting per spider by setting download_delay
spider attribute.
DOWNLOAD_HANDLERS¶
Default: {}
A dict containing the request downloader handlers enabled in your project.
See DOWNLOAD_HANDLERS_BASE
for example format.
DOWNLOAD_HANDLERS_BASE¶
Default:
{
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}
A dict containing the request download handlers enabled by default in Scrapy.
You should never modify this setting in your project, modify
DOWNLOAD_HANDLERS
instead.
You can disable any of these download handlers by assigning None
to their
URI scheme in DOWNLOAD_HANDLERS
. E.g., to disable the built-in FTP
handler (without replacement), place this in your settings.py
:
DOWNLOAD_HANDLERS = {
'ftp': None,
}
DOWNLOAD_TIMEOUT¶
Default: 180
The amount of time (in secs) that the downloader will wait before timing out.
注釈
This timeout can be set per spider using download_timeout
spider attribute and per-request using download_timeout
Request.meta key.
DOWNLOAD_MAXSIZE¶
Default: 1073741824 (1024MB)
The maximum response size (in bytes) that downloader will download.
If you want to disable it set to 0.
注釈
This size can be set per spider using download_maxsize
spider attribute and per-request using download_maxsize
Request.meta key.
This feature needs Twisted >= 11.1.
DOWNLOAD_WARNSIZE¶
Default: 33554432 (32MB)
The response size (in bytes) that downloader will start to warn.
If you want to disable it set to 0.
注釈
This size can be set per spider using download_warnsize
spider attribute and per-request using download_warnsize
Request.meta key.
This feature needs Twisted >= 11.1.
DOWNLOAD_FAIL_ON_DATALOSS¶
Default: True
Whether or not to fail on broken responses, that is, declared
Content-Length
does not match content sent by the server or chunked
response was not properly finish. If True
, these responses raise a
ResponseFailed([_DataLoss])
error. If False
, these responses
are passed through and the flag dataloss
is added to the response, i.e.:
'dataloss' in response.flags
is True
.
Optionally, this can be set per-request basis by using the
download_fail_on_dataloss
Request.meta key to False
.
注釈
A broken response, or data loss error, may happen under several
circumstances, from server misconfiguration to network errors to data
corruption. It is up to the user to decide if it makes sense to process
broken responses considering they may contain partial or incomplete content.
If RETRY_ENABLED
is True
and this setting is set to True
,
the ResponseFailed([_DataLoss])
failure will be retried as usual.
DUPEFILTER_CLASS¶
Default: 'scrapy.dupefilters.RFPDupeFilter'
The class used to detect and filter duplicate requests.
The default (RFPDupeFilter
) filters based on request fingerprint using
the scrapy.utils.request.request_fingerprint
function. In order to change
the way duplicates are checked you could subclass RFPDupeFilter
and
override its request_fingerprint
method. This method should accept
scrapy Request
object and return its fingerprint
(a string).
You can disable filtering of duplicate requests by setting
DUPEFILTER_CLASS
to 'scrapy.dupefilters.BaseDupeFilter'
.
Be very careful about this however, because you can get into crawling loops.
It's usually a better idea to set the dont_filter
parameter to
True
on the specific Request
that should not be
filtered.
DUPEFILTER_DEBUG¶
Default: False
By default, RFPDupeFilter
only logs the first duplicate request.
Setting DUPEFILTER_DEBUG
to True
will make it log all duplicate requests.
EDITOR¶
Default: vi
(on Unix systems) or the IDLE editor (on Windows)
The editor to use for editing spiders with the edit
command.
Additionally, if the EDITOR
environment variable is set, the edit
command will prefer it over the default setting.
EXTENSIONS¶
Default:: {}
A dict containing the extensions enabled in your project, and their orders.
EXTENSIONS_BASE¶
Default:
{
'scrapy.extensions.corestats.CoreStats': 0,
'scrapy.extensions.telnet.TelnetConsole': 0,
'scrapy.extensions.memusage.MemoryUsage': 0,
'scrapy.extensions.memdebug.MemoryDebugger': 0,
'scrapy.extensions.closespider.CloseSpider': 0,
'scrapy.extensions.feedexport.FeedExporter': 0,
'scrapy.extensions.logstats.LogStats': 0,
'scrapy.extensions.spiderstate.SpiderState': 0,
'scrapy.extensions.throttle.AutoThrottle': 0,
}
A dict containing the extensions available by default in Scrapy, and their orders. This setting contains all stable built-in extensions. Keep in mind that some of them need to be enabled through a setting.
For more information See the extensions user guide and the list of available extensions.
FEED_TEMPDIR¶
The Feed Temp dir allows you to set a custom folder to save crawler temporary files before uploading with FTP feed storage and Amazon S3.
FTP_PASSWORD¶
Default: "guest"
The password to use for FTP connections when there is no "ftp_password"
in Request
meta.
注釈
Paraphrasing RFC 1635, although it is common to use either the password "guest" or one's e-mail address for anonymous FTP, some FTP servers explicitly ask for the user's e-mail address and will not allow login with the "guest" password.
FTP_USER¶
Default: "anonymous"
The username to use for FTP connections when there is no "ftp_user"
in Request
meta.
ITEM_PIPELINES¶
Default: {}
A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define them in the 0-1000 range. Lower orders process before higher orders.
Example:
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
ITEM_PIPELINES_BASE¶
Default: {}
A dict containing the pipelines enabled by default in Scrapy. You should never
modify this setting in your project, modify ITEM_PIPELINES
instead.
LOG_FORMAT¶
Default: '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
String for formatting log messsages. Refer to the Python logging documentation for the whole list of available placeholders.
LOG_DATEFORMAT¶
Default: '%Y-%m-%d %H:%M:%S'
String for formatting date/time, expansion of the %(asctime)s
placeholder
in LOG_FORMAT
. Refer to the Python datetime documentation for the whole list of available
directives.
LOG_LEVEL¶
Default: 'DEBUG'
Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING, INFO, DEBUG. For more info see Logging.
LOG_STDOUT¶
Default: False
If True
, all standard output (and error) of your process will be redirected
to the log. For example if you print('hello')
it will appear in the Scrapy
log.
LOG_SHORT_NAMES¶
Default: False
If True
, the logs will just contain the root path. If it is set to False
then it displays the component responsible for the log output
MEMDEBUG_NOTIFY¶
Default: []
When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.
Example:
MEMDEBUG_NOTIFY = ['user@example.com']
MEMUSAGE_ENABLED¶
Default: True
Scope: scrapy.extensions.memusage
Whether to enable the memory usage extension. This extension keeps track of
a peak memory used by the process (it writes it to stats). It can also
optionally shutdown the Scrapy process when it exceeds a memory limit
(see MEMUSAGE_LIMIT_MB
), and notify by email when that happened
(see MEMUSAGE_NOTIFY_MAIL
).
MEMUSAGE_LIMIT_MB¶
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.
MEMUSAGE_CHECK_INTERVAL_SECONDS¶
バージョン 1.1 で追加.
Default: 60.0
Scope: scrapy.extensions.memusage
The Memory usage extension
checks the current memory usage, versus the limits set by
MEMUSAGE_LIMIT_MB
and MEMUSAGE_WARNING_MB
,
at fixed time intervals.
This sets the length of these intervals, in seconds.
MEMUSAGE_NOTIFY_MAIL¶
Default: False
Scope: scrapy.extensions.memusage
A list of emails to notify if the memory limit has been reached.
Example:
MEMUSAGE_NOTIFY_MAIL = ['user@example.com']
MEMUSAGE_WARNING_MB¶
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero, no warning will be produced.
NEWSPIDER_MODULE¶
Default: ''
Module where to create new spiders using the genspider
command.
Example:
NEWSPIDER_MODULE = 'mybot.spiders_dev'
RANDOMIZE_DOWNLOAD_DELAY¶
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY
and 1.5 * DOWNLOAD_DELAY
) while fetching requests from the same
website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait
option.
If DOWNLOAD_DELAY
is zero (default) this option has no effect.
REACTOR_THREADPOOL_MAXSIZE¶
Default: 10
The maximum limit for Twisted Reactor thread pool size. This is common multi-purpose thread pool used by various Scrapy components. Threaded DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase this value if you're experiencing problems with insufficient blocking IO.
REDIRECT_MAX_TIMES¶
Default: 20
Defines the maximum times a request can be redirected. After this maximum the request's response is returned as is. We used Firefox default value for the same task.
REDIRECT_PRIORITY_ADJUST¶
Default: +2
Scope: scrapy.downloadermiddlewares.redirect.RedirectMiddleware
Adjust redirect request priority relative to original request:
- a positive priority adjust (default) means higher priority.
- a negative priority adjust means lower priority.
RETRY_PRIORITY_ADJUST¶
Default: -1
Scope: scrapy.downloadermiddlewares.retry.RetryMiddleware
Adjust retry request priority relative to original request:
- a positive priority adjust means higher priority.
- a negative priority adjust (default) means lower priority.
ROBOTSTXT_OBEY¶
Default: False
Scope: scrapy.downloadermiddlewares.robotstxt
If enabled, Scrapy will respect robots.txt policies. For more information see RobotsTxtMiddleware.
注釈
While the default value is False
for historical reasons,
this option is enabled by default in settings.py file generated
by scrapy startproject
command.
SCHEDULER_DEBUG¶
Default: False
Setting to True
will log debug information about the requests scheduler.
This currently logs (only once) if the requests cannot be serialized to disk.
Stats counter (scheduler/unserializable
) tracks the number of times this happens.
Example entry in logs:
1956-01-31 00:00:00+0800 [scrapy.core.scheduler] ERROR: Unable to serialize request:
<GET http://example.com> - reason: cannot serialize <Request at 0x9a7c7ec>
(type Request)> - no more unserializable requests will be logged
(see 'scheduler/unserializable' stats counter)
SCHEDULER_DISK_QUEUE¶
Default: 'scrapy.squeues.PickleLifoDiskQueue'
Type of disk queue that will be used by scheduler. Other available types are
scrapy.squeues.PickleFifoDiskQueue
, scrapy.squeues.MarshalFifoDiskQueue
,
scrapy.squeues.MarshalLifoDiskQueue
.
SCHEDULER_MEMORY_QUEUE¶
Default: 'scrapy.squeues.LifoMemoryQueue'
Type of in-memory queue used by scheduler. Other available type is:
scrapy.squeues.FifoMemoryQueue
.
SCHEDULER_PRIORITY_QUEUE¶
Default: 'queuelib.PriorityQueue'
Type of priority queue used by scheduler.
SPIDER_CONTRACTS¶
Default:: {}
A dict containing the spider contracts enabled in your project, used for testing spiders. For more info see Spiders Contracts.
SPIDER_CONTRACTS_BASE¶
Default:
{
'scrapy.contracts.default.UrlContract' : 1,
'scrapy.contracts.default.ReturnsContract': 2,
'scrapy.contracts.default.ScrapesContract': 3,
}
A dict containing the scrapy contracts enabled by default in Scrapy. You should
never modify this setting in your project, modify SPIDER_CONTRACTS
instead. For more info see Spiders Contracts.
You can disable any of these contracts by assigning None
to their class
path in SPIDER_CONTRACTS
. E.g., to disable the built-in
ScrapesContract
, place this in your settings.py
:
SPIDER_CONTRACTS = {
'scrapy.contracts.default.ScrapesContract': None,
}
SPIDER_LOADER_CLASS¶
Default: 'scrapy.spiderloader.SpiderLoader'
The class that will be used for loading spiders, which must implement the SpiderLoader API.
SPIDER_LOADER_WARN_ONLY¶
バージョン 1.3.3 で追加.
Default: False
By default, when scrapy tries to import spider classes from SPIDER_MODULES
,
it will fail loudly if there is any ImportError
exception.
But you can choose to silence this exception and turn it into a simple
warning by setting SPIDER_LOADER_WARN_ONLY = True
.
注釈
Some scrapy commands run with this setting to True
already (i.e. they will only issue a warning and will not fail)
since they do not actually need to load spider classes to work:
scrapy runspider
,
scrapy settings
,
scrapy startproject
,
scrapy version
.
SPIDER_MIDDLEWARES¶
Default:: {}
A dict containing the spider middlewares enabled in your project, and their orders. For more info see Activating a spider middleware.
SPIDER_MIDDLEWARES_BASE¶
Default:
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}
A dict containing the spider middlewares enabled by default in Scrapy, and their orders. Low orders are closer to the engine, high orders are closer to the spider. For more info see Activating a spider middleware.
SPIDER_MODULES¶
Default: []
A list of modules where Scrapy will look for spiders.
Example:
SPIDER_MODULES = ['mybot.spiders_prod', 'mybot.spiders_dev']
STATS_CLASS¶
Default: 'scrapy.statscollectors.MemoryStatsCollector'
The class to use for collecting stats, who must implement the Stats Collector API.
STATS_DUMP¶
Default: True
Dump the Scrapy stats (to the Scrapy log) once the spider finishes.
For more info see: Stats Collection.
STATSMAILER_RCPTS¶
Default: []
(empty list)
Send Scrapy stats after spiders finish scraping. See
StatsMailer
for more info.
TELNETCONSOLE_ENABLED¶
Default: True
A boolean which specifies if the telnet console will be enabled (provided its extension is also enabled).
TELNETCONSOLE_PORT¶
Default: [6023, 6073]
The port range to use for the telnet console. If set to None
or 0
, a
dynamically assigned port is used. For more info see
Telnet Console.
TEMPLATES_DIR¶
Default: templates
dir inside scrapy module
The directory where to look for templates when creating new projects with
startproject
command and new spiders with genspider
command.
The project name must not conflict with the name of custom files or directories
in the project
subdirectory.
URLLENGTH_LIMIT¶
Default: 2083
Scope: spidermiddlewares.urllength
The maximum URL length to allow for crawled URLs. For more information about the default value for this setting see: https://boutell.com/newfaq/misc/urllength.html
USER_AGENT¶
Default: "Scrapy/VERSION (+https://scrapy.org)"
The default User-Agent to use when crawling, unless overridden.
Settings documented elsewhere:¶
The following settings are documented elsewhere, please check each specific case to see how to enable and use them.
- AJAXCRAWL_ENABLED
- AUTOTHROTTLE_DEBUG
- AUTOTHROTTLE_ENABLED
- AUTOTHROTTLE_MAX_DELAY
- AUTOTHROTTLE_START_DELAY
- AUTOTHROTTLE_TARGET_CONCURRENCY
- AWS_ACCESS_KEY_ID
- AWS_ENDPOINT_URL
- AWS_REGION_NAME
- AWS_SECRET_ACCESS_KEY
- AWS_USE_SSL
- AWS_VERIFY
- BOT_NAME
- CLOSESPIDER_ERRORCOUNT
- CLOSESPIDER_ITEMCOUNT
- CLOSESPIDER_PAGECOUNT
- CLOSESPIDER_TIMEOUT
- COMMANDS_MODULE
- COMPRESSION_ENABLED
- CONCURRENT_ITEMS
- CONCURRENT_REQUESTS
- CONCURRENT_REQUESTS_PER_DOMAIN
- CONCURRENT_REQUESTS_PER_IP
- COOKIES_DEBUG
- COOKIES_ENABLED
- DEFAULT_ITEM_CLASS
- DEFAULT_REQUEST_HEADERS
- DEPTH_LIMIT
- DEPTH_PRIORITY
- DEPTH_STATS_VERBOSE
- DNSCACHE_ENABLED
- DNSCACHE_SIZE
- DNS_TIMEOUT
- DOWNLOADER
- DOWNLOADER_CLIENTCONTEXTFACTORY
- DOWNLOADER_CLIENT_TLS_METHOD
- DOWNLOADER_HTTPCLIENTFACTORY
- DOWNLOADER_MIDDLEWARES
- DOWNLOADER_MIDDLEWARES_BASE
- DOWNLOADER_STATS
- DOWNLOAD_DELAY
- DOWNLOAD_FAIL_ON_DATALOSS
- DOWNLOAD_HANDLERS
- DOWNLOAD_HANDLERS_BASE
- DOWNLOAD_MAXSIZE
- DOWNLOAD_TIMEOUT
- DOWNLOAD_WARNSIZE
- DUPEFILTER_CLASS
- DUPEFILTER_DEBUG
- EDITOR
- EXTENSIONS
- EXTENSIONS_BASE
- FEED_EXPORTERS
- FEED_EXPORTERS_BASE
- FEED_EXPORT_ENCODING
- FEED_EXPORT_FIELDS
- FEED_EXPORT_INDENT
- FEED_FORMAT
- FEED_STORAGES
- FEED_STORAGES_BASE
- FEED_STORE_EMPTY
- FEED_TEMPDIR
- FEED_URI
- FILES_EXPIRES
- FILES_RESULT_FIELD
- FILES_STORE
- FILES_STORE_GCS_ACL
- FILES_STORE_S3_ACL
- FILES_URLS_FIELD
- FTP_PASSIVE_MODE
- FTP_PASSWORD
- FTP_USER
- GCS_PROJECT_ID
- HTTPCACHE_ALWAYS_STORE
- HTTPCACHE_DBM_MODULE
- HTTPCACHE_DIR
- HTTPCACHE_ENABLED
- HTTPCACHE_EXPIRATION_SECS
- HTTPCACHE_GZIP
- HTTPCACHE_IGNORE_HTTP_CODES
- HTTPCACHE_IGNORE_MISSING
- HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
- HTTPCACHE_IGNORE_SCHEMES
- HTTPCACHE_POLICY
- HTTPCACHE_STORAGE
- HTTPERROR_ALLOWED_CODES
- HTTPERROR_ALLOW_ALL
- HTTPPROXY_AUTH_ENCODING
- HTTPPROXY_ENABLED
- IMAGES_EXPIRES
- IMAGES_MIN_HEIGHT
- IMAGES_MIN_WIDTH
- IMAGES_RESULT_FIELD
- IMAGES_STORE
- IMAGES_STORE_GCS_ACL
- IMAGES_STORE_S3_ACL
- IMAGES_THUMBS
- IMAGES_URLS_FIELD
- ITEM_PIPELINES
- ITEM_PIPELINES_BASE
- LOG_DATEFORMAT
- LOG_ENABLED
- LOG_ENCODING
- LOG_FILE
- LOG_FORMAT
- LOG_LEVEL
- LOG_SHORT_NAMES
- LOG_STDOUT
- MAIL_FROM
- MAIL_HOST
- MAIL_PASS
- MAIL_PORT
- MAIL_SSL
- MAIL_TLS
- MAIL_USER
- MEDIA_ALLOW_REDIRECTS
- MEMDEBUG_ENABLED
- MEMDEBUG_NOTIFY
- MEMUSAGE_CHECK_INTERVAL_SECONDS
- MEMUSAGE_ENABLED
- MEMUSAGE_LIMIT_MB
- MEMUSAGE_NOTIFY_MAIL
- MEMUSAGE_WARNING_MB
- METAREFRESH_ENABLED
- METAREFRESH_MAXDELAY
- NEWSPIDER_MODULE
- RANDOMIZE_DOWNLOAD_DELAY
- REACTOR_THREADPOOL_MAXSIZE
- REDIRECT_ENABLED
- REDIRECT_MAX_TIMES
- REDIRECT_MAX_TIMES
- REDIRECT_PRIORITY_ADJUST
- REFERER_ENABLED
- REFERRER_POLICY
- RETRY_ENABLED
- RETRY_HTTP_CODES
- RETRY_PRIORITY_ADJUST
- RETRY_TIMES
- ROBOTSTXT_OBEY
- SCHEDULER
- SCHEDULER_DEBUG
- SCHEDULER_DISK_QUEUE
- SCHEDULER_MEMORY_QUEUE
- SCHEDULER_PRIORITY_QUEUE
- SPIDER_CONTRACTS
- SPIDER_CONTRACTS_BASE
- SPIDER_LOADER_CLASS
- SPIDER_LOADER_WARN_ONLY
- SPIDER_MIDDLEWARES
- SPIDER_MIDDLEWARES_BASE
- SPIDER_MODULES
- STATSMAILER_RCPTS
- STATS_CLASS
- STATS_DUMP
- TELNETCONSOLE_ENABLED
- TELNETCONSOLE_HOST
- TELNETCONSOLE_PASSWORD
- TELNETCONSOLE_PORT
- TELNETCONSOLE_PORT
- TELNETCONSOLE_USERNAME
- TEMPLATES_DIR
- URLLENGTH_LIMIT
- USER_AGENT
Exceptions¶
Built-in Exceptions reference¶
Here's a list of all exceptions included in Scrapy and their usage.
DropItem¶
-
exception
scrapy.exceptions.
DropItem
¶
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item Pipeline.
CloseSpider¶
-
exception
scrapy.exceptions.
CloseSpider
(reason='cancelled')¶ This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:
パラメータ: reason (str) -- the reason for closing
For example:
def parse_page(self, response):
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')
DontCloseSpider¶
-
exception
scrapy.exceptions.
DontCloseSpider
¶
This exception can be raised in a spider_idle
signal handler to
prevent the spider from being closed.
IgnoreRequest¶
-
exception
scrapy.exceptions.
IgnoreRequest
¶
This exception can be raised by the Scheduler or any downloader middleware to indicate that the request should be ignored.
NotConfigured¶
-
exception
scrapy.exceptions.
NotConfigured
¶
This exception can be raised by some components to indicate that they will remain disabled. Those components include:
- Extensions
- Item pipelines
- Downloader middlewares
- Spider middlewares
The exception must be raised in the component's __init__
method.
- コマンドラインツール
- Scrapyのコマンドラインツールについて学ぶことができます。
- Spider
- ウェブサイトをクロールするためのルールを書きます。
- セレクタ
- XPathを使用してWebページからデータを抽出します。
- Scrapy shell
- 対話的にデータを抽出するコードを試します。
- Item
- 収集するデータを定義します。
- Itemローダー
- 抽出したデータをItemにセットします。
- Item Pipeline
- 収集したデータの後処理をします。
- Feed exports
- さまざまな形式とストレージを使用して収集したデータを出力します。
- Requests and Responses
- HTTPリクエストとレスポンスを表すためのクラスを理解します。
- Link Extractors
- ページに付随するリンクを抽出するための便利なクラスです。
- Settings
- Scrapyの すべての設定 を確認して、設定方法を学びます。
- Exceptions
- 使用可能な例外の解説です。
組み込みサービス¶
Logging¶
注釈
scrapy.log
has been deprecated alongside its functions in favor of
explicit calls to the Python standard logging. Keep reading to learn more
about the new logging system.
Scrapy uses Python's builtin logging system for event logging. We'll provide some simple examples to get you started, but for more advanced use-cases it's strongly suggested to read thoroughly its documentation.
Logging works out of the box, and can be configured to some extent with the Scrapy settings listed in Logging settings.
Scrapy calls scrapy.utils.log.configure_logging()
to set some reasonable
defaults and handle those settings in Logging settings when
running commands, so it's recommended to manually call it if you're running
Scrapy from scripts as described in Run Scrapy from a script.
Log levels¶
Python's builtin logging defines 5 different levels to indicate the severity of a given log message. Here are the standard ones, listed in decreasing order:
logging.CRITICAL
- for critical errors (highest severity)logging.ERROR
- for regular errorslogging.WARNING
- for warning messageslogging.INFO
- for informational messageslogging.DEBUG
- for debugging messages (lowest severity)
How to log messages¶
Here's a quick example of how to log a message using the logging.WARNING
level:
import logging
logging.warning("This is a warning")
There are shortcuts for issuing log messages on any of the standard 5 levels,
and there's also a general logging.log
method which takes a given level as
argument. If needed, the last example could be rewritten as:
import logging
logging.log(logging.WARNING, "This is a warning")
On top of that, you can create different "loggers" to encapsulate messages. (For example, a common practice is to create different loggers for every module). These loggers can be configured independently, and they allow hierarchical constructions.
The previous examples use the root logger behind the scenes, which is a top level
logger where all messages are propagated to (unless otherwise specified). Using
logging
helpers is merely a shortcut for getting the root logger
explicitly, so this is also an equivalent of the last snippets:
import logging
logger = logging.getLogger()
logger.warning("This is a warning")
You can use a different logger just by getting its name with the
logging.getLogger
function:
import logging
logger = logging.getLogger('mycustomlogger')
logger.warning("This is a warning")
Finally, you can ensure having a custom logger for any module you're working on
by using the __name__
variable, which is populated with current module's
path:
import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")
Logging from Spiders¶
Scrapy provides a logger
within each Spider
instance, which can be accessed and used like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapinghub.com']
def parse(self, response):
self.logger.info('Parse function called on %s', response.url)
That logger is created using the Spider's name, but you can use any custom Python logger you want. For example:
import logging
import scrapy
logger = logging.getLogger('mycustomlogger')
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapinghub.com']
def parse(self, response):
logger.info('Parse function called on %s', response.url)
Logging configuration¶
Loggers on their own don't manage how messages sent through them are displayed. For this task, different "handlers" can be attached to any logger instance and they will redirect those messages to appropriate destinations, such as the standard output, files, emails, etc.
By default, Scrapy sets and configures a handler for the root logger, based on the settings below.
Logging settings¶
These settings can be used to configure the logging:
The first couple of settings define a destination for log messages. If
LOG_FILE
is set, messages sent through the root logger will be
redirected to a file named LOG_FILE
with encoding
LOG_ENCODING
. If unset and LOG_ENABLED
is True
, log
messages will be displayed on the standard error. Lastly, if
LOG_ENABLED
is False
, there won't be any visible log output.
LOG_LEVEL
determines the minimum level of severity to display, those
messages with lower severity will be filtered out. It ranges through the
possible levels listed in Log levels.
LOG_FORMAT
and LOG_DATEFORMAT
specify formatting strings
used as layouts for all messages. Those strings can contain any placeholders
listed in logging's logrecord attributes docs and
datetime's strftime and strptime directives
respectively.
If LOG_SHORT_NAMES
is set, then the logs will not display the scrapy
component that prints the log. It is unset by default, hence logs contain the
scrapy component responsible for that log output.
Command-line options¶
There are command-line arguments, available for all commands, that you can use to override some of the Scrapy settings regarding logging.
--logfile FILE
- Overrides
LOG_FILE
--loglevel/-L LEVEL
- Overrides
LOG_LEVEL
--nolog
- Sets
LOG_ENABLED
toFalse
参考
- Module logging.handlers
- Further documentation on available handlers
Advanced customization¶
Because Scrapy uses stdlib logging module, you can customize logging using all features of stdlib logging.
For example, let's say you're scraping a website which returns many HTTP 404 and 500 responses, and you want to hide all messages like this:
2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring
response <500 http://quotes.toscrape.com/page/1-34/>: HTTP status code
is not handled or not allowed
The first thing to note is a logger name - it is in brackets:
[scrapy.spidermiddlewares.httperror]
. If you get just [scrapy]
then
LOG_SHORT_NAMES
is likely set to True; set it to False and re-run
the crawl.
Next, we can see that the message has INFO level. To hide it
we should set logging level for scrapy.spidermiddlewares.httperror
higher than INFO; next level after INFO is WARNING. It could be done
e.g. in the spider's __init__
method:
import logging
import scrapy
class MySpider(scrapy.Spider):
# ...
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
logger.setLevel(logging.WARNING)
super().__init__(*args, **kwargs)
If you run this spider again then INFO messages from
scrapy.spidermiddlewares.httperror
logger will be gone.
scrapy.utils.log module¶
Stats Collection¶
Scrapy provides a convenient facility for collecting stats in the form of
key/values, where values are often counters. The facility is called the Stats
Collector, and can be accessed through the stats
attribute of the Crawler API, as illustrated by the examples in
the Common Stats Collector uses section below.
However, the Stats Collector is always available, so you can always import it in your module and use its API (to increment or set new stat keys), regardless of whether the stats collection is enabled or not. If it's disabled, the API will still work but it won't collect anything. This is aimed at simplifying the stats collector usage: you should spend no more than one line of code for collecting stats in your spider, Scrapy extension, or whatever code you're using the Stats Collector from.
Another feature of the Stats Collector is that it's very efficient (when enabled) and extremely efficient (almost unnoticeable) when disabled.
The Stats Collector keeps a stats table per open spider which is automatically opened when the spider is opened, and closed when the spider is closed.
Common Stats Collector uses¶
Access the stats collector through the stats
attribute. Here is an example of an extension that access stats:
class ExtensionThatAccessStats(object):
def __init__(self, stats):
self.stats = stats
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
Set stat value:
stats.set_value('hostname', socket.gethostname())
Increment stat value:
stats.inc_value('custom_count')
Set stat value only if greater than previous:
stats.max_value('max_items_scraped', value)
Set stat value only if lower than previous:
stats.min_value('min_free_memory_percent', value)
Get stat value:
>>> stats.get_value('custom_count')
1
Get all stats:
>>> stats.get_stats()
{'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}
Available Stats Collectors¶
Besides the basic StatsCollector
there are other Stats Collectors
available in Scrapy which extend the basic Stats Collector. You can select
which Stats Collector to use through the STATS_CLASS
setting. The
default Stats Collector used is the MemoryStatsCollector
.
MemoryStatsCollector¶
-
class
scrapy.statscollectors.
MemoryStatsCollector
¶ A simple stats collector that keeps the stats of the last scraping run (for each spider) in memory, after they're closed. The stats can be accessed through the
spider_stats
attribute, which is a dict keyed by spider domain name.This is the default Stats Collector used in Scrapy.
-
spider_stats
¶ A dict of dicts (keyed by spider name) containing the stats of the last scraping run for each spider.
-
DummyStatsCollector¶
-
class
scrapy.statscollectors.
DummyStatsCollector
¶ A Stats collector which does nothing but is very efficient (because it does nothing). This stats collector can be set via the
STATS_CLASS
setting, to disable stats collect in order to improve performance. However, the performance penalty of stats collection is usually marginal compared to other Scrapy workload like parsing pages.
Sending e-mail¶
Although Python makes sending e-mails relatively easy via the smtplib library, Scrapy provides its own facility for sending e-mails which is very easy to use and it's implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking IO of the crawler. It also provides a simple API for sending attachments and it's very easy to configure, with a few settings.
Quick example¶
There are two ways to instantiate the mail sender. You can instantiate it using the standard constructor:
from scrapy.mail import MailSender
mailer = MailSender()
Or you can instantiate it passing a Scrapy settings object, which will respect the settings:
mailer = MailSender.from_settings(settings)
And here is how to use it to send an e-mail (without attachments):
mailer.send(to=["someone@example.com"], subject="Some subject", body="Some body", cc=["another@example.com"])
MailSender class reference¶
MailSender is the preferred class to use for sending emails from Scrapy, as it uses Twisted non-blocking IO, like the rest of the framework.
-
class
scrapy.mail.
MailSender
(smtphost=None, mailfrom=None, smtpuser=None, smtppass=None, smtpport=None)¶ パラメータ: - smtphost (str or bytes) -- the SMTP host to use for sending the emails. If omitted, the
MAIL_HOST
setting will be used. - mailfrom (str) -- the address used to send emails (in the
From:
header). If omitted, theMAIL_FROM
setting will be used. - smtpuser -- the SMTP user. If omitted, the
MAIL_USER
setting will be used. If not given, no SMTP authentication will be performed. - smtppass (str or bytes) -- the SMTP pass for authentication.
- smtpport (int) -- the SMTP port to connect to
- smtptls (boolean) -- enforce using SMTP STARTTLS
- smtpssl (boolean) -- enforce using a secure SSL connection
-
classmethod
from_settings
(settings)¶ Instantiate using a Scrapy settings object, which will respect these Scrapy settings.
パラメータ: settings ( scrapy.settings.Settings
object) -- the e-mail recipients
-
send
(to, subject, body, cc=None, attachs=(), mimetype='text/plain', charset=None)¶ Send email to the given recipients.
パラメータ: - to (str or list of str) -- the e-mail recipients
- subject (str) -- the subject of the e-mail
- cc (str or list of str) -- the e-mails to CC
- body (str) -- the e-mail body
- attachs (iterable) -- an iterable of tuples
(attach_name, mimetype, file_object)
whereattach_name
is a string with the name that will appear on the e-mail's attachment,mimetype
is the mimetype of the attachment andfile_object
is a readable file object with the contents of the attachment - mimetype (str) -- the MIME type of the e-mail
- charset (str) -- the character encoding to use for the e-mail contents
- smtphost (str or bytes) -- the SMTP host to use for sending the emails. If omitted, the
Mail settings¶
These settings define the default constructor values of the MailSender
class, and can be used to configure e-mail notifications in your project without
writing any code (for those extensions and code that uses MailSender
).
MAIL_USER¶
Default: None
User to use for SMTP authentication. If disabled no SMTP authentication will be performed.
MAIL_TLS¶
Default: False
Enforce using STARTTLS. STARTTLS is a way to take an existing insecure connection, and upgrade it to a secure connection using SSL/TLS.
Telnet Console¶
Scrapy comes with a built-in telnet console for inspecting and controlling a Scrapy running process. The telnet console is just a regular python shell running inside the Scrapy process, so you can do literally anything from it.
The telnet console is a built-in Scrapy extension which comes enabled by default, but you can also disable it if you want. For more information about the extension itself see Telnet console extension.
警告
It is not secure to use telnet console via public networks, as telnet doesn't provide any transport-layer security. Having username/password authentication doesn't change that.
Intended usage is connecting to a running Scrapy spider locally
(spider process and telnet client are on the same machine)
or over a secure connection (VPN, SSH tunnel).
Please avoid using telnet console over insecure connections,
or disable it completely using TELNETCONSOLE_ENABLED
option.
How to access the telnet console¶
The telnet console listens in the TCP port defined in the
TELNETCONSOLE_PORT
setting, which defaults to 6023
. To access
the console you need to type:
telnet localhost 6023
Trying localhost...
Connected to localhost.
Escape character is '^]'.
Username:
Password:
>>>
By default Username is scrapy
and Password is autogenerated. The
autogenerated Password can be seen on scrapy logs like the example bellow:
2018-10-16 14:35:21 [scrapy.extensions.telnet] INFO: Telnet Password: 16f92501e8a59326
Default Username and Password can be overriden by the settings
TELNETCONSOLE_USERNAME
and TELNETCONSOLE_PASSWORD
.
警告
Username and password provide only a limited protection, as telnet is not using secure transport - by default traffic is not encrypted even if username and password are set.
You need the telnet program which comes installed by default in Windows, and most Linux distros.
Available variables in the telnet console¶
The telnet console is like a regular Python shell running inside the Scrapy process, so you can do anything from it including importing new modules, etc.
However, the telnet console comes with some default variables defined for convenience:
Shortcut | Description |
---|---|
crawler |
the Scrapy Crawler (scrapy.crawler.Crawler object) |
engine |
Crawler.engine attribute |
spider |
the active spider |
slot |
the engine slot |
extensions |
the Extension Manager (Crawler.extensions attribute) |
stats |
the Stats Collector (Crawler.stats attribute) |
settings |
the Scrapy settings object (Crawler.settings attribute) |
est |
print a report of the engine status |
prefs |
for memory debugging (see Debugging memory leaks) |
p |
a shortcut to the pprint.pprint function |
hpy |
for memory debugging (see Debugging memory leaks) |
Telnet console usage examples¶
Here are some example tasks you can do with the telnet console:
View engine status¶
You can use the est()
method of the Scrapy engine to quickly show its state
using the telnet console:
telnet localhost 6023
>>> est()
Execution engine status
time()-engine.start_time : 8.62972998619
engine.has_capacity() : False
len(engine.downloader.active) : 16
engine.scraper.is_idle() : False
engine.spider.name : followall
engine.spider_is_idle(engine.spider) : False
engine.slot.closing : False
len(engine.slot.inprogress) : 16
len(engine.slot.scheduler.dqs or []) : 0
len(engine.slot.scheduler.mqs) : 92
len(engine.scraper.slot.queue) : 0
len(engine.scraper.slot.active) : 0
engine.scraper.slot.active_size : 0
engine.scraper.slot.itemproc_size : 0
engine.scraper.slot.needs_backout() : False
Pause, resume and stop the Scrapy engine¶
To pause:
telnet localhost 6023
>>> engine.pause()
>>>
To resume:
telnet localhost 6023
>>> engine.unpause()
>>>
To stop:
telnet localhost 6023
>>> engine.stop()
Connection closed by foreign host.
Telnet Console signals¶
-
scrapy.extensions.telnet.
update_telnet_vars
(telnet_vars)¶ Sent just before the telnet console is opened. You can hook up to this signal to add, remove or update the variables that will be available in the telnet local namespace. In order to do that, you need to update the
telnet_vars
dict in your handler.パラメータ: telnet_vars (dict) -- the dict of telnet variables
Telnet settings¶
These are the settings that control the telnet console's behaviour:
TELNETCONSOLE_PORT¶
Default: [6023, 6073]
The port range to use for the telnet console. If set to None
or 0
, a
dynamically assigned port is used.
TELNETCONSOLE_PASSWORD¶
Default: None
The password used for the telnet console, default behaviour is to have it autogenerated
- Logging
- Pythonの組み込みログ出力をScrapy上で使う方法について学びます。
- Stats Collection
- クローラーに対しての統計を取ります。
- Sending e-mail
- 特定のイベントが発生したときに電子メールで通知します。
- Telnet Console
- 組み込みのPythonコンソールを使用して実行中のクローラーを調査します。
- Web Service
- Webサービスを使用してクローラーを監視および制御します。
問題を解決する¶
Frequently Asked Questions¶
How does Scrapy compare to BeautifulSoup or lxml?¶
BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them.
Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them. After all, they're just parsing libraries which can be imported and used from any Python code.
In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.
Can I use Scrapy with BeautifulSoup?¶
Yes, you can.
As mentioned above, BeautifulSoup can be used
for parsing HTML responses in Scrapy callbacks.
You just have to feed the response's body into a BeautifulSoup
object
and extract whatever data you need from it.
Here's an example spider using BeautifulSoup API, with lxml
as the HTML parser:
from bs4 import BeautifulSoup
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
# use lxml to get decent HTML parsing speed
soup = BeautifulSoup(response.text, 'lxml')
yield {
"url": response.url,
"title": soup.h1.string
}
注釈
BeautifulSoup
supports several HTML/XML parsers.
See BeautifulSoup's official documentation on which ones are available.
What Python versions does Scrapy support?¶
Scrapy is supported under Python 2.7 and Python 3.4+ under CPython (default Python implementation) and PyPy (starting with PyPy 5.9). Python 2.6 support was dropped starting at Scrapy 0.20. Python 3 support was added in Scrapy 1.1. PyPy support was added in Scrapy 1.4, PyPy3 support was added in Scrapy 1.5.
注釈
For Python 3 support on Windows, it is recommended to use Anaconda/Miniconda as outlined in the installation guide.
Did Scrapy "steal" X from Django?¶
Probably, but we don't like that word. We think Django is a great open source project and an example to follow, so we've used it as an inspiration for Scrapy.
We believe that, if something is already done well, there's no need to reinvent it. This concept, besides being one of the foundations for open source and free software, not only applies to software but also to documentation, procedures, policies, etc. So, instead of going through each problem ourselves, we choose to copy ideas from those projects that have already solved them properly, and focus on the real problems we need to solve.
We'd be proud if Scrapy serves as an inspiration for other projects. Feel free to steal from us!
Does Scrapy work with HTTP proxies?¶
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP
Proxy downloader middleware. See
HttpProxyMiddleware
.
How can I scrape an item with attributes in different pages?¶
Scrapy crashes with: ImportError: No module named win32api¶
You need to install pywin32 because of this Twisted bug.
How can I simulate a user login in my spider?¶
See Using FormRequest.from_response() to simulate a user login.
Does Scrapy crawl in breadth-first or depth-first order?¶
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
My Scrapy crawler has memory leaks. What can I do?¶
Also, Python has a builtin memory leak issue which is described in Leaks without leaks.
How can I make Scrapy consume less memory?¶
See previous question.
Can I use Basic HTTP Authentication in my spiders?¶
Yes, see HttpAuthMiddleware
.
Why does Scrapy download pages in English instead of my native language?¶
Try changing the default Accept-Language request header by overriding the
DEFAULT_REQUEST_HEADERS
setting.
Can I run a spider without creating a project?¶
Yes. You can use the runspider
command. For example, if you have a
spider written in a my_spider.py
file you can run it with:
scrapy runspider my_spider.py
See runspider
command for more info.
I get "Filtered offsite request" messages. How can I fix them?¶
Those messages (logged with DEBUG
level) don't necessarily mean there is a
problem, so you may not need to fix them.
Those messages are thrown by the Offsite Spider Middleware, which is a spider middleware (enabled by default) whose purpose is to filter out requests to domains outside the ones covered by the spider.
For more info see:
OffsiteMiddleware
.
What is the recommended way to deploy a Scrapy crawler in production?¶
See Deploying Spiders.
Can I use JSON for large exports?¶
It'll depend on how large your output is. See this warning in JsonItemExporter
documentation.
Can I return (Twisted) deferreds from signal handlers?¶
Some signals support returning deferreds from their handlers, others don't. See the Built-in signals reference to know which ones.
What does the response status code 999 means?¶
999 is a custom response status code used by Yahoo sites to throttle requests.
Try slowing down the crawling speed by using a download delay of 2
(or
higher) in your spider:
class MySpider(CrawlSpider):
name = 'myspider'
download_delay = 2
# [ ... rest of the spider code ... ]
Or by setting a global download delay in your project with the
DOWNLOAD_DELAY
setting.
Can I call pdb.set_trace()
from my spiders to debug them?¶
Yes, but you can also use the Scrapy shell which allows you to quickly analyze
(and even modify) the response being processed by your spider, which is, quite
often, more useful than plain old pdb.set_trace()
.
For more info see Invoking the shell from spiders to inspect responses.
Simplest way to dump all my scraped items into a JSON/CSV/XML file?¶
To dump into a JSON file:
scrapy crawl myspider -o items.json
To dump into a CSV file:
scrapy crawl myspider -o items.csv
To dump into a XML file:
scrapy crawl myspider -o items.xml
For more information see Feed exports
What's this huge cryptic __VIEWSTATE
parameter used in some forms?¶
The __VIEWSTATE
parameter is used in sites built with ASP.NET/VB.NET. For
more info on how it works see this page. Also, here's an example spider
which scrapes one of these sites.
What's the best way to parse big XML/CSV data feeds?¶
Parsing big feeds with XPath selectors can be problematic since they need to build the DOM of the entire feed in memory, and this can be quite slow and consume a lot of memory.
In order to avoid parsing all the entire feed at once in memory, you can use
the functions xmliter
and csviter
from scrapy.utils.iterators
module. In fact, this is what the feed spiders (see Spider) use
under the cover.
Does Scrapy manage cookies automatically?¶
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.
For more info see Requests and Responses and CookiesMiddleware.
How can I see the cookies being sent and received from Scrapy?¶
Enable the COOKIES_DEBUG
setting.
How can I instruct a spider to stop itself?¶
Raise the CloseSpider
exception from a callback. For
more info see: CloseSpider
.
How can I prevent my Scrapy bot from getting banned?¶
Should I use spider arguments or settings to configure my spider?¶
Both spider arguments and settings can be used to configure your spider. There is no strict rule that mandates to use one or the other, but settings are more suited for parameters that, once set, don't change much, while spider arguments are meant to change more often, even on each spider run and sometimes are required for the spider to run at all (for example, to set the start url of a spider).
To illustrate with an example, assuming you have a spider that needs to log into a site to scrape data, and you only want to scrape data from a certain section of the site (which varies each time). In that case, the credentials to log in would be settings, while the url of the section to scrape would be a spider argument.
I'm scraping a XML document and my XPath selector doesn't return any items¶
You may need to remove namespaces. See ネームスペースを削除する.
Debugging Spiders¶
This document explains the most common techniques for debugging spiders. Consider the following scrapy spider below:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = (
'http://example.com/page1',
'http://example.com/page2',
)
def parse(self, response):
# <processing code not shown>
# collect `item_urls`
for item_url in item_urls:
yield scrapy.Request(item_url, self.parse_item)
def parse_item(self, response):
# <processing code not shown>
item = MyItem()
# populate `item` fields
# and extract item_details_url
yield scrapy.Request(item_details_url, self.parse_details, meta={'item': item})
def parse_details(self, response):
item = response.meta['item']
# populate more `item` fields
return item
Basically this is a simple spider which parses two pages of items (the
start_urls). Items also have a details page with additional information, so we
use the meta
functionality of Request
to pass a
partially populated item.
Parse Command¶
The most basic way of checking the output of your spider is to use the
parse
command. It allows to check the behaviour of different parts
of the spider at the method level. It has the advantage of being flexible and
simple to use, but does not allow debugging code inside a method.
In order to see the item scraped from a specific url:
$ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 2 <<<
# Scraped Items ------------------------------------------------------------
[{'url': <item_url>}]
# Requests -----------------------------------------------------------------
[]
Using the --verbose
or -v
option we can see the status at each depth level:
$ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>
[ ... scrapy log lines crawling example.com spider ... ]
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[]
# Requests -----------------------------------------------------------------
[<GET item_details_url>]
>>> DEPTH LEVEL: 2 <<<
# Scraped Items ------------------------------------------------------------
[{'url': <item_url>}]
# Requests -----------------------------------------------------------------
[]
Checking items scraped from a single start_url, can also be easily achieved using:
$ scrapy parse --spider=myspider -d 3 'http://example.com/page1'
Scrapy Shell¶
While the parse
command is very useful for checking behaviour of a
spider, it is of little help to check what happens inside a callback, besides
showing the response received and the output. How to debug the situation when
parse_details
sometimes receives no item?
Fortunately, the shell
is your bread and butter in this case (see
Invoking the shell from spiders to inspect responses):
from scrapy.shell import inspect_response
def parse_details(self, response):
item = response.meta.get('item', None)
if item:
# populate more `item` fields
return item
else:
inspect_response(response, self)
See also: Invoking the shell from spiders to inspect responses.
Open in browser¶
Sometimes you just want to see how a certain response looks in a browser, you
can use the open_in_browser
function for that. Here is an example of how
you would use it:
from scrapy.utils.response import open_in_browser
def parse_details(self, response):
if "item name" not in response.body:
open_in_browser(response)
open_in_browser
will open a browser with the response received by Scrapy at
that point, adjusting the base tag so that images and styles are displayed
properly.
Logging¶
Logging is another useful option for getting information about your spider run. Although not as convenient, it comes with the advantage that the logs will be available in all future runs should they be necessary again:
def parse_details(self, response):
item = response.meta.get('item', None)
if item:
# populate more `item` fields
return item
else:
self.logger.warning('No item received for %s', response.url)
For more information, check the Logging section.
Spiders Contracts¶
バージョン 0.15 で追加.
注釈
This is a new feature (introduced in Scrapy 0.15) and may be subject to minor functionality/API updates. Check the release notes to be notified of updates.
Testing spiders can get particularly annoying and while nothing prevents you from writing unit tests the task gets cumbersome quickly. Scrapy offers an integrated way of testing your spiders by the means of contracts.
This allows you to test each callback of your spider by hardcoding a sample url
and check various constraints for how the callback processes the response. Each
contract is prefixed with an @
and included in the docstring. See the
following example:
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://www.amazon.com/s?field-keywords=selfish+gene
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
This callback is tested using three built-in contracts:
-
class
scrapy.contracts.default.
UrlContract
¶ This contract (
@url
) sets the sample url used when checking other contract conditions for this spider. This contract is mandatory. All callbacks lacking this contract are ignored when running the checks:@url url
-
class
scrapy.contracts.default.
ReturnsContract
¶ This contract (
@returns
) sets lower and upper bounds for the items and requests returned by the spider. The upper bound is optional:@returns item(s)|request(s) [min [max]]
-
class
scrapy.contracts.default.
ScrapesContract
¶ This contract (
@scrapes
) checks that all the items returned by the callback have the specified fields:@scrapes field_1 field_2 ...
Use the check
command to run the contract checks.
Custom Contracts¶
If you find you need more power than the built-in scrapy contracts you can
create and load your own contracts in the project by using the
SPIDER_CONTRACTS
setting:
SPIDER_CONTRACTS = {
'myproject.contracts.ResponseCheck': 10,
'myproject.contracts.ItemValidate': 10,
}
Each contract must inherit from scrapy.contracts.Contract
and can
override three methods:
-
class
scrapy.contracts.
Contract
(method, *args)¶ パラメータ: - method (function) -- callback function to which the contract is associated
- args (list) -- list of arguments passed into the docstring (whitespace separated)
-
adjust_request_args
(args)¶ This receives a
dict
as an argument containing default arguments for request object.Request
is used by default, but this can be changed with therequest_cls
attribute. If multiple contracts in chain have this attribute defined, the last one is used.Must return the same or a modified version of it.
-
pre_process
(response)¶ This allows hooking in various checks on the response received from the sample request, before it's being passed to the callback.
-
post_process
(output)¶ This allows processing the output of the callback. Iterators are converted listified before being passed to this hook.
Here is a demo contract which checks the presence of a custom header in the
response received. Raise scrapy.exceptions.ContractFail
in order to
get the failures pretty printed:
from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail
class HasHeaderContract(Contract):
""" Demo contract which checks the presence of a custom header
@has_header X-CustomHeader
"""
name = 'has_header'
def pre_process(self, response):
for header in self.args:
if header not in response.headers:
raise ContractFail('X-CustomHeader not present')
Common Practices¶
This section documents common practices when using Scrapy. These are things that cover many topics and don't often fall into any other specific section.
Run Scrapy from a script¶
You can use the API to run Scrapy from a script, instead of
the typical way of running Scrapy via scrapy crawl
.
Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor.
The first utility you can use to run your spiders is
scrapy.crawler.CrawlerProcess
. This class will start a Twisted reactor
for you, configuring the logging and setting shutdown handlers. This class is
the one used by all Scrapy commands.
Here's an example showing how to run a single spider with it.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Make sure to check CrawlerProcess
documentation to get
acquainted with its usage details.
If you are inside a Scrapy project there are some additional helpers you can
use to import those components within the project. You can automatically import
your spiders passing their name to CrawlerProcess
, and
use get_project_settings
to get a Settings
instance with your project settings.
What follows is a working example of how to do that, using the testspiders project as example.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
There's another Scrapy utility that provides more control over the crawling
process: scrapy.crawler.CrawlerRunner
. This class is a thin wrapper
that encapsulates some simple helpers to run multiple crawlers, but it won't
start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your
spiders. It's recommended you use CrawlerRunner
instead of CrawlerProcess
if your application is
already using Twisted and you want to run Scrapy in the same reactor.
Note that you will also have to shutdown the Twisted reactor yourself after the
spider is finished. This can be achieved by adding callbacks to the deferred
returned by the CrawlerRunner.crawl
method.
Here's an example of its usage, along with a callback to manually stop the reactor after MySpider has finished running.
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Running multiple spiders in the same process¶
By default, Scrapy runs a single spider per process when you run scrapy
crawl
. However, Scrapy supports running multiple spiders per process using
the internal API.
Here is an example that runs multiple spiders simultaneously:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
Same example using CrawlerRunner
:
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
Same example but running the spiders sequentially by chaining the deferreds:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Distributed crawls¶
Scrapy doesn't provide any built-in facility for running crawls in a distribute (multi-server) manner. However, there are some ways to distribute crawls, which vary depending on how you plan to distribute them.
If you have many spiders, the obvious way to distribute the load is to setup many Scrapyd instances and distribute spider runs among those.
If you instead want to run a single (big) spider through many machines, what you usually do is partition the urls to crawl and send them to each separate spider. Here is a concrete example:
First, you prepare the list of urls to crawl and put them into separate files/urls:
http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list
Then you fire a spider run on 3 different Scrapyd servers. The spider would
receive a (spider) argument part
with the number of the partition to
crawl:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
Avoiding getting banned¶
Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.
Here are some tips to keep in mind when dealing with these kinds of sites:
- rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
- disable cookies (see
COOKIES_ENABLED
) as some sites may use cookies to spot bot behaviour - use download delays (2 or higher). See
DOWNLOAD_DELAY
setting. - if possible, use Google cache to fetch pages, instead of hitting the sites directly
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.
- use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
If you are still unable to prevent your bot getting banned, consider contacting commercial support.
Broad Crawls¶
Scrapy defaults are optimized for crawling specific sites. These sites are often handled by a single Scrapy spider, although this is not necessary or required (for example, there are generic spiders that handle any given site thrown at them).
In addition to this "focused crawl", there is another common type of crawling which covers a large (potentially unlimited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the domain was crawled to completion or when there are no more requests to perform. These are called "broad crawls" and is the typical crawlers employed by search engines.
These are some common properties often found in broad crawls:
- they crawl many domains (often, unbounded) instead of a specific set of sites
- they don't necessarily crawl domains to completion, because it would be impractical (or impossible) to do so, and instead limit the crawl by time or number of pages crawled
- they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often post-processed in a separate stage
- they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in parallel)
As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.
Increase concurrency¶
Concurrency is the number of requests that are processed in parallel. There is a global limit and a per-domain limit.
The default global concurrency limit in Scrapy is not suitable for crawling
many different domains in parallel, so you will want to increase it. How much
to increase it will depend on how much CPU you crawler will have available. A
good starting point is 100
, but the best way to find out is by doing some
trials and identifying at what concurrency your Scrapy process gets CPU
bounded. For optimum performance, you should pick a concurrency where CPU usage
is at 80-90%.
To increase the global concurrency use:
CONCURRENT_REQUESTS = 100
Increase Twisted IO thread pool maximum size¶
Currently Scrapy does DNS resolution in a blocking way with usage of thread pool. With higher concurrency levels the crawling could be slow or even fail hitting DNS resolver timeouts. Possible solution to increase the number of threads handling DNS queries. The DNS queue will be processed faster speeding up establishing of connection and crawling overall.
To increase maximum thread pool size use:
REACTOR_THREADPOOL_MAXSIZE = 20
Setup your own DNS¶
If you have multiple crawling processes and single central DNS, it can act like DoS attack on the DNS server resulting to slow down of entire network or even blocking your machines. To avoid this setup your own DNS server with local cache and upstream to some large DNS like OpenDNS or Verizon.
Reduce log level¶
When doing broad crawls you are often only interested in the crawl rates you
get and any errors found. These stats are reported by Scrapy when using the
INFO
log level. In order to save CPU (and log storage requirements) you
should not use DEBUG
log level when preforming large broad crawls in
production. Using DEBUG
level when developing your (broad) crawler may be
fine though.
To set the log level use:
LOG_LEVEL = 'INFO'
Disable cookies¶
Disable cookies unless you really need. Cookies are often not needed when doing broad crawls (search engine crawlers ignore them), and they improve performance by saving some CPU cycles and reducing the memory footprint of your Scrapy crawler.
To disable cookies use:
COOKIES_ENABLED = False
Disable retries¶
Retrying failed HTTP requests can slow down the crawls substantially, specially when sites causes are very slow (or fail) to respond, thus causing a timeout error which gets retried many times, unnecessarily, preventing crawler capacity to be reused for other domains.
To disable retries use:
RETRY_ENABLED = False
Reduce download timeout¶
Unless you are crawling from a very slow connection (which shouldn't be the case for broad crawls) reduce the download timeout so that stuck requests are discarded quickly and free up capacity to process the next ones.
To reduce the download timeout use:
DOWNLOAD_TIMEOUT = 15
Disable redirects¶
Consider disabling redirects, unless you are interested in following them. When doing broad crawls it's common to save redirects and resolve them when revisiting the site at a later crawl. This also help to keep the number of request constant per crawl batch, otherwise redirect loops may cause the crawler to dedicate too many resources on any specific domain.
To disable redirects use:
REDIRECT_ENABLED = False
Enable crawling of "Ajax Crawlable Pages"¶
Some pages (up to 1%, based on empirical data from year 2013) declare themselves as ajax crawlable. This means they provide plain HTML version of content that is usually available only via AJAX. Pages can indicate it in two ways:
- by using
#!
in URL - this is the default way; - by using a special meta tag - this way is used on "main", "index" website pages.
Scrapy handles (1) automatically; to handle (2) enable AjaxCrawlMiddleware:
AJAXCRAWL_ENABLED = True
When doing broad crawls it's common to crawl a lot of "index" web pages; AjaxCrawlMiddleware helps to crawl them correctly. It is turned OFF by default because it has some performance overhead, and enabling it for focused crawls doesn't make much sense.
Using your browser's Developer Tools for scraping¶
Here is a general guide on how to use your browser's Developer Tools to ease the scraping process. Today almost all browsers come with built in Developer Tools and although we will use Firefox in this guide, the concepts are applicable to any other browser.
In this guide we'll introduce the basic tools to use from a browser's Developer Tools by scraping quotes.toscrape.com.
Caveats with inspecting the live browser DOM¶
Since Developer Tools operate on a live browser DOM, what you'll actually see
when inspecting the page source is not the original HTML, but a modified one
after applying some browser clean up and executing Javascript code. Firefox,
in particular, is known for adding <tbody>
elements to tables. Scrapy, on
the other hand, does not modify the original page HTML, so you won't be able to
extract any data if you use <tbody>
in your XPath expressions.
Therefore, you should keep in mind the following things:
- Disable Javascript while inspecting the DOM looking for XPaths to be used in Scrapy (in the Developer Tools settings click Disable JavaScript)
- Never use full XPath paths, use relative and clever ones based on attributes
(such as
id
,class
,width
, etc) or any identifying features likecontains(@href, 'image')
. - Never include
<tbody>
elements in your XPath expressions unless you really know what you're doing
Inspecting a website¶
By far the most handy feature of the Developer Tools is the Inspector feature, which allows you to inspect the underlying HTML code of any webpage. To demonstrate the Inspector, let's look at the quotes.toscrape.com-site.
On the site we have a total of ten quotes from various authors with specific tags, as well as the Top Ten Tags. Let's say we want to extract all the quotes on this page, without any meta-information about authors, tags, etc.
Instead of viewing the whole source code for the page, we can simply right click
on a quote and select Inspect Element (Q)
, which opens up the Inspector.
In it you should see something like this:

The interesting part for us is this:
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">(...)</span>
<span>(...)</span>
<div class="tags">(...)</div>
</div>
If you hover over the first div
directly above the span
tag highlighted
in the screenshot, you'll see that the corresponding section of the webpage gets
highlighted as well. So now we have a section, but we can't find our quote text
anywhere.
The advantage of the Inspector is that it automatically expands and collapses
sections and tags of a webpage, which greatly improves readability. You can
expand and collapse a tag by clicking on the arrow in front of it or by double
clicking directly on the tag. If we expand the span
tag with the class=
"text"
we will see the quote-text we clicked on. The Inspector lets you
copy XPaths to selected elements. Let's try it out: Right-click on the span
tag, select Copy > XPath
and paste it in the scrapy shell like so:
$ scrapy shell "http://quotes.toscrape.com/"
(...)
>>> response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').getall()
['"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”]
Adding text()
at the end we are able to extract the first quote with this
basic selector. But this XPath is not really that clever. All it does is
go down a desired path in the source code starting from html
. So let's
see if we can refine our XPath a bit:
If we check the Inspector again we'll see that directly beneath our
expanded div
tag we have nine identical div
tags, each with the
same attributes as our first. If we expand any of them, we'll see the same
structure as with our first quote: Two span
tags and one div
tag. We can
expand each span
tag with the class="text"
inside our div
tags and
see each quote:
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
</span>
<span>(...)</span>
<div class="tags">(...)</div>
</div>
With this knowledge we can refine our XPath: Instead of a path to follow,
we'll simply select all span
tags with the class="text"
by using
the has-class-extension:
>>> response.xpath('//span[has-class("text")]/text()').getall()
['"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
(...)]
And with one simple, cleverer XPath we are able to extract all quotes from
the page. We could have constructed a loop over our first XPath to increase
the number of the last div
, but this would have been unnecessarily
complex and by simply constructing an XPath with has-class("text")
we were able to extract all quotes in one line.
The Inspector has a lot of other helpful features, such as searching in the source code or directly scrolling to an element you selected. Let's demonstrate a use case:
Say you want to find the Next
button on the page. Type Next
into the
search bar on the top right of the Inspector. You should get two results.
The first is a li
tag with the class="text"
, the second the text
of an a
tag. Right click on the a
tag and select Scroll into View
.
If you hover over the tag, you'll see the button highlighted. From here
we could easily create a Link Extractor to
follow the pagination. On a simple site such as this, there may not be
the need to find an element visually but the Scroll into View
function
can be quite useful on complex sites.
Note that the search bar can also be used to search for and test CSS
selectors. For example, you could search for span.text
to find
all quote texts. Instead of a full text search, this searches for
exactly the span
tag with the class="text"
in the page.
The Network-tool¶
While scraping you may come across dynamic webpages where some parts of the page are loaded dynamically through multiple requests. While this can be quite tricky, the Network-tool in the Developer Tools greatly facilitates this task. To demonstrate the Network-tool, let's take a look at the page quotes.toscrape.com/scroll.
The page is quite similar to the basic quotes.toscrape.com-page,
but instead of the above-mentioned Next
button, the page
automatically loads new quotes when you scroll to the bottom. We
could go ahead and try out different XPaths directly, but instead
we'll check another quite useful command from the scrapy shell:
$ scrapy shell "quotes.toscrape.com/scroll"
(...)
>>> view(response)
A browser window should open with the webpage but with one
crucial difference: Instead of the quotes we just see a greenish
bar with the word Loading...
.

The view(response)
command let's us view the response our
shell or later our spider receives from the server. Here we see
that some basic template is loaded which includes the title,
the login-button and the footer, but the quotes are missing. This
tells us that the quotes are being loaded from a different request
than quotes.toscrape/scroll
.
If you click on the Network
tab, you will probably only see
two entries. The first thing we do is enable persistent logs by
clicking on Persist Logs
. If this option is disabled, the
log is automatically cleared each time you navigate to a different
page. Enabling this option is a good default, since it gives us
control on when to clear the logs.
If we reload the page now, you'll see the log get populated with six new requests.

Here we see every request that has been made when reloading the page and can inspect each request and its response. So let's find out where our quotes are coming from:
First click on the request with the name scroll
. On the right
you can now inspect the request. In Headers
you'll find details
about the request headers, such as the URL, the method, the IP-address,
and so on. We'll ignore the other tabs and click directly on Reponse
.
What you should see in the Preview
pane is the rendered HTML-code,
that is exactly what we saw when we called view(response)
in the
shell. Accordingly the type
of the request in the log is html
.
The other requests have types like css
or js
, but what
interests us is the one request called quotes?page=1
with the
type json
.
If we click on this request, we see that the request URL is
http://quotes.toscrape.com/api/quotes?page=1
and the response
is a JSON-object that contains our quotes. We can also right-click
on the request and open Open in new tab
to get a better overview.

With this response we can now easily parse the JSON-object and also request each page to get every quote on the site:
import scrapy
import json
class QuoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['quotes.toscrape.com']
page = 1
start_urls = ['http://quotes.toscrape.com/api/quotes?page=1]
def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
yield {"quote": quote["text"]}
if data["has_next"]:
self.page += 1
url = "http://quotes.toscrape.com/api/quotes?page={}".format(self.page)
yield scrapy.Request(url=url, callback=self.parse)
This spider starts at the first page of the quotes-API. With each
response, we parse the response.text
and assign it to data
.
This lets us operate on the JSON-object like on a Python dictionary.
We iterate through the quotes
and print out the quote["text"]
.
If the handy has_next
element is true
(try loading
quotes.toscrape.com/api/quotes?page=10 in your browser or a
page-number greater than 10), we increment the page
attribute
and yield
a new request, inserting the incremented page-number
into our url
.
You can see that with a few inspections in the Network-tool we were able to easily replicate the dynamic requests of the scrolling functionality of the page. Crawling dynamic pages can be quite daunting and pages can be very complex, but it (mostly) boils down to identifying the correct request and replicating it in your spider.
Debugging memory leaks¶
In Scrapy, objects such as Requests, Responses and Items have a finite lifetime: they are created, used for a while, and finally destroyed.
From all those objects, the Request is probably the one with the longest lifetime, as it stays waiting in the Scheduler queue until it's time to process it. For more info see Architecture overview.
As these Scrapy objects have a (rather long) lifetime, there is always the risk of accumulating them in memory without releasing them properly and thus causing what is known as a "memory leak".
To help debugging memory leaks, Scrapy provides a built-in mechanism for tracking objects references called trackref, and you can also use a third-party library called Guppy for more advanced memory debugging (see below for more info). Both mechanisms must be used from the Telnet Console.
Common causes of memory leaks¶
It happens quite often (sometimes by accident, sometimes on purpose) that the
Scrapy developer passes objects referenced in Requests (for example, using the
meta
attribute or the request callback function)
and that effectively bounds the lifetime of those referenced objects to the
lifetime of the Request. This is, by far, the most common cause of memory leaks
in Scrapy projects, and a quite difficult one to debug for newcomers.
In big projects, the spiders are typically written by different people and some of those spiders could be "leaking" and thus affecting the rest of the other (well-written) spiders when they get to run concurrently, which, in turn, affects the whole crawling process.
The leak could also come from a custom middleware, pipeline or extension that
you have written, if you are not releasing the (previously allocated) resources
properly. For example, allocating resources on spider_opened
but not releasing them on spider_closed
may cause problems if
you're running multiple spiders per process.
Too Many Requests?¶
By default Scrapy keeps the request queue in memory; it includes
Request
objects and all objects
referenced in Request attributes (e.g. in meta
).
While not necessarily a leak, this can take a lot of memory. Enabling
persistent job queue could help keeping memory usage
in control.
Debugging memory leaks with trackref
¶
trackref
is a module provided by Scrapy to debug the most common cases of
memory leaks. It basically tracks the references to all live Requests,
Responses, Item and Selector objects.
You can enter the telnet console and inspect how many objects (of the classes
mentioned above) are currently alive using the prefs()
function which is an
alias to the print_live_refs()
function:
telnet localhost 6023
>>> prefs()
Live References
ExampleSpider 1 oldest: 15s ago
HtmlResponse 10 oldest: 1s ago
Selector 2 oldest: 0s ago
FormRequest 878 oldest: 7s ago
As you can see, that report also shows the "age" of the oldest object in each
class. If you're running multiple spiders per process chances are you can
figure out which spider is leaking by looking at the oldest request or response.
You can get the oldest object of each class using the
get_oldest()
function (from the telnet console).
Which objects are tracked?¶
The objects tracked by trackrefs
are all from these classes (and all its
subclasses):
scrapy.http.Request
scrapy.http.Response
scrapy.item.Item
scrapy.selector.Selector
scrapy.spiders.Spider
A real example¶
Let's see a concrete example of a hypothetical case of memory leaks. Suppose we have some spider with a line similar to this one:
return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
callback=self.parse, meta={referer: response})
That line is passing a response reference inside a request which effectively ties the response lifetime to the requests' one, and that would definitely cause memory leaks.
Let's see how we can discover the cause (without knowing it
a-priori, of course) by using the trackref
tool.
After the crawler is running for a few minutes and we notice its memory usage has grown a lot, we can enter its telnet console and check the live references:
>>> prefs()
Live References
SomenastySpider 1 oldest: 15s ago
HtmlResponse 3890 oldest: 265s ago
Selector 2 oldest: 0s ago
Request 3878 oldest: 250s ago
The fact that there are so many live responses (and that they're so old) is definitely suspicious, as responses should have a relatively short lifetime compared to Requests. The number of responses is similar to the number of requests, so it looks like they are tied in a some way. We can now go and check the code of the spider to discover the nasty line that is generating the leaks (passing response references inside requests).
Sometimes extra information about live objects can be helpful. Let's check the oldest response:
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('HtmlResponse')
>>> r.url
'http://www.somenastyspider.com/product.php?pid=123'
If you want to iterate over all objects, instead of getting the oldest one, you
can use the scrapy.utils.trackref.iter_all()
function:
>>> from scrapy.utils.trackref import iter_all
>>> [r.url for r in iter_all('HtmlResponse')]
['http://www.somenastyspider.com/product.php?pid=123',
'http://www.somenastyspider.com/product.php?pid=584',
...
Too many spiders?¶
If your project has too many spiders executed in parallel,
the output of prefs()
can be difficult to read.
For this reason, that function has a ignore
argument which can be used to
ignore a particular class (and all its subclases). For
example, this won't show any live references to spiders:
>>> from scrapy.spiders import Spider
>>> prefs(ignore=Spider)
scrapy.utils.trackref module¶
Here are the functions available in the trackref
module.
-
class
scrapy.utils.trackref.
object_ref
¶ Inherit from this class (instead of object) if you want to track live instances with the
trackref
module.
-
scrapy.utils.trackref.
print_live_refs
(class_name, ignore=NoneType)¶ Print a report of live references, grouped by class name.
パラメータ: ignore (class or classes tuple) -- if given, all objects from the specified class (or tuple of classes) will be ignored.
-
scrapy.utils.trackref.
get_oldest
(class_name)¶ Return the oldest object alive with the given class name, or
None
if none is found. Useprint_live_refs()
first to get a list of all tracked live objects per class name.
-
scrapy.utils.trackref.
iter_all
(class_name)¶ Return an iterator over all objects alive with the given class name, or
None
if none is found. Useprint_live_refs()
first to get a list of all tracked live objects per class name.
Debugging memory leaks with Guppy¶
trackref
provides a very convenient mechanism for tracking down memory
leaks, but it only keeps track of the objects that are more likely to cause
memory leaks (Requests, Responses, Items, and Selectors). However, there are
other cases where the memory leaks could come from other (more or less obscure)
objects. If this is your case, and you can't find your leaks using trackref
,
you still have another resource: the Guppy library.
If you're using Python3, see Debugging memory leaks with muppy.
If you use pip
, you can install Guppy with the following command:
pip install guppy
The telnet console also comes with a built-in shortcut (hpy
) for accessing
Guppy heap objects. Here's an example to view all Python objects available in
the heap using Guppy:
>>> x = hpy.heap()
>>> x.bytype
Partition of a set of 297033 objects. Total size = 52587824 bytes.
Index Count % Size % Cumulative % Type
0 22307 8 16423880 31 16423880 31 dict
1 122285 41 12441544 24 28865424 55 str
2 68346 23 5966696 11 34832120 66 tuple
3 227 0 5836528 11 40668648 77 unicode
4 2461 1 2222272 4 42890920 82 type
5 16870 6 2024400 4 44915320 85 function
6 13949 5 1673880 3 46589200 89 types.CodeType
7 13422 5 1653104 3 48242304 92 list
8 3735 1 1173680 2 49415984 94 _sre.SRE_Pattern
9 1209 0 456936 1 49872920 95 scrapy.http.headers.Headers
<1676 more rows. Type e.g. '_.more' to view.>
You can see that most space is used by dicts. Then, if you want to see from which attribute those dicts are referenced, you could do:
>>> x.bytype[0].byvia
Partition of a set of 22307 objects. Total size = 16423880 bytes.
Index Count % Size % Cumulative % Referred Via:
0 10982 49 9416336 57 9416336 57 '.__dict__'
1 1820 8 2681504 16 12097840 74 '.__dict__', '.func_globals'
2 3097 14 1122904 7 13220744 80
3 990 4 277200 2 13497944 82 "['cookies']"
4 987 4 276360 2 13774304 84 "['cache']"
5 985 4 275800 2 14050104 86 "['meta']"
6 897 4 251160 2 14301264 87 '[2]'
7 1 0 196888 1 14498152 88 "['moduleDict']", "['modules']"
8 672 3 188160 1 14686312 89 "['cb_kwargs']"
9 27 0 155016 1 14841328 90 '[1]'
<333 more rows. Type e.g. '_.more' to view.>
As you can see, the Guppy module is very powerful but also requires some deep knowledge about Python internals. For more info about Guppy, refer to the Guppy documentation.
Debugging memory leaks with muppy¶
If you're using Python 3, you can use muppy from Pympler.
If you use pip
, you can install muppy with the following command:
pip install Pympler
Here's an example to view all Python objects available in the heap using muppy:
>>> from pympler import muppy
>>> all_objects = muppy.get_objects()
>>> len(all_objects)
28667
>>> from pympler import summary
>>> suml = summary.summarize(all_objects)
>>> summary.print_(suml)
types | # objects | total size
==================================== | =========== | ============
<class 'str | 9822 | 1.10 MB
<class 'dict | 1658 | 856.62 KB
<class 'type | 436 | 443.60 KB
<class 'code | 2974 | 419.56 KB
<class '_io.BufferedWriter | 2 | 256.34 KB
<class 'set | 420 | 159.88 KB
<class '_io.BufferedReader | 1 | 128.17 KB
<class 'wrapper_descriptor | 1130 | 88.28 KB
<class 'tuple | 1304 | 86.57 KB
<class 'weakref | 1013 | 79.14 KB
<class 'builtin_function_or_method | 958 | 67.36 KB
<class 'method_descriptor | 865 | 60.82 KB
<class 'abc.ABCMeta | 62 | 59.96 KB
<class 'list | 446 | 58.52 KB
<class 'int | 1425 | 43.20 KB
For more info about muppy, refer to the muppy documentation.
Leaks without leaks¶
Sometimes, you may notice that the memory usage of your Scrapy process will only increase, but never decrease. Unfortunately, this could happen even though neither Scrapy nor your project are leaking memory. This is due to a (not so well) known problem of Python, which may not return released memory to the operating system in some cases. For more information on this issue see:
The improvements proposed by Evan Jones, which are detailed in this paper, got merged in Python 2.5, but this only reduces the problem, it doesn't fix it completely. To quote the paper:
Unfortunately, this patch can only free an arena if there are no more objects allocated in it anymore. This means that fragmentation is a large issue. An application could have many megabytes of free memory, scattered throughout all the arenas, but it will be unable to free any of it. This is a problem experienced by all memory allocators. The only way to solve it is to move to a compacting garbage collector, which is able to move objects in memory. This would require significant changes to the Python interpreter.
To keep memory consumption reasonable you can split the job into several smaller jobs or enable persistent job queue and stop/start spider from time to time.
Downloading and processing files and images¶
Scrapy provides reusable item pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally). These pipelines share a bit of functionality and structure (we refer to them as media pipelines), but typically you'll either use the Files Pipeline or the Images Pipeline.
Both pipelines implement these features:
- Avoid re-downloading media that was downloaded recently
- Specifying where to store the media (filesystem directory, Amazon S3 bucket, Google Cloud Storage bucket)
The Images Pipeline has a few extra functions for processing images:
- Convert all downloaded images to a common format (JPG) and mode (RGB)
- Thumbnail generation
- Check images width/height to make sure they meet a minimum constraint
The pipelines also keep an internal queue of those media URLs which are currently being scheduled for download, and connect those responses that arrive containing the same media to that queue. This avoids downloading the same media more than once when it's shared by several items.
Using the Files Pipeline¶
The typical workflow, when using the FilesPipeline
goes like
this:
- In a Spider, you scrape an item and put the URLs of the desired into a
file_urls
field. - The item is returned from the spider and goes to the item pipeline.
- When the item reaches the
FilesPipeline
, the URLs in thefile_urls
field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains "locked" at that particular pipeline stage until the files have finish downloading (or fail for some reason). - When the files are downloaded, another field (
files
) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from thefile_urls
field) , and the file checksum. The files in the list of thefiles
field will retain the same order of the originalfile_urls
field. If some file failed downloading, an error will be logged and the file won't be present in thefiles
field.
Using the Images Pipeline¶
Using the ImagesPipeline
is a lot like using the FilesPipeline
,
except the default field names used are different: you use image_urls
for
the image URLs of an item and it will populate an images
field for the information
about the downloaded images.
The advantage of using the ImagesPipeline
for image files is that you
can configure some extra functions like generating thumbnails and filtering
the images based on their size.
The Images Pipeline uses Pillow for thumbnailing and normalizing images to JPEG/RGB format, so you need to install this library in order to use it. Python Imaging Library (PIL) should also work in most cases, but it is known to cause troubles in some setups, so we recommend to use Pillow instead of PIL.
Enabling your Media Pipeline¶
To enable your media pipeline you must first add it to your project
ITEM_PIPELINES
setting.
For Images Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
For Files Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
注釈
You can also use both the Files and Images Pipeline at the same time.
Then, configure the target storage setting to a valid value that will be used
for storing the downloaded images. Otherwise the pipeline will remain disabled,
even if you include it in the ITEM_PIPELINES
setting.
For the Files Pipeline, set the FILES_STORE
setting:
FILES_STORE = '/path/to/valid/dir'
For the Images Pipeline, set the IMAGES_STORE
setting:
IMAGES_STORE = '/path/to/valid/dir'
Supported Storage¶
File system is currently the only officially supported storage, but there are also support for storing files in Amazon S3 and Google Cloud Storage.
File system storage¶
The files are stored using a SHA1 hash of their URLs for the file names.
For example, the following image URL:
http://www.example.com/image.jpg
Whose SHA1 hash is:
3afec3b4765f8f0a07b78f98c07b83f013567a0a
Will be downloaded and stored in the following file:
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
Where:
<IMAGES_STORE>
is the directory defined inIMAGES_STORE
setting for the Images Pipeline.full
is a sub-directory to separate full images from thumbnails (if used). For more info see Thumbnail generation for images.
Amazon S3 storage¶
FILES_STORE
and IMAGES_STORE
can represent an Amazon S3
bucket. Scrapy will automatically upload the files to the bucket.
For example, this is a valid IMAGES_STORE
value:
IMAGES_STORE = 's3://bucket/images'
You can modify the Access Control List (ACL) policy used for the stored files,
which is defined by the FILES_STORE_S3_ACL
and
IMAGES_STORE_S3_ACL
settings. By default, the ACL is set to
private
. To make the files publicly available use the public-read
policy:
IMAGES_STORE_S3_ACL = 'public-read'
For more information, see canned ACLs in the Amazon S3 Developer Guide.
Because Scrapy uses boto
/ botocore
internally you can also use other S3-like storages. Storages like
self-hosted Minio or s3.scality. All you need to do is set endpoint option in you Scrapy settings:
AWS_ENDPOINT_URL = 'http://minio.example.com:9000'
For self-hosting you also might feel the need not to use SSL and not to verify SSL connection:
AWS_USE_SSL = False # or True (None by default)
AWS_VERIFY = False # or True (None by default)
Google Cloud Storage¶
FILES_STORE
and IMAGES_STORE
can represent a Google Cloud Storage
bucket. Scrapy will automatically upload the files to the bucket. (requires google-cloud-storage )
For example, these are valid IMAGES_STORE
and GCS_PROJECT_ID
settings:
IMAGES_STORE = 'gs://bucket/images/'
GCS_PROJECT_ID = 'project_id'
For information about authentication, see this documentation.
You can modify the Access Control List (ACL) policy used for the stored files,
which is defined by the FILES_STORE_GCS_ACL
and
IMAGES_STORE_GCS_ACL
settings. By default, the ACL is set to
''
(empty string) which means that Cloud Storage applies the bucket's default object ACL to the object.
To make the files publicly available use the publicRead
policy:
IMAGES_STORE_GCS_ACL = 'publicRead'
For more information, see Predefined ACLs in the Google Cloud Platform Developer Guide.
Usage example¶
In order to use a media pipeline first, enable it.
Then, if a spider returns a dict with the URLs key (file_urls
or
image_urls
, for the Files or Images Pipeline respectively), the pipeline will
put the results under respective key (files
or images
).
If you prefer to use Item
, then define a custom item with the
necessary fields, like in this example for Images Pipeline:
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()
If you want to use another field name for the URLs key or for the results key, it is also possible to override it.
For the Files Pipeline, set FILES_URLS_FIELD
and/or
FILES_RESULT_FIELD
settings:
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'
For the Images Pipeline, set IMAGES_URLS_FIELD
and/or
IMAGES_RESULT_FIELD
settings:
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
If you need something more complex and want to override the custom pipeline behaviour, see Extending the Media Pipelines.
If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD and your custom settings will be used.
Additional features¶
File expiration¶
The Image Pipeline avoids downloading files that were downloaded recently. To
adjust this retention delay use the FILES_EXPIRES
setting (or
IMAGES_EXPIRES
, in case of Images Pipeline), which
specifies the delay in number of days:
# 120 days of delay for files expiration
FILES_EXPIRES = 120
# 30 days of delay for images expiration
IMAGES_EXPIRES = 30
The default value for both settings is 90 days.
If you have pipeline that subclasses FilesPipeline and you'd like to have different setting for it you can set setting keys preceded by uppercase class name. E.g. given pipeline class called MyPipeline you can set setting key:
MYPIPELINE_FILES_EXPIRES = 180
and pipeline class MyPipeline will have expiration time set to 180.
Thumbnail generation for images¶
The Images Pipeline can automatically create thumbnails of the downloaded images.
In order to use this feature, you must set IMAGES_THUMBS
to a dictionary
where the keys are the thumbnail names and the values are their dimensions.
For example:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:
<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg
Where:
<size_name>
is the one specified in theIMAGES_THUMBS
dictionary keys (small
,big
, etc)<image_id>
is the SHA1 hash of the image url
Example of image files stored using small
and big
thumbnail names:
<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
The first one is the full image, as downloaded from the site.
Filtering out small images¶
When using the Images Pipeline, you can drop images which are too small, by
specifying the minimum allowed size in the IMAGES_MIN_HEIGHT
and
IMAGES_MIN_WIDTH
settings.
For example:
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
注釈
The size constraints don't affect thumbnail generation at all.
It is possible to set just one size constraint or both. When setting both of them, only images that satisfy both minimum sizes will be saved. For the above example, images of sizes (105 x 105) or (105 x 200) or (200 x 105) will all be dropped because at least one dimension is shorter than the constraint.
By default, there are no size constraints, so all images are processed.
Allowing redirections¶
By default media pipelines ignore redirects, i.e. an HTTP redirection to a media file URL request will mean the media download is considered failed.
To handle media redirections, set this setting to True
:
MEDIA_ALLOW_REDIRECTS = True
Extending the Media Pipelines¶
See here the methods that you can override in your custom Files Pipeline:
-
class
scrapy.pipelines.files.
FilesPipeline
¶ -
get_media_requests
(item, info)¶ As seen on the workflow, the pipeline will get the URLs of the images to download from the item. In order to do this, you can override the
get_media_requests()
method and return a Request for each file URL:def get_media_requests(self, item, info): for file_url in item['file_urls']: yield scrapy.Request(file_url)
Those requests will be processed by the pipeline and, when they have finished downloading, the results will be sent to the
item_completed()
method, as a list of 2-element tuples. Each tuple will contain(success, file_info_or_error)
where:success
is a boolean which isTrue
if the image was downloaded successfully orFalse
if it failed for some reasonfile_info_or_error
is a dict containing the following keys (if success isTrue
) or a Twisted Failure if there was a problem.url
- the url where the file was downloaded from. This is the url of the request returned from theget_media_requests()
method.path
- the path (relative toFILES_STORE
) where the file was storedchecksum
- a MD5 hash of the image contents
The list of tuples received by
item_completed()
is guaranteed to retain the same order of the requests returned from theget_media_requests()
method.Here's a typical value of the
results
argument:[(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))]
By default the
get_media_requests()
method returnsNone
which means there are no files to download for the item.
-
item_completed
(results, item, info)¶ The
FilesPipeline.item_completed()
method called when all file requests for a single item have completed (either finished downloading, or failed for some reason).The
item_completed()
method must return the output that will be sent to subsequent item pipeline stages, so you must return (or drop) the item, as you would in any pipeline.Here is an example of the
item_completed()
method where we store the downloaded file paths (passed in results) in thefile_paths
item field, and we drop the item if it doesn't contain any files:from scrapy.exceptions import DropItem def item_completed(self, results, item, info): file_paths = [x['path'] for ok, x in results if ok] if not file_paths: raise DropItem("Item contains no files") item['file_paths'] = file_paths return item
By default, the
item_completed()
method returns the item.
-
See here the methods that you can override in your custom Images Pipeline:
-
class
scrapy.pipelines.images.
ImagesPipeline
¶ - The
ImagesPipeline
is an extension of theFilesPipeline
, customizing the field names and adding custom behavior for images.-
get_media_requests
(item, info)¶ Works the same way as
FilesPipeline.get_media_requests()
method, but using a different field name for image urls.Must return a Request for each image URL.
-
item_completed
(results, item, info)¶ The
ImagesPipeline.item_completed()
method is called when all image requests for a single item have completed (either finished downloading, or failed for some reason).Works the same way as
FilesPipeline.item_completed()
method, but using a different field names for storing image downloading results.By default, the
item_completed()
method returns the item.
-
Custom Images pipeline example¶
Here is a full example of the Images Pipeline whose methods are examplified above:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
Deploying Spiders¶
This section describes the different options you have for deploying your Scrapy spiders to run them on a regular basis. Running Scrapy spiders in your local machine is very convenient for the (early) development stage, but not so much when you need to execute long-running spiders or move spiders to run in production continuously. This is where the solutions for deploying Scrapy spiders come in.
Popular choices for deploying Scrapy spiders are:
- Scrapyd (open source)
- Scrapy Cloud (cloud-based)
Deploying to a Scrapyd Server¶
Scrapyd is an open source application to run Scrapy spiders. It provides a server with HTTP API, capable of running and monitoring Scrapy spiders.
To deploy spiders to Scrapyd, you can use the scrapyd-deploy tool provided by the scrapyd-client package. Please refer to the scrapyd-deploy documentation for more information.
Scrapyd is maintained by some of the Scrapy developers.
Deploying to Scrapy Cloud¶
Scrapy Cloud is a hosted, cloud-based service by Scrapinghub, the company behind Scrapy.
Scrapy Cloud removes the need to setup and monitor servers and provides a nice UI to manage spiders and review scraped items, logs and stats.
To deploy spiders to Scrapy Cloud you can use the shub command line tool. Please refer to the Scrapy Cloud documentation for more information.
Scrapy Cloud is compatible with Scrapyd and one can switch between
them as needed - the configuration is read from the scrapy.cfg
file
just like scrapyd-deploy
.
AutoThrottle extension¶
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
Design goals¶
- be nicer to sites instead of using default download delay of zero
- automatically adjust scrapy to the optimum crawling speed, so the user doesn't have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.
How it works¶
AutoThrottle extension adjusts download delays dynamically to make spider send
AUTOTHROTTLE_TARGET_CONCURRENCY
concurrent requests on average
to each remote website.
It uses download latency to compute the delays. The main idea is the
following: if a server needs latency
seconds to respond, a client
should send a request each latency/N
seconds to have N
requests
processed in parallel.
Instead of adjusting the delays one can just set a small fixed
download delay and impose hard limits on concurrency using
CONCURRENT_REQUESTS_PER_DOMAIN
or
CONCURRENT_REQUESTS_PER_IP
options. It will provide a similar
effect, but there are some important differences:
- because the download delay is small there will be occasional bursts of requests;
- often non-200 (error) responses can be returned faster than regular responses, so with a small download delay and a hard concurrency limit crawler will be sending requests to server faster when server starts to return errors. But this is an opposite of what crawler should do - in case of errors it makes more sense to slow down: these errors may be caused by the high request rate.
AutoThrottle doesn't have these issues.
Throttling algorithm¶
AutoThrottle algorithm adjusts download delays based on the following rules:
- spiders always start with a download delay of
AUTOTHROTTLE_START_DELAY
; - when a response is received, the target download delay is calculated as
latency / N
wherelatency
is a latency of the response, andN
isAUTOTHROTTLE_TARGET_CONCURRENCY
. - download delay for next requests is set to the average of previous download delay and the target download delay;
- latencies of non-200 responses are not allowed to decrease the delay;
- download delay can't become less than
DOWNLOAD_DELAY
or greater thanAUTOTHROTTLE_MAX_DELAY
注釈
The AutoThrottle extension honours the standard Scrapy settings for
concurrency and delay. This means that it will respect
CONCURRENT_REQUESTS_PER_DOMAIN
and
CONCURRENT_REQUESTS_PER_IP
options and
never set a download delay lower than DOWNLOAD_DELAY
.
In Scrapy, the download latency is measured as the time elapsed between establishing the TCP connection and receiving the HTTP headers.
Note that these latencies are very hard to measure accurately in a cooperative multitasking environment because Scrapy may be busy processing a spider callback, for example, and unable to attend downloads. However, these latencies should still give a reasonable estimate of how busy Scrapy (and ultimately, the server) is, and this extension builds on that premise.
Settings¶
The settings used to control the AutoThrottle extension are:
AUTOTHROTTLE_ENABLED
AUTOTHROTTLE_START_DELAY
AUTOTHROTTLE_MAX_DELAY
AUTOTHROTTLE_TARGET_CONCURRENCY
AUTOTHROTTLE_DEBUG
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
For more information see How it works.
AUTOTHROTTLE_MAX_DELAY¶
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
AUTOTHROTTLE_TARGET_CONCURRENCY¶
バージョン 1.1 で追加.
Default: 1.0
Average number of requests Scrapy should be sending in parallel to remote websites.
By default, AutoThrottle adjusts the delay to send a single
concurrent request to each of the remote websites. Set this option to
a higher value (e.g. 2.0
) to increase the throughput and the load on remote
servers. A lower AUTOTHROTTLE_TARGET_CONCURRENCY
value
(e.g. 0.5
) makes the crawler more conservative and polite.
Note that CONCURRENT_REQUESTS_PER_DOMAIN
and CONCURRENT_REQUESTS_PER_IP
options are still respected
when AutoThrottle extension is enabled. This means that if
AUTOTHROTTLE_TARGET_CONCURRENCY
is set to a value higher than
CONCURRENT_REQUESTS_PER_DOMAIN
or
CONCURRENT_REQUESTS_PER_IP
, the crawler won't reach this number
of concurrent requests.
At every given time point Scrapy can be sending more or less concurrent
requests than AUTOTHROTTLE_TARGET_CONCURRENCY
; it is a suggested
value the crawler tries to approach, not a hard limit.
AUTOTHROTTLE_DEBUG¶
Default: False
Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling parameters are being adjusted in real time.
Benchmarking¶
バージョン 0.17 で追加.
Scrapy comes with a simple benchmarking suite that spawns a local HTTP server and crawls it at the maximum possible speed. The goal of this benchmarking is to get an idea of how Scrapy performs in your hardware, in order to have a common baseline for comparisons. It uses a simple spider that does nothing and just follows links.
To run it use:
scrapy bench
You should see an output like this:
2016-12-16 21:18:48 [scrapy.utils.log] INFO: Scrapy 1.2.2 started (bot: quotesbot)
2016-12-16 21:18:48 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['quotesbot.spiders'], 'LOGSTATS_INTERVAL': 1, 'BOT_NAME': 'quotesbot', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'quotesbot.spiders'}
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2016-12-16 21:18:49 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:18:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:50 [scrapy.extensions.logstats] INFO: Crawled 70 pages (at 4200 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:51 [scrapy.extensions.logstats] INFO: Crawled 134 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:52 [scrapy.extensions.logstats] INFO: Crawled 198 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:53 [scrapy.extensions.logstats] INFO: Crawled 254 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:54 [scrapy.extensions.logstats] INFO: Crawled 302 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:55 [scrapy.extensions.logstats] INFO: Crawled 358 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:56 [scrapy.extensions.logstats] INFO: Crawled 406 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:57 [scrapy.extensions.logstats] INFO: Crawled 438 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:58 [scrapy.extensions.logstats] INFO: Crawled 470 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:18:59 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2016-12-16 21:18:59 [scrapy.extensions.logstats] INFO: Crawled 518 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:19:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 229995,
'downloader/request_count': 534,
'downloader/request_method_count/GET': 534,
'downloader/response_bytes': 1565504,
'downloader/response_count': 534,
'downloader/response_status_count/200': 534,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2016, 12, 16, 16, 19, 0, 647725),
'log_count/INFO': 17,
'request_depth_max': 19,
'response_received_count': 534,
'scheduler/dequeued': 533,
'scheduler/dequeued/memory': 533,
'scheduler/enqueued': 10661,
'scheduler/enqueued/memory': 10661,
'start_time': datetime.datetime(2016, 12, 16, 16, 18, 49, 799869)}
2016-12-16 21:19:00 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
That tells you that Scrapy is able to crawl about 3000 pages per minute in the hardware where you run it. Note that this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which results in slower crawl rates. How slower depends on how much your spider does and how well it's written.
In the future, more cases will be added to the benchmarking suite to cover other common scenarios.
Jobs: pausing and resuming crawls¶
Sometimes, for big sites, it's desirable to pause crawls and be able to resume them later.
Scrapy supports this functionality out of the box by providing the following facilities:
- a scheduler that persists scheduled requests on disk
- a duplicates filter that persists visited requests on disk
- an extension that keeps some spider state (key/value pairs) persistent between batches
Job directory¶
To enable persistence support you just need to define a job directory through
the JOBDIR
setting. This directory will be for storing all required data to
keep the state of a single job (ie. a spider run). It's important to note that
this directory must not be shared by different spiders, or even different
jobs/runs of the same spider, as it's meant to be used for storing the state of
a single job.
How to use it¶
To start a spider with persistence support enabled, run it like this:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Keeping persistent state between batches¶
Sometimes you'll want to keep some persistent spider state between pause/resume
batches. You can use the spider.state
attribute for that, which should be a
dict. There's a built-in extension that takes care of serializing, storing and
loading that attribute from the job directory, when the spider starts and
stops.
Here's an example of a callback that uses the spider state (other spider code is omitted for brevity):
def parse_item(self, response):
# parse item here
self.state['items_count'] = self.state.get('items_count', 0) + 1
Persistence gotchas¶
There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:
Cookies expiration¶
Cookies may expire. So, if you don't resume your spider quickly the requests scheduled may no longer work. This won't be an issue if you spider doesn't rely on cookies.
Request serialization¶
Requests must be serializable by the pickle module, in order for persistence to work, so you should make sure that your requests are serializable.
The most common issue here is to use lambda
functions on request callbacks that
can't be persisted.
So, for example, this won't work:
def some_callback(self, response):
somearg = 'test'
return scrapy.Request('http://www.example.com', callback=lambda r: self.other_callback(r, somearg))
def other_callback(self, response, somearg):
print("the argument passed is: %s" % somearg)
But this will:
def some_callback(self, response):
somearg = 'test'
return scrapy.Request('http://www.example.com', callback=self.other_callback, meta={'somearg': somearg})
def other_callback(self, response):
somearg = response.meta['somearg']
print("the argument passed is: %s" % somearg)
If you wish to log the requests that couldn't be serialized, you can set the
SCHEDULER_DEBUG
setting to True
in the project's settings page.
It is False
by default.
- Frequently Asked Questions
- よく寄せられる質問への回答です。
- Debugging Spiders
- Spiderの一般的な問題をデバッグする方法です。
- Spiders Contracts
- Spiderをテストするためのコントラクトを使用する方法です。
- Common Practices
- Scrapyの一般的な書き方のパターンを学びます。
- Broad Crawls
- 多くのドメインを並行してクロールするようにScrapyを調整する方法です。
- Using your browser's Developer Tools for scraping
- Learn how to scrape with your browser's developer tools.
- Debugging memory leaks
- メモリリークを見つけて取り除く方法を学びます。
- Downloading and processing files and images
- 収集したItemに関連するファイルや画像をダウンロードします。
- Deploying Spiders
- Spiderをデプロイし、リモートサーバーで実行します。
- AutoThrottle extension
- 負荷に基づいてクロール速度を動的に調整します。
- Benchmarking
- Scrapyのパフォーマンスをチェックします。
- Jobs: pausing and resuming crawls
- 大規模なSpiderのクロールを一時停止および再開する方法を学びます。
Scrapy拡張¶
Architecture overview¶
This document describes the architecture of Scrapy and how its components interact.
Overview¶
The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data flow that takes place inside the system (shown by the red arrows). A brief description of the components is included below with links for more detailed information about them. The data flow is also described below.
Data flow¶

The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The Engine sends the Requests to the
Downloader, passing through the
Downloader Middlewares (see
process_request()
). - Once the page finishes downloading the
Downloader generates a Response (with
that page) and sends it to the Engine, passing through the
Downloader Middlewares (see
process_response()
). - The Engine receives the Response from the
Downloader and sends it to the
Spider for processing, passing
through the Spider Middleware (see
process_spider_input()
). - The Spider processes the Response and returns
scraped items and new Requests (to follow) to the
Engine, passing through the
Spider Middleware (see
process_spider_output()
). - The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 1) until there are no more requests from the Scheduler.
Components¶
Scrapy Engine¶
The engine is responsible for controlling the data flow between all components of the system, and triggering events when certain actions occur. See the Data Flow section above for more details.
Scheduler¶
The Scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when the engine requests them.
Downloader¶
The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.
Spiders¶
Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional requests to follow. For more information see Spider.
Item Pipeline¶
The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database). For more information see Item Pipeline.
Downloader middlewares¶
Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.
Use a Downloader middleware if you need to do one of the following:
- process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
- change received response before passing it to a spider;
- send a new Request instead of passing received response to a spider;
- pass response to a spider without fetching a web page;
- silently drop some requests.
For more information see Downloader Middleware.
Spider middlewares¶
Spider middlewares are specific hooks that sit between the Engine and the Spiders and are able to process spider input (responses) and output (items and requests).
Use a Spider middleware if you need to
- post-process output of spider callbacks - change/add/remove requests or items;
- post-process start_requests;
- handle spider exceptions;
- call errback instead of callback for some of the requests based on response content.
For more information see Spider Middleware.
Downloader Middleware¶
The downloader middleware is a framework of hooks into Scrapy's request/response processing. It's a light, low-level system for globally altering Scrapy's requests and responses.
Activating a downloader middleware¶
To activate a downloader middleware component, add it to the
DOWNLOADER_MIDDLEWARES
setting, which is a dict whose keys are the
middleware class paths and their values are the middleware orders.
Here's an example:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
The DOWNLOADER_MIDDLEWARES
setting is merged with the
DOWNLOADER_MIDDLEWARES_BASE
setting defined in Scrapy (and not meant
to be overridden) and then sorted by order to get the final sorted list of
enabled middlewares: the first middleware is the one closer to the engine and
the last is the one closer to the downloader. In other words,
the process_request()
method of each middleware will be invoked in increasing
middleware order (100, 200, 300, ...) and the process_response()
method
of each middleware will be invoked in decreasing order.
To decide which order to assign to your middleware see the
DOWNLOADER_MIDDLEWARES_BASE
setting and pick a value according to
where you want to insert the middleware. The order does matter because each
middleware performs a different action and your middleware could depend on some
previous (or subsequent) middleware being applied.
If you want to disable a built-in middleware (the ones defined in
DOWNLOADER_MIDDLEWARES_BASE
and enabled by default) you must define it
in your project's DOWNLOADER_MIDDLEWARES
setting and assign None
as its value. For example, if you want to disable the user-agent middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own downloader middleware¶
Each middleware component is a Python class that defines one or more of the following methods:
-
class
scrapy.downloadermiddlewares.
DownloaderMiddleware
¶ 注釈
Any of the downloader middleware methods may also return a deferred.
-
process_request
(request, spider)¶ This method is called for each request that goes through the download middleware.
process_request()
should either: returnNone
, return aResponse
object, return aRequest
object, or raiseIgnoreRequest
.If it returns
None
, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).If it returns a
Response
object, Scrapy won't bother calling any otherprocess_request()
orprocess_exception()
methods, or the appropriate download function; it'll return that response. Theprocess_response()
methods of installed middleware is always called on every response.If it returns a
Request
object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.If it raises an
IgnoreRequest
exception, theprocess_exception()
methods of installed downloader middleware will be called. If none of them handle the exception, the errback function of the request (Request.errback
) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).パラメータ:
-
process_response
(request, response, spider)¶ process_response()
should either: return aResponse
object, return aRequest
object or raise aIgnoreRequest
exception.If it returns a
Response
(it could be the same given response, or a brand-new one), that response will continue to be processed with theprocess_response()
of the next middleware in the chain.If it returns a
Request
object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned fromprocess_request()
.If it raises an
IgnoreRequest
exception, the errback function of the request (Request.errback
) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).パラメータ:
-
process_exception
(request, exception, spider)¶ Scrapy calls
process_exception()
when a download handler or aprocess_request()
(from a downloader middleware) raises an exception (including anIgnoreRequest
exception)process_exception()
should return: eitherNone
, aResponse
object, or aRequest
object.If it returns
None
, Scrapy will continue processing this exception, executing any otherprocess_exception()
methods of installed middleware, until no middleware is left and the default exception handling kicks in.If it returns a
Response
object, theprocess_response()
method chain of installed middleware is started, and Scrapy won't bother calling any otherprocess_exception()
methods of middleware.If it returns a
Request
object, the returned request is rescheduled to be downloaded in the future. This stops the execution ofprocess_exception()
methods of the middleware the same as returning a response would.パラメータ:
-
from_crawler
(cls, crawler)¶ If present, this classmethod is called to create a middleware instance from a
Crawler
. It must return a new instance of the middleware. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy.パラメータ: crawler ( Crawler
object) -- crawler that uses this middleware
-
Built-in downloader middleware reference¶
This page describes all downloader middleware components that come with Scrapy. For information on how to use them and how to write your own downloader middleware, see the downloader middleware usage guide.
For a list of the components enabled by default (and their orders) see the
DOWNLOADER_MIDDLEWARES_BASE
setting.
CookiesMiddleware¶
This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers, and send them back on subsequent requests (from that spider), just like web browsers do.
The following settings can be used to configure the cookie middleware:
Multiple cookie sessions per spider¶
バージョン 0.15 で追加.
There is support for keeping multiple cookie sessions per spider by using the
cookiejar
Request meta key. By default it uses a single cookie jar
(session), but you can pass an identifier to use different ones.
For example:
for i, url in enumerate(urls):
yield scrapy.Request(url, meta={'cookiejar': i},
callback=self.parse_page)
Keep in mind that the cookiejar
meta key is not "sticky". You need to keep
passing it along on subsequent requests. For example:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
COOKIES_ENABLED¶
Default: True
Whether to enable the cookies middleware. If disabled, no cookies will be sent to web servers.
Notice that despite the value of COOKIES_ENABLED
setting if
Request.
meta['dont_merge_cookies']
evaluates to True
the request cookies will not be sent to the
web server and received cookies in Response
will
not be merged with the existing cookies.
For more detailed information see the cookies
parameter in
Request
.
COOKIES_DEBUG¶
Default: False
If enabled, Scrapy will log all cookies sent in requests (ie. Cookie
header) and all cookies received in responses (ie. Set-Cookie
header).
Here's an example of a log with COOKIES_DEBUG
enabled:
2011-04-06 14:35:10-0300 [scrapy.core.engine] INFO: Spider opened
2011-04-06 14:35:10-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.diningcity.com/netherlands/index.html>
Cookie: clientlanguage_nl=en_EN
2011-04-06 14:35:14-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 http://www.diningcity.com/netherlands/index.html>
Set-Cookie: JSESSIONID=B~FA4DC0C496C8762AE4F1A620EAB34F38; Path=/
Set-Cookie: ip_isocode=US
Set-Cookie: clientlanguage_nl=en_EN; Expires=Thu, 07-Apr-2011 21:21:34 GMT; Path=/
2011-04-06 14:49:50-0300 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.diningcity.com/netherlands/index.html> (referer: None)
[...]
DefaultHeadersMiddleware¶
-
class
scrapy.downloadermiddlewares.defaultheaders.
DefaultHeadersMiddleware
¶ This middleware sets all default requests headers specified in the
DEFAULT_REQUEST_HEADERS
setting.
DownloadTimeoutMiddleware¶
-
class
scrapy.downloadermiddlewares.downloadtimeout.
DownloadTimeoutMiddleware
¶ This middleware sets the download timeout for requests specified in the
DOWNLOAD_TIMEOUT
setting ordownload_timeout
spider attribute.
注釈
You can also set download timeout per-request using
download_timeout
Request.meta key; this is supported
even when DownloadTimeoutMiddleware is disabled.
HttpAuthMiddleware¶
-
class
scrapy.downloadermiddlewares.httpauth.
HttpAuthMiddleware
¶ This middleware authenticates all requests generated from certain spiders using Basic access authentication (aka. HTTP auth).
To enable HTTP authentication from certain spiders, set the
http_user
andhttp_pass
attributes of those spiders.Example:
from scrapy.spiders import CrawlSpider class SomeIntranetSiteSpider(CrawlSpider): http_user = 'someuser' http_pass = 'somepass' name = 'intranet.example.com' # .. rest of the spider code omitted ...
HttpCacheMiddleware¶
-
class
scrapy.downloadermiddlewares.httpcache.
HttpCacheMiddleware
¶ This middleware provides low-level cache to all HTTP requests and responses. It has to be combined with a cache storage backend as well as a cache policy.
Scrapy ships with three HTTP cache storage backends:
You can change the HTTP cache storage backend with the
HTTPCACHE_STORAGE
setting. Or you can also implement your own storage backend.Scrapy ships with two HTTP cache policies:
You can change the HTTP cache policy with the
HTTPCACHE_POLICY
setting. Or you can also implement your own policy.You can also avoid caching a response on every policy using
dont_cache
meta key equals True.
Dummy policy (default)¶
This policy has no awareness of any HTTP Cache-Control directives. Every request and its corresponding response are cached. When the same request is seen again, the response is returned without transferring anything from the Internet.
The Dummy policy is useful for testing spiders faster (without having to wait for downloads every time) and for trying your spider offline, when an Internet connection is not available. The goal is to be able to "replay" a spider run exactly as it ran before.
In order to use this policy, set:
HTTPCACHE_POLICY
toscrapy.extensions.httpcache.DummyPolicy
RFC2616 policy¶
This policy provides a RFC2616 compliant HTTP cache, i.e. with HTTP Cache-Control awareness, aimed at production and used in continuous runs to avoid downloading unmodified data (to save bandwidth and speed up crawls).
what is implemented:
Do not attempt to store responses/requests with no-store cache-control directive set
Do not serve responses from cache if no-cache cache-control directive is set even for fresh responses
Compute freshness lifetime from max-age cache-control directive
Compute freshness lifetime from Expires response header
Compute freshness lifetime from Last-Modified response header (heuristic used by Firefox)
Compute current age from Age response header
Compute current age from Date header
Revalidate stale responses based on Last-Modified response header
Revalidate stale responses based on ETag response header
Set Date header for any received response missing it
Support max-stale cache-control directive in requests
This allows spiders to be configured with the full RFC2616 cache policy, but avoid revalidation on a request-by-request basis, while remaining conformant with the HTTP spec.
Example:
Add Cache-Control: max-stale=600 to Request headers to accept responses that have exceeded their expiration time by no more than 600 seconds.
See also: RFC2616, 14.9.3
what is missing:
- Pragma: no-cache support https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
- Vary header support https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.6
- Invalidation after updates or deletes https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10
- ... probably others ..
In order to use this policy, set:
HTTPCACHE_POLICY
toscrapy.extensions.httpcache.RFC2616Policy
Filesystem storage backend (default)¶
File system storage backend is available for the HTTP cache middleware.
In order to use this storage backend, set:
HTTPCACHE_STORAGE
toscrapy.extensions.httpcache.FilesystemCacheStorage
Each request/response pair is stored in a different directory containing the following files:
request_body
- the plain request bodyrequest_headers
- the request headers (in raw HTTP format)response_body
- the plain response bodyresponse_headers
- the request headers (in raw HTTP format)meta
- some metadata of this cache resource in Pythonrepr()
format (grep-friendly format)pickled_meta
- the same metadata inmeta
but pickled for more efficient deserialization
The directory name is made from the request fingerprint (see
scrapy.utils.request.fingerprint
), and one level of subdirectories is
used to avoid creating too many files into the same directory (which is
inefficient in many file systems). An example directory could be:
/path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7
DBM storage backend¶
バージョン 0.13 で追加.
A DBM storage backend is also available for the HTTP cache middleware.
By default, it uses the anydbm module, but you can change it with the
HTTPCACHE_DBM_MODULE
setting.
In order to use this storage backend, set:
HTTPCACHE_STORAGE
toscrapy.extensions.httpcache.DbmCacheStorage
LevelDB storage backend¶
バージョン 0.23 で追加.
A LevelDB storage backend is also available for the HTTP cache middleware.
This backend is not recommended for development because only one process can access LevelDB databases at the same time, so you can't run a crawl and open the scrapy shell in parallel for the same spider.
In order to use this storage backend:
- set
HTTPCACHE_STORAGE
toscrapy.extensions.httpcache.LeveldbCacheStorage
- install LevelDB python bindings like
pip install leveldb
HTTPCache middleware settings¶
The HttpCacheMiddleware
can be configured through the following
settings:
バージョン 0.11 で追加.
Default: False
Whether the HTTP cache will be enabled.
バージョン 0.11 で変更: Before 0.11, HTTPCACHE_DIR
was used to enable cache.
Default: 0
Expiration time for cached requests, in seconds.
Cached requests older than this time will be re-downloaded. If zero, cached requests will never expire.
バージョン 0.11 で変更: Before 0.11, zero meant cached requests always expire.
Default: 'httpcache'
The directory to use for storing the (low-level) HTTP cache. If empty, the HTTP cache will be disabled. If a relative path is given, is taken relative to the project data dir. For more info see: Scrapyプロジェクトのデフォルト構成.
バージョン 0.10 で追加.
Default: []
Don't cache response with these HTTP codes.
Default: False
If enabled, requests not found in the cache will be ignored instead of downloaded.
バージョン 0.10 で追加.
Default: ['file']
Don't cache responses with these URI schemes.
Default: 'scrapy.extensions.httpcache.FilesystemCacheStorage'
The class which implements the cache storage backend.
バージョン 0.13 で追加.
Default: 'anydbm'
The database module to use in the DBM storage backend. This setting is specific to the DBM backend.
バージョン 0.18 で追加.
Default: 'scrapy.extensions.httpcache.DummyPolicy'
The class which implements the cache policy.
バージョン 1.0 で追加.
Default: False
If enabled, will compress all cached data with gzip. This setting is specific to the Filesystem backend.
バージョン 1.1 で追加.
Default: False
If enabled, will cache pages unconditionally.
A spider may wish to have all responses available in the cache, for future use with Cache-Control: max-stale, for instance. The DummyPolicy caches all responses but never revalidates them, and sometimes a more nuanced policy is desirable.
This setting still respects Cache-Control: no-store directives in responses. If you don't want that, filter no-store out of the Cache-Control headers in responses you feedto the cache middleware.
バージョン 1.1 で追加.
Default: []
List of Cache-Control directives in responses to be ignored.
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get upset at the traffic a spider can generate if it respects those directives. This allows to selectively ignore Cache-Control directives that are known to be unimportant for the sites being crawled.
We assume that the spider will not issue Cache-Control directives in requests unless it actually needs them, so directives in requests are not filtered.
HttpCompressionMiddleware¶
-
class
scrapy.downloadermiddlewares.httpcompression.
HttpCompressionMiddleware
¶ This middleware allows compressed (gzip, deflate) traffic to be sent/received from web sites.
This middleware also supports decoding brotli-compressed responses, provided brotlipy is installed.
HttpProxyMiddleware¶
バージョン 0.8 で追加.
-
class
scrapy.downloadermiddlewares.httpproxy.
HttpProxyMiddleware
¶ This middleware sets the HTTP proxy to use for requests, by setting the
proxy
meta value forRequest
objects.Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy
You can also set the meta key
proxy
per-request, to a value likehttp://some_proxy_server:port
orhttp://username:password@some_proxy_server:port
. Keep in mind this value will take precedence overhttp_proxy
/https_proxy
environment variables, and it will also ignoreno_proxy
environment variable.
RedirectMiddleware¶
-
class
scrapy.downloadermiddlewares.redirect.
RedirectMiddleware
¶ This middleware handles redirection of requests based on response status.
The urls which the request goes through (while being redirected) can be found
in the redirect_urls
Request.meta
key.
The RedirectMiddleware
can be configured through the following
settings (see the settings documentation for more info):
If Request.meta
has dont_redirect
key set to True, the request will be ignored by this middleware.
If you want to handle some redirect status codes in your spider, you can
specify these in the handle_httpstatus_list
spider attribute.
For example, if you want the redirect middleware to ignore 301 and 302 responses (and pass them through to your spider) you can do this:
class MySpider(CrawlSpider):
handle_httpstatus_list = [301, 302]
The handle_httpstatus_list
key of Request.meta
can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key
handle_httpstatus_all
to True
if you want to allow any response code
for a request.
MetaRefreshMiddleware¶
-
class
scrapy.downloadermiddlewares.redirect.
MetaRefreshMiddleware
¶ This middleware handles redirection of requests based on meta-refresh html tag.
The MetaRefreshMiddleware
can be configured through the following
settings (see the settings documentation for more info):
This middleware obey REDIRECT_MAX_TIMES
setting, dont_redirect
and redirect_urls
request meta keys as described for RedirectMiddleware
MetaRefreshMiddleware settings¶
バージョン 0.17 で追加.
Default: True
Whether the Meta Refresh middleware will be enabled.
Default: 100
The maximum meta-refresh delay (in seconds) to follow the redirection. Some sites use meta-refresh for redirecting to a session expired page, so we restrict automatic redirection to the maximum delay.
RetryMiddleware¶
-
class
scrapy.downloadermiddlewares.retry.
RetryMiddleware
¶ A middleware to retry failed requests that are potentially caused by temporary problems such as a connection timeout or HTTP 500 error.
Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages. Once there are no more failed pages to retry, this middleware sends a signal (retry_complete), so other extensions could connect to that signal.
The RetryMiddleware
can be configured through the following
settings (see the settings documentation for more info):
If Request.meta
has dont_retry
key
set to True, the request will be ignored by this middleware.
RetryMiddleware Settings¶
Default: 2
Maximum number of times to retry, in addition to the first download.
Maximum number of retries can also be specified per-request using
max_retry_times
attribute of Request.meta
.
When initialized, the max_retry_times
meta key takes higher
precedence over the RETRY_TIMES
setting.
Default: [500, 502, 503, 504, 522, 524, 408]
Which HTTP response codes to retry. Other errors (DNS lookup issues, connections lost, etc) are always retried.
In some cases you may want to add 400 to RETRY_HTTP_CODES
because
it is a common code used to indicate server overload. It is not included by
default because HTTP specs say so.
RobotsTxtMiddleware¶
-
class
scrapy.downloadermiddlewares.robotstxt.
RobotsTxtMiddleware
¶ This middleware filters out requests forbidden by the robots.txt exclusion standard.
To make sure Scrapy respects robots.txt make sure the middleware is enabled and the
ROBOTSTXT_OBEY
setting is enabled.
If Request.meta
has
dont_obey_robotstxt
key set to True
the request will be ignored by this middleware even if
ROBOTSTXT_OBEY
is enabled.
DownloaderStats¶
-
class
scrapy.downloadermiddlewares.stats.
DownloaderStats
¶ Middleware that stores stats of all requests, responses and exceptions that pass through it.
To use this middleware you must enable the
DOWNLOADER_STATS
setting.
UserAgentMiddleware¶
-
class
scrapy.downloadermiddlewares.useragent.
UserAgentMiddleware
¶ Middleware that allows spiders to override the default user agent.
In order for a spider to override the default user agent, its user_agent attribute must be set.
AjaxCrawlMiddleware¶
-
class
scrapy.downloadermiddlewares.ajaxcrawl.
AjaxCrawlMiddleware
¶ Middleware that finds 'AJAX crawlable' page variants based on meta-fragment html tag. See https://developers.google.com/webmasters/ajax-crawling/docs/getting-started for more info.
注釈
Scrapy finds 'AJAX crawlable' pages for URLs like
'http://example.com/!#foo=bar'
even without this middleware. AjaxCrawlMiddleware is necessary when URL doesn't contain'!#'
. This is often a case for 'index' or 'main' website pages.
AjaxCrawlMiddleware Settings¶
バージョン 0.21 で追加.
Default: False
Whether the AjaxCrawlMiddleware will be enabled. You may want to enable it for broad crawls.
Spider Middleware¶
The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spider for processing and to process the requests and items that are generated from spiders.
Activating a spider middleware¶
To activate a spider middleware component, add it to the
SPIDER_MIDDLEWARES
setting, which is a dict whose keys are the
middleware class path and their values are the middleware orders.
Here's an example:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
}
The SPIDER_MIDDLEWARES
setting is merged with the
SPIDER_MIDDLEWARES_BASE
setting defined in Scrapy (and not meant to
be overridden) and then sorted by order to get the final sorted list of enabled
middlewares: the first middleware is the one closer to the engine and the last
is the one closer to the spider. In other words,
the process_spider_input()
method of each middleware will be invoked in increasing
middleware order (100, 200, 300, ...), and the
process_spider_output()
method
of each middleware will be invoked in decreasing order.
To decide which order to assign to your middleware see the
SPIDER_MIDDLEWARES_BASE
setting and pick a value according to where
you want to insert the middleware. The order does matter because each
middleware performs a different action and your middleware could depend on some
previous (or subsequent) middleware being applied.
If you want to disable a builtin middleware (the ones defined in
SPIDER_MIDDLEWARES_BASE
, and enabled by default) you must define it
in your project SPIDER_MIDDLEWARES
setting and assign None as its
value. For example, if you want to disable the off-site middleware:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own spider middleware¶
Each middleware component is a Python class that defines one or more of the following methods:
-
class
scrapy.spidermiddlewares.
SpiderMiddleware
¶ -
process_spider_input
(response, spider)¶ This method is called for each response that goes through the spider middleware and into the spider, for processing.
process_spider_input()
should returnNone
or raise an exception.If it returns
None
, Scrapy will continue processing this response, executing all other middlewares until, finally, the response is handed to the spider for processing.If it raises an exception, Scrapy won't bother calling any other spider middleware
process_spider_input()
and will call the request errback. The output of the errback is chained back in the other direction forprocess_spider_output()
to process it, orprocess_spider_exception()
if it raised an exception.パラメータ:
-
process_spider_output
(response, result, spider)¶ This method is called with the results returned from the Spider, after it has processed the response.
process_spider_output()
must return an iterable ofRequest
, dict orItem
objects.パラメータ:
-
process_spider_exception
(response, exception, spider)¶ This method is called when a spider or
process_spider_input()
method (from other spider middleware) raises an exception.process_spider_exception()
should return eitherNone
or an iterable ofRequest
, dict orItem
objects.If it returns
None
, Scrapy will continue processing this exception, executing any otherprocess_spider_exception()
in the following middleware components, until no middleware components are left and the exception reaches the engine (where it's logged and discarded).If it returns an iterable the
process_spider_output()
pipeline kicks in, and no otherprocess_spider_exception()
will be called.パラメータ:
-
process_start_requests
(start_requests, spider)¶ バージョン 0.15 で追加.
This method is called with the start requests of the spider, and works similarly to the
process_spider_output()
method, except that it doesn't have a response associated and must return only requests (not items).It receives an iterable (in the
start_requests
parameter) and must return another iterable ofRequest
objects.注釈
When implementing this method in your spider middleware, you should always return an iterable (that follows the input one) and not consume all
start_requests
iterator because it can be very large (or even unbounded) and cause a memory overflow. The Scrapy engine is designed to pull start requests while it has capacity to process them, so the start requests iterator can be effectively endless where there is some other condition for stopping the spider (like a time limit or item/page count).パラメータ:
-
from_crawler
(cls, crawler)¶ If present, this classmethod is called to create a middleware instance from a
Crawler
. It must return a new instance of the middleware. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy.パラメータ: crawler ( Crawler
object) -- crawler that uses this middleware
-
Built-in spider middleware reference¶
This page describes all spider middleware components that come with Scrapy. For information on how to use them and how to write your own spider middleware, see the spider middleware usage guide.
For a list of the components enabled by default (and their orders) see the
SPIDER_MIDDLEWARES_BASE
setting.
DepthMiddleware¶
-
class
scrapy.spidermiddlewares.depth.
DepthMiddleware
¶ DepthMiddleware is used for tracking the depth of each Request inside the site being scraped. It works by setting request.meta['depth'] = 0 whenever there is no value previously set (usually just the first Request) and incrementing it by 1 otherwise.
It can be used to limit the maximum depth to scrape, control Request priority based on their depth, and things like that.
The
DepthMiddleware
can be configured through the following settings (see the settings documentation for more info):DEPTH_LIMIT
- The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.DEPTH_STATS_VERBOSE
- Whether to collect the number of requests for each depth.DEPTH_PRIORITY
- Whether to prioritize the requests based on their depth.
HttpErrorMiddleware¶
-
class
scrapy.spidermiddlewares.httperror.
HttpErrorMiddleware
¶ Filter out unsuccessful (erroneous) HTTP responses so that spiders don't have to deal with them, which (most of the time) imposes an overhead, consumes more resources, and makes the spider logic more complex.
According to the HTTP standard, successful responses are those whose status codes are in the 200-300 range.
If you still want to process response codes outside that range, you can
specify which response codes the spider is able to handle using the
handle_httpstatus_list
spider attribute or
HTTPERROR_ALLOWED_CODES
setting.
For example, if you want your spider to handle 404 responses you can do this:
class MySpider(CrawlSpider):
handle_httpstatus_list = [404]
The handle_httpstatus_list
key of Request.meta
can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key handle_httpstatus_all
to True
if you want to allow any response code for a request.
Keep in mind, however, that it's usually a bad idea to handle non-200 responses, unless you really know what you're doing.
For more information see: HTTP Status Code Definitions.
OffsiteMiddleware¶
-
class
scrapy.spidermiddlewares.offsite.
OffsiteMiddleware
¶ Filters out Requests for URLs outside the domains covered by the spider.
This middleware filters out every request whose host names aren't in the spider's
allowed_domains
attribute. All subdomains of any domain in the list are also allowed. E.g. the rulewww.example.org
will also allowbob.www.example.org
but notwww2.example.com
norexample.com
.When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message similar to this one:
DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>
To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example, if another request for
www.othersite.com
is filtered, no log message will be printed. But if a request forsomeothersite.com
is filtered, a message will be printed (but only for the first request filtered).If the spider doesn't define an
allowed_domains
attribute, or the attribute is empty, the offsite middleware will allow all requests.If the request has the
dont_filter
attribute set, the offsite middleware will allow the request even if its domain is not listed in allowed domains.
RefererMiddleware¶
-
class
scrapy.spidermiddlewares.referer.
RefererMiddleware
¶ Populates Request
Referer
header, based on the URL of the Response which generated it.
RefererMiddleware settings¶
バージョン 1.4 で追加.
Default: 'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'
Referrer Policy to apply when populating Request "Referer" header.
注釈
You can also set the Referrer Policy per request,
using the special "referrer_policy"
Request.meta key,
with the same acceptable values as for the REFERRER_POLICY
setting.
- either a path to a
scrapy.spidermiddlewares.referer.ReferrerPolicy
subclass — a custom policy or one of the built-in ones (see classes below), - or one of the standard W3C-defined string values,
- or the special
"scrapy-default"
.
String value | Class name (as a string) |
---|---|
"scrapy-default" (default) |
scrapy.spidermiddlewares.referer.DefaultReferrerPolicy |
"no-referrer" | scrapy.spidermiddlewares.referer.NoReferrerPolicy |
"no-referrer-when-downgrade" | scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy |
"same-origin" | scrapy.spidermiddlewares.referer.SameOriginPolicy |
"origin" | scrapy.spidermiddlewares.referer.OriginPolicy |
"strict-origin" | scrapy.spidermiddlewares.referer.StrictOriginPolicy |
"origin-when-cross-origin" | scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy |
"strict-origin-when-cross-origin" | scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy |
"unsafe-url" | scrapy.spidermiddlewares.referer.UnsafeUrlPolicy |
警告
Scrapy's default referrer policy — just like "no-referrer-when-downgrade",
the W3C-recommended value for browsers — will send a non-empty
"Referer" header from any http(s)://
to any https://
URL,
even if the domain is different.
"same-origin" may be a better choice if you want to remove referrer information for cross-domain requests.
注釈
"no-referrer-when-downgrade" policy is the W3C-recommended default, and is used by major web browsers.
However, it is NOT Scrapy's default referrer policy (see DefaultReferrerPolicy
).
警告
"unsafe-url" policy is NOT recommended.
UrlLengthMiddleware¶
-
class
scrapy.spidermiddlewares.urllength.
UrlLengthMiddleware
¶ Filters out requests with URLs longer than URLLENGTH_LIMIT
The
UrlLengthMiddleware
can be configured through the following settings (see the settings documentation for more info):URLLENGTH_LIMIT
- The maximum URL length to allow for crawled URLs.
Extensions¶
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
Extensions are just regular classes that are instantiated at Scrapy startup, when extensions are initialized.
Extension settings¶
Extensions use the Scrapy settings to manage their settings, just like any other Scrapy code.
It is customary for extensions to prefix their settings with their own name, to avoid collision with existing (and future) extensions. For example, a hypothetic extension to handle Google Sitemaps would use settings like GOOGLESITEMAP_ENABLED, GOOGLESITEMAP_DEPTH, and so on.
Loading & activating extensions¶
Extensions are loaded and activated at startup by instantiating a single
instance of the extension class. Therefore, all the extension initialization
code must be performed in the class constructor (__init__
method).
To make an extension available, add it to the EXTENSIONS
setting in
your Scrapy settings. In EXTENSIONS
, each extension is represented
by a string: the full Python path to the extension's class name. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': 500,
'scrapy.extensions.telnet.TelnetConsole': 500,
}
As you can see, the EXTENSIONS
setting is a dict where the keys are
the extension paths, and their values are the orders, which define the
extension loading order. The EXTENSIONS
setting is merged with the
EXTENSIONS_BASE
setting defined in Scrapy (and not meant to be
overridden) and then sorted by order to get the final sorted list of enabled
extensions.
As extensions typically do not depend on each other, their loading order is
irrelevant in most cases. This is why the EXTENSIONS_BASE
setting
defines all extensions with the same order (0
). However, this feature can
be exploited if you need to add an extension which depends on other extensions
already loaded.
Available, enabled and disabled extensions¶
Not all available extensions will be enabled. Some of them usually depend on a
particular setting. For example, the HTTP Cache extension is available by default
but disabled unless the HTTPCACHE_ENABLED
setting is set.
Disabling an extension¶
In order to disable an extension that comes enabled by default (ie. those
included in the EXTENSIONS_BASE
setting) you must set its order to
None
. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': None,
}
Writing your own extension¶
Each extension is a Python class. The main entry point for a Scrapy extension
(this also includes middlewares and pipelines) is the from_crawler
class method which receives a Crawler
instance. Through the Crawler object
you can access settings, signals, stats, and also control the crawling behaviour.
Typically, extensions connect to signals and perform tasks triggered by them.
Finally, if the from_crawler
method raises the
NotConfigured
exception, the extension will be
disabled. Otherwise, the extension will be enabled.
Sample extension¶
Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:
- a spider is opened
- a spider is closed
- a specific number of items are scraped
The extension will be enabled through the MYEXT_ENABLED
setting and the
number of items will be specified through the MYEXT_ITEMCOUNT
setting.
Here is the code of such extension:
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)
class SpiderOpenCloseLogging(object):
def __init__(self, item_count):
self.item_count = item_count
self.items_scraped = 0
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED'):
raise NotConfigured
# get the number of items from settings
item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)
# instantiate the extension object
ext = cls(item_count)
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
# return the extension object
return ext
def spider_opened(self, spider):
logger.info("opened spider %s", spider.name)
def spider_closed(self, spider):
logger.info("closed spider %s", spider.name)
def item_scraped(self, item, spider):
self.items_scraped += 1
if self.items_scraped % self.item_count == 0:
logger.info("scraped %d items", self.items_scraped)
Built-in extensions reference¶
General purpose extensions¶
Log Stats extension¶
-
class
scrapy.extensions.logstats.
LogStats
¶
Log basic stats like crawled pages and scraped items.
Core Stats extension¶
-
class
scrapy.extensions.corestats.
CoreStats
¶
Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).
Telnet console extension¶
-
class
scrapy.extensions.telnet.
TelnetConsole
¶
Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can be very useful for debugging.
The telnet console must be enabled by the TELNETCONSOLE_ENABLED
setting, and the server will listen in the port specified in
TELNETCONSOLE_PORT
.
Memory usage extension¶
-
class
scrapy.extensions.memusage.
MemoryUsage
¶
注釈
This extension does not work in Windows.
Monitors the memory used by the Scrapy process that runs the spider and:
- sends a notification e-mail when it exceeds a certain value
- closes the spider when it exceeds a certain value
The notification e-mails can be triggered when a certain warning value is
reached (MEMUSAGE_WARNING_MB
) and when the maximum value is reached
(MEMUSAGE_LIMIT_MB
) which will also cause the spider to be closed
and the Scrapy process to be terminated.
This extension is enabled by the MEMUSAGE_ENABLED
setting and
can be configured with the following settings:
Memory debugger extension¶
-
class
scrapy.extensions.memdebug.
MemoryDebugger
¶
An extension for debugging memory usage. It collects information about:
- objects uncollected by the Python garbage collector
- objects left alive that shouldn't. For more info, see Debugging memory leaks with trackref
To enable this extension, turn on the MEMDEBUG_ENABLED
setting. The
info will be stored in the stats.
Close spider extension¶
-
class
scrapy.extensions.closespider.
CloseSpider
¶
Closes a spider automatically when some conditions are met, using a specific closing reason for each condition.
The conditions for closing a spider can be configured through the following settings:
Default: 0
An integer which specifies a number of seconds. If the spider remains open for
more than that number of second, it will be automatically closed with the
reason closespider_timeout
. If zero (or non set), spiders won't be closed by
timeout.
Default: 0
An integer which specifies a number of items. If the spider scrapes more than
that amount and those items are passed by the item pipeline, the
spider will be closed with the reason closespider_itemcount
.
Requests which are currently in the downloader queue (up to
CONCURRENT_REQUESTS
requests) are still processed.
If zero (or non set), spiders won't be closed by number of passed items.
バージョン 0.11 で追加.
Default: 0
An integer which specifies the maximum number of responses to crawl. If the spider
crawls more than that, the spider will be closed with the reason
closespider_pagecount
. If zero (or non set), spiders won't be closed by
number of crawled responses.
バージョン 0.11 で追加.
Default: 0
An integer which specifies the maximum number of errors to receive before
closing the spider. If the spider generates more than that number of errors,
it will be closed with the reason closespider_errorcount
. If zero (or non
set), spiders won't be closed by number of errors.
StatsMailer extension¶
-
class
scrapy.extensions.statsmailer.
StatsMailer
¶
This simple extension can be used to send a notification e-mail every time a
domain has finished scraping, including the Scrapy stats collected. The email
will be sent to all recipients specified in the STATSMAILER_RCPTS
setting.
Debugging extensions¶
Stack trace dump extension¶
-
class
scrapy.extensions.debug.
StackTraceDump
¶
Dumps information about the running process when a SIGQUIT or SIGUSR2 signal is received. The information dumped is the following:
- engine status (using
scrapy.utils.engine.get_engine_status()
) - live references (see Debugging memory leaks with trackref)
- stack trace of all threads
After the stack trace and engine status is dumped, the Scrapy process continues running normally.
This extension only works on POSIX-compliant platforms (ie. not Windows), because the SIGQUIT and SIGUSR2 signals are not available on Windows.
There are at least two ways to send Scrapy the SIGQUIT signal:
By pressing Ctrl-while a Scrapy process is running (Linux only?)
By running this command (assuming
<pid>
is the process id of the Scrapy process):kill -QUIT <pid>
Debugger extension¶
-
class
scrapy.extensions.debug.
Debugger
¶
Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 signal is received. After the debugger is exited, the Scrapy process continues running normally.
For more info see Debugging in Python.
This extension only works on POSIX-compliant platforms (ie. not Windows).
Core API¶
バージョン 0.15 で追加.
This section documents the Scrapy core API, and it's intended for developers of extensions and middlewares.
Crawler API¶
The main entry point to Scrapy API is the Crawler
object, passed to extensions through the from_crawler
class method. This
object provides access to all Scrapy core components, and it's the only way for
extensions to access them and hook their functionality into Scrapy.
The Extension Manager is responsible for loading and keeping track of installed
extensions and it's configured through the EXTENSIONS
setting which
contains a dictionary of all available extensions and their order similar to
how you configure the downloader middlewares.
-
class
scrapy.crawler.
Crawler
(spidercls, settings)¶ The Crawler object must be instantiated with a
scrapy.spiders.Spider
subclass and ascrapy.settings.Settings
object.-
settings
¶ The settings manager of this crawler.
This is used by extensions & middlewares to access the Scrapy settings of this crawler.
For an introduction on Scrapy settings see Settings.
For the API see
Settings
class.
-
signals
¶ The signals manager of this crawler.
This is used by extensions & middlewares to hook themselves into Scrapy functionality.
For an introduction on signals see Signals.
For the API see
SignalManager
class.
-
stats
¶ The stats collector of this crawler.
This is used from extensions & middlewares to record stats of their behaviour, or access stats collected by other extensions.
For an introduction on stats collection see Stats Collection.
For the API see
StatsCollector
class.
-
extensions
¶ The extension manager that keeps track of enabled extensions.
Most extensions won't need to access this attribute.
For an introduction on extensions and a list of available extensions on Scrapy see Extensions.
-
engine
¶ The execution engine, which coordinates the core crawling logic between the scheduler, downloader and spiders.
Some extension may want to access the Scrapy engine, to inspect or modify the downloader and scheduler behaviour, although this is an advanced use and this API is not yet stable.
-
spider
¶ Spider currently being crawled. This is an instance of the spider class provided while constructing the crawler, and it is created after the arguments given in the
crawl()
method.
-
crawl
(*args, **kwargs)¶ Starts the crawler by instantiating its spider class with the given args and kwargs arguments, while setting the execution engine in motion.
Returns a deferred that is fired when the crawl is finished.
-
Settings API¶
-
scrapy.settings.
SETTINGS_PRIORITIES
¶ Dictionary that sets the key name and priority level of the default settings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the
Settings
class.SETTINGS_PRIORITIES = { 'default': 0, 'command': 10, 'project': 20, 'spider': 30, 'cmdline': 40, }
For a detailed explanation on each settings sources, see: Settings.
SpiderLoader API¶
-
class
scrapy.loader.
SpiderLoader
¶ This class is in charge of retrieving and handling the spider classes defined across the project.
Custom spider loaders can be employed by specifying their path in the
SPIDER_LOADER_CLASS
project setting. They must fully implement thescrapy.interfaces.ISpiderLoader
interface to guarantee an errorless execution.-
from_settings
(settings)¶ This class method is used by Scrapy to create an instance of the class. It's called with the current project settings, and it loads the spiders found recursively in the modules of the
SPIDER_MODULES
setting.パラメータ: settings ( Settings
instance) -- project settings
-
load
(spider_name)¶ Get the Spider class with the given name. It'll look into the previously loaded spiders for a spider class with name spider_name and will raise a KeyError if not found.
パラメータ: spider_name (str) -- spider class name
-
list
()¶ Get the names of the available spiders in the project.
-
Signals API¶
Stats Collector API¶
There are several Stats Collectors available under the
scrapy.statscollectors
module and they all implement the Stats
Collector API defined by the StatsCollector
class (which they all inherit from).
-
class
scrapy.statscollectors.
StatsCollector
¶ -
get_value
(key, default=None)¶ Return the value for the given stats key or default if it doesn't exist.
-
get_stats
()¶ Get all stats from the currently running spider as a dict.
-
set_value
(key, value)¶ Set the given value for the given stats key.
-
set_stats
(stats)¶ Override the current stats with the dict passed in
stats
argument.
-
inc_value
(key, count=1, start=0)¶ Increment the value of the given stats key, by the given count, assuming the start value given (when it's not set).
-
max_value
(key, value)¶ Set the given value for the given key only if current value for the same key is lower than value. If there is no current value for the given key, the value is always set.
-
min_value
(key, value)¶ Set the given value for the given key only if current value for the same key is greater than value. If there is no current value for the given key, the value is always set.
-
clear_stats
()¶ Clear all stats.
The following methods are not part of the stats collection api but instead used when implementing custom stats collectors:
-
open_spider
(spider)¶ Open the given spider for stats collection.
-
close_spider
(spider)¶ Close the given spider. After this is called, no more specific stats can be accessed or collected.
-
Signals¶
Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not provided out of the box.
Even though signals provide several arguments, the handlers that catch them don't need to accept all of them - the signal dispatching mechanism will only deliver the arguments that the handler receives.
You can connect to signals (or send your own) through the Signals API.
Here is a simple example showing how you can catch signals and perform some action:
from scrapy import signals
from scrapy import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
spider.logger.info('Spider closed: %s', spider.name)
def parse(self, response):
pass
Deferred signal handlers¶
Some signals support returning Twisted deferreds from their handlers, see the Built-in signals reference below to know which ones.
Built-in signals reference¶
Here's the list of Scrapy built-in signals and their meaning.
engine_started¶
-
scrapy.signals.
engine_started
()¶ Sent when the Scrapy engine has started crawling.
This signal supports returning deferreds from their handlers.
注釈
This signal may be fired after the spider_opened
signal,
depending on how the spider was started. So don't rely on this signal
getting fired before spider_opened
.
engine_stopped¶
-
scrapy.signals.
engine_stopped
()¶ Sent when the Scrapy engine is stopped (for example, when a crawling process has finished).
This signal supports returning deferreds from their handlers.
item_scraped¶
-
scrapy.signals.
item_scraped
(item, response, spider)¶ Sent when an item has been scraped, after it has passed all the Item Pipeline stages (without being dropped).
This signal supports returning deferreds from their handlers.
パラメータ:
item_dropped¶
-
scrapy.signals.
item_dropped
(item, response, exception, spider)¶ Sent after an item has been dropped from the Item Pipeline when some stage raised a
DropItem
exception.This signal supports returning deferreds from their handlers.
パラメータ: - item (dict or
Item
object) -- the item dropped from the Item Pipeline - spider (
Spider
object) -- the spider which scraped the item - response (
Response
object) -- the response from where the item was dropped - exception (
DropItem
exception) -- the exception (which must be aDropItem
subclass) which caused the item to be dropped
- item (dict or
item_error¶
-
scrapy.signals.
item_error
(item, response, spider, failure)¶ Sent when a Item Pipeline generates an error (ie. raises an exception), except
DropItem
exception.This signal supports returning deferreds from their handlers.
パラメータ:
spider_closed¶
-
scrapy.signals.
spider_closed
(spider, reason)¶ Sent after a spider has been closed. This can be used to release per-spider resources reserved on
spider_opened
.This signal supports returning deferreds from their handlers.
パラメータ: - spider (
Spider
object) -- the spider which has been closed - reason (str) -- a string which describes the reason why the spider was closed. If
it was closed because the spider has completed scraping, the reason
is
'finished'
. Otherwise, if the spider was manually closed by calling theclose_spider
engine method, then the reason is the one passed in thereason
argument of that method (which defaults to'cancelled'
). If the engine was shutdown (for example, by hitting Ctrl-C to stop it) the reason will be'shutdown'
.
- spider (
spider_opened¶
-
scrapy.signals.
spider_opened
(spider)¶ Sent after a spider has been opened for crawling. This is typically used to reserve per-spider resources, but can be used for any task that needs to be performed when a spider is opened.
This signal supports returning deferreds from their handlers.
パラメータ: spider ( Spider
object) -- the spider which has been opened
spider_idle¶
-
scrapy.signals.
spider_idle
(spider)¶ Sent when a spider has gone idle, which means the spider has no further:
- requests waiting to be downloaded
- requests scheduled
- items being processed in the item pipeline
If the idle state persists after all handlers of this signal have finished, the engine starts closing the spider. After the spider has finished closing, the
spider_closed
signal is sent.You may raise a
DontCloseSpider
exception to prevent the spider from being closed.This signal does not support returning deferreds from their handlers.
パラメータ: spider ( Spider
object) -- the spider which has gone idle
注釈
Scheduling some requests in your spider_idle
handler does
not guarantee that it can prevent the spider from being closed,
although it sometimes can. That's because the spider may still remain idle
if all the scheduled requests are rejected by the scheduler (e.g. filtered
due to duplication).
spider_error¶
-
scrapy.signals.
spider_error
(failure, response, spider)¶ Sent when a spider callback generates an error (ie. raises an exception).
This signal does not support returning deferreds from their handlers.
パラメータ:
request_scheduled¶
request_dropped¶
request_reached_downloader¶
response_received¶
Item Exporters¶
Once you have scraped your items, you often want to persist or export those items, to use the data in some other application. That is, after all, the whole purpose of the scraping process.
For this purpose Scrapy provides a collection of Item Exporters for different output formats, such as XML, CSV or JSON.
Using Item Exporters¶
If you are in a hurry, and just want to use an Item Exporter to output scraped data see the Feed exports. Otherwise, if you want to know how Item Exporters work or need more custom functionality (not covered by the default exports), continue reading below.
In order to use an Item Exporter, you must instantiate it with its required args. Each Item Exporter requires different arguments, so check each exporter documentation to be sure, in Built-in Item Exporters reference. After you have instantiated your exporter, you have to:
1. call the method start_exporting()
in order to
signal the beginning of the exporting process
2. call the export_item()
method for each item you want
to export
3. and finally call the finish_exporting()
to signal
the end of the exporting process
Here you can see an Item Pipeline which uses multiple Item Exporters to group scraped items to different files according to the value of one of their fields:
from scrapy.exporters import XmlItemExporter
class PerYearXmlExportPipeline(object):
"""Distribute items across multiple XML files according to their 'year' field"""
def open_spider(self, spider):
self.year_to_exporter = {}
def close_spider(self, spider):
for exporter in self.year_to_exporter.values():
exporter.finish_exporting()
exporter.file.close()
def _exporter_for_item(self, item):
year = item['year']
if year not in self.year_to_exporter:
f = open('{}.xml'.format(year), 'wb')
exporter = XmlItemExporter(f)
exporter.start_exporting()
self.year_to_exporter[year] = exporter
return self.year_to_exporter[year]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
Serialization of item fields¶
By default, the field values are passed unmodified to the underlying serialization library, and the decision of how to serialize them is delegated to each particular serialization library.
However, you can customize how each field value is serialized before it is passed to the serialization library.
There are two ways to customize how a field will be serialized, which are described next.
1. Declaring a serializer in the field¶
If you use Item
you can declare a serializer in the
field metadata. The serializer must be
a callable which receives a value and returns its serialized form.
Example:
import scrapy
def serialize_price(value):
return '$ %s' % str(value)
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field(serializer=serialize_price)
2. Overriding the serialize_field() method¶
You can also override the serialize_field()
method to
customize how your field value will be exported.
Make sure you call the base class serialize_field()
method
after your custom code.
Example:
from scrapy.exporter import XmlItemExporter
class ProductXmlExporter(XmlItemExporter):
def serialize_field(self, field, name, value):
if field == 'price':
return '$ %s' % str(value)
return super(Product, self).serialize_field(field, name, value)
Built-in Item Exporters reference¶
Here is a list of the Item Exporters bundled with Scrapy. Some of them contain output examples, which assume you're exporting these two items:
Item(name='Color TV', price='1200')
Item(name='DVD player', price='200')
BaseItemExporter¶
-
class
scrapy.exporters.
BaseItemExporter
(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0)¶ This is the (abstract) base class for all Item Exporters. It provides support for common features used by all (concrete) Item Exporters, such as defining what fields to export, whether to export empty fields, or which encoding to use.
These features can be configured through the constructor arguments which populate their respective instance attributes:
fields_to_export
,export_empty_fields
,encoding
,indent
.-
export_item
(item)¶ Exports the given item. This method must be implemented in subclasses.
-
serialize_field
(field, name, value)¶ Return the serialized value for the given field. You can override this method (in your custom Item Exporters) if you want to control how a particular field or value will be serialized/exported.
By default, this method looks for a serializer declared in the item field and returns the result of applying that serializer to the value. If no serializer is found, it returns the value unchanged except for
unicode
values which are encoded tostr
using the encoding declared in theencoding
attribute.パラメータ:
-
start_exporting
()¶ Signal the beginning of the exporting process. Some exporters may use this to generate some required header (for example, the
XmlItemExporter
). You must call this method before exporting any items.
-
finish_exporting
()¶ Signal the end of the exporting process. Some exporters may use this to generate some required footer (for example, the
XmlItemExporter
). You must always call this method after you have no more items to export.
-
fields_to_export
¶ A list with the name of the fields that will be exported, or None if you want to export all fields. Defaults to None.
Some exporters (like
CsvItemExporter
) respect the order of the fields defined in this attribute.Some exporters may require fields_to_export list in order to export the data properly when spiders return dicts (not
Item
instances).
-
export_empty_fields
¶ Whether to include empty/unpopulated item fields in the exported data. Defaults to
False
. Some exporters (likeCsvItemExporter
) ignore this attribute and always export all empty fields.This option is ignored for dict items.
-
encoding
¶ The encoding that will be used to encode unicode values. This only affects unicode values (which are always serialized to str using this encoding). Other value types are passed unchanged to the specific serialization library.
-
indent
¶ Amount of spaces used to indent the output on each level. Defaults to
0
.indent=None
selects the most compact representation, all items in the same line with no indentationindent<=0
each item on its own line, no indentationindent>0
each item on its own line, indented with the provided numeric value
-
XmlItemExporter¶
-
class
scrapy.exporters.
XmlItemExporter
(file, item_element='item', root_element='items', **kwargs)¶ Exports Items in XML format to the specified file object.
パラメータ: - file -- the file-like object to use for exporting the data. Its
write
method should acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc) - root_element (str) -- The name of root element in the exported XML.
- item_element (str) -- The name of each item element in the exported XML.
The additional keyword arguments of this constructor are passed to the
BaseItemExporter
constructor.A typical output of this exporter would be:
<?xml version="1.0" encoding="utf-8"?> <items> <item> <name>Color TV</name> <price>1200</price> </item> <item> <name>DVD player</name> <price>200</price> </item> </items>
Unless overridden in the
serialize_field()
method, multi-valued fields are exported by serializing each value inside a<value>
element. This is for convenience, as multi-valued fields are very common.For example, the item:
Item(name=['John', 'Doe'], age='23')
Would be serialized as:
<?xml version="1.0" encoding="utf-8"?> <items> <item> <name> <value>John</value> <value>Doe</value> </name> <age>23</age> </item> </items>
- file -- the file-like object to use for exporting the data. Its
CsvItemExporter¶
-
class
scrapy.exporters.
CsvItemExporter
(file, include_headers_line=True, join_multivalued=', ', **kwargs)¶ Exports Items in CSV format to the given file-like object. If the
fields_to_export
attribute is set, it will be used to define the CSV columns and their order. Theexport_empty_fields
attribute has no effect on this exporter.パラメータ: - file -- the file-like object to use for exporting the data. Its
write
method should acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc) - include_headers_line (str) -- If enabled, makes the exporter output a header
line with the field names taken from
BaseItemExporter.fields_to_export
or the first exported item fields. - join_multivalued -- The char (or chars) that will be used for joining multi-valued fields, if found.
The additional keyword arguments of this constructor are passed to the
BaseItemExporter
constructor, and the leftover arguments to the csv.writer constructor, so you can use any csv.writer constructor argument to customize this exporter.A typical output of this exporter would be:
product,price Color TV,1200 DVD player,200
- file -- the file-like object to use for exporting the data. Its
PickleItemExporter¶
-
class
scrapy.exporters.
PickleItemExporter
(file, protocol=0, **kwargs)¶ Exports Items in pickle format to the given file-like object.
パラメータ: - file -- the file-like object to use for exporting the data. Its
write
method should acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc) - protocol (int) -- The pickle protocol to use.
For more information, refer to the pickle module documentation.
The additional keyword arguments of this constructor are passed to the
BaseItemExporter
constructor.Pickle isn't a human readable format, so no output examples are provided.
- file -- the file-like object to use for exporting the data. Its
PprintItemExporter¶
-
class
scrapy.exporters.
PprintItemExporter
(file, **kwargs)¶ Exports Items in pretty print format to the specified file object.
パラメータ: file -- the file-like object to use for exporting the data. Its write
method should acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)The additional keyword arguments of this constructor are passed to the
BaseItemExporter
constructor.A typical output of this exporter would be:
{'name': 'Color TV', 'price': '1200'} {'name': 'DVD player', 'price': '200'}
Longer lines (when present) are pretty-formatted.
JsonItemExporter¶
-
class
scrapy.exporters.
JsonItemExporter
(file, **kwargs)¶ Exports Items in JSON format to the specified file-like object, writing all objects as a list of objects. The additional constructor arguments are passed to the
BaseItemExporter
constructor, and the leftover arguments to the JSONEncoder constructor, so you can use any JSONEncoder constructor argument to customize this exporter.パラメータ: file -- the file-like object to use for exporting the data. Its write
method should acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)A typical output of this exporter would be:
[{"name": "Color TV", "price": "1200"}, {"name": "DVD player", "price": "200"}]
警告
JSON is very simple and flexible serialization format, but it doesn't scale well for large amounts of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on any language), and most of them just parse the entire object in memory. If you want the power and simplicity of JSON with a more stream-friendly format, consider using
JsonLinesItemExporter
instead, or splitting the output in multiple chunks.
JsonLinesItemExporter¶
-
class
scrapy.exporters.
JsonLinesItemExporter
(file, **kwargs)¶ Exports Items in JSON format to the specified file-like object, writing one JSON-encoded item per line. The additional constructor arguments are passed to the
BaseItemExporter
constructor, and the leftover arguments to the JSONEncoder constructor, so you can use any JSONEncoder constructor argument to customize this exporter.パラメータ: file -- the file-like object to use for exporting the data. Its write
method should acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)A typical output of this exporter would be:
{"name": "Color TV", "price": "1200"} {"name": "DVD player", "price": "200"}
Unlike the one produced by
JsonItemExporter
, the format produced by this exporter is well suited for serializing large amounts of data.
- Architecture overview
- Scrapyのアーキテクチャを理解します。
- Downloader Middleware
- ページのリクエストとダウンロードの方法をカスタマイズします。
- Spider Middleware
- Spiderの入力と出力をカスタマイズします。
- Extensions
- カスタム機能を使ってScrapyを拡張します。
- Core API
- 拡張機能やミドルウェアで使用できるAPIです。
- Signals
- 利用可能なすべてのシグナルと、それらを操作する方法です。
- Item Exporters
- 収集したItemをファイル(XML、CSVなど)にすばやくエクスポートします。
その他¶
Release notes¶
Scrapy 1.6.0 (2019-01-30)¶
Highlights:
- better Windows support;
- Python 3.7 compatibility;
- big documentation improvements, including a switch
from
.extract_first()
+.extract()
API to.get()
+.getall()
API; - feed exports, FilePipeline and MediaPipeline improvements;
- better extensibility:
item_error
andrequest_reached_downloader
signals;from_crawler
support for feed exporters, feed storages and dupefilters. scrapy.contracts
fixes and new features;- telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22);
- clean-up of the deprecated code;
- various bug fixes, small new features and usability improvements across the codebase.
Selector API changes¶
While these are not changes in Scrapy itself, but rather in the parsel
library which Scrapy uses for xpath/css selectors, these changes are
worth mentioning here. Scrapy now depends on parsel >= 1.5, and
Scrapy documentation is updated to follow recent parsel
API conventions.
Most visible change is that .get()
and .getall()
selector
methods are now preferred over .extract_first()
and .extract()
.
We feel that these new methods result in a more concise and readable code.
See extract() and extract_first() for more details.
注釈
There are currently no plans to deprecate .extract()
and .extract_first()
methods.
Another useful new feature is the introduction of Selector.attrib
and
SelectorList.attrib
properties, which make it easier to get
attributes of HTML elements. See Selecting element attributes.
CSS selectors are cached in parsel >= 1.5, which makes them faster when the same CSS path is used many times. This is very common in case of Scrapy spiders: callbacks are usually called several times, on different pages.
If you're using custom Selector
or SelectorList
subclasses,
a backwards incompatible change in parsel may affect your code.
See parsel changelog for a detailed description, as well as for the
full list of improvements.
Telnet console¶
Backwards incompatible: Scrapy's telnet console now requires username and password. See Telnet Console for more details. This change fixes a security issue; see Scrapy 1.5.2 (2019-01-22) release notes for details.
New extensibility features¶
from_crawler
support is added to feed exporters and feed storages. This, among other things, allows to access Scrapy settings from custom feed storages and exporters (issue 1605, issue 3348).from_crawler
support is added to dupefilters (issue 2956); this allows to access e.g. settings or a spider from a dupefilter.item_error
is fired when an error happens in a pipeline (issue 3256);request_reached_downloader
is fired when Downloader gets a new Request; this signal can be useful e.g. for custom Schedulers (issue 3393).- new SitemapSpider
sitemap_filter()
method which allows to select sitemap entries based on their attributes in SitemapSpider subclasses (issue 3512). - Lazy loading of Downloader Handlers is now optional; this enables better initialization error handling in custom Downloader Handlers (issue 3394).
New FilePipeline and MediaPipeline features¶
- Expose more options for S3FilesStore:
AWS_ENDPOINT_URL
,AWS_USE_SSL
,AWS_VERIFY
,AWS_REGION_NAME
. For example, this allows to use alternative or self-hosted AWS-compatible providers (issue 2609, issue 3548). - ACL support for Google Cloud Storage:
FILES_STORE_GCS_ACL
andIMAGES_STORE_GCS_ACL
(issue 3199).
scrapy.contracts
improvements¶
- Exceptions in contracts code are handled better (issue 3377);
dont_filter=True
is used for contract requests, which allows to test different callbacks with the same URL (issue 3381);request_cls
attribute in Contract subclasses allow to use different Request classes in contracts, for example FormRequest (issue 3383).- Fixed errback handling in contracts, e.g. for cases where a contract is executed for URL which returns non-200 response (issue 3371).
Usability improvements¶
- more stats for RobotsTxtMiddleware (issue 3100)
- INFO log level is used to show telnet host/port (issue 3115)
- a message is added to IgnoreRequest in RobotsTxtMiddleware (issue 3113)
- better validation of
url
argument inResponse.follow
(issue 3131) - non-zero exit code is returned from Scrapy commands when error happens on spider inititalization (issue 3226)
- Link extraction improvements: "ftp" is added to scheme list (issue 3152); "flv" is added to common video extensions (issue 3165)
- better error message when an exporter is disabled (issue 3358);
scrapy shell --help
mentions syntax required for local files (./file.html
) - issue 3496.- Referer header value is added to RFPDupeFilter log messages (issue 3588)
Bug fixes¶
- fixed issue with extra blank lines in .csv exports under Windows (issue 3039);
- proper handling of pickling errors in Python 3 when serializing objects for disk queues (issue 3082)
- flags are now preserved when copying Requests (issue 3342);
- FormRequest.from_response clickdata shouldn't ignore elements with
input[type=image]
(issue 3153). - FormRequest.from_response should preserve duplicate keys (issue 3247)
Documentation improvements¶
- Docs are re-written to suggest .get/.getall API instead of .extract/.extract_first. Also, セレクタ docs are updated and re-structured to match latest parsel docs; they now contain more topics, such as Selecting element attributes or Extensions to CSS Selectors (issue 3390).
- Using your browser's Developer Tools for scraping is a new tutorial which replaces old Firefox and Firebug tutorials (issue 3400).
- SCRAPY_PROJECT environment variable is documented (issue 3518);
- troubleshooting section is added to install instructions (issue 3517);
- improved links to beginner resources in the tutorial (issue 3367, issue 3468);
- fixed
RETRY_HTTP_CODES
default values in docs (issue 3335); - remove unused DEPTH_STATS option from docs (issue 3245);
- other cleanups (issue 3347, issue 3350, issue 3445, issue 3544, issue 3605).
Deprecation removals¶
Compatibility shims for pre-1.0 Scrapy module names are removed (issue 3318):
scrapy.command
scrapy.contrib
(with all submodules)scrapy.contrib_exp
(with all submodules)scrapy.dupefilter
scrapy.linkextractor
scrapy.project
scrapy.spider
scrapy.spidermanager
scrapy.squeue
scrapy.stats
scrapy.statscol
scrapy.utils.decorator
See Module Relocations for more information, or use suggestions from Scrapy 1.5.x deprecation warnings to update your code.
Other deprecation removals:
- Deprecated scrapy.interfaces.ISpiderManager is removed; please use scrapy.interfaces.ISpiderLoader.
- Deprecated
CrawlerSettings
class is removed (issue 3327). - Deprecated
Settings.overrides
andSettings.defaults
attributes are removed (issue 3327, issue 3359).
Other improvements, cleanups¶
- All Scrapy tests now pass on Windows; Scrapy testing suite is executed in a Windows environment on CI (issue 3315).
- Python 3.7 support (issue 3326, issue 3150, issue 3547).
- Testing and CI fixes (issue 3526, issue 3538, issue 3308, issue 3311, issue 3309, issue 3305, issue 3210, issue 3299)
scrapy.http.cookies.CookieJar.clear
accepts "domain", "path" and "name" optional arguments (issue 3231).- additional files are included to sdist (issue 3495);
- code style fixes (issue 3405, issue 3304);
- unneeded .strip() call is removed (issue 3519);
- collections.deque is used to store MiddlewareManager methods instead of a list (issue 3476)
Scrapy 1.5.2 (2019-01-22)¶
Security bugfix: Telnet console extension can be easily exploited by rogue websites POSTing content to http://localhost:6023, we haven't found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so and elevates the risk for local development environment.
The fix is backwards incompatible, it enables telnet user-password authentication by default with a random generated password. If you can't upgrade right away, please consider setting
TELNET_CONSOLE_PORT
out of its default value.See telnet console documentation for more info
Backport CI build failure under GCE environemnt due to boto import error.
Scrapy 1.5.1 (2018-07-12)¶
This is a maintenance release with important bug fixes, but no new features:
O(N^2)
gzip decompression issue which affected Python 3 and PyPy is fixed (issue 3281);- skipping of TLS validation errors is improved (issue 3166);
- Ctrl-C handling is fixed in Python 3.5+ (issue 3096);
- testing fixes (issue 3092, issue 3263);
- documentation improvements (issue 3058, issue 3059, issue 3089, issue 3123, issue 3127, issue 3189, issue 3224, issue 3280, issue 3279, issue 3201, issue 3260, issue 3284, issue 3298, issue 3294).
Scrapy 1.5.0 (2017-12-29)¶
This release brings small new features and improvements across the codebase. Some highlights:
- Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
- Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now.
- Warnings, exception and logging messages are improved to make debugging easier.
scrapy parse
command now allows to set custom request meta via--meta
argument.- Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by running tests on CI.
- Better default handling of HTTP 308, 522 and 524 status codes.
- Documentation is improved, as usual.
Backwards Incompatible Changes¶
- Scrapy 1.5 drops support for Python 3.3.
- Default Scrapy User-Agent now uses https link to scrapy.org (issue 2983).
This is technically backwards-incompatible; override
USER_AGENT
if you relied on old value. - Logging of settings overridden by
custom_settings
is fixed; this is technically backwards-incompatible because the logger changes from[scrapy.utils.log]
to[scrapy.crawler]
. If you're parsing Scrapy logs, please update your log parsers (issue 1343). - LinkExtractor now ignores
m4v
extension by default, this is change in behavior. - 522 and 524 status codes are added to
RETRY_HTTP_CODES
(issue 2851)
New features¶
- Support
<link>
tags inResponse.follow
(issue 2785) - Support for
ptpython
REPL (issue 2654) - Google Cloud Storage support for FilesPipeline and ImagesPipeline (issue 2923).
- New
--meta
option of the "scrapy parse" command allows to pass additional request.meta (issue 2883) - Populate spider variable when using
shell.inspect_response
(issue 2812) - Handle HTTP 308 Permanent Redirect (issue 2844)
- Add 522 and 524 to
RETRY_HTTP_CODES
(issue 2851) - Log versions information at startup (issue 2857)
scrapy.mail.MailSender
now works in Python 3 (it requires Twisted 17.9.0)- Connections to proxy servers are reused (issue 2743)
- Add template for a downloader middleware (issue 2755)
- Explicit message for NotImplementedError when parse callback not defined (issue 2831)
- CrawlerProcess got an option to disable installation of root log handler (issue 2921)
- LinkExtractor now ignores
m4v
extension by default - Better log messages for responses over
DOWNLOAD_WARNSIZE
andDOWNLOAD_MAXSIZE
limits (issue 2927) - Show warning when a URL is put to
Spider.allowed_domains
instead of a domain (issue 2250).
Bug fixes¶
- Fix logging of settings overridden by
custom_settings
; this is technically backwards-incompatible because the logger changes from[scrapy.utils.log]
to[scrapy.crawler]
, so please update your log parsers if needed (issue 1343) - Default Scrapy User-Agent now uses https link to scrapy.org (issue 2983).
This is technically backwards-incompatible; override
USER_AGENT
if you relied on old value. - Fix PyPy and PyPy3 test failures, support them officially (issue 2793, issue 2935, issue 2990, issue 3050, issue 2213, issue 3048)
- Fix DNS resolver when
DNSCACHE_ENABLED=False
(issue 2811) - Add
cryptography
for Debian Jessie tox test env (issue 2848) - Add verification to check if Request callback is callable (issue 2766)
- Port
extras/qpsclient.py
to Python 3 (issue 2849) - Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning (issue 2862)
- Update deprecated test aliases (issue 2876)
- Fix
SitemapSpider
support for alternate links (issue 2853)
Docs¶
- Added missing bullet point for the
AUTOTHROTTLE_TARGET_CONCURRENCY
setting. (issue 2756) - Update Contributing docs, document new support channels (issue 2762, issue:3038)
- Include references to Scrapy subreddit in the docs
- Fix broken links; use https:// for external links (issue 2978, issue 2982, issue 2958)
- Document CloseSpider extension better (issue 2759)
- Use
pymongo.collection.Collection.insert_one()
in MongoDB example (issue 2781) - Spelling mistake and typos (issue 2828, issue 2837, issue 2884, issue 2924)
- Clarify
CSVFeedSpider.headers
documentation (issue 2826) - Document
DontCloseSpider
exception and clarifyspider_idle
(issue 2791) - Update "Releases" section in README (issue 2764)
- Fix rst syntax in
DOWNLOAD_FAIL_ON_DATALOSS
docs (issue 2763) - Small fix in description of startproject arguments (issue 2866)
- Clarify data types in Response.body docs (issue 2922)
- Add a note about
request.meta['depth']
to DepthMiddleware docs (issue 2374) - Add a note about
request.meta['dont_merge_cookies']
to CookiesMiddleware docs (issue 2999) - Up-to-date example of project structure (issue 2964, issue 2976)
- A better example of ItemExporters usage (issue 2989)
- Document
from_crawler
methods for spider and downloader middlewares (issue 3019)
Scrapy 1.4.0 (2017-05-18)¶
Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless.
Scrapy now supports anonymous FTP sessions with customizable user and
password via the new FTP_USER
and FTP_PASSWORD
settings.
And if you're using Twisted version 17.1.0 or above, FTP is now available
with Python 3.
There's a new response.follow
method
for creating requests; it is now a recommended way to create Requests
in Scrapy spiders. This method makes it easier to write correct
spiders; response.follow
has several advantages over creating
scrapy.Request
objects directly:
- it handles relative URLs;
- it works properly with non-ascii URLs on non-UTF8 pages;
- in addition to absolute and relative URLs it supports Selectors;
for
<a>
elements it can also extract their href values.
For example, instead of this:
for href in response.css('li.page a::attr(href)').extract():
url = response.urljoin(href)
yield scrapy.Request(url, self.parse, encoding=response.encoding)
One can now write this:
for a in response.css('li.page a'):
yield response.follow(a, self.parse)
Link extractors are also improved. They work similarly to what a regular
modern browser would do: leading and trailing whitespace are removed
from attributes (think href=" http://example.com"
) when building
Link
objects. This whitespace-stripping also happens for action
attributes with FormRequest
.
Please also note that link extractors do not canonicalize URLs by default anymore. This was puzzling users every now and then, and it's not what browsers do in fact, so we removed that extra transformation on extracted links.
For those of you wanting more control on the Referer:
header that Scrapy
sends when following links, you can set your own Referrer Policy
.
Prior to Scrapy 1.4, the default RefererMiddleware
would simply and
blindly set it to the URL of the response that generated the HTTP request
(which could leak information on your URL seeds).
By default, Scrapy now behaves much like your regular browser does.
And this policy is fully customizable with W3C standard values
(or with something really custom of your own if you wish).
See REFERRER_POLICY
for details.
To make Scrapy spiders easier to debug, Scrapy logs more stats by default in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code stats. A similar change is that HTTP cache path is also visible in logs now.
Last but not least, Scrapy now has the option to make JSON and XML items
more human-readable, with newlines between items and even custom indenting
offset, using the new FEED_EXPORT_INDENT
setting.
Enjoy! (Or read on for the rest of changes in this release.)
Deprecations and Backwards Incompatible Changes¶
- Default to
canonicalize=False
inscrapy.linkextractors.LinkExtractor
(issue 2537, fixes issue 1941 and issue 1982): warning, this is technically backwards-incompatible - Enable memusage extension by default (issue 2539, fixes issue 2187);
this is technically backwards-incompatible so please check if you have
any non-default
MEMUSAGE_***
options set. EDITOR
environment variable now takes precedence overEDITOR
option defined in settings.py (issue 1829); Scrapy default settings no longer depend on environment variables. This is technically a backwards incompatible change.Spider.make_requests_from_url
is deprecated (issue 1728, fixes issue 1495).
New Features¶
- Accept proxy credentials in
proxy
request meta key (issue 2526) - Support brotli-compressed content; requires optional brotlipy (issue 2535)
- New response.follow shortcut for creating requests (issue 1940)
- Added
flags
argument and attribute toRequest
objects (issue 2047) - Support Anonymous FTP (issue 2342)
- Added
retry/count
,retry/max_reached
andretry/reason_count/<reason>
stats toRetryMiddleware
(issue 2543) - Added
httperror/response_ignored_count
andhttperror/response_ignored_status_count/<status>
stats toHttpErrorMiddleware
(issue 2566) - Customizable
Referrer policy
inRefererMiddleware
(issue 2306) - New
data:
URI download handler (issue 2334, fixes issue 2156) - Log cache directory when HTTP Cache is used (issue 2611, fixes issue 2604)
- Warn users when project contains duplicate spider names (fixes issue 2181)
CaselessDict
now acceptsMapping
instances and not only dicts (issue 2646)- Media downloads, with
FilesPipelines
orImagesPipelines
, can now optionally handle HTTP redirects using the newMEDIA_ALLOW_REDIRECTS
setting (issue 2616, fixes issue 2004) - Accept non-complete responses from websites using a new
DOWNLOAD_FAIL_ON_DATALOSS
setting (issue 2590, fixes issue 2586) - Optional pretty-printing of JSON and XML items via
FEED_EXPORT_INDENT
setting (issue 2456, fixes issue 1327) - Allow dropping fields in
FormRequest.from_response
formdata whenNone
value is passed (issue 667) - Per-request retry times with the new
max_retry_times
meta key (issue 2642) python -m scrapy
as a more explicit alternative toscrapy
command (issue 2740)
Bug fixes¶
- LinkExtractor now strips leading and trailing whitespaces from attributes (issue 2547, fixes issue 1614)
- Properly handle whitespaces in action attribute in
FormRequest
(issue 2548) - Buffer CONNECT response bytes from proxy until all HTTP headers are received (issue 2495, fixes issue 2491)
- FTP downloader now works on Python 3, provided you use Twisted>=17.1 (issue 2599)
- Use body to choose response type after decompressing content (issue 2393, fixes issue 2145)
- Always decompress
Content-Encoding: gzip
atHttpCompressionMiddleware
stage (issue 2391) - Respect custom log level in
Spider.custom_settings
(issue 2581, fixes issue 1612) - 'make htmlview' fix for macOS (issue 2661)
- Remove "commands" from the command list (issue 2695)
- Fix duplicate Content-Length header for POST requests with empty body (issue 2677)
- Properly cancel large downloads, i.e. above
DOWNLOAD_MAXSIZE
(issue 1616) - ImagesPipeline: fixed processing of transparent PNG images with palette (issue 2675)
Cleanups & Refactoring¶
- Tests: remove temp files and folders (issue 2570), fixed ProjectUtilsTest on OS X (issue 2569), use portable pypy for Linux on Travis CI (issue 2710)
- Separate building request from
_requests_to_follow
in CrawlSpider (issue 2562) - Remove “Python 3 progress” badge (issue 2567)
- Add a couple more lines to
.gitignore
(issue 2557) - Remove bumpversion prerelease configuration (issue 2159)
- Add codecov.yml file (issue 2750)
- Set context factory implementation based on Twisted version (issue 2577, fixes issue 2560)
- Add omitted
self
arguments in default project middleware template (issue 2595) - Remove redundant
slot.add_request()
call in ExecutionEngine (issue 2617) - Catch more specific
os.error
exception inFSFilesStore
(issue 2644) - Change "localhost" test server certificate (issue 2720)
- Remove unused
MEMUSAGE_REPORT
setting (issue 2576)
Documentation¶
- Binary mode is required for exporters (issue 2564, fixes issue 2553)
- Mention issue with
FormRequest.from_response
due to bug in lxml (issue 2572) - Use single quotes uniformly in templates (issue 2596)
- Document
ftp_user
andftp_password
meta keys (issue 2587) - Removed section on deprecated
contrib/
(issue 2636) - Recommend Anaconda when installing Scrapy on Windows (issue 2477, fixes issue 2475)
- FAQ: rewrite note on Python 3 support on Windows (issue 2690)
- Rearrange selector sections (issue 2705)
- Remove
__nonzero__
fromSelectorList
docs (issue 2683) - Mention how to disable request filtering in documentation of
DUPEFILTER_CLASS
setting (issue 2714) - Add sphinx_rtd_theme to docs setup readme (issue 2668)
- Open file in text mode in JSON item writer example (issue 2729)
- Clarify
allowed_domains
example (issue 2670)
Scrapy 1.3.3 (2017-03-10)¶
Bug fixes¶
- Make
SpiderLoader
raiseImportError
again by default for missing dependencies and wrongSPIDER_MODULES
. These exceptions were silenced as warnings since 1.3.0. A new setting is introduced to toggle between warning or exception if needed ; seeSPIDER_LOADER_WARN_ONLY
for details.
Scrapy 1.3.2 (2017-02-13)¶
Bug fixes¶
- Preserve request class when converting to/from dicts (utils.reqser) (issue 2510).
- Use consistent selectors for author field in tutorial (issue 2551).
- Fix TLS compatibility in Twisted 17+ (issue 2558)
Scrapy 1.3.1 (2017-02-08)¶
New features¶
- Support
'True'
and'False'
string values for boolean settings (issue 2519); you can now do something likescrapy crawl myspider -s REDIRECT_ENABLED=False
. - Support kwargs with
response.xpath()
to use XPath variables and ad-hoc namespaces declarations ; this requires at least Parsel v1.1 (issue 2457). - Add support for Python 3.6 (issue 2485).
- Run tests on PyPy (warning: some tests still fail, so PyPy is not supported yet).
Bug fixes¶
- Enforce
DNS_TIMEOUT
setting (issue 2496). - Fix
view
command ; it was a regression in v1.3.0 (issue 2503). - Fix tests regarding
*_EXPIRES settings
with Files/Images pipelines (issue 2460). - Fix name of generated pipeline class when using basic project template (issue 2466).
- Fix compatiblity with Twisted 17+ (issue 2496, issue 2528).
- Fix
scrapy.Item
inheritance on Python 3.6 (issue 2511). - Enforce numeric values for components order in
SPIDER_MIDDLEWARES
,DOWNLOADER_MIDDLEWARES
,EXTENIONS
andSPIDER_CONTRACTS
(issue 2420).
Documentation¶
- Reword Code of Coduct section and upgrade to Contributor Covenant v1.4 (issue 2469).
- Clarify that passing spider arguments converts them to spider attributes (issue 2483).
- Document
formid
argument onFormRequest.from_response()
(issue 2497). - Add .rst extension to README files (issue 2507).
- Mention LevelDB cache storage backend (issue 2525).
- Use
yield
in sample callback code (issue 2533). - Add note about HTML entities decoding with
.re()/.re_first()
(issue 1704). - Typos (issue 2512, issue 2534, issue 2531).
Cleanups¶
- Remove reduntant check in
MetaRefreshMiddleware
(issue 2542). - Faster checks in
LinkExtractor
for allow/deny patterns (issue 2538). - Remove dead code supporting old Twisted versions (issue 2544).
Scrapy 1.3.0 (2016-12-21)¶
This release comes rather soon after 1.2.2 for one main reason:
it was found out that releases since 0.18 up to 1.2.2 (included) use
some backported code from Twisted (scrapy.xlib.tx.*
),
even if newer Twisted modules are available.
Scrapy now uses twisted.web.client
and twisted.internet.endpoints
directly.
(See also cleanups below.)
As it is a major change, we wanted to get the bug fix out quickly while not breaking any projects using the 1.2 series.
New Features¶
MailSender
now accepts single strings as values forto
andcc
arguments (issue 2272)scrapy fetch url
,scrapy shell url
andfetch(url)
inside scrapy shell now follow HTTP redirections by default (issue 2290); Seefetch
andshell
for details.HttpErrorMiddleware
now logs errors withINFO
level instead ofDEBUG
; this is technically backwards incompatible so please check your log parsers.- By default, logger names now use a long-form path, e.g.
[scrapy.extensions.logstats]
, instead of the shorter "top-level" variant of prior releases (e.g.[scrapy]
); this is backwards incompatible if you have log parsers expecting the short logger name part. You can switch back to short logger names usingLOG_SHORT_NAMES
set toTrue
.
Dependencies & Cleanups¶
- Scrapy now requires Twisted >= 13.1 which is the case for many Linux distributions already.
- As a consequence, we got rid of
scrapy.xlib.tx.*
modules, which copied some of Twisted code for users stuck with an "old" Twisted version ChunkedTransferMiddleware
is deprecated and removed from the default downloader middlewares.
Scrapy 1.2.3 (2017-03-03)¶
- Packaging fix: disallow unsupported Twisted versions in setup.py
Scrapy 1.2.2 (2016-12-06)¶
Bug fixes¶
- Fix a cryptic traceback when a pipeline fails on
open_spider()
(issue 2011) - Fix embedded IPython shell variables (fixing issue 396 that re-appeared in 1.2.0, fixed in issue 2418)
- A couple of patches when dealing with robots.txt:
- handle (non-standard) relative sitemap URLs (issue 2390)
- handle non-ASCII URLs and User-Agents in Python 2 (issue 2373)
Documentation¶
- Document
"download_latency"
key inRequest
'smeta
dict (issue 2033) - Remove page on (deprecated & unsupported) Ubuntu packages from ToC (issue 2335)
- A few fixed typos (issue 2346, issue 2369, issue 2369, issue 2380) and clarifications (issue 2354, issue 2325, issue 2414)
Other changes¶
- Advertize conda-forge as Scrapy's official conda channel (issue 2387)
- More helpful error messages when trying to use
.css()
or.xpath()
on non-Text Responses (issue 2264) startproject
command now generates a samplemiddlewares.py
file (issue 2335)- Add more dependencies' version info in
scrapy version
verbose output (issue 2404) - Remove all
*.pyc
files from source distribution (issue 2386)
Scrapy 1.2.1 (2016-10-21)¶
Bug fixes¶
- Include OpenSSL's more permissive default ciphers when establishing TLS/SSL connections (issue 2314).
- Fix "Location" HTTP header decoding on non-ASCII URL redirects (issue 2321).
Documentation¶
- Fix JsonWriterPipeline example (issue 2302).
- Various notes: issue 2330 on spider names, issue 2329 on middleware methods processing order, issue 2327 on getting multi-valued HTTP headers as lists.
Other changes¶
- Removed
www.
fromstart_urls
in built-in spider templates (issue 2299).
Scrapy 1.2.0 (2016-10-03)¶
New Features¶
- New
FEED_EXPORT_ENCODING
setting to customize the encoding used when writing items to a file. This can be used to turn off\uXXXX
escapes in JSON output. This is also useful for those wanting something else than UTF-8 for XML or CSV output (issue 2034). startproject
command now supports an optional destination directory to override the default one based on the project name (issue 2005).- New
SCHEDULER_DEBUG
setting to log requests serialization failures (issue 1610). - JSON encoder now supports serialization of
set
instances (issue 2058). - Interpret
application/json-amazonui-streaming
asTextResponse
(issue 1503). scrapy
is imported by default when using shell tools (shell
, inspect_response) (issue 2248).
Bug fixes¶
- DefaultRequestHeaders middleware now runs before UserAgent middleware (issue 2088). Warning: this is technically backwards incompatible, though we consider this a bug fix.
- HTTP cache extension and plugins that use the
.scrapy
data directory now work outside projects (issue 1581). Warning: this is technically backwards incompatible, though we consider this a bug fix. Selector
does not allow passing bothresponse
andtext
anymore (issue 2153).- Fixed logging of wrong callback name with
scrapy parse
(issue 2169). - Fix for an odd gzip decompression bug (issue 1606).
- Fix for selected callbacks when using
CrawlSpider
withscrapy parse
(issue 2225). - Fix for invalid JSON and XML files when spider yields no items (issue 872).
- Implement
flush()
fprStreamLogger
avoiding a warning in logs (issue 2125).
Refactoring¶
canonicalize_url
has been moved to w3lib.url (issue 2168).
Tests & Requirements¶
Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously Ubuntu 12.04 Precise. What this means in practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0, pyOpenSSL 0.14, lxml 3.4.
Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted versions for example) but it is not guaranteed (because it's not tested anymore).
Documentation¶
- Grammar fixes: issue 2128, issue 1566.
- Download stats badge removed from README (issue 2160).
- New scrapy architecture diagram (issue 2165).
- Updated
Response
parameters documentation (issue 2197). - Reworded misleading
RANDOMIZE_DOWNLOAD_DELAY
description (issue 2190). - Add StackOverflow as a support channel (issue 2257).
Scrapy 1.1.4 (2017-03-03)¶
- Packaging fix: disallow unsupported Twisted versions in setup.py
Scrapy 1.1.3 (2016-09-22)¶
Bug fixes¶
- Class attributes for subclasses of
ImagesPipeline
andFilesPipeline
work as they did before 1.1.1 (issue 2243, fixes issue 2198)
Documentation¶
- Overview and tutorial rewritten to use http://toscrape.com websites (issue 2236, issue 2249, issue 2252).
Scrapy 1.1.2 (2016-08-18)¶
Bug fixes¶
- Introduce a missing
IMAGES_STORE_S3_ACL
setting to override the default ACL policy inImagesPipeline
when uploading images to S3 (note that default ACL policy is "private" -- instead of "public-read" -- since Scrapy 1.1.0) IMAGES_EXPIRES
default value set back to 90 (the regression was introduced in 1.1.1)
Scrapy 1.1.1 (2016-07-13)¶
Bug fixes¶
- Add "Host" header in CONNECT requests to HTTPS proxies (issue 2069)
- Use response
body
when choosing response class (issue 2001, fixes issue 2000) - Do not fail on canonicalizing URLs with wrong netlocs (issue 2038, fixes issue 2010)
- a few fixes for
HttpCompressionMiddleware
(andSitemapSpider
):- Do not decode HEAD responses (issue 2008, fixes issue 1899)
- Handle charset parameter in gzip Content-Type header (issue 2050, fixes issue 2049)
- Do not decompress gzip octet-stream responses (issue 2065, fixes issue 2063)
- Catch (and ignore with a warning) exception when verifying certificate against IP-address hosts (issue 2094, fixes issue 2092)
- Make
FilesPipeline
andImagesPipeline
backward compatible again regarding the use of legacy class attributes for customization (issue 1989, fixes issue 1985)
New features¶
- Enable genspider command outside project folder (issue 2052)
- Retry HTTPS CONNECT
TunnelError
by default (issue 1974)
Documentation¶
FEED_TEMPDIR
setting at lexicographical position (commit 9b3c72c)- Use idiomatic
.extract_first()
in overview (issue 1994) - Update years in copyright notice (commit c2c8036)
- Add information and example on errbacks (issue 1995)
- Use "url" variable in downloader middleware example (issue 2015)
- Grammar fixes (issue 2054, issue 2120)
- New FAQ entry on using BeautifulSoup in spider callbacks (issue 2048)
- Add notes about scrapy not working on Windows with Python 3 (issue 2060)
- Encourage complete titles in pull requests (issue 2026)
Tests¶
- Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (issue 2095)
Scrapy 1.1.0 (2016-05-11)¶
This 1.1 release brings a lot of interesting features and bug fixes:
- Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See Beta Python 3 Support for more details and some limitations.
- Hot new features:
- Item loaders now support nested loaders (issue 1467).
FormRequest.from_response
improvements (issue 1382, issue 1137).- Added setting
AUTOTHROTTLE_TARGET_CONCURRENCY
and improved AutoThrottle docs (issue 1324). - Added
response.text
to get body as unicode (issue 1730). - Anonymous S3 connections (issue 1358).
- Deferreds in downloader middlewares (issue 1473). This enables better robots.txt handling (issue 1471).
- HTTP caching now follows RFC2616 more closely, added settings
HTTPCACHE_ALWAYS_STORE
andHTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
(issue 1151). - Selectors were extracted to the parsel library (issue 1409). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy.
- HTTPS downloader now does TLS protocol negotiation by default,
instead of forcing TLS 1.0. You can also set the SSL/TLS method
using the new
DOWNLOADER_CLIENT_TLS_METHOD
.
- These bug fixes may require your attention:
- Don't retry bad requests (HTTP 400) by default (issue 1289).
If you need the old behavior, add
400
toRETRY_HTTP_CODES
. - Fix shell files argument handling (issue 1710, issue 1550).
If you try
scrapy shell index.html
it will try to load the URL http://index.html, usescrapy shell ./index.html
to load a local file. - Robots.txt compliance is now enabled by default for newly-created projects
(issue 1724). Scrapy will also wait for robots.txt to be downloaded
before proceeding with the crawl (issue 1735). If you want to disable
this behavior, update
ROBOTSTXT_OBEY
insettings.py
file after creating a new project. - Exporters now work on unicode, instead of bytes by default (issue 1080).
If you use
PythonItemExporter
, you may want to update your code to disable binary mode which is now deprecated. - Accept XML node names containing dots as valid (issue 1533).
- When uploading files or images to S3 (with
FilesPipeline
orImagesPipeline
), the default ACL policy is now "private" instead of "public" Warning: backwards incompatible!. You can useFILES_STORE_S3_ACL
to change it. - We've reimplemented
canonicalize_url()
for more correct output, especially for URLs with non-ASCII characters (issue 1947). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs. Warning: backwards incompatible!.
- Don't retry bad requests (HTTP 400) by default (issue 1289).
If you need the old behavior, add
Keep reading for more details on other improvements and bug fixes.
Beta Python 3 Support¶
We have been hard at work to make Scrapy run on Python 3. As a result, now you can run spiders on Python 3.3, 3.4 and 3.5 (Twisted >= 15.5 required). Some features are still missing (and some may never be ported).
Almost all builtin extensions/middlewares are expected to work. However, we are aware of some limitations in Python 3:
- Scrapy does not work on Windows with Python 3
- Sending emails is not supported
- FTP download handler is not supported
- Telnet console is not supported
Additional New Features and Enhancements¶
- Scrapy now has a Code of Conduct (issue 1681).
- Command line tool now has completion for zsh (issue 934).
- Improvements to
scrapy shell
:- Support for bpython and configure preferred Python shell via
SCRAPY_PYTHON_SHELL
(issue 1100, issue 1444). - Support URLs without scheme (issue 1498) Warning: backwards incompatible!
- Bring back support for relative file path (issue 1710, issue 1550).
- Support for bpython and configure preferred Python shell via
- Added
MEMUSAGE_CHECK_INTERVAL_SECONDS
setting to change default check interval (issue 1282). - Download handlers are now lazy-loaded on first request using their scheme (issue 1390, issue 1421).
- HTTPS download handlers do not force TLS 1.0 anymore; instead,
OpenSSL's
SSLv23_method()/TLS_method()
is used allowing to try negotiating with the remote hosts the highest TLS protocol version it can (issue 1794, issue 1629). RedirectMiddleware
now skips the status codes fromhandle_httpstatus_list
on spider attribute or inRequest
'smeta
key (issue 1334, issue 1364, issue 1447).- Form submission:
- now works with
<button>
elements too (issue 1469). - an empty string is now used for submit buttons without a value (issue 1472)
- now works with
- Dict-like settings now have per-key priorities (issue 1135, issue 1149 and issue 1586).
- Sending non-ASCII emails (issue 1662)
CloseSpider
andSpiderState
extensions now get disabled if no relevant setting is set (issue 1723, issue 1725).- Added method
ExecutionEngine.close
(issue 1423). - Added method
CrawlerRunner.create_crawler
(issue 1528). - Scheduler priority queue can now be customized via
SCHEDULER_PRIORITY_QUEUE
(issue 1822). .pps
links are now ignored by default in link extractors (issue 1835).- temporary data folder for FTP and S3 feed storages can be customized
using a new
FEED_TEMPDIR
setting (issue 1847). FilesPipeline
andImagesPipeline
settings are now instance attributes instead of class attributes, enabling spider-specific behaviors (issue 1891).JsonItemExporter
now formats opening and closing square brackets on their own line (first and last lines of output file) (issue 1950).- If available,
botocore
is used forS3FeedStorage
,S3DownloadHandler
andS3FilesStore
(issue 1761, issue 1883). - Tons of documentation updates and related fixes (issue 1291, issue 1302, issue 1335, issue 1683, issue 1660, issue 1642, issue 1721, issue 1727, issue 1879).
- Other refactoring, optimizations and cleanup (issue 1476, issue 1481, issue 1477, issue 1315, issue 1290, issue 1750, issue 1881).
Deprecations and Removals¶
- Added
to_bytes
andto_unicode
, deprecatedstr_to_unicode
andunicode_to_str
functions (issue 778). binary_is_text
is introduced, to replace use ofisbinarytext
(but with inverse return value) (issue 1851)- The
optional_features
set has been removed (issue 1359). - The
--lsprof
command line option has been removed (issue 1689). Warning: backward incompatible, but doesn't break user code. - The following datatypes were deprecated (issue 1720):
scrapy.utils.datatypes.MultiValueDictKeyError
scrapy.utils.datatypes.MultiValueDict
scrapy.utils.datatypes.SiteNode
- The previously bundled
scrapy.xlib.pydispatch
library was deprecated and replaced by pydispatcher.
Relocations¶
telnetconsole
was relocated toextensions/
(issue 1524).- Note: telnet is not enabled on Python 3 (https://github.com/scrapy/scrapy/pull/1524#issuecomment-146985595)
Bugfixes¶
- Scrapy does not retry requests that got a
HTTP 400 Bad Request
response anymore (issue 1289). Warning: backwards incompatible! - Support empty password for http_proxy config (issue 1274).
- Interpret
application/x-json
asTextResponse
(issue 1333). - Support link rel attribute with multiple values (issue 1201).
- Fixed
scrapy.http.FormRequest.from_response
when there is a<base>
tag (issue 1564). - Fixed
TEMPLATES_DIR
handling (issue 1575). - Various
FormRequest
fixes (issue 1595, issue 1596, issue 1597). - Makes
_monkeypatches
more robust (issue 1634). - Fixed bug on
XMLItemExporter
with non-string fields in items (issue 1738). - Fixed startproject command in OS X (issue 1635).
- Fixed PythonItemExporter and CSVExporter for non-string item types (issue 1737).
- Various logging related fixes (issue 1294, issue 1419, issue 1263, issue 1624, issue 1654, issue 1722, issue 1726 and issue 1303).
- Fixed bug in
utils.template.render_templatefile()
(issue 1212). - sitemaps extraction from
robots.txt
is now case-insensitive (issue 1902). - HTTPS+CONNECT tunnels could get mixed up when using multiple proxies to same remote host (issue 1912).
Scrapy 1.0.7 (2017-03-03)¶
- Packaging fix: disallow unsupported Twisted versions in setup.py
Scrapy 1.0.6 (2016-05-04)¶
- FIX: RetryMiddleware is now robust to non-standard HTTP status codes (issue 1857)
- FIX: Filestorage HTTP cache was checking wrong modified time (issue 1875)
- DOC: Support for Sphinx 1.4+ (issue 1893)
- DOC: Consistency in selectors examples (issue 1869)
Scrapy 1.0.5 (2016-02-04)¶
- FIX: [Backport] Ignore bogus links in LinkExtractors (fixes issue 907, commit 108195e)
- TST: Changed buildbot makefile to use 'pytest' (commit 1f3d90a)
- DOC: Fixed typos in tutorial and media-pipeline (commit 808a9ea and commit 803bd87)
- DOC: Add AjaxCrawlMiddleware to DOWNLOADER_MIDDLEWARES_BASE in settings docs (commit aa94121)
Scrapy 1.0.4 (2015-12-30)¶
- Ignoring xlib/tx folder, depending on Twisted version. (commit 7dfa979)
- Run on new travis-ci infra (commit 6e42f0b)
- Spelling fixes (commit 823a1cc)
- escape nodename in xmliter regex (commit da3c155)
- test xml nodename with dots (commit 4418fc3)
- TST don't use broken Pillow version in tests (commit a55078c)
- disable log on version command. closes #1426 (commit 86fc330)
- disable log on startproject command (commit db4c9fe)
- Add PyPI download stats badge (commit df2b944)
- don't run tests twice on Travis if a PR is made from a scrapy/scrapy branch (commit a83ab41)
- Add Python 3 porting status badge to the README (commit 73ac80d)
- fixed RFPDupeFilter persistence (commit 97d080e)
- TST a test to show that dupefilter persistence is not working (commit 97f2fb3)
- explicit close file on file:// scheme handler (commit d9b4850)
- Disable dupefilter in shell (commit c0d0734)
- DOC: Add captions to toctrees which appear in sidebar (commit aa239ad)
- DOC Removed pywin32 from install instructions as it's already declared as dependency. (commit 10eb400)
- Added installation notes about using Conda for Windows and other OSes. (commit 1c3600a)
- Fixed minor grammar issues. (commit 7f4ddd5)
- fixed a typo in the documentation. (commit b71f677)
- Version 1 now exists (commit 5456c0e)
- fix another invalid xpath error (commit 0a1366e)
- fix ValueError: Invalid XPath: //div/[id="not-exists"]/text() on selectors.rst (commit ca8d60f)
- Typos corrections (commit 7067117)
- fix typos in downloader-middleware.rst and exceptions.rst, middlware -> middleware (commit 32f115c)
- Add note to ubuntu install section about debian compatibility (commit 23fda69)
- Replace alternative OSX install workaround with virtualenv (commit 98b63ee)
- Reference Homebrew's homepage for installation instructions (commit 1925db1)
- Add oldest supported tox version to contributing docs (commit 5d10d6d)
- Note in install docs about pip being already included in python>=2.7.9 (commit 85c980e)
- Add non-python dependencies to Ubuntu install section in the docs (commit fbd010d)
- Add OS X installation section to docs (commit d8f4cba)
- DOC(ENH): specify path to rtd theme explicitly (commit de73b1a)
- minor: scrapy.Spider docs grammar (commit 1ddcc7b)
- Make common practices sample code match the comments (commit 1b85bcf)
- nextcall repetitive calls (heartbeats). (commit 55f7104)
- Backport fix compatibility with Twisted 15.4.0 (commit b262411)
- pin pytest to 2.7.3 (commit a6535c2)
- Merge pull request #1512 from mgedmin/patch-1 (commit 8876111)
- Merge pull request #1513 from mgedmin/patch-2 (commit 5d4daf8)
- Typo (commit f8d0682)
- Fix list formatting (commit 5f83a93)
- fix scrapy squeue tests after recent changes to queuelib (commit 3365c01)
- Merge pull request #1475 from rweindl/patch-1 (commit 2d688cd)
- Update tutorial.rst (commit fbc1f25)
- Merge pull request #1449 from rhoekman/patch-1 (commit 7d6538c)
- Small grammatical change (commit 8752294)
- Add openssl version to version command (commit 13c45ac)
Scrapy 1.0.3 (2015-08-11)¶
- add service_identity to scrapy install_requires (commit cbc2501)
- Workaround for travis#296 (commit 66af9cd)
Scrapy 1.0.2 (2015-08-06)¶
- Twisted 15.3.0 does not raises PicklingError serializing lambda functions (commit b04dd7d)
- Minor method name fix (commit 6f85c7f)
- minor: scrapy.Spider grammar and clarity (commit 9c9d2e0)
- Put a blurb about support channels in CONTRIBUTING (commit c63882b)
- Fixed typos (commit a9ae7b0)
- Fix doc reference. (commit 7c8a4fe)
Scrapy 1.0.1 (2015-07-01)¶
- Unquote request path before passing to FTPClient, it already escape paths (commit cc00ad2)
- include tests/ to source distribution in MANIFEST.in (commit eca227e)
- DOC Fix SelectJmes documentation (commit b8567bc)
- DOC Bring Ubuntu and Archlinux outside of Windows subsection (commit 392233f)
- DOC remove version suffix from ubuntu package (commit 5303c66)
- DOC Update release date for 1.0 (commit c89fa29)
Scrapy 1.0.0 (2015-06-19)¶
You will find a lot of new features and bugfixes in this major release. Make sure to check our updated overview to get a glance of some of the changes, along with our brushed tutorial.
Support for returning dictionaries in spiders¶
Declaring and returning Scrapy Items is no longer necessary to collect the scraped data from your spider, you can now return explicit dictionaries instead.
Classic version
class MyItem(scrapy.Item):
url = scrapy.Field()
class MySpider(scrapy.Spider):
def parse(self, response):
return MyItem(url=response.url)
New version
class MySpider(scrapy.Spider):
def parse(self, response):
return {'url': response.url}
Per-spider settings (GSoC 2014)¶
Last Google Summer of Code project accomplished an important redesign of the mechanism used for populating settings, introducing explicit priorities to override any given setting. As an extension of that goal, we included a new level of priority for settings that act exclusively for a single spider, allowing them to redefine project settings.
Start using it by defining a custom_settings
class variable in your spider:
class MySpider(scrapy.Spider):
custom_settings = {
"DOWNLOAD_DELAY": 5.0,
"RETRY_ENABLED": False,
}
Read more about settings population: Settings
Python Logging¶
Scrapy 1.0 has moved away from Twisted logging to support Python built in’s as default logging system. We’re maintaining backward compatibility for most of the old custom interface to call logging functions, but you’ll get warnings to switch to the Python logging API entirely.
Old version
from scrapy import log
log.msg('MESSAGE', log.INFO)
New version
import logging
logging.info('MESSAGE')
Logging with spiders remains the same, but on top of the
log()
method you’ll have access to a custom
logger
created for the spider to issue log
events:
class MySpider(scrapy.Spider):
def parse(self, response):
self.logger.info('Response received')
Read more in the logging documentation: Logging
Crawler API refactoring (GSoC 2014)¶
Another milestone for last Google Summer of Code was a refactoring of the internal API, seeking a simpler and easier usage. Check new core interface in: Core API
A common situation where you will face these changes is while running Scrapy from scripts. Here’s a quick example of how to run a Spider manually with the new API:
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
Bear in mind this feature is still under development and its API may change until it reaches a stable status.
See more examples for scripts running Scrapy: Common Practices
Module Relocations¶
There’s been a large rearrangement of modules trying to improve the general structure of Scrapy. Main changes were separating various subpackages into new projects and dissolving both scrapy.contrib and scrapy.contrib_exp into top level packages. Backward compatibility was kept among internal relocations, while importing deprecated modules expect warnings indicating their new place.
Full list of relocations¶
Outsourced packages
注釈
These extensions went through some minor changes, e.g. some setting names were changed. Please check the documentation in each new repository to get familiar with the new usage.
Old location | New location |
---|---|
scrapy.commands.deploy | scrapyd-client (See other alternatives here: Deploying Spiders) |
scrapy.contrib.djangoitem | scrapy-djangoitem |
scrapy.webservice | scrapy-jsonrpc |
scrapy.contrib_exp and scrapy.contrib dissolutions
Old location | New location |
---|---|
scrapy.contrib_exp.downloadermiddleware.decompression | scrapy.downloadermiddlewares.decompression |
scrapy.contrib_exp.iterators | scrapy.utils.iterators |
scrapy.contrib.downloadermiddleware | scrapy.downloadermiddlewares |
scrapy.contrib.exporter | scrapy.exporters |
scrapy.contrib.linkextractors | scrapy.linkextractors |
scrapy.contrib.loader | scrapy.loader |
scrapy.contrib.loader.processor | scrapy.loader.processors |
scrapy.contrib.pipeline | scrapy.pipelines |
scrapy.contrib.spidermiddleware | scrapy.spidermiddlewares |
scrapy.contrib.spiders | scrapy.spiders |
|
scrapy.extensions.* |
Plural renames and Modules unification
Old location | New location |
---|---|
scrapy.command | scrapy.commands |
scrapy.dupefilter | scrapy.dupefilters |
scrapy.linkextractor | scrapy.linkextractors |
scrapy.spider | scrapy.spiders |
scrapy.squeue | scrapy.squeues |
scrapy.statscol | scrapy.statscollectors |
scrapy.utils.decorator | scrapy.utils.decorators |
Class renames
Old location | New location |
---|---|
scrapy.spidermanager.SpiderManager | scrapy.spiderloader.SpiderLoader |
Settings renames
Old location | New location |
---|---|
SPIDER_MANAGER_CLASS | SPIDER_LOADER_CLASS |
Changelog¶
New Features and Enhancements
- Python logging (issue 1060, issue 1235, issue 1236, issue 1240, issue 1259, issue 1278, issue 1286)
- FEED_EXPORT_FIELDS option (issue 1159, issue 1224)
- Dns cache size and timeout options (issue 1132)
- support namespace prefix in xmliter_lxml (issue 963)
- Reactor threadpool max size setting (issue 1123)
- Allow spiders to return dicts. (issue 1081)
- Add Response.urljoin() helper (issue 1086)
- look in ~/.config/scrapy.cfg for user config (issue 1098)
- handle TLS SNI (issue 1101)
- Selectorlist extract first (issue 624, issue 1145)
- Added JmesSelect (issue 1016)
- add gzip compression to filesystem http cache backend (issue 1020)
- CSS support in link extractors (issue 983)
- httpcache dont_cache meta #19 #689 (issue 821)
- add signal to be sent when request is dropped by the scheduler (issue 961)
- avoid download large response (issue 946)
- Allow to specify the quotechar in CSVFeedSpider (issue 882)
- Add referer to "Spider error processing" log message (issue 795)
- process robots.txt once (issue 896)
- GSoC Per-spider settings (issue 854)
- Add project name validation (issue 817)
- GSoC API cleanup (issue 816, issue 1128, issue 1147, issue 1148, issue 1156, issue 1185, issue 1187, issue 1258, issue 1268, issue 1276, issue 1285, issue 1284)
- Be more responsive with IO operations (issue 1074 and issue 1075)
- Do leveldb compaction for httpcache on closing (issue 1297)
Deprecations and Removals
- Deprecate htmlparser link extractor (issue 1205)
- remove deprecated code from FeedExporter (issue 1155)
- a leftover for.15 compatibility (issue 925)
- drop support for CONCURRENT_REQUESTS_PER_SPIDER (issue 895)
- Drop old engine code (issue 911)
- Deprecate SgmlLinkExtractor (issue 777)
Relocations
- Move exporters/__init__.py to exporters.py (issue 1242)
- Move base classes to their packages (issue 1218, issue 1233)
- Module relocation (issue 1181, issue 1210)
- rename SpiderManager to SpiderLoader (issue 1166)
- Remove djangoitem (issue 1177)
- remove scrapy deploy command (issue 1102)
- dissolve contrib_exp (issue 1134)
- Deleted bin folder from root, fixes #913 (issue 914)
- Remove jsonrpc based webservice (issue 859)
- Move Test cases under project root dir (issue 827, issue 841)
- Fix backward incompatibility for relocated paths in settings (issue 1267)
Documentation
- CrawlerProcess documentation (issue 1190)
- Favoring web scraping over screen scraping in the descriptions (issue 1188)
- Some improvements for Scrapy tutorial (issue 1180)
- Documenting Files Pipeline together with Images Pipeline (issue 1150)
- deployment docs tweaks (issue 1164)
- Added deployment section covering scrapyd-deploy and shub (issue 1124)
- Adding more settings to project template (issue 1073)
- some improvements to overview page (issue 1106)
- Updated link in docs/topics/architecture.rst (issue 647)
- DOC reorder topics (issue 1022)
- updating list of Request.meta special keys (issue 1071)
- DOC document download_timeout (issue 898)
- DOC simplify extension docs (issue 893)
- Leaks docs (issue 894)
- DOC document from_crawler method for item pipelines (issue 904)
- Spider_error doesn't support deferreds (issue 1292)
- Corrections & Sphinx related fixes (issue 1220, issue 1219, issue 1196, issue 1172, issue 1171, issue 1169, issue 1160, issue 1154, issue 1127, issue 1112, issue 1105, issue 1041, issue 1082, issue 1033, issue 944, issue 866, issue 864, issue 796, issue 1260, issue 1271, issue 1293, issue 1298)
Bugfixes
- Item multi inheritance fix (issue 353, issue 1228)
- ItemLoader.load_item: iterate over copy of fields (issue 722)
- Fix Unhandled error in Deferred (RobotsTxtMiddleware) (issue 1131, issue 1197)
- Force to read DOWNLOAD_TIMEOUT as int (issue 954)
- scrapy.utils.misc.load_object should print full traceback (issue 902)
- Fix bug for ".local" host name (issue 878)
- Fix for Enabled extensions, middlewares, pipelines info not printed anymore (issue 879)
- fix dont_merge_cookies bad behaviour when set to false on meta (issue 846)
Python 3 In Progress Support
- disable scrapy.telnet if twisted.conch is not available (issue 1161)
- fix Python 3 syntax errors in ajaxcrawl.py (issue 1162)
- more python3 compatibility changes for urllib (issue 1121)
- assertItemsEqual was renamed to assertCountEqual in Python 3. (issue 1070)
- Import unittest.mock if available. (issue 1066)
- updated deprecated cgi.parse_qsl to use six's parse_qsl (issue 909)
- Prevent Python 3 port regressions (issue 830)
- PY3: use MutableMapping for python 3 (issue 810)
- PY3: use six.BytesIO and six.moves.cStringIO (issue 803)
- PY3: fix xmlrpclib and email imports (issue 801)
- PY3: use six for robotparser and urlparse (issue 800)
- PY3: use six.iterkeys, six.iteritems, and tempfile (issue 799)
- PY3: fix has_key and use six.moves.configparser (issue 798)
- PY3: use six.moves.cPickle (issue 797)
- PY3 make it possible to run some tests in Python3 (issue 776)
Tests
- remove unnecessary lines from py3-ignores (issue 1243)
- Fix remaining warnings from pytest while collecting tests (issue 1206)
- Add docs build to travis (issue 1234)
- TST don't collect tests from deprecated modules. (issue 1165)
- install service_identity package in tests to prevent warnings (issue 1168)
- Fix deprecated settings API in tests (issue 1152)
- Add test for webclient with POST method and no body given (issue 1089)
- py3-ignores.txt supports comments (issue 1044)
- modernize some of the asserts (issue 835)
- selector.__repr__ test (issue 779)
Code refactoring
- CSVFeedSpider cleanup: use iterate_spider_output (issue 1079)
- remove unnecessary check from scrapy.utils.spider.iter_spider_output (issue 1078)
- Pydispatch pep8 (issue 992)
- Removed unused 'load=False' parameter from walk_modules() (issue 871)
- For consistency, use job_dir helper in SpiderState extension. (issue 805)
- rename "sflo" local variables to less cryptic "log_observer" (issue 775)
Scrapy 0.24.6 (2015-04-20)¶
- encode invalid xpath with unicode_escape under PY2 (commit 07cb3e5)
- fix IPython shell scope issue and load IPython user config (commit 2c8e573)
- Fix small typo in the docs (commit d694019)
- Fix small typo (commit f92fa83)
- Converted sel.xpath() calls to response.xpath() in Extracting the data (commit c2c6d15)
Scrapy 0.24.5 (2015-02-25)¶
- Support new _getEndpoint Agent signatures on Twisted 15.0.0 (commit 540b9bc)
- DOC a couple more references are fixed (commit b4c454b)
- DOC fix a reference (commit e3c1260)
- t.i.b.ThreadedResolver is now a new-style class (commit 9e13f42)
- S3DownloadHandler: fix auth for requests with quoted paths/query params (commit cdb9a0b)
- fixed the variable types in mailsender documentation (commit bb3a848)
- Reset items_scraped instead of item_count (commit edb07a4)
- Tentative attention message about what document to read for contributions (commit 7ee6f7a)
- mitmproxy 0.10.1 needs netlib 0.10.1 too (commit 874fcdd)
- pin mitmproxy 0.10.1 as >0.11 does not work with tests (commit c6b21f0)
- Test the parse command locally instead of against an external url (commit c3a6628)
- Patches Twisted issue while closing the connection pool on HTTPDownloadHandler (commit d0bf957)
- Updates documentation on dynamic item classes. (commit eeb589a)
- Merge pull request #943 from Lazar-T/patch-3 (commit 5fdab02)
- typo (commit b0ae199)
- pywin32 is required by Twisted. closes #937 (commit 5cb0cfb)
- Update install.rst (commit 781286b)
- Merge pull request #928 from Lazar-T/patch-1 (commit b415d04)
- comma instead of fullstop (commit 627b9ba)
- Merge pull request #885 from jsma/patch-1 (commit de909ad)
- Update request-response.rst (commit 3f3263d)
- SgmlLinkExtractor - fix for parsing <area> tag with Unicode present (commit 49b40f0)
Scrapy 0.24.4 (2014-08-09)¶
- pem file is used by mockserver and required by scrapy bench (commit 5eddc68)
- scrapy bench needs scrapy.tests* (commit d6cb999)
Scrapy 0.24.3 (2014-08-09)¶
- no need to waste travis-ci time on py3 for 0.24 (commit 8e080c1)
- Update installation docs (commit 1d0c096)
- There is a trove classifier for Scrapy framework! (commit 4c701d7)
- update other places where w3lib version is mentioned (commit d109c13)
- Update w3lib requirement to 1.8.0 (commit 39d2ce5)
- Use w3lib.html.replace_entities() (remove_entities() is deprecated) (commit 180d3ad)
- set zip_safe=False (commit a51ee8b)
- do not ship tests package (commit ee3b371)
- scrapy.bat is not needed anymore (commit c3861cf)
- Modernize setup.py (commit 362e322)
- headers can not handle non-string values (commit 94a5c65)
- fix ftp test cases (commit a274a7f)
- The sum up of travis-ci builds are taking like 50min to complete (commit ae1e2cc)
- Update shell.rst typo (commit e49c96a)
- removes weird indentation in the shell results (commit 1ca489d)
- improved explanations, clarified blog post as source, added link for XPath string functions in the spec (commit 65c8f05)
- renamed UserTimeoutError and ServerTimeouterror #583 (commit 037f6ab)
- adding some xpath tips to selectors docs (commit 2d103e0)
- fix tests to account for https://github.com/scrapy/w3lib/pull/23 (commit f8d366a)
- get_func_args maximum recursion fix #728 (commit 81344ea)
- Updated input/ouput processor example according to #560. (commit f7c4ea8)
- Fixed Python syntax in tutorial. (commit db59ed9)
- Add test case for tunneling proxy (commit f090260)
- Bugfix for leaking Proxy-Authorization header to remote host when using tunneling (commit d8793af)
- Extract links from XHTML documents with MIME-Type "application/xml" (commit ed1f376)
- Merge pull request #793 from roysc/patch-1 (commit 91a1106)
- Fix typo in commands.rst (commit 743e1e2)
- better testcase for settings.overrides.setdefault (commit e22daaf)
- Using CRLF as line marker according to http 1.1 definition (commit 5ec430b)
Scrapy 0.24.2 (2014-07-08)¶
- Use a mutable mapping to proxy deprecated settings.overrides and settings.defaults attribute (commit e5e8133)
- there is not support for python3 yet (commit 3cd6146)
- Update python compatible version set to debian packages (commit fa5d76b)
- DOC fix formatting in release notes (commit c6a9e20)
Scrapy 0.24.1 (2014-06-27)¶
- Fix deprecated CrawlerSettings and increase backwards compatibility with .defaults attribute (commit 8e3f20a)
Scrapy 0.24.0 (2014-06-26)¶
Enhancements¶
- Improve Scrapy top-level namespace (issue 494, issue 684)
- Add selector shortcuts to responses (issue 554, issue 690)
- Add new lxml based LinkExtractor to replace unmantained SgmlLinkExtractor (issue 559, issue 761, issue 763)
- Cleanup settings API - part of per-spider settings GSoC project (issue 737)
- Add UTF8 encoding header to templates (issue 688, issue 762)
- Telnet console now binds to 127.0.0.1 by default (issue 699)
- Update debian/ubuntu install instructions (issue 509, issue 549)
- Disable smart strings in lxml XPath evaluations (issue 535)
- Restore filesystem based cache as default for http cache middleware (issue 541, issue 500, issue 571)
- Expose current crawler in Scrapy shell (issue 557)
- Improve testsuite comparing CSV and XML exporters (issue 570)
- New offsite/filtered and offsite/domains stats (issue 566)
- Support process_links as generator in CrawlSpider (issue 555)
- Verbose logging and new stats counters for DupeFilter (issue 553)
- Add a mimetype parameter to MailSender.send() (issue 602)
- Generalize file pipeline log messages (issue 622)
- Replace unencodeable codepoints with html entities in SGMLLinkExtractor (issue 565)
- Converted SEP documents to rst format (issue 629, issue 630, issue 638, issue 632, issue 636, issue 640, issue 635, issue 634, issue 639, issue 637, issue 631, issue 633, issue 641, issue 642)
- Tests and docs for clickdata's nr index in FormRequest (issue 646, issue 645)
- Allow to disable a downloader handler just like any other component (issue 650)
- Log when a request is discarded after too many redirections (issue 654)
- Log error responses if they are not handled by spider callbacks (issue 612, issue 656)
- Add content-type check to http compression mw (issue 193, issue 660)
- Run pypy tests using latest pypi from ppa (issue 674)
- Run test suite using pytest instead of trial (issue 679)
- Build docs and check for dead links in tox environment (issue 687)
- Make scrapy.version_info a tuple of integers (issue 681, issue 692)
- Infer exporter's output format from filename extensions (issue 546, issue 659, issue 760)
- Support case-insensitive domains in url_is_from_any_domain() (issue 693)
- Remove pep8 warnings in project and spider templates (issue 698)
- Tests and docs for request_fingerprint function (issue 597)
- Update SEP-19 for GSoC project per-spider settings (issue 705)
- Set exit code to non-zero when contracts fails (issue 727)
- Add a setting to control what class is instanciated as Downloader component (issue 738)
- Pass response in item_dropped signal (issue 724)
- Improve scrapy check contracts command (issue 733, issue 752)
- Document spider.closed() shortcut (issue 719)
- Document request_scheduled signal (issue 746)
- Add a note about reporting security issues (issue 697)
- Add LevelDB http cache storage backend (issue 626, issue 500)
- Sort spider list output of scrapy list command (issue 742)
- Multiple documentation enhancemens and fixes (issue 575, issue 587, issue 590, issue 596, issue 610, issue 617, issue 618, issue 627, issue 613, issue 643, issue 654, issue 675, issue 663, issue 711, issue 714)
Bugfixes¶
- Encode unicode URL value when creating Links in RegexLinkExtractor (issue 561)
- Ignore None values in ItemLoader processors (issue 556)
- Fix link text when there is an inner tag in SGMLLinkExtractor and HtmlParserLinkExtractor (issue 485, issue 574)
- Fix wrong checks on subclassing of deprecated classes (issue 581, issue 584)
- Handle errors caused by inspect.stack() failures (issue 582)
- Fix a reference to unexistent engine attribute (issue 593, issue 594)
- Fix dynamic itemclass example usage of type() (issue 603)
- Use lucasdemarchi/codespell to fix typos (issue 628)
- Fix default value of attrs argument in SgmlLinkExtractor to be tuple (issue 661)
- Fix XXE flaw in sitemap reader (issue 676)
- Fix engine to support filtered start requests (issue 707)
- Fix offsite middleware case on urls with no hostnames (issue 745)
- Testsuite doesn't require PIL anymore (issue 585)
Scrapy 0.22.2 (released 2014-02-14)¶
- fix a reference to unexistent engine.slots. closes #593 (commit 13c099a)
- downloaderMW doc typo (spiderMW doc copy remnant) (commit 8ae11bf)
- Correct typos (commit 1346037)
Scrapy 0.22.1 (released 2014-02-08)¶
- localhost666 can resolve under certain circumstances (commit 2ec2279)
- test inspect.stack failure (commit cc3eda3)
- Handle cases when inspect.stack() fails (commit 8cb44f9)
- Fix wrong checks on subclassing of deprecated classes. closes #581 (commit 46d98d6)
- Docs: 4-space indent for final spider example (commit 13846de)
- Fix HtmlParserLinkExtractor and tests after #485 merge (commit 368a946)
- BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag (commit b566388)
- BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (commit c1cb418)
- BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag (commit 7e4d627)
- Fix tests for Travis-CI build (commit 76c7e20)
- replace unencodeable codepoints with html entities. fixes #562 and #285 (commit 5f87b17)
- RegexLinkExtractor: encode URL unicode value when creating Links (commit d0ee545)
- Updated the tutorial crawl output with latest output. (commit 8da65de)
- Updated shell docs with the crawler reference and fixed the actual shell output. (commit 875b9ab)
- PEP8 minor edits. (commit f89efaf)
- Expose current crawler in the scrapy shell. (commit 5349cec)
- Unused re import and PEP8 minor edits. (commit 387f414)
- Ignore None's values when using the ItemLoader. (commit 0632546)
- DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm. (commit cde9a8c)
- show ubuntu setup instructions as literal code (commit fb5c9c5)
- Update Ubuntu installation instructions (commit 70fb105)
- Merge pull request #550 from stray-leone/patch-1 (commit 6f70b6a)
- modify the version of scrapy ubuntu package (commit 725900d)
- fix 0.22.0 release date (commit af0219a)
- fix typos in news.rst and remove (not released yet) header (commit b7f58f4)
Scrapy 0.22.0 (released 2014-01-17)¶
Enhancements¶
- [Backwards incompatible] Switched HTTPCacheMiddleware backend to filesystem (issue 541) To restore old backend set HTTPCACHE_STORAGE to scrapy.contrib.httpcache.DbmCacheStorage
- Proxy https:// urls using CONNECT method (issue 392, issue 397)
- Add a middleware to crawl ajax crawleable pages as defined by google (issue 343)
- Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (issue 510, issue 519)
- Selectors register EXSLT namespaces by default (issue 472)
- Unify item loaders similar to selectors renaming (issue 461)
- Make RFPDupeFilter class easily subclassable (issue 533)
- Improve test coverage and forthcoming Python 3 support (issue 525)
- Promote startup info on settings and middleware to INFO level (issue 520)
- Support partials in get_func_args util (issue 506, issue:504)
- Allow running indiviual tests via tox (issue 503)
- Update extensions ignored by link extractors (issue 498)
- Add middleware methods to get files/images/thumbs paths (issue 490)
- Improve offsite middleware tests (issue 478)
- Add a way to skip default Referer header set by RefererMiddleware (issue 475)
- Do not send x-gzip in default Accept-Encoding header (issue 469)
- Support defining http error handling using settings (issue 466)
- Use modern python idioms wherever you find legacies (issue 497)
- Improve and correct documentation (issue 527, issue 524, issue 521, issue 517, issue 512, issue 505, issue 502, issue 489, issue 465, issue 460, issue 425, issue 536)
Fixes¶
- Update Selector class imports in CrawlSpider template (issue 484)
- Fix unexistent reference to engine.slots (issue 464)
- Do not try to call body_as_unicode() on a non-TextResponse instance (issue 462)
- Warn when subclassing XPathItemLoader, previously it only warned on instantiation. (issue 523)
- Warn when subclassing XPathSelector, previously it only warned on instantiation. (issue 537)
- Multiple fixes to memory stats (issue 531, issue 530, issue 529)
- Fix overriding url in FormRequest.from_response() (issue 507)
- Fix tests runner under pip 1.5 (issue 513)
- Fix logging error when spider name is unicode (issue 479)
Scrapy 0.20.2 (released 2013-12-09)¶
- Update CrawlSpider Template with Selector changes (commit 6d1457d)
- fix method name in tutorial. closes GH-480 (commit b4fc359
Scrapy 0.20.1 (released 2013-11-28)¶
- include_package_data is required to build wheels from published sources (commit 5ba1ad5)
- process_parallel was leaking the failures on its internal deferreds. closes #458 (commit 419a780)
Scrapy 0.20.0 (released 2013-11-08)¶
Enhancements¶
- New Selector's API including CSS selectors (issue 395 and issue 426),
- Request/Response url/body attributes are now immutable (modifying them had been deprecated for a long time)
ITEM_PIPELINES
is now defined as a dict (instead of a list)- Sitemap spider can fetch alternate URLs (issue 360)
- Selector.remove_namespaces() now remove namespaces from element's attributes. (issue 416)
- Paved the road for Python 3.3+ (issue 435, issue 436, issue 431, issue 452)
- New item exporter using native python types with nesting support (issue 366)
- Tune HTTP1.1 pool size so it matches concurrency defined by settings (commit b43b5f575)
- scrapy.mail.MailSender now can connect over TLS or upgrade using STARTTLS (issue 327)
- New FilesPipeline with functionality factored out from ImagesPipeline (issue 370, issue 409)
- Recommend Pillow instead of PIL for image handling (issue 317)
- Added debian packages for Ubuntu quantal and raring (commit 86230c0)
- Mock server (used for tests) can listen for HTTPS requests (issue 410)
- Remove multi spider support from multiple core components (issue 422, issue 421, issue 420, issue 419, issue 423, issue 418)
- Travis-CI now tests Scrapy changes against development versions of w3lib and queuelib python packages.
- Add pypy 2.1 to continuous integration tests (commit ecfa7431)
- Pylinted, pep8 and removed old-style exceptions from source (issue 430, issue 432)
- Use importlib for parametric imports (issue 445)
- Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (issue 372)
- Bugfix crawling shutdown on SIGINT (issue 450)
- Do not submit reset type inputs in FormRequest.from_response (commit b326b87)
- Do not silence download errors when request errback raises an exception (commit 684cfc0)
Bugfixes¶
- Fix tests under Django 1.6 (commit b6bed44c)
- Lot of bugfixes to retry middleware under disconnections using HTTP 1.1 download handler
- Fix inconsistencies among Twisted releases (issue 406)
- Fix scrapy shell bugs (issue 418, issue 407)
- Fix invalid variable name in setup.py (issue 429)
- Fix tutorial references (issue 387)
- Improve request-response docs (issue 391)
- Improve best practices docs (issue 399, issue 400, issue 401, issue 402)
- Improve django integration docs (issue 404)
- Document bindaddress request meta (commit 37c24e01d7)
- Improve Request class documentation (issue 226)
Other¶
- Dropped Python 2.6 support (issue 448)
- Add cssselect python package as install dependency
- Drop libxml2 and multi selector's backend support, lxml is required from now on.
- Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
- Running test suite now requires mock python library (issue 390)
Thanks¶
Thanks to everyone who contribute to this release!
List of contributors sorted by number of commits:
69 Daniel Graña <dangra@...>
37 Pablo Hoffman <pablo@...>
13 Mikhail Korobov <kmike84@...>
9 Alex Cepoi <alex.cepoi@...>
9 alexanderlukanin13 <alexander.lukanin.13@...>
8 Rolando Espinoza La fuente <darkrho@...>
8 Lukasz Biedrycki <lukasz.biedrycki@...>
6 Nicolas Ramirez <nramirez.uy@...>
3 Paul Tremberth <paul.tremberth@...>
2 Martin Olveyra <molveyra@...>
2 Stefan <misc@...>
2 Rolando Espinoza <darkrho@...>
2 Loren Davie <loren@...>
2 irgmedeiros <irgmedeiros@...>
1 Stefan Koch <taikano@...>
1 Stefan <cct@...>
1 scraperdragon <dragon@...>
1 Kumara Tharmalingam <ktharmal@...>
1 Francesco Piccinno <stack.box@...>
1 Marcos Campal <duendex@...>
1 Dragon Dave <dragon@...>
1 Capi Etheriel <barraponto@...>
1 cacovsky <amarquesferraz@...>
1 Berend Iwema <berend@...>
Scrapy 0.18.4 (released 2013-10-10)¶
- IPython refuses to update the namespace. fix #396 (commit 3d32c4f)
- Fix AlreadyCalledError replacing a request in shell command. closes #407 (commit b1d8919)
- Fix start_requests laziness and early hangs (commit 89faf52)
Scrapy 0.18.3 (released 2013-10-03)¶
- fix regression on lazy evaluation of start requests (commit 12693a5)
- forms: do not submit reset inputs (commit e429f63)
- increase unittest timeouts to decrease travis false positive failures (commit 912202e)
- backport master fixes to json exporter (commit cfc2d46)
- Fix permission and set umask before generating sdist tarball (commit 06149e0)
Scrapy 0.18.2 (released 2013-09-03)¶
- Backport scrapy check command fixes and backward compatible multi crawler process(issue 339)
Scrapy 0.18.1 (released 2013-08-27)¶
- remove extra import added by cherry picked changes (commit d20304e)
- fix crawling tests under twisted pre 11.0.0 (commit 1994f38)
- py26 can not format zero length fields {} (commit abf756f)
- test PotentiaDataLoss errors on unbound responses (commit b15470d)
- Treat responses without content-length or Transfer-Encoding as good responses (commit c4bf324)
- do no include ResponseFailed if http11 handler is not enabled (commit 6cbe684)
- New HTTP client wraps connection losts in ResponseFailed exception. fix #373 (commit 1a20bba)
- limit travis-ci build matrix (commit 3b01bb8)
- Merge pull request #375 from peterarenot/patch-1 (commit fa766d7)
- Fixed so it refers to the correct folder (commit 3283809)
- added quantal & raring to support ubuntu releases (commit 1411923)
- fix retry middleware which didn't retry certain connection errors after the upgrade to http1 client, closes GH-373 (commit bb35ed0)
- fix XmlItemExporter in Python 2.7.4 and 2.7.5 (commit de3e451)
- minor updates to 0.18 release notes (commit c45e5f1)
- fix contributters list format (commit 0b60031)
Scrapy 0.18.0 (released 2013-08-09)¶
- Lot of improvements to testsuite run using Tox, including a way to test on pypi
- Handle GET parameters for AJAX crawleable urls (commit 3fe2a32)
- Use lxml recover option to parse sitemaps (issue 347)
- Bugfix cookie merging by hostname and not by netloc (issue 352)
- Support disabling HttpCompressionMiddleware using a flag setting (issue 359)
- Support xml namespaces using iternodes parser in XMLFeedSpider (issue 12)
- Support dont_cache request meta flag (issue 19)
- Bugfix scrapy.utils.gz.gunzip broken by changes in python 2.7.4 (commit 4dc76e)
- Bugfix url encoding on SgmlLinkExtractor (issue 24)
- Bugfix TakeFirst processor shouldn't discard zero (0) value (issue 59)
- Support nested items in xml exporter (issue 66)
- Improve cookies handling performance (issue 77)
- Log dupe filtered requests once (issue 105)
- Split redirection middleware into status and meta based middlewares (issue 78)
- Use HTTP1.1 as default downloader handler (issue 109 and issue 318)
- Support xpath form selection on FormRequest.from_response (issue 185)
- Bugfix unicode decoding error on SgmlLinkExtractor (issue 199)
- Bugfix signal dispatching on pypi interpreter (issue 205)
- Improve request delay and concurrency handling (issue 206)
- Add RFC2616 cache policy to HttpCacheMiddleware (issue 212)
- Allow customization of messages logged by engine (issue 214)
- Multiples improvements to DjangoItem (issue 217, issue 218, issue 221)
- Extend Scrapy commands using setuptools entry points (issue 260)
- Allow spider allowed_domains value to be set/tuple (issue 261)
- Support settings.getdict (issue 269)
- Simplify internal scrapy.core.scraper slot handling (issue 271)
- Added Item.copy (issue 290)
- Collect idle downloader slots (issue 297)
- Add ftp:// scheme downloader handler (issue 329)
- Added downloader benchmark webserver and spider tools Benchmarking
- Moved persistent (on disk) queues to a separate project (queuelib) which scrapy now depends on
- Add scrapy commands using external libraries (issue 260)
- Added
--pdb
option toscrapy
command line tool - Added
XPathSelector.remove_namespaces()
which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in セレクタ. - Several improvements to spider contracts
- New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
- MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62
- added from_crawler method to spiders
- added system tests with mock server
- more improvements to Mac OS compatibility (thanks Alex Cepoi)
- several more cleanups to singletons and multi-spider support (thanks Nicolas Ramirez)
- support custom download slots
- added --spider option to "shell" command.
- log overridden settings when scrapy starts
Thanks to everyone who contribute to this release. Here is a list of contributors sorted by number of commits:
130 Pablo Hoffman <pablo@...>
97 Daniel Graña <dangra@...>
20 Nicolás Ramírez <nramirez.uy@...>
13 Mikhail Korobov <kmike84@...>
12 Pedro Faustino <pedrobandim@...>
11 Steven Almeroth <sroth77@...>
5 Rolando Espinoza La fuente <darkrho@...>
4 Michal Danilak <mimino.coder@...>
4 Alex Cepoi <alex.cepoi@...>
4 Alexandr N Zamaraev (aka tonal) <tonal@...>
3 paul <paul.tremberth@...>
3 Martin Olveyra <molveyra@...>
3 Jordi Llonch <llonchj@...>
3 arijitchakraborty <myself.arijit@...>
2 Shane Evans <shane.evans@...>
2 joehillen <joehillen@...>
2 Hart <HartSimha@...>
2 Dan <ellisd23@...>
1 Zuhao Wan <wanzuhao@...>
1 whodatninja <blake@...>
1 vkrest <v.krestiannykov@...>
1 tpeng <pengtaoo@...>
1 Tom Mortimer-Jones <tom@...>
1 Rocio Aramberri <roschegel@...>
1 Pedro <pedro@...>
1 notsobad <wangxiaohugg@...>
1 Natan L <kuyanatan.nlao@...>
1 Mark Grey <mark.grey@...>
1 Luan <luanpab@...>
1 Libor Nenadál <libor.nenadal@...>
1 Juan M Uys <opyate@...>
1 Jonas Brunsgaard <jonas.brunsgaard@...>
1 Ilya Baryshev <baryshev@...>
1 Hasnain Lakhani <m.hasnain.lakhani@...>
1 Emanuel Schorsch <emschorsch@...>
1 Chris Tilden <chris.tilden@...>
1 Capi Etheriel <barraponto@...>
1 cacovsky <amarquesferraz@...>
1 Berend Iwema <berend@...>
Scrapy 0.16.5 (released 2013-05-30)¶
- obey request method when scrapy deploy is redirected to a new endpoint (commit 8c4fcee)
- fix inaccurate downloader middleware documentation. refs #280 (commit 40667cb)
- doc: remove links to diveintopython.org, which is no longer available. closes #246 (commit bd58bfa)
- Find form nodes in invalid html5 documents (commit e3d6945)
- Fix typo labeling attrs type bool instead of list (commit a274276)
Scrapy 0.16.4 (released 2013-01-23)¶
- fixes spelling errors in documentation (commit 6d2b3aa)
- add doc about disabling an extension. refs #132 (commit c90de33)
- Fixed error message formatting. log.err() doesn't support cool formatting and when error occurred, the message was: "ERROR: Error processing %(item)s" (commit c16150c)
- lint and improve images pipeline error logging (commit 56b45fc)
- fixed doc typos (commit 243be84)
- add documentation topics: Broad Crawls & Common Practies (commit 1fbb715)
- fix bug in scrapy parse command when spider is not specified explicitly. closes #209 (commit c72e682)
- Update docs/topics/commands.rst (commit 28eac7a)
Scrapy 0.16.3 (released 2012-12-07)¶
- Remove concurrency limitation when using download delays and still ensure inter-request delays are enforced (commit 487b9b5)
- add error details when image pipeline fails (commit 8232569)
- improve mac os compatibility (commit 8dcf8aa)
- setup.py: use README.rst to populate long_description (commit 7b5310d)
- doc: removed obsolete references to ClientForm (commit 80f9bb6)
- correct docs for default storage backend (commit 2aa491b)
- doc: removed broken proxyhub link from FAQ (commit bdf61c4)
- Fixed docs typo in SpiderOpenCloseLogging example (commit 7184094)
Scrapy 0.16.2 (released 2012-11-09)¶
- scrapy contracts: python2.6 compat (commit a4a9199)
- scrapy contracts verbose option (commit ec41673)
- proper unittest-like output for scrapy contracts (commit 86635e4)
- added open_in_browser to debugging doc (commit c9b690d)
- removed reference to global scrapy stats from settings doc (commit dd55067)
- Fix SpiderState bug in Windows platforms (commit 58998f4)
Scrapy 0.16.1 (released 2012-10-26)¶
- fixed LogStats extension, which got broken after a wrong merge before the 0.16 release (commit 8c780fd)
- better backwards compatibility for scrapy.conf.settings (commit 3403089)
- extended documentation on how to access crawler stats from extensions (commit c4da0b5)
- removed .hgtags (no longer needed now that scrapy uses git) (commit d52c188)
- fix dashes under rst headers (commit fa4f7f9)
- set release date for 0.16.0 in news (commit e292246)
Scrapy 0.16.0 (released 2012-10-18)¶
Scrapy changes:
- added Spiders Contracts, a mechanism for testing spiders in a formal/reproducible way
- added options
-o
and-t
to therunspider
command - documented AutoThrottle extension and added to extensions installed by default. You still need to enable it with
AUTOTHROTTLE_ENABLED
- major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (
stats_spider_opened
, etc). Stats are much simpler now, backwards compatibility is kept on the Stats Collector API and signals. - added
process_start_requests()
method to spider middlewares - dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
- dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals documentation for more info.
- dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info.
- documented Core API
- lxml is now the default selectors backend instead of libxml2
- ported FormRequest.from_response() to use lxml instead of ClientForm
- removed modules:
scrapy.xlib.BeautifulSoup
andscrapy.xlib.ClientForm
- SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (commit 10ed28b)
- StackTraceDump extension: also dump trackref live references (commit fe2ce93)
- nested items now fully supported in JSON and JSONLines exporters
- added
cookiejar
Request meta key to support multiple cookie sessions per spider - decoupled encoding detection code to w3lib.encoding, and ported Scrapy code to use that module
- dropped support for Python 2.5. See https://blog.scrapinghub.com/2012/02/27/scrapy-0-15-dropping-support-for-python-2-5/
- dropped support for Twisted 2.5
- added
REFERER_ENABLED
setting, to control referer middleware - changed default user agent to:
Scrapy/VERSION (+http://scrapy.org)
- removed (undocumented)
HTMLImageLinkExtractor
class fromscrapy.contrib.linkextractors.image
- removed per-spider settings (to be replaced by instantiating multiple crawler objects)
USER_AGENT
spider attribute will no longer work, useuser_agent
attribute insteadDOWNLOAD_TIMEOUT
spider attribute will no longer work, usedownload_timeout
attribute instead- removed
ENCODING_ALIASES
setting, as encoding auto-detection has been moved to the w3lib library - promoted DjangoItem to main contrib
- LogFormatter method now return dicts(instead of strings) to support lazy formatting (issue 164, commit dcef7b0)
- downloader handlers (
DOWNLOAD_HANDLERS
setting) now receive settings as the first argument of the constructor - replaced memory usage acounting with (more portable) resource module, removed
scrapy.utils.memory
module - removed signal:
scrapy.mail.mail_sent
- removed
TRACK_REFS
setting, now trackrefs is always enabled - DBM is now the default storage backend for HTTP cache middleware
- number of log messages (per level) are now tracked through Scrapy stats (stat name:
log_count/LEVEL
) - number received responses are now tracked through Scrapy stats (stat name:
response_received_count
) - removed
scrapy.log.started
attribute
Scrapy 0.14.4¶
- added precise to supported ubuntu distros (commit b7e46df)
- fixed bug in json-rpc webservice reported in https://groups.google.com/forum/#!topic/scrapy-users/qgVBmFybNAQ/discussion. also removed no longer supported 'run' command from extras/scrapy-ws.py (commit 340fbdb)
- meta tag attributes for content-type http equiv can be in any order. #123 (commit 0cb68af)
- replace "import Image" by more standard "from PIL import Image". closes #88 (commit 4d17048)
- return trial status as bin/runtests.sh exit value. #118 (commit b7b2e7f)
Scrapy 0.14.3¶
- forgot to include pydispatch license. #118 (commit fd85f9c)
- include egg files used by testsuite in source distribution. #118 (commit c897793)
- update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 (commit 2548dcc)
- added note to docs/topics/firebug.rst about google directory being shut down (commit 668e352)
- dont discard slot when empty, just save in another dict in order to recycle if needed again. (commit 8e9f607)
- do not fail handling unicode xpaths in libxml2 backed selectors (commit b830e95)
- fixed minor mistake in Request objects documentation (commit bf3c9ee)
- fixed minor defect in link extractors documentation (commit ba14f38)
- removed some obsolete remaining code related to sqlite support in scrapy (commit 0665175)
Scrapy 0.14.2¶
- move buffer pointing to start of file before computing checksum. refs #92 (commit 6a5bef2)
- Compute image checksum before persisting images. closes #92 (commit 9817df1)
- remove leaking references in cached failures (commit 673a120)
- fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) (commit 11133e9)
- fixed struct.error on http compression middleware. closes #87 (commit 1423140)
- ajax crawling wasn't expanding for unicode urls (commit 0de3fb4)
- Catch start_requests iterator errors. refs #83 (commit 454a21d)
- Speed-up libxml2 XPathSelector (commit 2fbd662)
- updated versioning doc according to recent changes (commit 0a070f5)
- scrapyd: fixed documentation link (commit 2b4e4c3)
- extras/makedeb.py: no longer obtaining version from git (commit caffe0e)
Scrapy 0.14.1¶
- extras/makedeb.py: no longer obtaining version from git (commit caffe0e)
- bumped version to 0.14.1 (commit 6cb9e1c)
- fixed reference to tutorial directory (commit 4b86bd6)
- doc: removed duplicated callback argument from Request.replace() (commit 1aeccdd)
- fixed formatting of scrapyd doc (commit 8bf19e6)
- Dump stacks for all running threads and fix engine status dumped by StackTraceDump extension (commit 14a8e6e)
- added comment about why we disable ssl on boto images upload (commit 5223575)
- SSL handshaking hangs when doing too many parallel connections to S3 (commit 63d583d)
- change tutorial to follow changes on dmoz site (commit bcb3198)
- Avoid _disconnectedDeferred AttributeError exception in Twisted>=11.1.0 (commit 98f3f87)
- allow spider to set autothrottle max concurrency (commit 175a4b5)
Scrapy 0.14¶
New features and settings¶
- Support for AJAX crawleable urls
- New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (r2737)
- added
-o
option toscrapy crawl
, a shortcut for dumping scraped items into a file (or standard output using-
) - Added support for passing custom settings to Scrapyd
schedule.json
api (r2779, r2783) - New
ChunkedTransferMiddleware
(enabled by default) to support chunked transfer encoding (r2769) - Add boto 2.0 support for S3 downloader handler (r2763)
- Added marshal to formats supported by feed exports (r2744)
- In request errbacks, offending requests are now received in failure.request attribute (r2738)
- Big downloader refactoring to support per domain/ip concurrency limits (r2732)
CONCURRENT_REQUESTS_PER_SPIDER
setting has been deprecated and replaced by:
- check the documentation for more details
- Added builtin caching DNS resolver (r2728)
- Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (r2706, r2714)
- Moved spider queues to scrapyd: scrapy.spiderqueue -> scrapyd.spiderqueue (r2708)
- Moved sqlite utils to scrapyd: scrapy.utils.sqlite -> scrapyd.sqlite (r2781)
- Real support for returning iterators on start_requests() method. The iterator is now consumed during the crawl when the spider is getting idle (r2704)
- Added
REDIRECT_ENABLED
setting to quickly enable/disable the redirect middleware (r2697) - Added
RETRY_ENABLED
setting to quickly enable/disable the retry middleware (r2694) - Added
CloseSpider
exception to manually close spiders (r2691) - Improved encoding detection by adding support for HTML5 meta charset declaration (r2690)
- Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (r2688)
- Added
SitemapSpider
(see documentation in Spiders page) (r2658) - Added
LogStats
extension for periodically logging basic stats (like crawled pages and scraped items) (r2657) - Make handling of gzipped responses more robust (#319, r2643). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an IOError.
- Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (r2639)
- Added new command to edit spiders:
scrapy edit
(r2636) and -e flag to genspider command that uses it (r2653) - Changed default representation of items to pretty-printed dicts. (r2631). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines.
- Added
spider_error
signal (r2628) - Added
COOKIES_ENABLED
setting (r2625) - Stats are now dumped to Scrapy log (default value of
STATS_DUMP
setting has been changed to True). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there. - Added support for dynamically adjusting download delay and maximum concurrent requests (r2599)
- Added new DBM HTTP cache storage backend (r2576)
- Added
listjobs.json
API to Scrapyd (r2571) CsvItemExporter
: addedjoin_multivalued
parameter (r2578)- Added namespace support to
xmliter_lxml
(r2552) - Improved cookies middleware by making COOKIES_DEBUG nicer and documenting it (r2579)
- Several improvements to Scrapyd and Link extractors
Code rearranged and removed¶
- Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (r2630)
- original item_scraped signal was removed
- original item_passed signal was renamed to item_scraped
- old log lines
Scraped Item...
were removed - old log lines
Passed Item...
were renamed toScraped Item...
lines and downgraded toDEBUG
level
- Removed unused function: scrapy.utils.request.request_info() (r2577)
- Removed googledir project from examples/googledir. There's now a new example project called dirbot available on github: https://github.com/scrapy/dirbot
- Removed support for default field values in Scrapy items (r2616)
- Removed experimental crawlspider v2 (r2632)
- Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (DUPEFILTER_CLASS setting) (r2640)
- Removed support for passing urls to
scrapy crawl
command (usescrapy parse
instead) (r2704) - Removed deprecated Execution Queue (r2704)
- Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (r2780)
- removed
CONCURRENT_SPIDERS
setting (use scrapyd maxproc instead) (r2789) - Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (r2717, r2718)
- Renamed setting
CLOSESPIDER_ITEMPASSED
toCLOSESPIDER_ITEMCOUNT
(r2655). Backwards compatibility kept.
Scrapy 0.12¶
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features and improvements¶
- Passed item is now sent in the
item
argument of theitem_passed
(#273) - Added verbose option to
scrapy version
command, useful for bug reports (#298) - HTTP cache now stored by default in the project data dir (#279)
- Added project data storage directory (#276, #277)
- Documented file structure of Scrapy projects (see command-line tool doc)
- New lxml backend for XPath selectors (#147)
- Per-spider settings (#245)
- Support exit codes to signal errors in Scrapy commands (#248)
- Added
-c
argument toscrapy shell
command - Made
libxml2
optional (#260) - New
deploy
command (#261) - Added
CLOSESPIDER_PAGECOUNT
setting (#253) - Added
CLOSESPIDER_ERRORCOUNT
setting (#254)
Scrapyd changes¶
- Scrapyd now uses one process per spider
- It stores one log file per spider run, and rotate them keeping the lastest 5 logs per spider (by default)
- A minimal web ui was added, available at http://localhost:6800 by default
- There is now a scrapy server command to start a Scrapyd server of the current project
Changes to settings¶
- added HTTPCACHE_ENABLED setting (False by default) to enable HTTP cache middleware
- changed HTTPCACHE_EXPIRATION_SECS semantics: now zero means "never expire".
Deprecated/obsoleted functionality¶
- Deprecated
runserver
command in favor ofserver
command which starts a Scrapyd server. See also: Scrapyd changes - Deprecated
queue
command in favor of using Scrapydschedule.json
API. See also: Scrapyd changes - Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib)
Scrapy 0.10¶
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features and improvements¶
- New Scrapy service called
scrapyd
for deploying Scrapy crawlers in production (#218) (documentation available) - Simplified Images pipeline usage which doesn't require subclassing your own images pipeline now (#217)
- Scrapy shell now shows the Scrapy log by default (#206)
- Refactored execution queue in a common base code and pluggable backends called "spider queues" (#220)
- New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in server mode and then schedule spiders to run.
- Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available)
- Feed exporters with pluggable backends (#197) (documentation available)
- Deferred signals (#193)
- Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195)
- Support for overriding default request headers per spider (#181)
- Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins (#186)
- Splitted Debian package into two packages - the library and the service (#187)
- Scrapy log refactoring (#188)
- New extension for keeping persistent spider contexts among different runs (#203)
- Added dont_redirect request.meta key for avoiding redirects (#233)
- Added dont_retry request.meta key for avoiding retries (#234)
Command-line tool changes¶
- New scrapy command which replaces the old scrapy-ctl.py (#199) - there is only one global scrapy command now, instead of one scrapy-ctl.py per project - Added scrapy.bat script for running more conveniently from Windows
- Added bash completion to command-line tool (#210)
- Renamed command start to runserver (#209)
API changes¶
url
andbody
attributes of Request objects are now read-only (#230)Request.copy()
andRequest.replace()
now also copies theircallback
anderrback
attributes (#231)- Removed
UrlFilterMiddleware
fromscrapy.contrib
(already disabled by default) - Offsite middelware doesn't filter out any request coming from a spider that doesn't have a allowed_domains attribute (#225)
- Removed Spider Manager
load()
method. Now spiders are loaded in the constructor itself. - Changes to Scrapy Manager (now called "Crawler"):
scrapy.core.manager.ScrapyManager
class renamed toscrapy.crawler.Crawler
scrapy.core.manager.scrapymanager
singleton moved toscrapy.project.crawler
- Moved module:
scrapy.contrib.spidermanager
toscrapy.spidermanager
- Spider Manager singleton moved from
scrapy.spider.spiders
to thespiders` attribute of ``scrapy.project.crawler
singleton. - moved Stats Collector classes: (#204)
scrapy.stats.collector.StatsCollector
toscrapy.statscol.StatsCollector
scrapy.stats.collector.SimpledbStatsCollector
toscrapy.contrib.statscol.SimpledbStatsCollector
- default per-command settings are now specified in the
default_settings
attribute of command object class (#201) - changed arguments of Item pipeline
process_item()
method from(spider, item)
to(item, spider)
- backwards compatibility kept (with deprecation warning)
- changed arguments of Item pipeline
- moved
scrapy.core.signals
module toscrapy.signals
- backwards compatibility kept (with deprecation warning)
- moved
- moved
scrapy.core.exceptions
module toscrapy.exceptions
- backwards compatibility kept (with deprecation warning)
- moved
- added
handles_request()
class method toBaseSpider
- dropped
scrapy.log.exc()
function (usescrapy.log.err()
instead) - dropped
component
argument ofscrapy.log.msg()
function - dropped
scrapy.log.log_level
attribute - Added
from_settings()
class methods to Spider Manager, and Item Pipeline Manager
Changes to settings¶
- Added
HTTPCACHE_IGNORE_SCHEMES
setting to ignore certain schemes on !HttpCacheMiddleware (#225) - Added
SPIDER_QUEUE_CLASS
setting which defines the spider queue to use (#220) - Added
KEEP_ALIVE
setting (#220) - Removed
SERVICE_QUEUE
setting (#220) - Removed
COMMANDS_SETTINGS_MODULE
setting (#201) - Renamed
REQUEST_HANDLERS
toDOWNLOAD_HANDLERS
and make download handlers classes (instead of functions)
Scrapy 0.9¶
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features and improvements¶
- Added SMTP-AUTH support to scrapy.mail
- New settings added:
MAIL_USER
,MAIL_PASS
(r2065 | #149) - Added new scrapy-ctl view command - To view URL in the browser, as seen by Scrapy (r2039)
- Added web service for controlling Scrapy process (this also deprecates the web console. (r2053 | #167)
- Support for running Scrapy as a service, for production systems (r1988, r2054, r2055, r2056, r2057 | #168)
- Added wrapper induction library (documentation only available in source code for now). (r2011)
- Simplified and improved response encoding support (r1961, r1969)
- Added
LOG_ENCODING
setting (r1956, documentation available) - Added
RANDOMIZE_DOWNLOAD_DELAY
setting (enabled by default) (r1923, doc available) MailSender
is no longer IO-blocking (r1955 | #146)- Linkextractors and new Crawlspider now handle relative base tag urls (r1960 | #148)
- Several improvements to Item Loaders and processors (r2022, r2023, r2024, r2025, r2026, r2027, r2028, r2029, r2030)
- Added support for adding variables to telnet console (r2047 | #165)
- Support for requests without callbacks (r2050 | #166)
API changes¶
- Change
Spider.domain_name
toSpider.name
(SEP-012, r1975) Response.encoding
is now the detected encoding (r1961)HttpErrorMiddleware
now returns None or raises an exception (r2006 | #157)scrapy.command
modules relocation (r2035, r2036, r2037)- Added
ExecutionQueue
for feeding spiders to scrape (r2034) - Removed
ExecutionEngine
singleton (r2039) - Ported
S3ImagesStore
(images pipeline) to use boto and threads (r2033) - Moved module:
scrapy.management.telnet
toscrapy.telnet
(r2047)
Scrapy 0.8¶
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features¶
- Added DEFAULT_RESPONSE_ENCODING setting (r1809)
- Added
dont_click
argument toFormRequest.from_response()
method (r1813, r1816) - Added
clickdata
argument toFormRequest.from_response()
method (r1802, r1803) - Added support for HTTP proxies (
HttpProxyMiddleware
) (r1781, r1785) - Offsite spider middleware now logs messages when filtering out requests (r1841)
Backwards-incompatible changes¶
- Changed
scrapy.utils.response.get_meta_refresh()
signature (r1804) - Removed deprecated
scrapy.item.ScrapedItem
class - usescrapy.item.Item instead
(r1838) - Removed deprecated
scrapy.xpath
module - usescrapy.selector
instead. (r1836) - Removed deprecated
core.signals.domain_open
signal - usecore.signals.domain_opened
instead (r1822) log.msg()
now receives aspider
argument (r1822)- Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the
spider
argument and pass spider references. If you really want to pass a string, use thecomponent
argument instead.
- Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the
- Changed core signals
domain_opened
,domain_closed
,domain_idle
- Changed Item pipeline to use spiders instead of domains
- The
domain
argument ofprocess_item()
item pipeline method was changed tospider
, the new signature is:process_item(spider, item)
(r1827 | #105) - To quickly port your code (to work with Scrapy 0.8) just use
spider.domain_name
where you previously useddomain
.
- The
- Changed Stats API to use spiders instead of domains (r1849 | #113)
StatsCollector
was changed to receive spider references (instead of domains) in its methods (set_value
,inc_value
, etc).- added
StatsCollector.iter_spider_stats()
method - removed
StatsCollector.list_domains()
method - Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes:
- To quickly port your code (to work with Scrapy 0.8) just use
spider.domain_name
where you previously useddomain
.spider_stats
contains exactly the same data asdomain_stats
.
CloseDomain
extension moved toscrapy.contrib.closespider.CloseSpider
(r1833)- Its settings were also renamed:
CLOSEDOMAIN_TIMEOUT
toCLOSESPIDER_TIMEOUT
CLOSEDOMAIN_ITEMCOUNT
toCLOSESPIDER_ITEMCOUNT
- Removed deprecated
SCRAPYSETTINGS_MODULE
environment variable - useSCRAPY_SETTINGS_MODULE
instead (r1840) - Renamed setting:
REQUESTS_PER_DOMAIN
toCONCURRENT_REQUESTS_PER_SPIDER
(r1830, r1844) - Renamed setting:
CONCURRENT_DOMAINS
toCONCURRENT_SPIDERS
(r1830) - Refactored HTTP Cache middleware
- HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (r1843 )
- Renamed exception:
DontCloseDomain
toDontCloseSpider
(r1859 | #120) - Renamed extension:
DelayedCloseDomain
toSpiderCloseDelay
(r1861 | #121) - Removed obsolete
scrapy.utils.markup.remove_escape_chars
function - usescrapy.utils.markup.replace_escape_chars
instead (r1865)
Scrapy 0.7¶
First release of Scrapy.
Contributing to Scrapy¶
重要
Double check that you are reading the most recent version of this document at https://docs.scrapy.org/en/master/contributing.html
There are many ways to contribute to Scrapy. Here are some of them:
- Blog about Scrapy. Tell the world how you're using Scrapy. This will help newcomers with more examples and will help the Scrapy project to increase its visibility.
- Report bugs and request features in the issue tracker, trying to follow the guidelines detailed in Reporting bugs below.
- Submit patches for new functionalities and/or bug fixes. Please read Writing patches and Submitting patches below for details on how to write and submit a patch.
- Join the Scrapy subreddit and share your ideas on how to improve Scrapy. We're always open to suggestions.
- Answer Scrapy questions at Stack Overflow.
Reporting bugs¶
注釈
Please report security issues only to scrapy-security@googlegroups.com. This is a private list only open to trusted Scrapy developers, and its archives are not public.
Well-written bug reports are very helpful, so keep in mind the following guidelines when you're going to report a new bug.
- check the FAQ first to see if your issue is addressed in a well-known question
- if you have a general question about scrapy usage, please ask it at Stack Overflow (use "scrapy" tag).
- check the open issues to see if the issue has already been reported. If it has, don't dismiss the report, but check the ticket history and comments. If you have additional useful information, please leave a comment, or consider sending a pull request with a fix.
- search the scrapy-users list and Scrapy subreddit to see if it has been discussed there, or if you're not sure if what you're seeing is a bug. You can also ask in the #scrapy IRC channel.
- write complete, reproducible, specific bug reports. The smaller the test case, the better. Remember that other developers won't have your project to reproduce the bug, so please include all relevant files required to reproduce it. See for example StackOverflow's guide on creating a Minimal, Complete, and Verifiable example exhibiting the issue.
- the most awesome way to provide a complete reproducible example is to send a pull request which adds a failing test case to the Scrapy testing suite (see Submitting patches). This is helpful even if you don't have an intention to fix the issue yourselves.
- include the output of
scrapy version -v
so developers working on your bug know exactly which version and platform it occurred on, which is often very helpful for reproducing it, or knowing if it was already fixed.
Writing patches¶
The better a patch is written, the higher the chances that it'll get accepted and the sooner it will be merged.
Well-written patches should:
- contain the minimum amount of code required for the specific change. Small patches are easier to review and merge. So, if you're doing more than one change (or bug fix), please consider submitting one patch per change. Do not collapse multiple changes into a single patch. For big changes consider using a patch queue.
- pass all unit-tests. See Running tests below.
- include one (or more) test cases that check the bug fixed or the new functionality added. See Writing tests below.
- if you're adding or changing a public (documented) API, please include the documentation changes in the same patch. See Documentation policies below.
Submitting patches¶
The best way to submit a patch is to issue a pull request on GitHub, optionally creating a new issue first.
Remember to explain what was fixed or the new functionality (what it is, why it's needed, etc). The more info you include, the easier will be for core developers to understand and accept your patch.
You can also discuss the new functionality (or bug fix) before creating the patch, but it's always good to have a patch ready to illustrate your arguments and show that you have put some additional thought into the subject. A good starting point is to send a pull request on GitHub. It can be simple enough to illustrate your idea, and leave documentation/tests for later, after the idea has been validated and proven useful. Alternatively, you can start a conversation in the Scrapy subreddit to discuss your idea first.
Sometimes there is an existing pull request for the problem you'd like to solve, which is stalled for some reason. Often the pull request is in a right direction, but changes are requested by Scrapy maintainers, and the original pull request author hasn't had time to address them. In this case consider picking up this pull request: open a new pull request with all commits from the original pull request, as well as additional changes to address the raised issues. Doing so helps a lot; it is not considered rude as soon as the original author is acknowledged by keeping his/her commits.
You can pull an existing pull request to a local branch
by running git fetch upstream pull/$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE
(replace 'upstream' with a remote name for scrapy repository,
$PR_NUMBER
with an ID of the pull request, and $BRANCH_NAME_TO_CREATE
with a name of the branch you want to create locally).
See also: https://help.github.com/articles/checking-out-pull-requests-locally/#modifying-an-inactive-pull-request-locally.
When writing GitHub pull requests, try to keep titles short but descriptive. E.g. For bug #411: "Scrapy hangs if an exception raises in start_requests" prefer "Fix hanging when exception occurs in start_requests (#411)" instead of "Fix for #411". Complete titles make it easy to skim through the issue tracker.
Finally, try to keep aesthetic changes (PEP 8 compliance, unused imports removal, etc) in separate commits from functional changes. This will make pull requests easier to review and more likely to get merged.
Coding style¶
Please follow these coding conventions when writing code for inclusion in Scrapy:
- Unless otherwise specified, follow PEP 8.
- It's OK to use lines longer than 80 chars if it improves the code readability.
- Don't put your name in the code you contribute; git provides enough metadata to identify author of the code. See https://help.github.com/articles/setting-your-username-in-git/ for setup instructions.
Documentation policies¶
- Don't use docstrings for documenting classes, or methods which are
already documented in the official (sphinx) documentation. Alternatively,
do provide a docstring, but make sure sphinx documentation uses
autodoc extension to pull the docstring. For example, the
ItemLoader.add_value()
method should be either documented only in the sphinx documentation (not as a docstring), or it should have a docstring which is pulled to sphinx documentation using autodoc extension. - Do use docstrings for documenting functions not present in the official
(sphinx) documentation, such as functions from
scrapy.utils
package and its sub-modules.
Tests¶
Tests are implemented using the Twisted unit-testing framework, running tests requires tox.
Running tests¶
Make sure you have a recent enough tox installation:
tox --version
If your version is older than 1.7.0, please update it first:
pip install -U tox
To run all tests go to the root directory of Scrapy source code and run:
tox
To run a specific test (say tests/test_loader.py
) use:
tox -- tests/test_loader.py
To see coverage report install coverage (pip install coverage
) and run:
coverage report
see output of coverage --help
for more options like html or xml report.
Writing tests¶
All functionality (including new features and bug fixes) must include a test case to check that it works as expected, so please include tests for your patches if you want them to get accepted sooner.
Scrapy uses unit-tests, which are located in the tests/ directory. Their module name typically resembles the full path of the module they're testing. For example, the item loaders code is in:
scrapy.loader
And their unit-tests are in:
tests/test_loader.py
Versioning and API Stability¶
Versioning¶
There are 3 numbers in a Scrapy version: A.B.C
- A is the major version. This will rarely change and will signify very large changes.
- B is the release number. This will include many changes including features and things that possibly break backwards compatibility, although we strive to keep theses cases at a minimum.
- C is the bugfix release number.
Backward-incompatibilities are explicitly mentioned in the release notes, and may require special attention before upgrading.
Development releases do not follow 3-numbers version and are generally
released as dev
suffixed versions, e.g. 1.3dev
.
注釈
With Scrapy 0.* series, Scrapy used odd-numbered versions for development releases. This is not the case anymore from Scrapy 1.0 onwards.
Starting with Scrapy 1.0, all releases should be considered production-ready.
For example:
- 1.1.1 is the first bugfix release of the 1.1 series (safe to use in production)
API Stability¶
API stability was one of the major goals for the 1.0 release.
Methods or functions that start with a single dash (_
) are private and
should never be relied as stable.
Also, keep in mind that stable doesn't mean complete: stable APIs could grow new methods or functionality but the existing methods should keep working the same way.
- Release notes
- Scrapyのアップデートによる変更点です。
- Contributing to Scrapy
- Scrapyプロジェクトに貢献する方法です。
- Versioning and API Stability
- ScrapyのバージョニングとAPIの互換性について理解します。