Crawler Step 2 Documentation¶
User Guide
Installation¶
At the command line:
easy_install crawler
Or, if you have pip
installed:
pip install crawler
Support¶
The easiest way to get help with the project is to join the #crawler
channel on Freenode.
We hang out there and you can get real-time help with your projects.
The other good way is to open an issue on Github.
The mailing list at is also available for support.
Cookbook¶
Crawl a web page
The most simple way to use our program is with no arguments. Simply run:
python main.py -u <url>
to crawl a webpage.
Crawl a page slowly
To add a delay to your crawler,
use -d
:
python main.py -d 10 -u <url>
This will wait 10 seconds between page fetches.
Crawl only your blog
You will want to use the -i
flag,
which while ignore URLs matching the passed regex:
python main.py -i "^blog" -u <url>
This will only crawl pages that contain your blog URL.
Programmer Reference
Command Line Options¶
These flags allow you to change the behavior of Crawler. Check out how to use them in the Cookbook.
-
-d
<sec>
,
--delay
<sec>
¶
Use a delay in between page fetchs so we don’t overwhelm the remote server. Value in seconds.
Default: 1 second
-
-i
<regex>
,
--ignore
<regex>
¶
Ignore pages that match a specific pattern.
Default: None
Crawler Python API¶
Getting started with Crawler is easy. The main class you need to care about is
-
crawler.utils.
should_ignore
(ignore_list, url)¶ Returns True if the URL should be ignored
Parameters: - ignore_list – The list of regexs to ignore.
- url – The fully qualified URL to compare against.
>>> should_ignore(['blog/$'], 'http://ericholscher.com/blog/')
True
>>> should_ignore(['home'], 'http://ericholscher.com/blog/')
False
>>> log('http://ericholscher.com/blog/', 200)
OK: 200 http://ericholscher.com/blog/
>>> log('http://ericholscher.com/blog/', 500)
ERR: 500 http://ericholscher.com/blog/
Other directive is testcode
log('http://ericholscher.com/blog/', 500)
That requires separate testoutput
ERR: 500 http://ericholscher.com/blog/
If i add this text and push will it automatically appear in the docs?