Crawl Frontier 0.2.0 documentation¶
This documentation contains everything you need to know about Crawl Frontier.
First steps¶
Crawl Frontier at a glance¶
Crawl Frontier is an application framework that is meant to be used as part of a Crawling System, allowing you to easily manage and define tasks related to a Crawling Frontier.
Even though it was originally designed for Scrapy, it can also be used with any other Crawling Framework/System as the framework offers a generic frontier functionality.
The purpose of this document is to introduce you to the concepts behind Crawl Frontier so that you can get an idea of how it works and to decide if it is suited to your needs.
1. Create your crawler¶
Create your Scrapy project as you usually do. Enter a directory where you’d like to store your code and then run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These are basically:
- scrapy.cfg: the project configuration file
- tutorial/: the project’s python module, you’ll later import your code from here.
- tutorial/items.py: the project’s items file.
- tutorial/pipelines.py: the project’s pipelines file.
- tutorial/settings.py: the project’s settings file.
- tutorial/spiders/: a directory where you’ll later put your spiders.
2. Integrate your crawler with the frontier¶
Add the Scrapy Crawl Frontier middlewares to your settings:
SPIDER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierDownloaderMiddleware': 1000,
})
Create a Crawl Frontier settings.py file and add it to your Scrapy settings:
FRONTIER_SETTINGS = 'tutorial/frontier/settings.py'
3. Choose your backend¶
Configure frontier settings to use a built-in backend like in-memory BFS:
BACKEND = 'crawlfrontier.contrib.backends.memory.heapq.BFS'
4. Run the spider¶
Run your Scrapy spider as usual from the command line:
scrapy crawl myspider
And that’s it! You got your spider running integrated with Crawl Frontier.
What else?¶
You’ve seen a simple example of how to use Crawl Frontier with Scrapy, but this is just the surface. Crawl Frontier provides many powerful features for making Frontier management easy and efficient, such as:
- Easy built-in integration with Scrapy and any other crawler through its API.
- Creating different crawling logic/policies defining your own backend.
- Built-in support for database storage for crawled pages.
- Support for extending Crawl Frontier by plugging your own functionality using middlewares.
- Built-in middlewares for:
- Extracting domain info from page URLs.
- Create unique fingerprints for page URLs and domain names.
- Create fake sitemaps and reproduce crawling without crawler with the graph Manager.
- Tools for easy frontier testing.
- Record your Scrapy crawls and use it later for frontier testing.
- Logging facility that you can hook on to for catching errors and debug your frontiers.
What’s next?¶
The next obvious steps are for you to install Crawl Frontier, and read the architecture overview and API docs. Thanks for your interest!
Installation Guide¶
The installation steps assume that you have the following things installed:
- Python 2.7
- pip and setuptools Python packages. Nowadays pip requires and installs setuptools if not installed.
You can install Crawl Frontier using pip (which is the canonical way to install Python packages).
To install using pip:
pip install crawl-frontier
- Crawl Frontier at a glance
- Understand what Crawl Frontier is and how it can help you.
- Installation Guide
- Get Crawl Frontier installed on your computer.
Basic concepts¶
What is a Crawl Frontier?¶
A crawl frontier is the part of a crawling system that decides the logic and policies to follow when a crawler is visiting websites (what pages should be crawled next, priorities and ordering, how often pages are revisited, etc).
A usual crawler-frontier scheme is:

The frontier is initialized with a list of start URLs, that we call the seeds. Once the frontier is initialized the crawler asks it what pages should be visited next. As the crawler starts to visit the pages and obtains results, it will inform the frontier of each page response and also of the extracted hyperlinks contained within the page. These links are added by the frontier as new requests to visit according to the frontier policies.
This process (ask for new requests/notify results) is repeated until the end condition for the crawl is reached. Some crawlers may never stop, that’s what we call continuous crawls.
Frontier policies can be based in almost any logic. Common use cases are usually based in score/priority systems, computed from one or many page attributes (freshness, update times, content relevance for certain terms, etc). They can also be based in really simple logics as FIFO/LIFO or DFS/BFS page visit ordering.
Depending on frontier logic, a persistent storage system may be needed to store, update or query information about the pages. Other systems can be 100% volatile and not share any information at all between different crawls.
Architecture overview¶
This document describes the architecture of Crawl Frontier and how its components interact.
Overview¶
The following diagram shows an overview of the Crawl Frontier architecture with its components (referenced by numbers) and an outline of the data flow that takes place inside the system. A brief description of the components is included below with links for more detailed information about them. The data flow is also described below.

Components¶
Crawler¶
The Crawler (2) is responsible for fetching web pages from the sites (1) and feeding them to the frontier which manages what pages should be crawled next.
Crawler can be implemented using Scrapy or any other crawling framework/system as the framework offers a generic frontier functionality.
Frontier API / Manager¶
The main entry point to Crawl Frontier API (3) is the FrontierManager object. Frontier users, in our case the Crawler (2), will communicate with the frontier through it.
Communication with the frontier can also be done through other mechanisms such as an HTTP API or a queue system. These functionalities are not available for the time being, but hopefully will be in future versions.
For more information see Frontier API.
Middlewares¶
Frontier middlewares (4) are specific hooks that sit between the Manager (3) and the Backend (5). These middlewares process Request and Response objects when they pass to and from the Frontier and the Backend. They provide a convenient mechanism for extending functionality by plugging custom code.
For more information see Middlewares.
Backend¶
The frontier backend (5) is where the crawling logic/policies lies. It’s responsible for receiving all the crawl info and selecting the next pages to be crawled.
May require, depending on the logic implemented, a persistent storage (6) to manage Request and Response objects info.
For more information see Backends.
Data Flow¶
The data flow in Crawl Frontier is controlled by the Frontier Manager, all data passes through the manager-middlewares-backend scheme and goes like this:
- The frontier is initialized with a list of seed requests (seed URLs) as entry point for the crawl.
- The crawler asks for a list of requests to crawl.
- Each url is crawled and the frontier is notified back of the crawl result as well of the extracted links the page contains. If anything went wrong during the crawl, the frontier is also informed of it.
Once all urls have been crawled, steps 2-3 are repeated until crawl of frontier end condition is reached. Each loop (steps 2-3) repetition is called a frontier iteration.
Frontier objects¶
Frontier uses 2 object types: Request and Response. They are used to represent crawling HTTP requests and responses respectively.
These classes are used by most Frontier API methods either as a parameter or as a return value depending on the method used.
Frontier also uses these objects to internally communicate between different components (middlewares and backend).
Request objects¶
- class crawlfrontier.core.models.Request(url, method='GET', headers=None, cookies=None, meta=None)¶
A Request object represents an HTTP request, which is generated for seeds, extracted page links and next pages to crawl. Each one should be associated to a Response object when crawled.
Parameters: - url (string) – URL to send.
- method (string) – HTTP method to use.
- headers (dict) – dictionary of headers to send.
- cookies (dict) – dictionary of cookies to attach to this request.
- meta (dict) – dictionary that contains arbitrary metadata for this request.
Dictionary of cookies to attach to this request.
- headers¶
A dictionary which contains the request headers.
- meta¶
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Crawl-frontier components (middlewares, etc). So the data contained in this dict depends on the components you have enabled.
- method¶
A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example: GET, POST, PUT, etc
- url¶
A string containing the URL of this request.
Response objects¶
- class crawlfrontier.core.models.Response(url, status_code=200, headers=None, body='', request=None)¶
A Response object represents an HTTP response, which is usually downloaded (by the crawler) and sent back to the frontier for processing.
Parameters: - url (string) – URL of this response.
- status_code (int) – the HTTP status of the response. Defaults to 200.
- headers (dict) – dictionary of headers to send.
- body (dict) – the response body.
- request (dict) – The Request object that generated this response.
- body¶
A str containing the body of this Response.
- headers¶
A dictionary object which contains the response headers.
- meta¶
A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).
- status_code¶
An integer representing the HTTP status of the response. Example: 200, 404, 500.
- url¶
A string containing the URL of the response.
Fields domain and fingerprint are added by built-in middlewares
Identifying unique objects¶
As frontier objects are shared between the crawler and the frontier, some mechanism to uniquely identify objects is needed. This method may vary depending on the frontier logic (in most cases due to the backend used).
By default, Crawl Frontier activates the fingerprint middleware to generate a unique fingerprint calculated from the Request.url and Response.url fields, which is added to the Request.meta and Response.meta fields respectively. You can use this middleware or implement your own method to manage frontier objects identification.
An example of a generated fingerprint for a Request object:
>>> request.url
'http://thehackernews.com'
>>> request.meta['fingerprint']
'198d99a8b2284701d6c147174cd69a37a7dea90f'
Adding additional data to objects¶
In most cases frontier objects can be used to represent the information needed to manage the frontier logic/policy.
Also, additional data can be stored by components using the Request.meta and Response.meta fields.
For instance the frontier domain middleware adds a domain info field for every Request.meta and Response.meta if is activated:
>>> request.url
'http://www.scrapinghub.com'
>>> request.meta['domain']
{
"name": "scrapinghub.com",
"netloc": "www.scrapinghub.com",
"scheme": "http",
"sld": "scrapinghub",
"subdomain": "www",
"tld": "com"
}
Frontier API¶
This section documents the Crawl Frontier core API, and is intended for developers of middlewares and backends.
Crawl Frontier API / Manager¶
The main entry point to Crawl Frontier API is the FrontierManager object, passed to middlewares and backend through the from_manager class method. This object provides access to all Crawl Frontier core components, and is the only way for middlewares and backend to access them and hook their functionality into Crawl Frontier.
The FrontierManager is responsible for loading the installed middlewares and backend, as well as for managing the data flow around the whole frontier.
Loading from settings¶
Although FrontierManager can be initialized using parameters the most common way of doing this is using Frontier Settings.
This can be done through the from_settings class method, using either a string path:
>>> from crawlfrontier import FrontierManager
>>> frontier = FrontierManager.from_settings('my_project.frontier.settings')
or a Settings object instance:
>>> from crawlfrontier import FrontierManager, Settings
>>> settings = Settings()
>>> settings.MAX_PAGES = 0
>>> frontier = FrontierManager.from_settings(settings)
It can also be initialized without parameters, in this case the frontier will use the default settings:
>>> from crawlfrontier import FrontierManager, Settings
>>> frontier = FrontierManager.from_settings()
Frontier Manager¶
- class crawlfrontier.core.manager.FrontierManager(request_model, response_model, backend, logger, event_log_manager, middlewares=None, test_mode=False, max_requests=0, max_next_requests=0, auto_start=True, settings=None)¶
The FrontierManager object encapsulates the whole frontier, providing an API to interact with. It’s also responsible of loading and communicating all different frontier components.
Parameters: - request_model (object/string) – The Request object to be used by the frontier.
- response_model (object/string) – The Response object to be used by the frontier.
- backend (object/string) – The Backend object to be used by the frontier.
- logger (object/string) – The Logger object to be used by the frontier.
- event_log_manager (object/string) – The EventLogger object to be used by the frontier.
- middlewares (list) – A list of Middleware objects to be used by the frontier.
- test_mode (bool) – Activate/deactivate frontier test mode.
- max_requests (int) – Number of pages after which the frontier would stop (See Finish conditions).
- max_next_requests (int) – Maximum number of requests returned by get_next_requests method.
- auto_start (bool) – Activate/deactivate automatic frontier start (See starting/stopping the frontier).
- settings (object/string) – The Settings object used by the frontier.
Attributes
- request_model¶
The Request object to be used by the frontier. Can be defined with REQUEST_MODEL setting.
- response_model¶
The Response object to be used by the frontier. Can be defined with RESPONSE_MODEL setting.
- event_log_manager¶
The EventLogger object to be used by the frontier. Can be defined with EVENT_LOGGER setting.
- middlewares¶
A list of Middleware objects to be used by the frontier. Can be defined with MIDDLEWARES setting.
- test_mode¶
Boolean value indicating if the frontier is using frontier test mode. Can be defined with TEST_MODE setting.
- max_requests¶
Number of pages after which the frontier would stop (See Finish conditions). Can be defined with MAX_REQUESTS setting.
- max_next_requests¶
Maximum number of requests returned by get_next_requests method. Can be defined with MAX_NEXT_REQUESTS setting.
- auto_start¶
Boolean value indicating if automatic frontier start is activated. See starting/stopping the frontier. Can be defined with AUTO_START setting.
- iteration¶
Current frontier iteration.
- n_requests¶
Number of accumulated requests returned by the frontier.
- finished¶
Boolean value indicating if the frontier has finished. See Finish conditions.
API Methods
- start()¶
Notifies all the components of the frontier start. Typically used for initializations (See starting/stopping the frontier).
Returns: None.
- stop()¶
Notifies all the components of the frontier stop. Typically used for finalizations (See starting/stopping the frontier).
Returns: None.
- add_seeds(seeds)¶
Adds a list of seed requests (seed URLs) as entry point for the crawl.
Parameters: seeds (list) – A list of Request objects. Returns: None.
- get_next_requests(max_next_requests=0)¶
Returns a list of next requests to be crawled. Optionally a maximum number of pages can be passed. If no value is passed, FrontierManager.max_next_requests will be used instead. (MAX_NEXT_REQUESTS setting).
Parameters: max_next_requests (int) – Maximum number of requests to be returned by this method. Returns: list of Request objects.
- page_crawled(response, links=None)¶
Informs the frontier about the crawl result and extracted links for the current page.
Parameters: Returns: None.
- request_error(request, error)¶
Informs the frontier about a page crawl error. An error identifier must be provided.
Parameters: - request (object) – The crawled with error Request object.
- error (string) – A string identifier for the error.
Returns: None.
Class Methods
- classmethod from_settings(settings=None)¶
Returns a FrontierManager instance initialized with the passed settings argument. Argument value can either be a string path pointing to settings file or a Settings object instance. If no settings is given, frontier default settings are used.
Starting/Stopping the frontier¶
Sometimes, frontier components need to perform initialization and finalization operations. The frontier mechanism to notify the different components of the frontier start and stop is done by the start() and stop() methods respectively.
By default auto_start frontier value is activated, this means that components will be notified once the FrontierManager object is created. If you need to have more fine control of when different components are initialized, deactivate auto_start and manually call frontier API start() and stop() methods.
Note
Frontier stop() method is not automatically called when auto_start is active (because frontier is not aware of the crawling state). If you need to notify components of frontier end you should call the method manually.
Frontier iterations¶
Once frontier is running, the usual process is the one described in the data flow section.
Crawler asks the frontier for next pages using the get_next_requests() method. Each time the frontier returns a non empty list of pages (data available), is what we call a frontier iteration.
Current frontier iteration can be accessed using the iteration attribute.
Finishing the frontier¶
Crawl can be finished either by the Crawler or by the Crawl Frontier. Crawl frontier will finish when a maximum number of pages are returned. This limit is controlled by the max_requests attribute (MAX_REQUESTS setting).
If max_requests has a value of 0 (default value) the frontier will continue indefinitely.
Once the frontier is finished, no more pages will be returned by the get_next_requests method and finished attribute will be True.
Component objects¶
- class crawlfrontier.core.components.Component¶
Interface definition for a frontier component The Component object is the base class for frontier Middleware and Backend objects.
FrontierManager communicates with the active components using the hook methods listed below.
Implementations are different for Middleware and Backend objects, therefore methods are not fully described here but in their corresponding section.
Attributes
- name¶
The component name
Abstract methods
- frontier_start()¶
Called when the frontier starts, see starting/stopping the frontier.
- frontier_stop()¶
Called when the frontier stops, see starting/stopping the frontier.
- add_seeds(seeds)¶
This method is called when new seeds are are added to the frontier.
Parameters: seeds (list) – A list of Request objects.
- page_crawled(response, links)¶
This method is called each time a page has been crawled.
Parameters:
- request_error(page, error)¶
This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error Request object.
- error (string) – A string identifier for the error.
Class Methods
- classmethod from_manager(manager)¶
Class method called from FrontierManager passing the manager itself.
Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Test mode¶
In some cases while testing, frontier components need to act in a different way than they usually do (for instance domain middleware accepts non valid URLs like 'A1' or 'B1' when parsing domain urls in test mode).
Components can know if the frontier is in test mode via the boolean test_mode attribute.
Another ways of using the frontier¶
Communication with the frontier can also be done through other mechanisms such as an HTTP API or a queue system. These functionalities are not available for the time being, but hopefully will be included in future versions.
Settings¶
The Crawl Frontier settings allows you to customize the behaviour of all components, including the FrontierManager, Middleware and Backend themselves.
The infrastructure of the settings provides a global namespace of key-value mappings that can be used to pull configuration values from. The settings can be populated through different mechanisms, which are described below.
For a list of available built-in settings see: Built-in settings reference.
Designating the settings¶
When you use Crawl Frontier, you have to tell it which settings you’re using. As FrontierManager is the main entry point to Frontier usage, you can do this by using the method described in the Loading from settings section.
When using a string path pointing to a settings file for the frontier we propose the following directory structure:
my_project/
frontier/
__init__.py
settings.py
middlewares.py
backends.py
...
These are basically:
- frontier/settings.py: the frontier settings file.
- frontier/middlewares.py: the middlewares used by the frontier.
- frontier/backends.py: the backend(s) used by the frontier.
How to access settings¶
Settings can be accessed through the FrontierManager.settings attribute, that is passed to Middleware.from_manager and Backend.from_manager class methods:
class MyMiddleware(Component):
@classmethod
def from_manager(cls, manager):
manager = crawler.settings
if settings.TEST_MODE:
print "test mode is enabled!"
In other words, settings can be accessed as attributes of the Settings object.
Settings class¶
Built-in frontier settings¶
Here’s a list of all available Crawl Frontier settings, in alphabetical order, along with their default values and the scope where they apply.
AUTO_START¶
Default: True
Whether to enable frontier automatic start. See Starting/Stopping the frontier
BACKEND¶
Default: 'crawlfrontier.contrib.backends.memory.FIFO'
The Backend to be used by the frontier. For more info see Activating a backend.
EVENT_LOGGER¶
Default: 'crawlfrontier.logger.events.EventLogManager'
The EventLoggerManager class to be used by the Frontier.
MAX_NEXT_REQUESTS¶
Default: 0
The maximum number of requests returned by get_next_requests API method. If value is 0 (default), no maximum value will be used.
MAX_REQUESTS¶
Default: 0
Maximum number of returned requests after which Crawl frontier is finished. If value is 0 (default), the frontier will continue indefinitely. See Finishing the frontier.
MIDDLEWARES¶
A list containing the middlewares enabled in the frontier. For more info see Activating a middleware.
Default:
[
'crawlfrontier.contrib.middlewares.domain.DomainMiddleware',
'crawlfrontier.contrib.middlewares.fingerprint.UrlFingerprintMiddleware',
'crawlfrontier.contrib.middlewares.fingerprint.DomainFingerprintMiddleware',
]
REQUEST_MODEL¶
Default: 'crawlfrontier.core.models.Request'
The Request model to be used by the frontier.
Built-in fingerprint middleware settings¶
Settings used by the UrlFingerprintMiddleware and DomainFingerprintMiddleware.
URL_FINGERPRINT_FUNCTION¶
Default: crawlfrontier.utils.fingerprint.sha1
The function used to calculate the url fingerprint.
DOMAIN_FINGERPRINT_FUNCTION¶
Default: crawlfrontier.utils.fingerprint.sha1
The function used to calculate the domain fingerprint.
Default settings¶
If no settings are specified, frontier will use the built-in default ones. For a complete list of default values see: Built-in settings reference. All default settings can be overridden.
Frontier default settings¶
Values:
PAGE_MODEL = 'crawlfrontier.core.models.Page'
LINK_MODEL = 'crawlfrontier.core.models.Link'
FRONTIER = 'crawlfrontier.core.frontier.Frontier'
MIDDLEWARES = [
'crawlfrontier.contrib.middlewares.domain.DomainMiddleware',
'crawlfrontier.contrib.middlewares.fingerprint.UrlFingerprintMiddleware',
'crawlfrontier.contrib.middlewares.fingerprint.DomainFingerprintMiddleware',
]
BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'
TEST_MODE = False
MAX_PAGES = 0
MAX_NEXT_PAGES = 0
AUTO_START = True
Fingerprints middleware default settings¶
Values:
URL_FINGERPRINT_FUNCTION = 'crawlfrontier.utils.fingerprint.sha1'
DOMAIN_FINGERPRINT_FUNCTION = 'crawlfrontier.utils.fingerprint.sha1'
Logging default settings¶
Values:
LOGGER = 'crawlfrontier.logger.FrontierLogger'
LOGGING_ENABLED = True
LOGGING_EVENTS_ENABLED = False
LOGGING_EVENTS_INCLUDE_METADATA = True
LOGGING_EVENTS_INCLUDE_DOMAIN = True
LOGGING_EVENTS_INCLUDE_DOMAIN_FIELDS = ['name', 'netloc', 'scheme', 'sld', 'tld', 'subdomain']
LOGGING_EVENTS_HANDLERS = [
"crawlfrontier.logger.handlers.COLOR_EVENTS",
]
LOGGING_MANAGER_ENABLED = False
LOGGING_MANAGER_LOGLEVEL = logging.DEBUG
LOGGING_MANAGER_HANDLERS = [
"crawlfrontier.logger.handlers.COLOR_CONSOLE_MANAGER",
]
LOGGING_BACKEND_ENABLED = False
LOGGING_BACKEND_LOGLEVEL = logging.DEBUG
LOGGING_BACKEND_HANDLERS = [
"crawlfrontier.logger.handlers.COLOR_CONSOLE_BACKEND",
]
LOGGING_DEBUGGING_ENABLED = False
LOGGING_DEBUGGING_LOGLEVEL = logging.DEBUG
LOGGING_DEBUGGING_HANDLERS = [
"crawlfrontier.logger.handlers.COLOR_CONSOLE_DEBUGGING",
]
EVENT_LOG_MANAGER = 'crawlfrontier.logger.events.EventLogManager'
- What is a Crawl Frontier?
- Learn what a crawl frontier is and how to use it.
- Architecture overview
- See how Crawl Frontier works and its different components.
- Frontier objects
- Understand the classes used to represent links and pages.
- Frontier API
- Learn how to use the frontier.
- Settings
- See how to configure Crawl Frontier.
Extending Crawl Frontier¶
Middlewares¶
Frontier Middleware sits between FrontierManager and Backend objects, using hooks for Request and Response processing according to frontier data flow.
It’s a light, low-level system for filtering and altering Frontier’s requests and responses.
Activating a middleware¶
To activate a Middleware component, add it to the MIDDLEWARES setting, which is a list whose values can be class paths or instances of Middleware objects.
Here’s an example:
MIDDLEWARES = [
'crawlfrontier.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares are called in the same order they’ve been defined in the list, to decide which order to assign to your middleware pick a value according to where you want to insert it. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.
Writing your own middleware¶
Writing your own frontier backend is easy. Each Middleware component is a single Python class inherited from Component.
FrontierManager will communicate with all active middlewares through the methods described below.
- class crawlfrontier.core.components.Middleware¶
Interface definition for a Frontier Middlewares
Methods
- frontier_start()¶
Called when the frontier starts, see starting/stopping the frontier.
- frontier_stop()¶
Called when the frontier stops, see starting/stopping the frontier.
- add_seeds(seeds)¶
This method is called when new seeds are are added to the frontier.
Parameters: seeds (list) – A list of Request objects. Returns: Request object list or None Should either return None or a list of Request objects.
If it returns None, FrontierManager won’t continue processing any other middleware and seed will never reach the Backend.
If it returns a list of Request objects, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to the Backend.
If you want to filter any seed, just don’t include it in the returned object list.
- page_crawled(response, links)¶
This method is called each time a page has been crawled.
Parameters: Returns: Response or None
Should either return None or a Response object.
If it returns None, FrontierManager won’t continue processing any other middleware and Backend will never be notified.
If it returns a Response object, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to the Backend.
If you want to filter a page, just return None.
- request_error(page, error)¶
This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error Request object.
- error (string) – A string identifier for the error.
Returns: Request or None
Should either return None or a Request object.
If it returns None, FrontierManager won’t continue processing any other middleware and Backend will never be notified.
If it returns a Response object, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to the Backend.
If you want to filter a page error, just return None.
Class Methods
- classmethod from_manager(manager)¶
Class method called from FrontierManager passing the manager itself.
Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Built-in middleware reference¶
This page describes all Middleware components that come with Crawl Frontier. For information on how to use them and how to write your own middleware, see the middleware usage guide..
For a list of the components enabled by default (and their orders) see the MIDDLEWARES setting.
DomainMiddleware¶
- class crawlfrontier.contrib.middlewares.domain.DomainMiddleware¶
This Middleware will add a domain info field for every Request.meta and Response.meta if is activated.
domain object will contains the following fields:
- netloc: URL netloc according to RFC 1808 syntax specifications
- name: Domain name
- scheme: URL scheme
- tld: Top level domain
- sld: Second level domain
- subdomain: URL subdomain(s)
An example for a Request object:
>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If TEST_MODE is active, It will accept testing URLs, parsing letter domains:
>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
- class crawlfrontier.contrib.middlewares.fingerprint.UrlFingerprintMiddleware¶
This Middleware will add a fingerprint field for every Request.meta and Response.meta if is activated.
Fingerprint will be calculated from object URL, using the function defined in URL_FINGERPRINT_FUNCTION setting. You can write your own fingerprint calculation function and use by changing this setting.
An example for a Request object:
>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
DomainFingerprintMiddleware¶
- class crawlfrontier.contrib.middlewares.fingerprint.DomainFingerprintMiddleware¶
This Middleware will add a fingerprint field for every Request.meta and Response.meta domain fields if is activated.
Fingerprint will be calculated from object URL, using the function defined in DOMAIN_FINGERPRINT_FUNCTION setting. You can write your own fingerprint calculation function and use by changing this setting.
An example for a Request object:
>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
Backends¶
Frontier Backend is where the crawling logic/policies lies. It’s responsible for receiving all the crawl info and selecting the next pages to be crawled. It’s called by the FrontierManager after Middleware, using hooks for Request and Response processing according to frontier data flow.
Unlike Middleware, that can have many different instances activated, only one Backend can be used per frontier.
Some backends require, depending on the logic implemented, a persistent storage to manage Request and Response objects info.
Activating a backend¶
To activate the frontier middleware component, set it through the BACKEND setting.
Here’s an example:
BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'
Keep in mind that some backends may need to be enabled through a particular setting. See each backend documentation for more info.
Writing your own backend¶
Writing your own frontier backend is easy. Each Backend component is a single Python class inherited from Component.
FrontierManager will communicate with active Backend through the methods described below.
- class crawlfrontier.core.components.Backend¶
Interface definition for a Frontier Backend
Methods
- frontier_start()¶
Called when the frontier starts, see starting/stopping the frontier.
Returns: None.
- frontier_stop()¶
Called when the frontier stops, see starting/stopping the frontier.
Returns: None.
- add_seeds(seeds)¶
This method is called when new seeds are are added to the frontier.
Parameters: seeds (list) – A list of Request objects. Returns: None.
- get_next_requests(max_n_requests)¶
Returns a list of next requests to be crawled.
Parameters: max_next_requests (int) – Maximum number of requests to be returned by this method. Returns: list of Request objects.
- page_crawled(response, links)¶
This method is called each time a page has been crawled.
Parameters: Returns: None.
- request_error(page, error)¶
This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error Request object.
- error (string) – A string identifier for the error.
Returns: None.
Class Methods
- classmethod from_manager(manager)¶
Class method called from FrontierManager passing the manager itself.
Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Built-in backend reference¶
This page describes all each backend documentation components that come with Crawl Frontier. For information on how to use them and how to write your own middleware, see the backend usage guide..
To know the default activated Backend check the BACKEND setting.
Basic algorithms¶
Some of the built-in Backend objects implement basic algorithms as as FIFO/LIFO or DFS/BFS for page visit ordering.
Differences between them will be on storage engine used. For instance, memory.FIFO and sqlalchemy.FIFO will use the same logic but with different storage engines.
Memory backends¶
This set of Backend objects will use an heapq object as storage for basic algorithms.
- class crawlfrontier.contrib.backends.memory.FIFO¶
- class crawlfrontier.contrib.backends.memory.LIFO¶
- class crawlfrontier.contrib.backends.memory.BFS¶
- class crawlfrontier.contrib.backends.memory.DFS¶
SQLAlchemy backends¶
This set of Backend objects will use SQLAlchemy as storage for basic algorithms.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
Request and Response are represented by a declarative sqlalchemy model:
class Page(Base):
__tablename__ = 'pages'
__table_args__ = (
UniqueConstraint('url'),
)
class State:
NOT_CRAWLED = 'NOT CRAWLED'
QUEUED = 'QUEUED'
CRAWLED = 'CRAWLED'
ERROR = 'ERROR'
url = Column(String(1000), nullable=False)
fingerprint = Column(String(40), primary_key=True, nullable=False, index=True, unique=True)
depth = Column(Integer, nullable=False)
created_at = Column(TIMESTAMP, nullable=False)
status_code = Column(String(20))
state = Column(String(10))
error = Column(String(20))
If you need to create your own models, you can do it by using the DEFAULT_MODELS setting:
DEFAULT_MODELS = {
'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
}
This setting uses a dictionary where key represents the name of the model to define and value the model to use. If you want for instance to create a model to represent domains:
DEFAULT_MODELS = {
'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
'Domain': 'myproject.backends.sqlalchemy.models.Domain',
}
Models can be accessed from the Backend dictionary attribute models.
For a complete list of all settings used for sqlalchemy backends check the settings section.
- class crawlfrontier.contrib.backends.sqlalchemy.FIFO¶
- class crawlfrontier.contrib.backends.sqlalchemy.LIFO¶
- class crawlfrontier.contrib.backends.sqlalchemy.BFS¶
- class crawlfrontier.contrib.backends.sqlalchemy.DFS¶
OPIC backend¶
The OPIC backend takes its name from the “Online Page Importance Computation” algorithm, described in:
Adaptive On-Line Page Importance Computation
Abiteboul S., Preda M., Cobena G.
2003
The main idea is that we want to crawl pages which are important, and this importance depends on which pages links to and which pages are linked by the current page in a manner similar to PageRank. Implementation is in
- class crawlfrontier.contrib.backends.opic.backend.OpicHitsBackend(manager, db_graph=None, db_pages=None, db_hits=None, scheduler=None, freq_estimator=None, change_detector=None, test=False)¶
Frontier backend implementation based on the OPIC algorithm adaptated to HITS scoring
Parameters: - manager (FrontierManager) – Frontier manager.
- db_graph (GraphInterface) – Graph database. If None use a new instance of graphdb.SQLite
- db_pages (PageDBInterface) – Page database. If None us a new instance of pagedb.SQLite.
- db_hits (HitsDBInterface) – HITS database. If None use a new instance of hitsdb.SQLite.
- scheduler (SchedulerInterface) – Decides which page to crawl next
- freq_estimator (FreqEstimatorInterface) – Frequency estimator.
- change_detector (PageChangeInterface) – Change detector.
- test (bool) – If True compute h_scores and a_scores prior to closing.
For more information about the implementations details read the annex OPIC details
OPIC Details¶
Since OPIC is just a way of computing the HITS score for each page we must first understand what the HITS score is. HITS stands for Hyperlink-Induced Topic Search, originally described in:
Authoritative sources in a hyperlinked environment
Kleinber J.
1998
The idea is to compute for each page a pair of numbers called the Hub and Authority scores. Intutively we say a page is a hub when it points to lot of pages with high authority, and we say that a page has high authority if it is pointed by many hubs. Mathematically this is expressed as:
Where \(h_i\) represents the hub score of page \(i\) and \(a_i\) represents its authority score. \(j: i \to j\) represent the set of pages \(j\) that are pointed by \(i\) and \(j: j \to i\) represents the set of pages \(j\) that point to \(i\). For example, consider the following set of pages:
The equations to compute the scores would be:
The above equations can be rewritten in matrix form. Let’s define:
\(L\) is called the link matrix because it contains all the link information available in the graph. Notice that \(L_{ij}\) is 1 if there is a link between nodes \(i\) and \(j\) and 0 otherwise. If we group the hub and authorities scores in the vectors \(h\) and \(a\) we can write the HITS equations as:
The above equations always have the trivial solutions \(h=a=0\). The other non-zero solutions are given by the eigenvectors of the matrices \(H\) and \(A\).
A very simple and efficient way of getting the eigenvector with the largest eigenvalue is using the power method: to compute the largest eigenvector \(v\) of a matrix \(M\) simple iterate long enough starting from a random vector \(v^0\)
Notice that unless the associated eigenvalue is exactly 1 you will need to re-normalize after each iteration.
Let’s apply this method to our toy problem. Using numpy the code for the power method is very simple:
import numpy as np
def power(M, eps=1e-6, max_iter=1000):
"""Compute the largest eigenvector of M.
Returns: the eigenvector and the eigenvalue
"""
v0 = np.random.rand(M.shape[0])
converged = False
for i in xrange(max_iter):
v1 = M.dot(v0)
v1 /= np.linalg.norm(v1)
err = np.max(np.abs(v1 - v0))
if err < eps:
converged = True
break
v0 = v1
if not converged:
print "Warning: didn't converge"
return v1, M[0, :].dot(v1)/v1[0]
And we can apply it to our toy problem:
L = np.array([
0, 1, 0, 0,
0, 0, 0, 1,
1, 1, 0, 1,
0, 0, 0, 0]).astype(float).reshape(4,4)
H = L.dot(L.transpose())
A = L.transpose().dot(L)
h, h_eigv = power(H)
a, a_eigv = power(A)
print "Hub scores: ", h
# Hub scores: [ 0.32505736 0.3250578 0.88807383 0. ]
print "Authority scores: ", a
# Authority scores: [ 0.45970084 0.62796324 0. 0.62796282]
If we represent the scores directly on the graph we see how the scores correctly represent our intuitive notion of what a hub or an authoritative page represents:
Notice that the power method can fail. Take for example the matrix \(L\). No matter what initial value we start iterating from, it will always converge to a zero vector. To see why notice the following identity which gives name to the power method:
Now let’s compute the different powers of \(L\)
The reason for this behavior is that the only eigenvalue of \(L\) is 0. There is however a non-zero eigenvector associated to the 0 eigenvalue:
But the power method fails to find it. A trick to solve this problem is to apply a small perturbation \(\epsilon\) to \(L\) and apply the power method. It should converge to an small perturbation of the original eigenvector. Let’s try:
print power(L + 1e-5)
# (array([ 5.52652292e-02, 3.23859608e-03, 9.98466441e-01,
# 1.79833785e-04]), 0.058792257444600315)
As you can see it returns a result very close to the true answer. The perturbated eigenvector is \((0.055, 0.003, 0.998, 0.000)\) and the perturbated eigenvalue is 0.06.
We finally end making some comments:
The power method is efficient because the link matrix is sparse, meaning that most of its elements are zero. However we need to have it fully loaded in memory to perform an iteration of the algorithm.
We don’t need to compute the \(H\) and \(A\) matrices and instead we can perform the following iteration:
\[\begin{split}h_{k+1} &= La_k \\ a_{k+1} &= L^Th_k\end{split}\]Which can also be rewritten as:
\[\begin{split}\begin{pmatrix} h \\ a \end{pmatrix}^{k+1} = J \begin{pmatrix} h \\ a \end{pmatrix}^{k}\end{split}\]
Where:
\[\begin{split}J = \begin{bmatrix} 0 & L \\ L^T & 0 \end{bmatrix}\end{split}\]And we can check that the power method applied to matrix \(J\) gives the same answer as before
N = L.shape[0] J = np.zeros((2*N, 2*N)) J[:N, N:] = L J[N:, :N] = L.transpose() h, a = np.split(power(J, max_iter=1000)[0], 2) h /= np.linalg.norm(h) a /= np.linalg.norm(a) print "Hub scores: ", h # Hub scores: [ 0.32505758 0.32505758 0.88807383 0. ] print "Authority scores: ",a # Authority scores: [ 0.45970084 0.62796303 0. 0.62796303]Actually, it gives a warning about convergence. The reason is that the vector built by stacking together \(h\) and \(a\) oscillates between the following two vectors:
\[\begin{split}\begin{matrix} (0.200 & 0.200 & 0.545 & 0 & | & 0.363 & 0.496 & 0 & 0.496) \\ (0.257 & 0.257 & 0.701 & 0 & | & 0.282 & 0.386 & 0 & 0.386) \end{matrix}\end{split}\]However, each half part of the above vectors are equal up to a scaling constant, and so the method converge to the correct scores.
OPIC is just an alternative to the power method to compute the largest eigenvector and so it can be applied to compute the HITS score. It is described here:
Adaptive On-Line Page Importance Computation
Abiteboul S., Preda M., Cobena G.
2003
Given a page graph, the idea is to maintain for each page two quantities: the cash and the history. When we update a page the cash is distributed to its children and the the total quantitiy is added to the page history. For example, returning to our toy example let’s suppose that at a given instant in time we have the current state of cash and history associated to each page:
And we update page 3. Since it has 6 units of cash it will distribute 2 units to each one of its 3 children. It will also increment its history by 6 units, since this is the total amount of cash it has distributed and its cash will be reset to 0. The following image shows the change in the graph state after a single update.
If we keep performing these updates, making sure that we update all pages, the history of the pages will converge, after normalization, to same value of the power method applied to the following matrix:
You can see that \(\bar{L}\) is very similar to \(L^T\) since its actually the same matrix transposed, where each column has been divided by a normalizing factor, so as to make each column add to 1.
The utility of the OPIC method lies in the fact that it does not matters the order in which we update the pages, nor it matters if we update some pages more than others, as long as all pages are updated periodically.
However, we must remember that the power method sometimes fail to converge and we solved it by adding an small perturbation to the link matrix. We make an equivalente modification to the OPIC algorithm by adding an small modification to the graph: we add a virtual page that links and is linked by all other pages. For our toy problem it would look like this:
Let’s check our claims with some code. The following code will apply the OPIC algorithm to an arbitrary graph. For simplicity the graph is represented as a dictionary where the key is the node name and the value is a list of outgoing nodes.
def add_virtual_page(graph):
new_graph = {node: children + ['V']
for node, children in graph.iteritems()}
new_graph['V'] = graph.keys()
return new_graph
example = add_virtual_page(
{1: [2],
2: [4],
3: [1, 2, 4],
4: []}
)
We now implement the algorithm, where the next page to update is randomly selected.
import random
def opic(graph, n_iter=100):
"""Simple implementation of OPIC where all children get the
same cash"""
# get the list of nodes
nodes = graph.keys()
# build a dictionary to hold cash and history
# we give each page an initial cash of 1
state = {node: (1.0, 0.0) for node in nodes}
for i in xrange(n_iter):
node = random.choice(nodes)
cash, history = state[node]
children = graph[node]
# distribute cash to children
n_children = len(children)
for child in children:
child_cash, child_history = state[child]
state[child] = (child_cash + cash/n_children,
child_history)
# update node state
state[node] = (0.0, history + cash)
# return normalized history
history = [state[n][1] for n in nodes]
total = sum(history)
return zip(nodes, [h/total for h in history])
To compare against the power method we build the equivalent link matrix, where the last row/column corresponds to the virtual page. First without normalization:
And after transposition and normalization:
We can compare now the two methods:
for node, score in opic(example, n_iter=1000):
print node, score
# 1 0.198500025862
# 2 0.298524016644
# 3 0.157148867939
# 4 0.345827089555
L = np.array([
0, 1, 0, 0, 1,
0, 0, 0, 1, 1,
1, 1, 0, 1, 1,
0, 0, 0, 0, 1,
1, 1, 1, 1, 0]).astype(float).reshape(5, 5)
p, k = power(make_stochastic(L.transpose()))
p = p[:4]
p /= np.linalg.norm(p, ord=1)
print p
# [ 0.19801978 0.29702981 0.15841572 0.34653469]
So, to summarize, if we have a matrix \(M\), where each rows sums to 1, we can compute its largest eigenvector either by using the power method on that matrix or by applying the OPIC algorithm in an associated graph which has an edge from \(i\) to \(j\) if \(M_{ji} \neq 0\). When node \(i\) with cash \(c_i\) inside the node is updated it will give \(M_{ji}\cdot c_i\) cash to node \(j\).
Notice that we really cannot apply OPIC as HITS without modifying the problem, since we have the requirement of all columns (or rows, depending what you consider) summing to 1. This is a requirement for the OPIC algorithm to work, since it assures that the amount of cash inside the graph remains constant. Otherwise the cash would go to 0 or to \(\infty\).
We will now use 4 scores for each page: hub cash and history and authority cash and history. When we update a page it will send its hub cash distributed to all its predecessors’ authority cash, and add it to its hub history. It will send also its authority cash distributed to all its children’s hub cash and add it to its authority history.
The following figure shows a single step of the algorithm applied to our toy problem, where we have ignored the virtual page to simplify things. Actually it’s not necessary to update in the same step the two scores however we show here both the hub and authority cash flow.
Notice that with this algorithm what remains constant after each step is the total amount of cash, that is, the sum of the hub and authority cash.
If you want to see how OPIC compares against the power method the following document shows a comparison between the two:
In the following document we analyze a real crawl, obtained from running the example in the crawl-frontier repository. It may take some time:
crawl-frontier/examples/scrapy_frontier$ scrapy crawl example -s MAX_REQUESTS=2000
2015-01-30 22:32:33+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapy_frontier)
.
.
.
2015-01-30 22:46:46+0100 [example] INFO: Spider closed (finished)
After that you should have a directory crawl-opic populated with several databases:
crawl-frontier/examples/scrapy_frontier/crawl-opic$ ls -1sh
total 71M
540K freqs.sqlite
45M graph.sqlite
156K hash.sqlite
7,9M hits.sqlite
8,3M pages.sqlite
4,6M scheduler.sqlite
4,9M updates.sqlite
For a description of what the above databases are about see Persistence. In particular, two databases in the above list give us an snapshot of the OPIC algorithm state: graph and hits.
The following script even allow us to run the OPIC algorithm whithout the crawler running and see how the OPIC score converges to the score computed using the power method, validating therefore the correcteness of our implementation.
You can download download the script too.
#!/usr/bin/env python
"""
This script makes a plot comparing the hub/authority score of the OPIC
algorithm against an equivalent power method.
Requirements:
- numpy
- scipy
- matplotlib
- crawlfrontier
"""
usage = """Compares the precision of the Power Method against OPIC computing
the HITS score.
python opic-precision.py backend_opic_workdir
The main output is a plot of the page scores using both methods
"""
import sys
import os
import matplotlib.pylab as plt
import numpy as np
import scipy.sparse
from crawlfrontier.contrib.backends.opic.opichits import OpicHits
import crawlfrontier.contrib.backends.opic.graphdb as graphdb
import crawlfrontier.contrib.backends.opic.hitsdb as hitsdb
def graph_to_hits_matrix(db):
"""
Returns a tuple with:
i2n: mapping between row/column number and node identifier
H : hub matrix
A : authority matrix
h : virtual page hub vector
a : virtual page authority vector
"""
i2n = [n for n in db.inodes()]
n2i = {n: i for i, n in enumerate(i2n)}
N = len(i2n)
H = scipy.sparse.dok_matrix((N, N), dtype=float)
A = scipy.sparse.dok_matrix((N, N), dtype=float)
h = np.zeros((N,))
a = np.zeros((N,))
for node in i2n:
i = n2i[node]
succ = db.successors(node)
for s in succ:
w = 1.0/(len(succ) + 1)
A[n2i[s], i] = w
a[i] = w
pred = db.predecessors(node)
for p in pred:
w = 1.0/(len(pred) + 1)
H[n2i[p], i] = w
h[i] = w
return (i2n, H, A, h, a)
def hits_pm(H, A, h, a, err_max=1e-4, iter_max=500, verbose=False):
"""Return hub and authority scores.
Parameters:
H : hub matrix
A : authority matrix
h : virtual page hub vector
a : virtual page authority vector
"""
# Number of pages
N = H.shape[0]
# Hub score
x = np.random.rand(N)
x /= np.sum(x)
# Authority score
y = np.random.rand(N)
y /= np.sum(y)
vh = 0.0
va = 0.0
i = 0
while True:
xn = H.dot(y) + va/N
yn = A.dot(x) + vh/N
vh = h.dot(y)
va = a.dot(x)
sa = np.sum(yn)
sh = np.sum(xn)
va /= sa
yn /= sa
vh /= sh
xn /= sh
eps = max(
np.linalg.norm(xn - x, ord=np.inf),
np.linalg.norm(yn - y, ord=np.inf)
)
x = xn
y = yn
if verbose:
print "iter={0:06d} eps={1:e}".format(i, eps)
if eps < err_max:
break
i += 1
if i > iter_max:
break
return (x, y)
def precision_crawl(workdir):
# Load databases inside workdir
opic1 = OpicHits(
db_graph=graphdb.SQLite(os.path.join(workdir, 'graph.sqlite')),
db_scores=hitsdb.SQLite(os.path.join(workdir, 'hits.sqlite'))
)
print "Converting crawled graph to sparse matrix... ",
i2n, H, A, h, a = graph_to_hits_matrix(opic1._graph)
print "done"
print "Computing HITS scores using power method... ",
# h1: hub score, power method
# a1: authority score, power method
h_pm, a_pm = hits_pm(H, A, h, a, verbose=False)
print "done"
# Estimate error of OPIC
# ----------------------------------------------
# Matched against the same pages as power method.
# h2: hub score, OPIC
# a2: authority score, OPIC
h_iter_1, a_iter_1 = zip(*[opic1.get_scores(page_id)
for page_id in i2n])
h_iter_2 = H.dot(A.dot(h_iter_1))
h_iter_2 /= np.sum(h_iter_2)
# To compute the error of opic authority score
a_iter_2 = A.dot(H.dot(a_iter_1))
a_iter_2 /= np.sum(a_iter_2)
print "Error of OPIC algorithm (L^inf metric):"
print " Hub score : ", \
np.linalg.norm(h_iter_2 - h_iter_1, ord=np.inf)
print " Authority score: ", \
np.linalg.norm(a_iter_2 - a_iter_1, ord=np.inf)
# Compare OPIC against PM
# ----------------------------------------------
# h_dist_pm: ordered from lowest to highest hub scores for PM
# h_opic : OPIC hub scores following PM page order
h_dist_pm, h_pm_ids = zip(*sorted(zip(h_pm, i2n)))
h_opic = [opic1.get_scores(page_id)[0]
for page_id in h_pm_ids]
# a_dist_pm: ordered from lowest to highest authority scores for PM
# a_opic : OPIC authority scores following PM page order
a_dist_pm, a_pm_ids = zip(*sorted(zip(a_pm, i2n)))
a_opic = [opic1.get_scores(page_id)[1]
for page_id in a_pm_ids]
h_dist_opic = sorted(h_opic)
a_dist_opic = sorted(a_opic)
# Improve OPIC scores
# ----------------------------------------------
print "Additional opic iterations"
opic2 = OpicHits(db_graph=opic1._graph, db_scores=None)
for i in xrange(10):
opic2.update(n_iter=1000)
print " ", (i+1)*1000
h_opic_improved = [opic2.get_scores(page_id)[0] for page_id in h_pm_ids]
a_opic_improved = [opic2.get_scores(page_id)[1] for page_id in a_pm_ids]
h_dist_opic_improved = sorted(h_opic_improved)
a_dist_opic_improved = sorted(a_opic_improved)
# Plot figure
# ----------------------------------------------
fig = plt.figure()
fig.suptitle('Power method vs OPIC')
# Hub score PM vs OPIC
ax1 = plt.subplot(221)
plt.hold(True)
ax1.set_ylabel('Hub score')
ax1.set_title('Power method vs OPIC')
p1, = plt.plot(h_dist_pm, 'r-')
p2, = plt.plot(h_dist_opic, 'b-')
p3, = plt.plot(h_opic, 'b.')
ax1.legend(
[p1, p2, p3],
['Power method', 'OPIC (dist)', 'OPIC'],
loc='upper left'
)
# Authority score PM vs OPIC
ax2 = plt.subplot(223, sharex=ax1)
plt.hold(True)
ax2.set_ylabel('Authority score')
ax2.set_xlabel('Page')
p1, = plt.plot(a_dist_pm, 'r-')
p2, = plt.plot(a_dist_opic, 'b-')
p3, = plt.plot(a_opic, 'b.')
# Hub score PM vs OPIC improved
ax3 = plt.subplot(222, sharex=ax1, sharey=ax1)
plt.hold(True)
ax3.set_title('Power method vs OPIC improved')
p1, = plt.plot(h_dist_pm, 'r-')
p2, = plt.plot(h_dist_opic_improved, 'b-')
p3, = plt.plot(h_opic_improved, 'b.')
# Authority score PM vs OPIC improved
ax4 = plt.subplot(224, sharex=ax1, sharey=ax2)
plt.hold(True)
ax4.set_xlabel('Page')
p1, = plt.plot(a_dist_pm, 'r-')
p2, = plt.plot(a_dist_opic_improved, 'b-')
p3, = plt.plot(a_opic_improved, 'b.')
return fig
if __name__ == '__main__':
if len(sys.argv) != 2:
sys.exit(usage)
fig = precision_crawl(sys.argv[1])
plt.show()
We can run the above script:
crawl-frontier/examples/scrapy_frontier$ python opic-precision.py crawl-opic/
Converting crawled graph to sparse matrix... done
Computing HITS scores using power method... done
Error of OPIC algorithm (L^inf metric):
Hub score : 0.00589243418972
Authority score: 0.00451756698138
Additional opic iterations
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
And at the end it will generate a plot comparing the hub and authority scores as computed by the power method and OPIC. We include now a zoom of the plot near the highest scored pages, which is the most interesting method and discuss the meaning of the plot.

first notice that we requested 2000 pages but if you look at the x-axis of the figure there are more than 40000 pages. This is because we not only show crawled pages, but also uncrawled links, which constitute most of the graph database.
There are two rows in the above plot. The first row shows hub scores and the second row shows authority scores. Also, the first column shows a comparison of the power method vs the OPIC scores, just as they were in the database after the crawl finished. The second column show the same result for the power method, but the OPIC scores are improved by performing additional iterations of the algorithm without crawling additional pages.
Each subplot has three lines. The red line shows the power method score for each page. The continous blue line shows the OPIC score for each page, but without having the pages in the same ordering as the power method. Showing this line is useful to have a look of how the OPIC scores are distributed but we cannot compare directly the red and blue lines. The dots show the OPIC score for each page, this time with the same page ordering as the power method. We can use the three lines to get an idea of what are the differences between the scores computed by the OPIC and the power method. The following figure explains how:
With this in mind we can see in the plot of OPIC vs power method that:
- After the crawl is finished there is quite a difference in the scores, although there is a good correlation between OPIC and the power method. This means that both methods will likely agree that a page is in the top 5% of pages, but the exact score or rank within that margin could be very different.
- If we continue iterating the OPIC algorithm after the crawl is finished the scores between the two methods converge. This is mainly because the OPIC algorithm plays with unfair rules: it computes the score while the graph is growing, while the power method was run with the graph frozen. This shows that if we run both methods in equal conditions they get the same results. That’s why we need the adaptive version of the OPIC algorithm, however, tuning the time window parameter is difficult. Another solution is to iterate more frequently the OPIC algorithm when the graph changes, but this consumes a lot of CPU.
The adaptive version of the OPIC algorithm takes into account the case where the graph is changing. In this case the pages that were from the beginning inside the graph have accumulated more history than the newer ones, even if the later are more important. Of course, if we freeze the graph this unbalance will in the long term be corrected but it could take lot of time.
One possible solution would be to discard all the pages history and restart again, but this is overkill since probably we are already near to the correct solution. The adaptive version of OPIC proposes a middle ground: to discard the history before a certain time starting from now. To do this exactly we would need to save the history of each page at every point of time. A more efficient alternative is to estimate how much history we should discard. This is called in the original paper history interpolation and is the approach that has been taken in this implementation.
Notice that the adaptive version has introduced a new parameter: a time window outside of which we should ignore history.
Although once of the nice properties of HITS scoring is that it only requires link information it has been demonstrated that adding content based information can greatly improve the effectivenes of topic focused crawlers. There are lots of ways in which we could fusion an external relevance score and the HITS scores to give a global score of page relevance and we will describe now the one developed in this case.
We will consider that the external relevance score is another measure of the page authority (a good hub can be empty of content, as long as it points to relevant one) and ranges between 0 (not relevant at all) and 1 (completely relevant).
With this in mind we modify the distribution of authority cash when we update a page. Let’s call \(r_i\) the relevance score for page \(i\), which has \(P_i\) parent pages pointing to it. If page \(i\) is relevant we want to distribute all its authority cash back to its parents, but if it’s not relevant we want to avoid distributing any cash back. Since we don’t want to destroy cash what we will do instead is to give that cash to the virtual page.
Mathematically, if we have \(a_i\) authority cash:
- Give \(a_i\cdot z(r_i)\) cash to each parent page.
- Give \(a_i[1 - P_i\cdot z(r_i)]\) cash to the virtual page.
Where \(z\) is a function of the page relevance and must satisfy the following conditions:
Notice that the third condition is equivalent to saying that when the relevance is 0.5 the algorithm is equivalent to OPIC without external relevance score.
One possible function \(z(r)\) can be build by considern the second order polynomial fitting all the above points:
The algorithm described in the previous sections is implemented inside this class:
- class crawlfrontier.contrib.backends.opic.opichits.OpicHits(db_graph=None, db_scores=None, time_window=None, db_relevance=None)¶
Implements the OPIC algorithm applied to the HITS scores problem
Parameters: - db_graph (GraphInterface) – Read only graph database. If None create a new one using SQLite
- db_scores (HitsDBInterface) – Scores database. If None create a new one using SQLite
- time_window (float) – Ignore cash flow out of this time window. Set to False/None to ignore.
- db_relevance (RelevanceDBInterface) – Read only relevance database. If None create a new one using SQLite
- a_mean¶
Mean of authority scores
- add_page(page_id)¶
Add a new page
Parameters: page_id (str) – Page identification Returns: HitsScore – The new score assigned to the page
- close()¶
Close any associated database
- get_scores(page_id)¶
Normalized hub and authority score
Parameters: page_id (str) – Page identification Returns: (float, float) – A tuple (hub score, authority score) for the given page_id
- h_mean¶
Mean of hub scores
- iscores()¶
Iterate over (page id, hub score, authority score)
- mark_update(page_id)¶
Add this to the list of pages to update
To decide which pages we should update next we uuse an heuristic: we select the pages with the highest accumulated authority or hub cash. This function makes possible to externally add a given page to the set of pages to be updated, irrespective of its accumulated cash.
Parameters: page_id (str) – Page identification
- update(n_iter=1)¶
Run a full iteration of the OPIC-HITS algorithm
Parameters: n_iter (int) – number of iterations Returns: pair of lists – The first one contains pages with an updated hub score and the second ones with update authority score.
As you can see it has one optional paramter, the time window of the adaptive version, and 3 external database components:
The facilitate adding different database backends their interface has been extracted in different abstract metaclasses.
Every class implementing the following interface can be a drop-in replacement for the included SQLite backend. It is not a requirement to inherit from this class, but is recommended for checking the interface at compile time.
Notice that this database is only accessed from opichits.OpicHits via read operations.
- class crawlfrontier.contrib.backends.opic.graphdb.GraphInterface¶
Interface definition for a Graph database
The graph does not store any aditional data inside the nodes or edges, only the link structure. All the function below just represent nodes as unique identifier strings and edges as pairs of nodes (python tuples).
- add_edge(start, end)¶
Add a new edge to the graph, from start to end
- add_node(node)¶
Add a new node
- clear()¶
Delete all contents
- close()¶
Close connection and commit changes
- delete_edge(node)¶
Delete edge
- delete_node(node)¶
Delete node, and all edges connecting to this node
- end_batch()¶
Commit pending changes
- has_node(node)¶
True if node present inside graph
- iedges()¶
An iterator for all the edges
- inodes()¶
An iterator for all the nodes
- nedges()¶
Number of edges
- neighbours(node)¶
A set of nodes going to or from ‘node’
- nnodes()¶
Number of nodes
- predecessors(node)¶
A list of the predecessors for the given node
- start_batch()¶
Do not commit changes to the graph until this batch ends
- successors(node)¶
A list of the successors for the given node
The following class is a simple implementation of the above interface using SQLite. It maintains a table with edges and a table with nodes. The table is fully indexed for performance.
- class crawlfrontier.contrib.backends.opic.graphdb.SQLite(db=None)¶
SQLite implementation of GraphInterface
If this database is provided opichits.OpicHits will consult the score associated to each page to decide how to distribute the authority score.
Notice that this database is only accessed from opichits.OpicHits via read operations.
As usual, there is an interface which all backends must satisfy:
- class crawlfrontier.contrib.backends.opic.relevancedb.RelevanceDBInterface¶
Interface definition for relevance databases.
Page ID’s are unique string identifiers for pages and page scores are float numbers between 0 and 1.
- add(page_id, page_score)¶
Add a new association
- clear()¶
Delete all contents
- delete(page_id)¶
Delete page
- get(page_id)¶
Get score for the given page
- get_best_scores(n)¶
Get the n highest scores as a list of tuples of type (page_id, page_score)
- iscores()¶
An iterator over all tuples (page_id, page_score)
- set(page_id, page_score)¶
Change score
And there is also a batteries included SQLite implementation of the interface:
- class crawlfrontier.contrib.backends.opic.relevancedb.SQLite(db=None)¶
A SQLite implementation for RelevanceDBInterface
The HITS database contains the state of the OPIC algorithm. Since there are several values to maintain for each page a new class has been created for storing them:
- class crawlfrontier.contrib.backends.opic.hitsdb.HitsScore(h_history, h_cash, h_last, a_history, a_cash, a_last)¶
Just a container for the following (modifiable) fields:
Parameters: - h_history – Accumulated hub cash
- h_cash – Non-spent hub cash
- h_last – Total hub cash the last time it was updated
- a_history – Accumulated authority cash
- a_cash – Non-spent authority cash
- a_last – Total authority cash the last time it was updated
First, the interface.
- class crawlfrontier.contrib.backends.opic.hitsdb.HitsDBInterface¶
Interface definition for every HITS database.
The keys of this database are page id’s, unique string identifiers for each page, and the values are HitsScore instances.
- add(page_id, page_score)¶
Associate page_score with page_id
- clear()¶
Delete all contents
- close()¶
Close connection and commit changes
- delete(page_id)¶
Delete association
- get_a_total()¶
Total accumulated authority score
- get_count()¶
Get number of scores in database
- get_h_total()¶
Total accumulated hub score
- get_highest_a_cash(n=1)¶
Get the highest authority cash
- get_highest_h_cash(n=1)¶
Get the highest hub cash
- increase_a_cash(page_id_list, a_cash)¶
Increase the cash in the given pages in this amount
- increase_all_cash(h_cash, a_cash)¶
Increase the cash in all pages in this amount
- increase_h_cash(page_id_list, h_cash)¶
Increase the cash in the given pages in this amount
- iteritems()¶
Iterate over hits scores
Returns: iterator over pairs – Each pair has the form (page_id, page_score)
And of course, the SQLite already provided implementation:
- class crawlfrontier.contrib.backends.opic.hitsdb.SQLite(db=None)¶
SQLite based implementation for HitsDBInterface
Make a new connection to a HITS scores database or, if None provided make a new in-memory
Having a measure of page importance is only the first step. Now we need to consider which page should we be crawling next. In order for the scheduler to take that decision it must have some information about the pages. We will assume that it has two items of information about each page: the change rate of the page and a measure of the page value. With this information it must give back to us the next pages to crawl. With these considerations we define the interface that all schedulers must satisfy:
- class crawlfrontier.contrib.backends.opic.scheduler.SchedulerInterface¶
Interface that must be satisfied by all schedulers
- close()¶
Close all databases
- delete(page_id)¶
Remove page from scheduler
- get_next_pages(n_pages)¶
Return next pages to crawl
- set_rate(page_id, rate_new)¶
Set change rate for given page
- set_value(page_id, value_new)¶
Set page value for given page
One possible strategy for an scheduler is just to be greedy: visit the next page in the frontier that gives the most value. This type of crawler is implemented in:
- class crawlfrontier.contrib.backends.opic.scheduler.BestFirst(rate_value_db=None)¶
A BestFirst crawler always return the next page with highest value.
To be really BestFirst get_next_pages should be called with n_pages=1. However, the crawler runs OK if its asked for the next best n_pages.
Parameters: rate_value_db (SchedulerDBInterface .schedulerdb.SchedulerDBInterface) – database to store page_id, rate, value triplets
BestFirst as is has one obvious limitation: it doesn’t revisit web pages since once a page is crawled it is deleted from the frontier. We could modify it, adding an estimation of each web page change rate and then reconsidering a page after some time has passed depending.
Another possibility is to use an scheduler which is not greedy, but optimal in the sense that maximizes the long term effectiviness of the crawler if some assumptions about page change rate hold. This scheduler is implemented inside:
- class crawlfrontier.contrib.backends.opic.scheduler.Optimal(n_clusters=100, rate_value_db=None, freq_db=None)¶
Compute the optimal refresh frequency, with a fixed computational cost
Parameters: - n_clusters (int) – number of clusters to use for the approximation
- rate_value_db (SchedulerDBInterface .schedulerdb.SchedulerDBInterface) – database to store page_id, rate, value triplets
- freq_db (FreqDBInterface .freqdb.FreqDBInterface) – database to store the scheduler solution
- crawl_rate¶
Get crawl rate
- frequency(page_id)¶
Get the optimal refresh frequency for a page
As you can see there are two databases associated to the scheduler:
The scheduler database is just for maintaining an association between pages and page change rates and values. As usual it can be anything that adheres to the following interface:
- class crawlfrontier.contrib.backends.opic.schedulerdb.SchedulerDBInterface¶
- add(page_id, page_rate, page_value)¶
Add a new association
- clear()¶
Delete all contents
- get(page_id)¶
Get (page_rate, page_value) for the given page
- get_best_value(n_pages=1, delete=False)¶
Get the pages with highest value
Parameters: - n_pages (int) – number of pages to retrieve
- delete (bool) – if True remove the retrieves pages from the database
- iter()¶
An iterator over all tuples (rate, value)
- set(page_id, page_rate, page_value)¶
Change association
Of course there is an already provided SQLite implementation:
- class crawlfrontier.contrib.backends.opic.schedulerdb.SQLite(db=None)¶
A SQLite implementation for the SchedulerDBInterface
The “freqs” database associates pages with optimal frequencies as computed by the scheduler. However it not only takes stores the association but also allows to query what is the next page to be crawled.
The interface is defined in:
- class crawlfrontier.contrib.backends.opic.freqdb.FreqDBInterface¶
Interface definition for every FreqDB database
- add(page_id, page_freq)¶
Associate page_freq with page_id, where:
Parameters: - page_id (str) – an string which identifies the page
- page_freq (float) – refresh frequency of the page (Hz)
- clear()¶
Delete all contents
- close()¶
Close connection and commit changes
- delete(page_id)¶
Delete association
- get_next_pages(n=1)¶
Return the next pages, to maintain the desired frequency
- set(page_id, page_freq)¶
Change the frequency associated with page_id
And the SQLite implementation:
- class crawlfrontier.contrib.backends.opic.freqdb.SQLite(db=None)¶
SQLite based implementation for FreqDBInterface
Make a new connection to a frequency database or, if None provided make a new in-memory
The justification of the optimal scheduler algorithm is quite lengthy and so is not included here. If you are interested in the details the it is described here:
This algorithm is based mainly on the following paper:
Effective Page Refresh Policies For Web Crawlers
Junghoo C. and Garcia-Molina H.
I will keep the notation similar to the above paper, and I will summarize some results to make it easier to follow.
The problem is: we have several web pages, which have possibly changed since the last time we crawled them. We want to know the optimal refreshing rate for them, for some meaning of optimal. Obviously, the refreshing rate cannot be unbounded, since we have a limited bandwith and computer resources.
First, we repeat some concepts from the original paper.
- \(I_i\): time interval between consecutive refreshes for page \(i\), this is the unknown we want to calculate.
- \(f_i\): refreshing frequency, just \(f_i=1/I_i\)
- \(\lambda_i\): change rate for page \(i\). We make the assumption that change rates for web pages follow a Poisson process. It seems this is a very reasonable assumption.
- \(N\): total number of pages we want to refresh in a time interval \(I\)
- \(\bar{f}\): mean refreshing rate: \(\bar{f}=1/I\)
We can now state our refreshing rate constraint: during time interval \(I\) we refresh \(I/I_i\) number of times page \(i\). It must be:
Or equivalently by simple manipulation:
We now define the value of a new page refresh \(V_i\), which is our first step towards a global value function, which we want to maximize. Notice that this target function is different from the original paper, which considered two different ones: freshness and age. We will however follow the same process, but for this different objective.
Of course, we don’t know the value \(V_i\) until we refresh the page, but we can compute its expected value. Remember we said refreshing rates follow a Poisson distribution, so we have that if \(X\) is the number of page modifications since last time we refreshed:
Where \(t\) is the amount of time that has passed since the last refresh.
The expected value of a refresh, is then:
We could finish now and say that our strategy is to crawl at each instant \(t\) the page that maximizes the expected refresh value \(E[V_i(t)]\). This however has two problems:
Even if a page has at a given time the maximum expected value, it could be better to wait and refresh it later, since it may have a higher expected value.
This expected value is a function of time and we would need to recompute constantly the expected value of all our candidate pages. Consider the following figure showing three different possible curves for \(E[V_i(t)]\) for three different pages, as a function of time. Notice how the page with the maximum expected value moves as a function of time.
The problem of which pages to refresh in the next period of time, such to as maximize the gained value in that period is very difficult, but we can make it more simple if that period of time goes to infinity, meaning that we search for a strategy that maximizes the value gained in all future time. In this case, we are a looking for an stationary solution and all our unknowns are the frequency at which we refresh each page.
If we refresh page \(i\) with frequency \(f_i\) the expected value for each refresh is:
And the value per unit of time we extract from the page is:
Our target function is the total value per unit of time:
And the optimization problem we want to solve is:
And for \(i=1\dotsc N\) we have the constraints:
Let’s have a look at the shape of our problem. The following figure shows the behavior of the function \(\bar{T}\) (with \(w=1\)):
The following figure shows the constraints for a problem with just two pages:
We can graphically solve the problem for a particular instance if we restrict ourselves with two pages. For example, consider the following weights and refresh rates for two pages:
If we plot the contour curves for \(\bar{T}\) and the constraints we see that there are two kinds of solutions, depending of whether the maximum happens in the interior or in the boundaries (\(f_1 = 0\) or \(f_2=0\)):
If we cut the surface \(\bar{T}\) with the plane defined by the constraint \((f_1 + f_2)/2=\bar{f}\) we transform the problem into a maximization problem in just one unknown:
We see that for low values of \(\bar{f}\), it’s better to not refresh at all the first page. However, as the available frequency increases it starts to make sense to refresh all pages. The following figure shows the refresh frequencies for the two pages as a function of the available frequency \(\bar{f}\). It’s interesting to note that although initially the first page is not refreshed at all, past a certain value of \(\bar{f}\) it is refreshed more frequently than the other one.
Introducing the Lagrange multipliers \(\alpha\) and \(\beta=(\beta_i, i=1\dotsc N)\) we build the Lagrangian:
A necessary condition for a solution to be a local optimum requires that the gradient of the Lagrangian vanishes (Karush–Kuhn–Tucker conditions):
Since \(\bar{T}\) is a linear combination of \(\bar{T}_i\), and each \(\bar{T}_i\) is concave, then \(\bar{T}\) is concave. Since our constraints are linear the whole problem is concave and any local maximum is also a global maximum. Furthermore, there are no local minimums, so every solution that satisfies the KKT conditions is a global maximum.
We now to try to solve the KKT equations directly. Calling \(h = \nabla L\), we must make \(h = 0\) to find an equilibrium point. There are therefore \(2N + 1\) equations to solve:
Calling \(x=(f, \beta, \alpha)\), the Newton-Raphson method makes the following approximation of the function in the neighborhood of the point \(x_k\):
From the above, we have:
Solving for \(\Delta x\) is not difficult, since our equations are loosely coupled:
So we have:
And we can solve easily the above system of equations. For \(i =1 \dotsc N\) compute:
Compute then:
And then again for \(i = 1 \dotsc N\):
Notice that once that \(f_i=0\) or \(\beta_i=0\) the algorithm gets stuck at that value. The problem with this is that as new pages are added, or maybe already present pages are changed (in \(w_i\) or \(\lambda_i\)) \(L\) changes. It would be very convenient to start searching for a new solution from the current one, since it’s an iterative procedure, however we cannot do that, since once a page has its refresh frequency set to zero, it cannot change.
Since our problem is concave, then we know that the solution to our original problem is also the solution to the unconstrained problem:
In more intuitive terms, the solution to the constrained maximization problem is given by a saddle point of the Lagrangian.
We could just use a gradient descent:
However we get the same problem of getting stuck when reaching \(\beta_i=0\). Notice that the update equation is:
We drop the inequality multipliers from the Lagrangian and impose these constraints directly while searching for a new point.
Our new Lagrangian is therefore:
For a fixed value of \(\alpha\) we first solve:
Subject to:
Notice that:
And we have that:
This means that:
Notice also that the Lagrangian can be decomposed in:
The maximization of \(L\) relative to \(f\) is equivalent to the maximization of each \(L_i\) relative to \(f_i\).
We can see that the maximization of \(L_i\) relative to \(f_i\) have the following solutions depending on the value of \(\alpha\):
Otherwise we solve:
The above equation gives a solution of the form:
We can pre-compute \(g(x)\) function for \(-1 < x < 0\) with:
Using the above we get the following figure:
We have now an efficient method to compute:
We now need to solve:
Which is an unidimensional minimization problem. We play safe and use the golden search algorithm, which is guaranteed to find the minimum even when the first derivative is not continuous. We search inside the region \(\alpha \in [-N\underset{i}{\max}\,w_i, 0]\).
For a description of the golden section search algorithm see for example:
Numerical Recipes, Third Edition Section
10.2 Golden Section Search in One Dimension
Substituting
in the Lagrangian we get:
The following figure shows the shape of \(h\):
Now, imagine we cluster the values of the pages \(w\) in \(K\) clusters. Let’s call \(W[k]\) the value of the \(k^{th}\) cluster and \(I[k]\) will be the indices of pages inside the cluster:
Then we rewrite \(Q\) as:
And make the approximation:
We call the accumulated page refresh rate inside the cluster \(k\) as:
Finally, we rephrase our minimization problem:
Of course it remains open how to actually perform the clustering, an operation that can be in itself costly. As a first approximation we can assume that \(0 < w_i \le 1\) and divide this interval equally in \(K\) clusters:
Another possibility would be doing some kind of online k-means:
Web-Scale K-Means Clustering
D. Sculley
The above algorithm main class is implemented in
- class crawlfrontier.contrib.backends.opic.scheduler.OptimalSolver(verbose=False)¶
Initialize the solver.
Parameters: verbose (bool) – If True print convergence info - solve(values, rates, frequency, n_pages)¶
Get frequencies that maximize page value per unit of time
Parameters: - values (list) – a list of page values (float in (0,1])
- rates (list) – a list of page change rates (float > 0)
- n_pages – number of pages
It must be that:
n_pages >= len(values) == len(rates)
And the supporting algorithms are also implemented inside the same module. The “clustering” algorithm is in
- class crawlfrontier.contrib.backends.opic.scheduler.WCluster(n_clusters=100)¶
Cluster the w_i values of the pages
Initialize clusters
Parameters: n_clusters (int) – Number of clusters to use - add(page_value, page_rate)¶
Adds page to cluster
- static clamp(w)¶
Make sure 0 <= w <= 1
- cluster_index(w)¶
For the given page value, find the cluster index
- delete(page_value, page_rate)¶
Deletes page from cluster
And the golden section search has been also implemented to avoid external dependencies:
- crawlfrontier.contrib.backends.opic.scheduler.golden_section_search(f, xmin, xmax, eps=1e-08, verbose=False)¶
Find the minimum of a function using the golden section search algorithm
Parameters: - f (function) – function to bracket
- xmin (float) – left side of the start interval
- xmax (float) – right side of the start interval
- eps (float) – required precision. Stop when the minimum is found inside an interval with length less than this value
- verbose (bool) – If true print convergence results
Returns: a pair of triples – (xa, xb, xc), (fa, fb, fc) or None if failure
The returned values obey:
f(xa) > f(xb) f(xc) > f(xc) xa < xb < xc |xc - xa| < eps
Failure can happen only if the search could not be initialized, because otherwise the algorithm is sure to converge.
So, we have refreshed a page with the following intervals of time between refreshes, for \(M\) different refreshes:
We partition the set of refreshes into two sets: \(U\) is the set of refreshes when the page has been updated and \(S\) is the set of refreshes where we observed the page remained the same. More exactly:
The likelihood of this observation:
The log-likelihood:
Maximization of the log-likelihood gives us the following equation, where we apply the change \(\theta = e^{-\lambda}\):
Now, the above equation is very easy to solve if all intervals are equal: \(I^i=I\)
Or in terms of the original \(\lambda\):
The formula for \(\lambda\) when the intervals are constant is not only simple, is also very convenient if we want to maintain estimating \(\lambda\), since we just need to track how many times the page was updated, \(|U|\), and how many times it remained the same \(|S|\).
It’s interesting to note that we could very easily estimate \(\lambda\) if we could observe all changes in the web page. In that case, since \(\lambda\) is the estimated number of changes in an interval of time we could just say:
We will use this latter formula, which will always give a biased lower bound for \(\lambda\), as a simple approximation to estimate \(\lambda\).
Right now it is assumed that the pages probability of change follow a Poisson distribution. The details are given in the description of the scheduler algorithm Optimal revisiting algorithm.
As always any implementation satisfying the following interface can be used:
- class crawlfrontier.contrib.backends.opic.freqest.FreqEstimatorInterface¶
- add(page_id)¶
Add initial frequency estimation for page_id
page_id – An string which identifies the page
- close()¶
Persist or flush all necesary information
- delete(page_id)¶
Stop tracking page_id
- frequency(page_id)¶
Return the estimated refresh frequency for the page. If the page is not being tracked return None
- refresh(page_id, updated)¶
Add new refresh information for page_id
updated – A boolean indicating if the page has changed
And we already have a working SQLite implementation, which just counts the number of updates in a given time interval:
- class crawlfrontier.contrib.backends.opic.freqest.Simple(db=None, clock=None, default_freq=0.0)¶
The simple estimator just computes the frequency as the total number of updates divided by the total observation time
Initialize estimator
- Arguments:
- db – updates database to use. If None provided create a new
- in-memory one.
- clock – A function that returns elapsed time in seconds from a
- fixed origin in time.
- default_freq – Return this frequency if not enough info available to
- compute an estimate
The frequency estimator expects as input whether or not a page has changed but does not make any assumption of what constitutes a change since this is the responsability of a page change estimator, which has a very simple interface: given a page and its body returns True if the change has changed:
- class crawlfrontier.contrib.backends.opic.pagechange.PageChangeInterface¶
Interface for all page change detectors
- update(page_id, page_body)¶
Returns one of the change status for the page
The value returned is one of the fields of Status
The current implementation is very simple since it just tests for any change in the body of the page by computing its hash. Notice that very simple changes, which don’t really add new content, will be detected as a change.
- class crawlfrontier.contrib.backends.opic.pagechange.BodySHA1(db=None)¶
Configuration¶
The constructor has a lot of arguments, but they will automatically filled correctly from global settings. Apart from the common settings with other backends this backend uses the following additional settings:
BACKEND_OPIC_IN_MEMORY (default False)
If True all information will be kept in-memory. This will make it run faster but also will consume more memory and, more importantly, you will not be able to resume the crawl after you shut down the spider.
BACKEND_OPIC_WORKDIR
If BACKEND_OPIC_IN_MEMORY is False, then all the state information necessary to resume the crawl will be kept inside this directory. If this directory is not set a default one will be generated in the current directory following the pattern:
crawl-opic-DYYYY.MM.DD-THH.mm.SS
Where YYYY is the current year, MM is the month, etc...
BACKEND_OPIC_SCHEDULER
‘optimal’ will use Optimal. Any other value, or if not set, will use BestFirst.
BACKEND_TEST
If True the backend will save some information to inspect after the databases are closed to test the backend performance.
Persistence¶
The following SQLite databases are generated inside the BACKEND_OPIC_WORKDIR:
- graph.sqlite: the graph database. See Graph database.
- freqs.sqlite: the output of the scheduler. How frequently each page should be crawled. See Revisiting scheduler.
- hash.sqlite: a hash of the body of each page the last time it was visited. It allows to track if a page has changed. See Page change rate estimator
- hits.sqlite: the HITS score information and additional information for the OPIC algorithm. See HITS scores database.
- pages.sqlite: additional info about pages, like URL and domain.
- scheduler.sqlite: page value and page change rate used by the scheduler algorithm. See Revisiting scheduler.
- updates.sqlite: number of updates in a given interval of time. Used for page change rate estimation. See Page change rate estimator
Since they are standard SQLite databases they can be accessed using any tool of your choice (for example sqlitebrowser) which is useful for debugging or interfacing with other tools. An example would be accesing the data inside the databases to compare the precision of OPIC and the power method, like it’s explained in OPIC vs power method.
- Middlewares
- Filter or alter information for links and pages.
- Backends
- Define your own crawling logic.
Built-in services and tools¶
Using the Frontier with Scrapy¶
Using Crawl Frontier is quite easy, it includes a set of Scrapy middlewares that encapsulates frontier usage and can be easily configured using Scrapy settings.
Activating the frontier¶
The frontier uses 2 different middlewares: CrawlFrontierSpiderMiddleware and CrawlFrontierDownloaderMiddleware.
To activate the frontier in your Scrapy project, just add them to the SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES settings:
SPIDER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierDownloaderMiddleware': 1000,
})
Create a Crawl Frontier settings.py file and add it to your Scrapy settings:
FRONTIER_SETTINGS = 'tutorial/frontier/settings.py'
Organizing files¶
When using frontier with a Scrapy project, we propose the following directory structure:
my_scrapy_project/
my_scrapy_project/
frontier/
__init__.py
settings.py
middlewares.py
backends.py
spiders/
...
__init__.py
settings.py
scrapy.cfg
These are basically:
- my_scrapy_project/frontier/settings.py: the frontier settings file.
- my_scrapy_project/frontier/middlewares.py: the middlewares used by the frontier.
- my_scrapy_project/frontier/backends.py: the backend(s) used by the frontier.
- my_scrapy_project/spiders: the Scrapy spiders folder
- my_scrapy_project/settings.py: the Scrapy settings file
- scrapy.cfg: the Scrapy config file
Running the Crawl¶
Just run your Scrapy spider as usual from the command line:
scrapy crawl myspider
In case you need to disable frontier, you can do it by overriding the FRONTIER_ENABLED setting:
scrapy crawl myspider -s FRONTIER_ENABLED=False
Frontier Scrapy settings¶
Here’s a list of all available Crawl Frontier Scrapy settings, in alphabetical order, along with their default values and the scope where they apply:
FRONTIER_SCHEDULER_CONCURRENT_REQUESTS¶
Default: 256
Number of concurrent requests that the middleware will maintain while asking for next pages.
FRONTIER_SCHEDULER_INTERVAL¶
Default: 0.01
Interval of number of requests check in seconds. Indicates how often the frontier will be asked for new pages if there is gap for new requests.
Using the Frontier with Requests¶
To integrate frontier with Requests library, there is a RequestsFrontierManager class available.
This class is just a simple FrontierManager wrapper that uses Requests objects (Request/Response) and converts them from and to frontier ones for you.
Use it in the same way that FrontierManager, initialize it with your settings and use Requests Request and Response objects. get_next_requests method will return a Requests Request object.
An example:
import re
import requests
from urlparse import urljoin
from crawlfrontier.contrib.requests.manager import RequestsFrontierManager
from crawlfrontier import Settings
SETTINGS = Settings()
SETTINGS.BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'
SETTINGS.LOGGING_MANAGER_ENABLED = True
SETTINGS.LOGGING_BACKEND_ENABLED = True
SETTINGS.MAX_REQUESTS = 100
SETTINGS.MAX_NEXT_REQUESTS = 10
SEEDS = [
'http://www.imdb.com',
]
LINK_RE = re.compile(r'href="(.*?)"')
def extract_page_links(response):
return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]
if __name__ == '__main__':
frontier = RequestsFrontierManager(SETTINGS)
frontier.add_seeds([requests.Request(url=url) for url in SEEDS])
while True:
next_requests = frontier.get_next_requests()
if not next_requests:
break
for request in next_requests:
try:
response = requests.get(request.url)
links = [requests.Request(url=url) for url in extract_page_links(response)]
frontier.page_crawled(response=response, links=links)
except requests.RequestException, e:
error_code = type(e).__name__
frontier.request_error(request, error_code)
Graph Manager¶
The Graph Manager is a tool to represent web sitemaps as a graph.
It can easily be used to test frontiers. We can “fake” crawler request/responses by querying pages to the graph manager, and also know the links extracted for each one without using a crawler at all. You can make your own fake tests or use the Frontier Tester tool.
You can use it by defining your own sites for testing or use the Scrapy Recorder to record crawlings that can be reproduced later.
Defining a Site Graph¶
Pages from a web site and its links can be easily defined as a directed graph, where each node represents a page and the edges the links between them.
Let’s use a really simple site representation with a starting page A that have links inside to tree pages B, C, D. We can represent the site with this graph:

We use a list to represent the different site pages and one tuple to define the page and its links, for the previous example:
site = [
('A', ['B', 'C', 'D']),
]
Note that we don’t need to define pages without links, but we can also use it as a valid representation:
site = [
('A', ['B', 'C', 'D']),
('B', []),
('C', []),
('D', []),
]
A more complex site:

Can be represented as:
site = [
('A', ['B', 'C', 'D']),
('D', ['A', 'D', 'E', 'F']),
]
Note that D is linking to itself and to his parent A.
In the same way, a page can have several parents:

site = [
('A', ['B', 'C', 'D']),
('B', ['C']),
('D', ['C']),
]
In order to simplify examples we’re not using urls for page representation, but of course urls are the intended use for site graphs:

site = [
('http://example.com', ['http://example.com/anotherpage', 'http://othersite.com']),
]
Using the Graph Manager¶
Once we have defined our site represented as a graph, we can start using it with the Graph Manager.
We must first create our graph manager:
>>> from crawlfrontier import graphs
>>> g = graphs.Manager()
And add the site using the add_site method:
>>> site = [('A', ['B', 'C', 'D'])]
>>> g.add_site(site)
The manager is now initialized and ready to be used.
We can get all the pages in the graph:
>>> g.pages
[<1:A*>, <2:B>, <3:C>, <4:D>]
Asterisk represents that the page is a seed, if we want to get just the seeds of the site graph:
>>> g.seeds
[<1:A*>]
We can get individual pages using get_page, if a page does not exists None is returned
>>> g.get_page('A')
<1:A*>
>>> g.get_page('F')
None
CrawlPage objects¶
Pages are represented as a CrawlPage object:
- class CrawlPage¶
A CrawlPage object represents an Graph Manager page, which is usually generated in the Graph Manager.
- id¶
Autonumeric page id.
- url¶
The url of the page.
- status¶
Represents the HTTP code status of the page.
- is_seed¶
Boolean value indicating if the page is seed or not.
- links¶
List of pages the current page links to.
- referers¶
List of pages that link to the current page.
In our example:
>>> p = g.get_page('A')
>>> p.id
1
>>> p.url
u'A'
>>> p.status # defaults to 200
u'200'
>>> p.is_seed
True
>>> p.links
[<2:B>, <3:C>, <4:D>]
>>> p.referers # No referers for A
[]
>>> g.get_page('B').referers # referers for B
[<1:A*>]
Adding pages and Links¶
Site graphs can be also defined adding pages and links individually, the same graph from our example can be defined this way:
>>> g = graphs.Manager()
>>> a = g.add_page(url='A', is_seed=True)
>>> b = g.add_link(page=a, url='B')
>>> c = g.add_link(page=a, url='C')
>>> d = g.add_link(page=a, url='D')
add_page and add_link can be combined with add_site and used anytime:
>>> site = [('A', ['B', 'C', 'D'])]
>>> g = graphs.Manager()
>>> g.add_site(site)
>>> d = g.get_page('D')
>>> g.add_link(d, 'E')
Adding multiple sites¶
Multiple sites can be added to the manager:
>>> site1 = [('A1', ['B1', 'C1', 'D1'])]
>>> site2 = [('A2', ['B2', 'C2', 'D2'])]
>>> g = graphs.Manager()
>>> g.add_site(site1)
>>> g.add_site(site2)
>>> g.pages
[<1:A1*>, <2:B1>, <3:C1>, <4:D1>, <5:A2*>, <6:B2>, <7:C2>, <8:D2>]
>>> g.seeds
[<1:A1*>, <5:A2*>]
Or as a list of sites with add_site_list method:
>>> site_list = [
[('A1', ['B1', 'C1', 'D1'])],
[('A2', ['B2', 'C2', 'D2'])],
]
>>> g = graphs.Manager()
>>> g.add_site_list(site_list)
Graphs Database¶
Graph Manager uses SQLAlchemy to store and represent graphs.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
An example using SQLite:
>>> g = graphs.Manager(engine='sqlite:///graph.db')
Changes are committed with every new add by default, graphs can be loaded later:
>>> graph = graphs.Manager(engine='sqlite:///graph.db')
>>> graph.add_site(('A', []))
>>> another_graph = graphs.Manager(engine='sqlite:///graph.db')
>>> another_graph.pages
[<1:A1*>]
A database content reset can be done using clear_content parameter:
>>> g = graphs.Manager(engine='sqlite:///graph.db', clear_content=True)
Using graphs with status codes¶
In order to recreate/simulate crawling using graphs, HTTP response codes can be defined for each page.
Example for a 404 error:
>>> g = graphs.Manager()
>>> g.add_page(url='A', status=404)
Status codes can be defined for sites in the following way using a list of tuples:
>>> site_with_status_codes = [
((200, "A"), ["B", "C"]),
((404, "B"), ["D", "E"]),
((500, "C"), ["F", "G"]),
]
>>> g = graphs.Manager()
>>> g.add_site(site_with_status_codes)
Default status code value is 200 for new pages.
A simple crawl faking example¶
Frontier tests can better be done using the Frontier Tester tool, but here’s an example of how fake a crawl with a frontier:
from crawlfrontier import FrontierManager, graphs, Request, Response
if __name__ == '__main__':
# Load graph from existing database
graph = graphs.Manager('sqlite:///graph.db')
# Create frontier from default settings
frontier = FrontierManager.from_settings()
# Create and add seeds
seeds = [Request(seed.url) for seed in graph.seeds]
frontier.add_seeds(seeds)
# Get next requests
next_requets = frontier.get_next_requests()
# Crawl pages
while (next_requests):
for request in next_requests:
# Fake page crawling
crawled_page = graph.get_page(request.url)
# Create response
response = Response(url=crawled_page.url, status_code=crawled_page.status)
# Update Page
page = frontier.page_crawled(response=response
links=[link.url for link in crawled_page.links])
# Get next requests
next_requets = frontier.get_next_requests()
Rendering graphs¶
Graphs can be rendered to png files:
>>> g.render(filename='graph.png', label='A simple Graph')
Rendering graphs uses pydot, a Python interface to Graphviz‘s Dot language.
How to use it¶
Graph Manager can be used to test frontiers in conjunction with Frontier Tester and also with Scrapy Recordings.
Testing a Frontier¶
Frontier Tester is a helper class for easy frontier testing.
Basically it runs a fake crawl against a Frontier, crawl info is faked using a Graph Manager instance.
Creating a Frontier Tester¶
FrontierTester needs a Graph Manager and a FrontierManager instances:
>>> from crawlfrontier import FrontierManager, FrontierTester, graphs
>>> graph = graphs.Manager('sqlite:///graph.db') # Crawl fake data loading
>>> frontier = FrontierManager.from_settings() # Create frontier from default settings
>>> tester = FrontierTester(frontier, graph)
Running a Test¶
The tester is now initialized, to run the test just call the method run:
>>> tester.run()
When run method is called the tester will:
- Add all the seeds from the graph.
- Ask the frontier about next pages.
- Fake page response and inform the frontier about page crawl and its links.
Steps 1 and 2 are repeated until crawl or frontier ends.
Once the test is finished, the crawling page sequence is available as a list of frontier Request objects:
Test Parameters¶
In some test cases you may want to add all graph pages as seeds, this can be done with the parameter add_all_pages:
>>> tester.run(add_all_pages=True)
Maximum number of returned pages per get_next_requests call can be set using frontier settings, but also can be modified when creating the FrontierTester with the max_next_pages argument:
>>> tester = FrontierTester(frontier, graph, max_next_pages=10)
An example of use¶
A working example using test data from graphs and basic backends:
from crawlfrontier import FrontierManager, Settings, FrontierTester, graphs
def test_backend(backend):
# Graph
graph = graphs.Manager()
graph.add_site_list(graphs.data.SITE_LIST_02)
# Frontier
settings = Settings()
settings.BACKEND = backend
settings.TEST_MODE = True
frontier = FrontierManager.from_settings(settings)
# Tester
tester = FrontierTester(frontier, graph)
tester.run(add_all_pages=True)
# Show crawling sequence
print '-'*40
print frontier.backend.name
print '-'*40
for page in tester.sequence:
print page.url
if __name__ == '__main__':
test_backend('crawlfrontier.contrib.backends.memory.heapq.FIFO')
test_backend('crawlfrontier.contrib.backends.memory.heapq.LIFO')
test_backend('crawlfrontier.contrib.backends.memory.heapq.BFS')
test_backend('crawlfrontier.contrib.backends.memory.heapq.DFS')
Recording a Scrapy crawl¶
Scrapy Recorder is a set of Scrapy middlewares that will allow you to record a scrapy crawl and store it into a Graph Manager.
This can be useful to perform frontier tests without having to crawl the entire site again or even using Scrapy.
Activating the recorder¶
The recorder uses 2 different middlewares: CrawlRecorderSpiderMiddleware and CrawlRecorderDownloaderMiddleware.
To activate the recording in your Scrapy project, just add them to the SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES settings:
SPIDER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.recording.CrawlRecorderSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.recording.CrawlRecorderDownloaderMiddleware': 1000,
})
Choosing your storage engine¶
As Graph Manager is internally used by the recorder to store crawled pages, you can choose between different storage engines.
We can set the storage engine with the RECORDER_STORAGE_ENGINE setting:
RECORDER_STORAGE_ENGINE = 'sqlite:///my_record.db'
You can also choose to reset database tables or just reset data with this settings:
RECORDER_STORAGE_DROP_ALL_TABLES = True
RECORDER_STORAGE_CLEAR_CONTENT = True
Running the Crawl¶
Just run your Scrapy spider as usual from the command line:
scrapy crawl myspider
Once it’s finished you should have the recording available and ready for use.
In case you need to disable recording, you can do it by overriding the RECORDER_ENABLED setting:
scrapy crawl myspider -s RECORDER_ENABLED=False
Recorder settings¶
Here’s a list of all available Scrapy Recorder settings, in alphabetical order, along with their default values and the scope where they apply.
RECORDER_STORAGE_CLEAR_CONTENT¶
Default: True
Deletes table content from storage database in Graph Manager.
RECORDER_STORAGE_ENGINE¶
Default: None
Sets Graph Manager storage engine used to store the recording.
Scrapy Seed Loaders¶
Crawl Frontier has some built-in Scrapy middlewares for seed loading.
Seed loaders use the process_start_requests method to generate requests from a source that are added later to the FrontierManager.
Activating a Seed loader¶
Just add the Seed Loader middleware to the SPIDER_MIDDLEWARES scrapy settings:
SPIDER_MIDDLEWARES.update({
'crawl_frontier.contrib.scrapy.middlewares.seeds.FileSeedLoader': 650
})
FileSeedLoader¶
Load seed URLs from a file. The file must be formatted contain one URL per line:
http://www.asite.com
http://www.anothersite.com
...
Yo can disable URLs using the # character:
...
#http://www.acommentedsite.com
...
Settings:
- SEEDS_SOURCE: Path to the seeds file
S3SeedLoader¶
Load seeds from a file stored in an Amazon S3 bucket
File format should the same one used in FileSeedLoader.
Settings:
- SEEDS_SOURCE: Path to S3 bucket file. eg: s3://some-project/seed-urls/
- SEEDS_AWS_ACCESS_KEY: S3 credentials Access Key
- SEEDS_AWS_SECRET_ACCESS_KEY: S3 credentials Secret Access Key
- Using the Frontier with Scrapy
- Learn how to use Crawl Frontier with Scrapy.
- Using the Frontier with Requests
- Learn how to use Crawl Frontier with Requests.
- Graph Manager
- Define fake crawlings for websites to test your frontier.
- Testing a Frontier
- Test your frontier in an easy way.
- Recording a Scrapy crawl
- Create Scrapy crawl recordings and reproduce them later.
- Scrapy Seed Loaders
- Scrapy middlewares for seed loading
All the rest¶
Examples¶
The project repo includes an examples folder with some scripts and projects using CrawlFrontier:
examples/
requests/
scrapy_frontier/
scrapy_recording/
scripts/
- requests: Example script with Requests library.
- scrapy_frontier: Scrapy Frontier example project.
- scrapy_recording: Scrapy Recording example project.
- scripts: Some simple scripts.
Note
This examples may need to install additional libraries in order to work.
You can install them using pip:
pip install -r requirements/examples.txt
requests¶
A simple script that follow all the links from a site using Requests library.
How to run it:
python links_follower.py
scrapy_frontier¶
A simple script with a spider that follows all the links for the sites defined in a seeds.txt file.
How to run it:
scrapy crawl example
scrapy_recording¶
A simple script with a spider that follows all the links for a site, recording crawling results.
How to run it:
scrapy crawl recorder
scripts¶
Some sample scripts on how to use different frontier components.
Tests¶
Crawl Frontier tests are implemented using the pytest tool.
You can install pytest and the additional required libraries used in the tests using pip:
pip install -r requirements/tests.txt
Writing tests¶
All functionality (including new features and bug fixes) must include a test case to check that it works as expected, so please include tests for your patches if you want them to get accepted sooner.
Backend testing¶
A base pytest class for Backend testing is provided: BackendTest
Let’s say for instance that you want to test to your backend MyBackend and create a new frontier instance for each test method call, you can define a test class like this:
class TestMyBackend(backends.BackendTest):
backend_class = 'crawlfrontier.contrib.backend.abackend.MyBackend'
def test_one(self):
frontier = self.get_frontier()
...
def test_two(self):
frontier = self.get_frontier()
...
...
And let’s say too that it uses a database file and you need to clean it before and after each test:
class TestMyBackend(backends.BackendTest):
backend_class = 'crawlfrontier.contrib.backend.abackend.MyBackend'
def setup_backend(self, method):
self._delete_test_db()
def teardown_backend(self, method):
self._delete_test_db()
def _delete_test_db(self):
try:
os.remove('mytestdb.db')
except OSError:
pass
def test_one(self):
frontier = self.get_frontier()
...
def test_two(self):
frontier = self.get_frontier()
...
...
Testing backend sequences¶
To test Backend crawling sequences you can use the BackendSequenceTest class.
BackendSequenceTest class will run a complete crawl of the passed site graphs and return the sequence used by the backend for visiting the different pages.
Let’s say you want to test to a backend that sort pages using alphabetic order. You can define the following test:
class TestAlphabeticSortBackend(backends.BackendSequenceTest):
backend_class = 'crawlfrontier.contrib.backend.abackend.AlphabeticSortBackend'
SITE_LIST = [
[
('C', []),
('B', []),
('A', []),
],
]
def test_one(self):
# Check sequence is the expected one
self.assert_sequence(site_list=self.SITE_LIST,
expected_sequence=['A', 'B', 'C'],
max_next_requests=0)
def test_two(self):
# Get sequence and work with it
sequence = self.get_sequence(site_list=SITE_LIST,
max_next_requests=0)
assert len(sequence) > 2
...
Testing basic algorithms¶
If your backend uses any of the basic algorithms logics, you can just inherit the correponding test base class for each logic and sequences will be automatically tested for it:
from crawlfrontier.tests import backends
class TestMyBackendFIFO(backends.FIFOBackendTest):
backend_class = 'crawlfrontier.contrib.backends.abackend.MyBackendFIFO'
class TestMyBackendLIFO(backends.LIFOBackendTest):
backend_class = 'crawlfrontier.contrib.backends.abackend.MyBackendLIFO'
class TestMyBackendDFS(backends.DFSBackendTest):
backend_class = 'crawlfrontier.contrib.backends.abackend.MyBackendDFS'
class TestMyBackendBFS(backends.BFSBackendTest):
backend_class = 'crawlfrontier.contrib.backends.abackend.MyBackendBFS'
class TestMyBackendRANDOM(backends.RANDOMBackendTest):
backend_class = 'crawlfrontier.contrib.backends.abackend.MyBackendRANDOM'
Release Notes¶
0.2.0 (released 2015-01-12)¶
- Added documentation (Scrapy Seed Loaders+Tests+Examples) (8e5f60d)
- Refactored backend tests (00910bf, 5702bef, 9567566)
- Added requests library example (8796011)
- Added requests library manager and object converters (d6590b6)
- Added FrontierManagerWrapper (4f04a48)
- Added frontier object converters (7da51a4)
- Fixed script examples for new changes (101ea27)
- Optional Color logging (only if available) (c0ba0ba)
- Changed Scrapy frontier and recorder integration to scheduler+middlewares (cbe5f4f / 2fcdc06 / f7bf02b / 0d15dc1)
- Changed default frontier backend (03cd307)
- Added comment support to seeds (7d48973)
- Added doc requirements for RTD build (27daea4)
- Removed optional dependencies for setup.py and requirements (c6099f3 / 79a4e4d / e6910e3)
- Changed tests to pytest (848d2bf / edc9c01 / c318d14)
- Updated docstrings and documentation (fdccd92 / 9dec38c / 71d626f / 0977bbf)
- Changed frontier componets (Backend and Middleware) to abc (1e74467)
- Modified Scrapy frontier example to use seed loaders (0ad905d)
- Refactored Scrapy Seed loaders (a0eac84)
- Added new fields to Request and Response frontier objects (bb64afb)
- Added ScrapyFrontierManager (Scrapy wrapper for Frontier Manager) (8e50dc0)
- Changed frontier core objects (Page/Link to Request/Response) (74b54c8)
0.1¶
First release of Crawl Frontier.
- Examples
- Some example projects and scripts using Crawl Frontier.
- Tests
- How to run and write Crawl Frontier tests.
- Release Notes
- See what has changed in recent Crawl Frontier versions.